Python-docx 读取word.docx内容
第一次写博客,也不知道要写点儿什么好,所以就把我在学习Python的过程中遇到的问题记录下来,以便之后查看,本人小白,写的不好,如有错误,还请大家批评指正!
中文编码问题总是让人头疼,想要用Python读取word中的内容,用open()经常报错,上网一搜结果发现了Python有专门读取.docx的模块python_docx(只能读取.docx文件,不能读取.doc文件),用起来很方便。
安装python-docx:
pip install python_docx
(注意:不是pip install docx ! docx也可以安装,但总是报错,缺少exceptions,无法导入)
接下来就可以用Python_docx 来读取word文本了。
代码如下:
import docx from docx import Document path = "C:\\Users\\Administrator\\Desktop\\word.docx" document = Document(path) for paragraph in document.paragraphs: print(paragraph.text)
运行即可输出文本。
我尝试用docx读取.doc文本
代码如下:
import os import docx for filename in os.listdir(os.getcwd()): if filename.endswith('.doc'): print(filename[:-4]) doc = docx.Document(filename[:-4]+".docx") for para in doc.paragraphs: print (para.text)
结果报错:docx.opc.exceptions.PackageNotFoundError: Package not found。还是无法识别doc
引用1楼,“改变拓展名并没有改变其编码方式,因此无法读取文本内容,需将doc文件另存为docx文件后再用python-docx读取其内容”
# Document 还有添加标题、分页、段落、图片、章节等方法,说明如下
| add_heading(self, text='', level=1) | Return a heading paragraph newly added to the end of the document, | containing *text* and having its paragraph style determined by | *level*. If *level* is 0, the style is set to `Title`. If *level* is | 1 (or omitted), `Heading 1` is used. Otherwise the style is set to | `Heading {level}`. Raises |ValueError| if *level* is outside the | range 0-9. | | add_page_break(self) | Return a paragraph newly added to the end of the document and | containing only a page break. | | add_paragraph(self, text='', style=None) | Return a paragraph newly added to the end of the document, populated | with *text* and having paragraph style *style*. *text* can contain | tab (``\t``) characters, which are converted to the appropriate XML | form for a tab. *text* can also include newline (``\n``) or carriage | return (``\r``) characters, each of which is converted to a line | break. | | add_picture(self, image_path_or_stream, width=None, height=None) | Return a new picture shape added in its own paragraph at the end of | the document. The picture contains the image at | *image_path_or_stream*, scaled based on *width* and *height*. If | neither width nor height is specified, the picture appears at its | native size. If only one is specified, it is used to compute | a scaling factor that is then applied to the unspecified dimension, | preserving the aspect ratio of the image. The native size of the | picture is calculated using the dots-per-inch (dpi) value specified | in the image file, defaulting to 72 dpi if no value is specified, as | is often the case. | | add_section(self, start_type=2) | Return a |Section| object representing a new section added at the end | of the document. The optional *start_type* argument must be a member | of the :ref:`WdSectionStart` enumeration, and defaults to | ``WD_SECTION.NEW_PAGE`` if not provided. | | add_table(self, rows, cols, style=None) | Add a table having row and column counts of *rows* and *cols* | respectively and table style of *style*. *style* may be a paragraph | style object or a paragraph style name. If *style* is |None|, the | table inherits the default table style of the document. | | save(self, path_or_stream) | Save this document to *path_or_stream*, which can be eit a path to | a filesystem location (a string) or a file-like object.
docx还有许多其它功能,还正在学习中,详见官方文档:https://python-docx.readthedocs.io/en/latest/user/quickstart.html