Python导出Docx的Comment

用Python写了一段代码,用于将docx中的comment进行导出,导出到另外一个docx文件

 

用到的Lib:

docx,zipfile,bs4

原理:docx文件重命名为.zip文件,然后解压缩,会发现很多个xml文件

其中comment存放在 word/comments.xml。

另外一个文件 word/commentsExtended.xml,存放comment之间的关系,以及是否完成。

用bs4的BeautifulSoup对文件进行Parse

docx文件comment示例:

 

代码:

 1 from zipfile import ZipFile
 2 from bs4 import BeautifulSoup
 3 from docx import Document
 4 
 5 DOC = Document('untitled.docx')
 6 
 7 
 8 def export_docx(filename):
 9     z = ZipFile(filename)   #Mac下无法直接通过zipfile解压,可以手动重命名文件再解压
10     xml = z.read('word/comments.xml')
11     xml_ex = z.read('word/commentsExtended.xml')
12     wordObj = BeautifulSoup(xml.decode("utf-8"), features="html.parser")
13     wordObjEx = BeautifulSoup(xml_ex.decode("utf-8"), features="html.parser")
14 
15     new_heading = 'Comments From ' + filename
16     print(new_heading)
17     DOC.add_heading(new_heading, 1)
18 
19     # 用于存放comment之间的从属关系的字典,每条comment都有一个独特id
20     commentRelationDic = {}
21     comments = wordObjEx.findAll("w15:commentex")
22     for comment in comments:
23         id = comment.attrs['w15:paraid']
24         if 'w15:paraidparent' in comment.attrs.keys():
25             commentRelationDic[id] = True
26         else:
27             commentRelationDic[id] = False
28     comments = wordObj.findAll("w:comment")
29     for comment in comments:
30         author = comment.attrs['w:author']
31         id = comment.next.attrs['w14:paraid']
32         if not id in commentRelationDic:
33             DOC.add_paragraph('\n')
34             print('\n')
35         elif not commentRelationDic[id]:
36             DOC.add_paragraph('\n')
37             print('\n')
38         texts = comment.findAll("w:t")
39         comment_text = ''
40         for text in texts:
41             comment_text += text.text
42         p = DOC.add_paragraph()
43         p.add_run(author + '\n').bold = True
44         p.add_run(comment_text)
45         print("%s :"%author)
46         print(comment_text)
47     DOC.add_page_break()
48     DOC.save('Comments.docx')
49 
50 
51 if __name__ == '__main__':
52     export_docx(DOC)

 

posted @ 2021-03-02 19:47  Asp1rant  阅读(302)  评论(0编辑  收藏  举报