pypandoc库实现文档转换

写在前面：

对于python程序员来说，文件格式之间转换很常用，尤其是把我们爬虫爬到的内容转换成想要的文档格式时。这几天看到一个网站上有许多文章，个人很喜欢，直接复制太麻烦，为了将爬到的html文件以word .doc 文件的格式存储到自己的数据库，选用了pypandoc库。

这个库语法简单，瞄一眼就能会，就跟我一起来看看吧。

安装

安装一般先装pandoc 然后安装pypandoc库

1.window

1>安装pandoc：直接下载windows版本的.msi文件即可，传送门 https://github.com/jgm/pandoc/releases/

2>安装pypandoc库：命令行安装，或者直接idea安装

pip install pypandoc

2.ubuntu

sudo apt install pandoc
pip install pypandoc

使用

pypandoc主要有3个函数：

1.convert()
2.convent_file()
3.convenr_text()

其中convert（）官方建议不用，容易产生歧义。

convert_file()和convent_text()区别就在于一个接收文件参数，一个接收文本参数。结合下面例子一看即懂：

import pypandoc

"""
　　pypandoc.convert_file(source_file, to, format=None, extra_args=(), encoding='utf-8',
                 outputfile=None, filters=None, verify_format=True)
    参数说明：
    source_file:源文件路径
    to：输入应转换为的格式；可以是'pypandoc.get_pandoc_formats（）[1]`之一
    format：输入的格式；将从具有已知文件扩展名的源_文件推断；可以是“pypandoc.get_pandoc_formats（）[1]”之一（默认值= None)
    extra_args：要传递给pandoc的额外参数（字符串列表）(Default value = ())
    encoding：文件或输入字节的编码 (Default value = 'utf-8')
    outputfile：转换后的内容输出路径+文件名，文件名的后缀要和to的一致，如果没有，则返回转换后的内容（默认值= None)
    filters – pandoc过滤器，例如过滤器=['pandoc-citeproc']
    verify_format：是否对给定的格式参数进行验证，（pypandoc根据文件名截取后缀格式，与用户输入的format进行比对）
    
    pypandoc.convert_text(source, to, format, extra_args=(), encoding='utf-8',
                     outputfile=None, filters=None, verify_format=True):
    参数说明： 
    source：字符串        
    其余和canvert_file()相同       

"""

# 将当前目录下html目录中的1.html网页文件直接转换成.docx文件，文件名为file1.docx，并保存在当前目录下的doc文件夹中
pypandoc.convert_file('./html/1.html', 'docx', outputfile="./doc/file1.docx")

# 将当前目录下html目录中的1.html网页文件 读取出来，然后转换成.docx文件，文件名为file2.docx，并保存在当前目录下的doc文件夹中
with open('./html/1.html', encoding='utf-8') as f:
    f_text = f.read()
pypandoc.convert_text(f_text, 'docx', 'html', outputfile="./doc/file2.docx")

可转格式

支持输入格式：

biblatex, bibtex, commonmark, commonmark_x, creole, csljson, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, rtf, t2t, textile, tikiwiki, twiki, vimwiki

支持输出格式：

asciidoc, asciidoctor, beamer, biblatex, bibtex, commonmark, commonmark_x, context, csljson, docbook, docbook4, docbook5, docx, dokuwiki, dzslides, epub, epub2, epub3, fb2, gfm, haddock, html, html4, html5, icml, ipynb, jats, jats_archiving, jats_articleauthoring, jats_publishing, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, ms, muse, native, odt, opendocument, opml, org, pdf, plain, pptx, revealjs, rst, rtf, s5, slideous, slidy, tei, texinfo, textile, xwiki, zimwiki

一图知晓：

posted @ 2021-09-22 15:32 微笑_百年阅读(3919) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部