python 读取pdf,导出 txt 或 html
本文链接:https://www.cnblogs.com/tujia/p/16670374.html
一、安装 pdfminer.six
pip install pdfminer.six
二、使用代码读取pdf
from io import StringIO from pdfminer.layout import LAParams from pdfminer.high_level import extract_text_to_fp output_string = StringIO() with open('test.pdf', 'rb') as fin: # 导出txt # extract_text_to_fp(fin, output_string) # 导出html extract_text_to_fp(fin, output_string, laparams=LAParams(), output_type='html', codec=None) with open('test.html', 'w', encoding='utf-8') as f: f.write(output_string.getvalue().strip())
官方文档:
https://pdfminersix.readthedocs.io/en/latest/tutorial/highlevel.html
https://pdfminersix.readthedocs.io/en/latest/reference/highlevel.html
三、使用脚本读取pdf
https://pdfminersix.readthedocs.io/en/latest/tutorial/commandline.html
https://pdfminersix.readthedocs.io/en/latest/reference/commandline.html
说明:略
本文链接:https://www.cnblogs.com/tujia/p/16670374.html
完。