Python提取pdf文字信息
Python提取pdf文字信息
需求
今天教务处导出来我们全年级的成绩,一看吓一跳,我们的名字怎么不在文件名里,只能一个个找吗。事情开始变得离谱起来,因为足足有800多份成绩。
不怕,人生苦短,我用Python
,思路很简单,使用pdfminer读取pdf文件里的文字信息,剩下的就是一个时间复杂度为\(O(n)\)的查找问题了。
代码
- 文件目录结构
其中 scores
存放全学院的成绩,results
保存查找结果,main.py
是核心代码。
import os
from shutil import copy
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
score_dir = ".\scores"
def readPdf(pdf_file):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr=rsrcmgr, outfp=retstr, laparams=laparams)
process_pdf(rsrcmgr=rsrcmgr, device=device, fp=pdf_file)
device.close()
content = retstr.getvalue()
retstr.close()
return content
if __name__ == '__main__':
list = ["王五", "张三", "李四"]
file_list = os.listdir(score_dir)
for i in file_list:
# read pdf content
with open(os.path.join(score_dir, i), "rb") as f:
content = readPdf(f)
# search name in content
for search_name in list:
if search_name in content:
# save to results
copy(os.path.join(score_dir, i), os.path.join("./results/", search_name + ".pdf"))
# # save time
# list.remove(search_name)
if len(list) == 0:
break
print("Search successfully!")
总结
复习了一些相关语法,os.listdir(score_dir)
, copy(os.path.join(score_dir, i), os.path.join("./results/", search_name + ".pdf"))
本文来自博客园,作者:CuriosityWang,转载请注明原文链接:https://www.cnblogs.com/curiositywang/p/16456547.html