Python提取pdf文字信息

Python提取pdf文字信息

需求

今天教务处导出来我们全年级的成绩,一看吓一跳,我们的名字怎么不在文件名里,只能一个个找吗。事情开始变得离谱起来,因为足足有800多份成绩。

image-20220707215302530

不怕,人生苦短,我用Python,思路很简单,使用pdfminer读取pdf文件里的文字信息,剩下的就是一个时间复杂度为\(O(n)\)的查找问题了。

代码

  • 文件目录结构

image-20220707225815121其中 scores存放全学院的成绩,results保存查找结果,main.py是核心代码。

import os
from shutil import copy
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams

score_dir = ".\scores"


def readPdf(pdf_file):
    
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr=rsrcmgr, outfp=retstr, laparams=laparams)

    process_pdf(rsrcmgr=rsrcmgr, device=device, fp=pdf_file)
    device.close()

    content = retstr.getvalue()
    retstr.close()

    return content


if __name__ == '__main__':

    list = ["王五", "张三", "李四"]

    file_list = os.listdir(score_dir)

    for i in file_list:

        # read pdf content
        with open(os.path.join(score_dir, i), "rb") as f:
            content = readPdf(f)

        # search name in content
        for search_name in list:
            if search_name in content:
                # save to results
                copy(os.path.join(score_dir, i), os.path.join("./results/", search_name + ".pdf"))
                # # save time
                # list.remove(search_name)
                
        if len(list) == 0:
            break

    print("Search successfully!")

总结

复习了一些相关语法,os.listdir(score_dir)copy(os.path.join(score_dir, i), os.path.join("./results/", search_name + ".pdf"))

posted @ 2022-07-07 23:02  CuriosityWang  阅读(1120)  评论(0编辑  收藏  举报