大模型下的文本解析

PDF

导图一览：

pdfplumber

pdfplumber库按页处理 pdf ，获取页面文字，提取表格等操作。

import pdfplumber
with pdfplumber.open("E:\新员工\【学员讲义】企业文化.pdf") as pdf:
    page01 = pdf.pages[9] #指定页码
    text = page01.extract_text()#提取文本
    print(text)

    table = page01.extract_tables()  # 提取表格
    print(table)

优势：

1. pdfplumber能轻松访问有关PDF对象的所有详细信息，且用于提取文本和表格的方法高级可定制，使用者可根据表格的具体形式来调整参数。

2. 最关键的是pdfplumber作者持续在维护该库，而同样受欢迎的PyPDF2已经不再维护了。

劣势：

1. 提取的英文中间没有空格

PyPDF2

PyPDF2 是一个纯 Python PDF 库，可以读取文档信息（标题，作者等）、写入、分割、合并PDF文档，它还可以对pdf文档进行添加水印、加密解密等。

官方文档：https://pythonhosted.org/PyPDF2　　 https://www.w3cschool.cn/pypdf2/

import PyPDF2
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfFileReader(file)
        num_pages = pdf_reader.numPages
        text = ""
        for page_num in range(num_pages):
            page = pdf_reader.getPage(page_num)
            text += page.extractText()
    return text

pdf_path = 'example.pdf'
extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text)

View Code

版本 2.6.0

劣势：

1.英文识别不行，识别成单个字母了 S o l a r

2.表格无法识别

pdf2docx

可将 PDF 转换成 docx 文件的 Python 库。

该项目通过 PyMuPDF 库提取 PDF 文件中的数据，然后采用 python-docx 库解析内容的布局、段落、图片、表格等，最后自动生成 docx 文件。

from pdf2docx import parse

pdf_file = 'E:\新员工\【学员讲义】企业文化.pdf'
docx_file = 'E:\新员工\【学员讲义】企业文化2.docx'

# convert pdf to docx
parse(pdf_file, docx_file)

PDFminer

PDFMiner内置pdf2txt.py和dumppdf.py。但是pdf2txt.py从PDF文件中提取所有文本内容。但不能识别画成图片的文本，这需要对图片特征进行识别。对于加密的PDF你需要提供一个密码才能解析，对于没有提取权限的PDF文档你得不到任何文本。

https://pdfminersix.readthedocs.io

对于每个LTPage对象，它从上到下遍历每个元素，并尝试将适当的组件识别为:

LTFigure：表示PDF中页面上的图形或图像的区域。
LTTextContainer：代表一个矩形区域（段落）中的一组文本行（line），然后进一步分析成LTTextLine对象的列表。它们中的每一个都表示一个LTChar对象列表，这些对象存储文本的单个字符及其元数据。
LTRect表示一个二维矩形，可用于在LTPage对象中占位区或者Panel，图形或创建表。

因此，使用Python对页面进行重构之后，将页面元素分类为LTFigure(图像或图形)、LTTextContainer(文本信息)或LTRect(表格)

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

# 打开PDF文件
with open('example.pdf', 'rb') as file:
    # 创建一个PDFResourceManager对象
    resource_manager = PDFResourceManager()
    # 创建一个StringIO对象，用于存储提取的文本内容
    output = StringIO()
    # 创建一个TextConverter对象
    converter = TextConverter(resource_manager, output, laparams=LAParams())
    # 创建一个PDFPageInterpreter对象
    interpreter = PDFPageInterpreter(resource_manager, converter)
    # 逐页解析文档
    for page in PDFPage.get_pages(file):
        interpreter.process_page(page)
    # 获取提取的文本内容
    text = output.getvalue()
    print(text)

View Code

from pdfminer.high_level import extract_text
extracted_text = extract_text(pdf)
print(extracted_text)

View Code

Pdfminer.six

Pdfminer.six 是PDFMiner的一个分支。

它是一种从PDF中解析、提取信息的工具文件，侧重于获取和分析文本数据、元数据和图片，还可用于获取文本的确切位置、字体或颜色。Pdfminer.six 直接从页面中提取文本 PDF 的源代码。以模块化的方式构建，因此pdfminer.six的每个组件都可以轻松更换。

官方文档：https://pdfminersix.readthedocs.io/en/latest/index.html

代码参考：https://mp.weixin.qq.com/s/5OQGOT1rllEAE8_L-v4L9A

大概试了下，功能没那么强大

pymupdf

import fitz

def MuPDF_extract_text_from_pdf(path):
    doc = fitz.open(path)
    all_content = []
    page_nums = 0
    for i in doc.pages():
        page_nums += 1
        all_content.append(i.get_text())
    text = '\n'.join(all_content)
    # text = ''.join(text.split('\n'))
    return text

View Code

papermerge

from papermage.recipes import CoreRecipe

recipe = CoreRecipe()
doc = recipe.run("example.pdf")
for page in doc.pages:
    for row in page.rows:
        print(row.text)

xpdf

这个文本介绍了Xpdf，一个免费的PDF浏览器和工具包。它包括文本提取器、图片转换器、HTML转换器等工具，大部分工具都是开源的。

http://www.xpdfreader.com/

img2table

https://mp.weixin.qq.com/s/kPxiw4sgrr8Z3dHUt60dyw

综合运用上述模块

https://mp.weixin.qq.com/s/4mg59Sb7TzaoXVctEMJVWw　　使用python提取PDF中的文本信息（包括表格和图片OCR）

# 读取PDF
import PyPDF2
# 分析PDF的layout，提取文本
from pdfminer.high_level import extract_pages, extract_text
from pdfminer.layout import LTTextContainer, LTChar, LTRect, LTFigure
# 从PDF的表格中提取文本
import pdfplumber
# 从PDF中提取图片
from PIL import Image
from pdf2image import convert_from_path
# 运行OCR从图片中提取文本
import pytesseract
# 清除过程中的各种过程文件
import os

# 创建一个从pdf中裁剪图像元素的函数
def crop_image(element, pageObj):
    # 获取从PDF中裁剪图像的坐标
    [image_left, image_top, image_right, image_bottom] = [element.x0,element.y0,element.x1,element.y1]
    # 使用坐标(left, bottom, right, top)裁剪页面
    pageObj.mediabox.lower_left = (image_left, image_bottom)
    pageObj.mediabox.upper_right = (image_right, image_top)
    # 将裁剪后的页面保存为新的PDF
    cropped_pdf_writer = PyPDF2.PdfWriter()
    cropped_pdf_writer.add_page(pageObj)
    # 将裁剪好的PDF保存到一个新文件
    with open('cropped_image.pdf', 'wb') as cropped_pdf_file:
        cropped_pdf_writer.write(cropped_pdf_file)

# 创建一个将PDF内容转换为image的函数
def convert_to_images(input_file,):
    images = convert_from_path(input_file)
    image = images[0]
    output_file = "PDF_image.png"
    image.save(output_file, "PNG")

# 创建从图片中提取文本的函数
def image_to_text(image_path):
    # 读取图片
    img = Image.open(image_path)
    # 从图片中抽取文本
    text = pytesseract.image_to_string(img)
    return text

# 从页面中提取表格内容

def extract_table(pdf_path, page_num, table_num):
    # 打开PDF文件
    pdf = pdfplumber.open(pdf_path)
    # 查找已检查的页面
    table_page = pdf.pages[page_num]
    # 提取适当的表格
    table = table_page.extract_tables()[table_num]
    return table

# 将表格转换为适当的格式
def table_converter(table):
    table_string = ''
    # 遍历表格的每一行
    for row_num in range(len(table)):
        row = table[row_num]
        # 从warp的文字删除线路断路器
        cleaned_row = [item.replace('\n', ' ') if item is not None and '\n' in item else 'None' if item is None else item for item in row]
        # 将表格转换为字符串，注意'|'、'\n'
        table_string+=('|'+'|'.join(cleaned_row)+'|'+'\n')
    # 删除最后一个换行符
    table_string = table_string[:-1]
    return table_string


# 创建一个文本提取函数

def text_extraction(element):
    # 从行元素中提取文本
    line_text = element.get_text()

    # 探析文本的格式
    # 用文本行中出现的所有格式初始化列表
    line_formats = []
    for text_line in element:
        if isinstance(text_line, LTTextContainer):
            # 遍历文本行中的每个字符
            for character in text_line:
                if isinstance(character, LTChar):
                    # 追加字符的font-family
                    line_formats.append(character.fontname)
                    # 追加字符的font-size
                    line_formats.append(character.size)
    # 找到行中唯一的字体大小和名称
    format_per_line = list(set(line_formats))

    # 返回包含每行文本及其格式的元组
    return (line_text, format_per_line)

# 查找PDF路径
pdf_path = r'E:\5.pdf'


# 创建一个PDF文件对象
pdfFileObj = open(pdf_path, 'rb')
# 创建一个PDF阅读器对象
pdfReaded = PyPDF2.PdfReader(pdfFileObj)

# 创建字典以从每个图像中提取文本
text_per_page = {}
# 我们从PDF中提取页面
for pagenum, page in enumerate(extract_pages(pdf_path)):

    # 初始化从页面中提取文本所需的变量
    pageObj = pdfReaded.pages[pagenum]
    page_text = []
    line_format = []
    text_from_images = []
    text_from_tables = []
    page_content = []
    # 初始化检查表的数量
    table_num = 0
    first_element = True
    table_extraction_flag = False
    # 打开pdf文件
    pdf = pdfplumber.open(pdf_path)
    # 查找已检查的页面
    page_tables = pdf.pages[pagenum]
    # 找出本页上的表格数目
    tables = page_tables.find_tables()

    # 找到所有的元素
    page_elements = [(element.y1, element) for element in page._objs]
    # 对页面中出现的所有元素进行排序
    page_elements.sort(key=lambda a: a[0], reverse=True)

    # 查找组成页面的元素
    for i, component in enumerate(page_elements):
        # 提取PDF中元素顶部的位置
        pos = component[0]
        # 提取页面布局的元素
        element = component[1]

        # 检查该元素是否为文本元素
        if isinstance(element, LTTextContainer):
            print(i, 'LTTextContainer')
            # 检查文本是否出现在表中
            if table_extraction_flag == False:
                # 使用该函数提取每个文本元素的文本和格式
                (line_text, format_per_line) = text_extraction(element)
                # 将每行的文本追加到页文本
                page_text.append(line_text)
                # 附加每一行包含文本的格式
                line_format.append(format_per_line)
                page_content.append(line_text)
            else:
                # 省略表中出现的文本
                pass

        # 检查元素中的图像
        if isinstance(element, LTFigure):
            print(i, 'LTFigure')
            # 从PDF中裁剪图像
            crop_image(element, pageObj)
            # 将裁剪后的pdf转换为图像
            convert_to_images('cropped_image.pdf')
            # 从图像中提取文本
            image_text = image_to_text('PDF_image.png')
            text_from_images.append(image_text)
            page_content.append(image_text)
            # 在文本和格式列表中添加占位符
            page_text.append('image')
            line_format.append('image')

        # 检查表的元素
        if isinstance(element, LTRect):
            print(i, 'LTRect')
            # 如果第一个矩形元素
            if first_element == True and (table_num + 1) <= len(tables):
                # 找到表格的边界框
                lower_side = page.bbox[3] - tables[table_num].bbox[3]
                upper_side = element.y1
                # 从表中提取信息
                table = extract_table(pdf_path, pagenum, table_num)
                # 将表信息转换为结构化字符串格式
                table_string = table_converter(table)
                # 将表字符串追加到列表中
                text_from_tables.append(table_string)
                page_content.append(table_string)
                # 将标志设置为True以再次避免该内容
                table_extraction_flag = True
                # 让它成为另一个元素
                first_element = False
                # 在文本和格式列表中添加占位符
                page_text.append('table')
                line_format.append('table')

            # 检查我们是否已经从页面中提取了表
            # if element.y0 >= lower_side and element.y1 <= upper_side:
            #     pass
            # elif not isinstance(page_elements[i + 1][1], LTRect):
            if not isinstance(page_elements[i + 1][1], LTRect):
                table_extraction_flag = False
                first_element = True
                table_num += 1

    # 创建字典的键
    dctkey = 'Page_' + str(pagenum)
    # 将list的列表添加为页键的值
    text_per_page[dctkey] = [page_text, line_format, text_from_images, text_from_tables, page_content]

# 关闭pdf文件对象
pdfFileObj.close()

# 删除已创建的过程文件
os.remove('cropped_image.pdf')
os.remove('PDF_image.png')

# 显示页面内容
result = ''.join(text_per_page['Page_2'][3])    # text_per_page 看下数据格式
print(result)

View Code

doc docx

Python-dox：优点：能够解析docx格式文档；缺点：doc格式文档无法直接解析，需要进行转换为docx格式间接解析

# pip install python-docx
from docx import Document
doc = Document(file_path)

# 遍历文档中的所有段落
text = []
for paragraph in doc.paragraphs:
    text.append(paragraph.text)
print('*' * 200)

# 遍历文档中的所有表格
for table in doc.tables:
    for row in table.rows:
        for cell in row.cells:
            print(cell.text)
print('&' * 50)

# 遍历文档中的所有图片
for image in doc.inline_shapes:
    print(image.filename)

View Code

docx2pdf

from docx2pdf import convert
file = r'xx.docx'
convert(file, "output.pdf")

版面分析

文本分割模型在文档解析中的角色

单双栏区分

新出来的还没研究

PymuPDF4llm：PDF 提取的革命

利用LLM从非结构化PDF中提取结构化知识　　　　 PymuPDF4llm

PDF 文档提取和解析 API：使用最先进的 OCR 和 Ollama 支持的模型

参考资料还有很多方法，自行查看

参考资料：

https://mp.weixin.qq.com/s/9L_LwJvwn_F9C89J-yYezA　　大模型下开源文档解析工具总结及技术思考　　各种类型　　代码很详尽

https://zhuanlan.zhihu.com/p/344384506　　Python操作PDF全总结|pdfplumber&PyPDF2

https://mp.weixin.qq.com/s/gCU1hYmmHpqiV9APHotrYA　　只需2行代码，轻松将PDF转换成Word

https://mp.weixin.qq.com/s/W1TciuOp4FTBU09LHQYptQ　　AI文档智能助理都是如何处理pdf的？

发表于 2023-09-05 09:08 努力的孔子阅读(1168) 评论(0) 编辑收藏举报

刷新页面返回顶部

大模型下的文本解析

PDF

pdfplumber

PyPDF2

pdf2docx

PDFminer

Pdfminer.six

pymupdf

papermerge

xpdf

img2table

综合运用上述模块

doc docx

版面分析

文本分割模型在文档解析中的角色

单双栏区分

新出来的还没研究

导航