[949] Using re to extract unstructured tables of PDF files
Here is the problem, this unstructured table of a PDF file can not be extrcted as a table directly. We can only extract the whole texts of every page.
My task is to extract the Place ID, Place Name, and Title Details. Then only Title Details include patterns like this will be kept 00XXX0000
, numbers + characters + numbers.
Another issues, the extracted texts have some \n
or \n\n
.
The script:
import re, os, PyPDF2 import pandas as pd # Specify the path to the PDF file pdf_path = r"D:\Bingnan_Li\01_Tasks\11_20231109_PDF_reading\Planning_LGA\Fraser Coast Regional Council\DOCSHBCC__3131535_v6_Cover_sheet_of_Local_Heritage_Register_.pdf" # Extract all the texts from the PDF file page by page with open(pdf_path, "rb") as file: # Create a PDF reader object pdf_reader = PyPDF2.PdfFileReader(file) page_text = "" # From Page 2 to Page 6 for i in range(2, 7): page = pdf_reader.getPage(i) page_text += page.extractText() a = page_text # In order to match the text better, we replace the "\n" and "\n \n" a = a.replace("\n \n", "#####") a = a.replace("\n", "") # Delete the "*" in the text a = a.replace("*", "") # Try to match the text like this # "#####1#####Howard War Memorial#####Cnr William and#####Steley Streets Howard#####Refer to Queensland Heritage Register Place ID 600545#####A, B, D, E, G#####2##########" # (###[#]+[\d]{1,3}###[#]+) try to match "#####1#####" # (.*?) try to match the middle part # (###[#]+[\d]{1,3}###[#]+) try to match "#####2##########" # [\d]{1,3} means numbers with 1 digit, 2 digits or 3 digits pattern = r"(###[#]+[\d]{1,3}###[#]+)(.*?)(###[#]+[\d]{1,3}###[#]+)" # Create an emplty DataFrame df = pd.DataFrame(columns=["ID", "Heritage Name", "Lot", "Plan", "LotPlan"]) # Get all the matches # We cannot use the function of re.findall(), because it will miss the one start with "#####2##########" # So every time, we only find the first one, then move the string one the right to match another first one # Finally, we will get all the matches while True: match = re.search(pattern, a) if not match: break print(match.groups()[0], match.groups()[1]) # From the Title Details, we need to match the lot and the plan pattern_2 = r"([0-9]+)([a-zA-Z]+)([0-9]+)" matches_2 = re.findall(pattern_2, match.groups()[1]) for m_2 in matches_2: # Add this information in to the DataFrame df.loc[len(df)] = [match.groups()[0].replace("#", ""), match.groups()[1].split("#####")[0], m_2[0], m_2[1] + m_2[2], m_2[0]+m_2[1]+m_2[2]] a = a[match.span()[1]-20:] df.drop_duplicates() df.index = range(len(df)) df
Another example:
m = "#####234#####Alex Smith, and Jay#####12#######Lucy, Lily, and Jerry#####134########Tim, Tom, and Tommy#####1#######" + \ "Alex Smith233, and Jay#####233#######Lucy, Lily, and Jerry233#####34########Tim, Tom, and Tommy23233#####14#######" p = r"(##[#]+[\d]{1,3}##[#]+)(.*?)(##[#]+[\d]{1,3}##[#]+)" while True: tmp = re.search(p, m) if not tmp: break print(tmp.groups()[0] + tmp.groups()[1] + tmp.groups()[2]) print(tmp.groups()[0].replace("#", ""), tmp.groups()[1], tmp.groups()[2].replace("#", "")) print() m = m[tmp.span()[1]-20:]
Output:
#####234#####Alex Smith, and Jay#####12####### 234 Alex Smith, and Jay 12 #####12#######Lucy, Lily, and Jerry#####134######## 12 Lucy, Lily, and Jerry 134 #####134########Tim, Tom, and Tommy#####1####### 134 Tim, Tom, and Tommy 1 #####1#######Alex Smith233, and Jay#####233####### 1 Alex Smith233, and Jay 233 #####233#######Lucy, Lily, and Jerry233#####34######## 233 Lucy, Lily, and Jerry233 34 #####34########Tim, Tom, and Tommy23233#####14####### 34 Tim, Tom, and Tommy23233 14
分类:
Python Study
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· .NET10 - 预览版1新功能体验(一)
2022-11-22 【770】热点分析、Emerging Hotspot Analysis、P值、Z得分
2012-11-22 【092】罗马数字 XXII.XI.MMXII