前言
本文对使用python读取pdf、word、excel、ppt、csv、txt等常用文件,并提取所有文本的方法进行分享和使用总结。
可以读取不同文件的库和方法当然不止下面分享的这些,本文的代码主要目标都是:方便提取文件中所有文本的实现方式。
这些库的更多使用方法,请到官方文档中查阅。
读取PDF文本:PyPDF2
| import PyPDF2 |
| def read_pdf_to_text(file_path): |
| with open(file_path, 'rb') as pdf_file: |
| pdf_reader = PyPDF2.PdfReader(pdf_file) |
| contents_list = [] |
| for page in pdf_reader.pages: |
| content = page.extract_text() |
| contents_list.append(content) |
| return '\n'.join(contents_list) |
| read_pdf_to_text('xxx.pdf') |
读取Word文本:docx2txt
需执行 pip install python-docx
| import docx2txt |
| from docx import Document |
| |
| def convert_doc_to_docx(doc_file, docx_file): |
| doc=Document(doc_file) |
| doc.save(docx_file) |
| |
| def read_docx_to_text(file_path): |
| text = docx2txt.process(file_path) |
| return text |
| |
| |
| if __name__ == '__main__': |
| source_file = '***.doc' |
| file_path = os.path.dirname(source_file) |
| file_fileName = os.path.split(source_file)[1].split('.')[0] |
| if source_file.endwith('.doc') : |
| docx_file = file_fileName + '.docx' |
| docx_file = os.path(file_path, docx_file) |
| convert_doc_to_docx(source_file, docx_file): |
| else: |
| docx_file = souce_file |
| read_docx_to_text(docx_file) |
读取excel文本:pandas
当然,pandas能读取的文件不仅仅是excel,还包括csv、json等。
| import pandas as pd |
| def read_excel_to_text(file_path): |
| excel_file = pd.ExcelFile(file_path) |
| sheet_names = excel_file.sheet_names |
| text_list = [] |
| for sheet_name in sheet_names: |
| df = excel_file.parse(sheet_name) |
| text = df.to_string(index=False) |
| text_list.append(text) |
| return '\n'.join(text_list) |
| read_excel_to_text('xxx.xlsx') |
读取ppt文本:pptx
| from pptx import Presentation |
| def read_pptx_to_text(file_path): |
| prs = Presentation(file_path) |
| text_list = [] |
| for slide in prs.slides: |
| for shape in slide.shapes: |
| if shape.has_text_frame: |
| text_frame = shape.text_frame |
| text = text_frame.text |
| if text: |
| text_list.append(text) |
| return |
| read_pptx_to_text( |
读取csv、txt其他文本:直接open,read()
| def read_txt_to_text(file_path): |
| with open(file_path, 'r') as f: |
| text = f.read() |
| return text |
| read_txt_to_text('xxx.csv') |
| read_txt_to_text('xxx.txt') |
读取任何文件格式
| support = { |
| 'pdf': 'read_pdf_to_text', |
| 'docx': 'read_docx_to_text', |
| 'xlsx': 'read_excel_to_text', |
| 'pptx': 'read_pptx_to_text', |
| 'csv': 'read_txt_to_text', |
| 'txt': 'read_txt_to_text', |
| } |
| def read_any_file_to_text(file_path): |
| file_suffix = file_path.split('.')[-1] |
| func = support.get(file_suffix) |
| if func is None: |
| return '暂不支持该文件格式' |
| text = eval(func)(file_path) |
| return text |
| read_any_file_to_text('xxx.pdf') |
| read_any_file_to_text('xxx.docx') |
| read_any_file_to_text('xxx.xlsx') |
| read_any_file_to_text('xxx.pptx') |
| read_any_file_to_text('xxx.csv') |
| read_any_file_to_text('xxx.txt') |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了