[1035] Extract the content from online PDF file or PDF URL

Certainly! When working with online PDFs using the pyPDF2 library in Python, you can retrieve the content from a PDF file hosted at a URL. Let’s explore a couple of ways to achieve this:

Using requests (Python 3.x and higher): If you’re using Python 3.x (which is recommended), you can use the requests library to fetch the PDF content and then read it directly using pyPDF2. Here’s an example:

import io
import requests
from pyPDF2 import PdfReader

url = "https://www.example.com/sample.pdf"
response = requests.get(url, timeout=120)
on_fly_mem_obj = io.BytesIO(response.content)
pdf_file = PdfReader(on_fly_mem_obj)

# Now you can work with the PDF content

Replace "https://www.example.com/sample.pdf" with the actual URL of the PDF you want to read.

Remember to handle exceptions (such as network errors or invalid URLs) appropriately in your code. Also, adjust the code snippets according to your specific use case.

Feel free to choose the method that suits your Python version and requirements! If you have any more questions or need further assistance, feel free to ask. 📄🐍😊 Learn more ¹ ² ³ ⁴

import io
import requests
from PyPDF2 import PdfReader

url = "https://ntepa.nt.gov.au/__data/assets/pdf_file/0003/287130/wastehandler_epl156_darwin_toilet_hire.pdf"
response = requests.get(url, timeout=120)
on_fly_mem_obj = io.BytesIO(response.content)
pdf_reader = PdfReader(on_fly_mem_obj)
# Extract text from the first page
first_page = pdf_reader.pages[0]

extracted_text = " ".join(first_page.extract_text().split("\n"))

pattern = re.escape("Description ") + r"(.*?)$"

info = re.findall(pattern, extracted_text, re.MULTILINE | re.DOTALL)[0].strip()

print(info)

Darwin Toilet Hire (ABN 24 633 497 293) collects nightsoil from portable toilets and transports the waste to Power and Water  Corporation treatment facilities.

posted on 2024-07-18 12:01 McDelfino 阅读(3) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

alex_bn_lee

导航

公告

[1035] Extract the content from online PDF file or PDF URL