trafilatura
trafilatura是一个专为从网页中提取核心内容设计的Python库
特别适用于那些需要从HTML页面中提取主要文本信息的应用场景,比如文章正文、标题等,同时排除掉导航栏、广告、侧边栏和其他非主要内容
安装
pip install trafilatura
示例
import trafilatura # 指定网页 URL url = "https://www.cnblogs.com/baby123/p/18755330" # 下载网页内容 downloaded = trafilatura.fetch_url(url) # 提取核心文本内容 result = trafilatura.extract(downloaded) print(result)
对于一些动态加载内容的网站,可能需要先使用Playwright 或 Selenium 工具来获取完整的HTML内容,然后再使用 Trafilatura
进行内容提取
但是这样速度会变慢
import asyncio from playwright.async_api import async_playwright import trafilatura import time async def fetch_dynamic_content(browser, url): page = await browser.new_page() try: await page.goto(url) # 使用 'networkidle' 等待页面加载完成 await page.wait_for_load_state('networkidle') html_content = await page.content() return html_content finally: await page.close() def extract_core_content(html_content): # 使用 Trafilatura 提取核心内容 result = trafilatura.extract(html_content) return result async def main(): start_count = time.perf_counter() urls = ["http://jinan.tianqi.com/"] # 可以添加更多URL以测试并行处理 async with async_playwright() as p: browser = await p.chromium.launch() tasks = [fetch_dynamic_content(browser, url) for url in urls] results = await asyncio.gather(*tasks) core_contents = [] for html_content in results: core_content = extract_core_content(html_content) core_contents.append(core_content) print(core_content) await browser.close() end_count = time.perf_counter() elapsed_time = round(end_count - start_count, 2) print(f"本次查找时间:{elapsed_time} 秒") # 运行主函数 if __name__ == "__main__": asyncio.run(main())