trafilatura

trafilatura是一个专为从网页中提取核心内容设计的Python库

特别适用于那些需要从HTML页面中提取主要文本信息的应用场景,比如文章正文、标题等,同时排除掉导航栏、广告、侧边栏和其他非主要内容

安装

pip install trafilatura

示例

import trafilatura

# 指定网页 URL
url = "https://www.cnblogs.com/baby123/p/18755330"
# 下载网页内容
downloaded = trafilatura.fetch_url(url)
# 提取核心文本内容
result = trafilatura.extract(downloaded)
print(result)

对于一些动态加载内容的网站,可能需要先使用Playwright 或 Selenium 工具来获取完整的HTML内容,然后再使用 Trafilatura 进行内容提取

但是这样速度会变慢

import asyncio
from playwright.async_api import async_playwright
import trafilatura
import time

async def fetch_dynamic_content(browser, url):
    page = await browser.new_page()
    try:
        await page.goto(url)
        # 使用 'networkidle' 等待页面加载完成
        await page.wait_for_load_state('networkidle')
        html_content = await page.content()
        return html_content
    finally:
        await page.close()

def extract_core_content(html_content):
    # 使用 Trafilatura 提取核心内容
    result = trafilatura.extract(html_content)
    return result

async def main():
    start_count = time.perf_counter()
    
    urls = ["http://jinan.tianqi.com/"]  # 可以添加更多URL以测试并行处理
    
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        tasks = [fetch_dynamic_content(browser, url) for url in urls]
        
        results = await asyncio.gather(*tasks)
        
        core_contents = []
        for html_content in results:
            core_content = extract_core_content(html_content)
            core_contents.append(core_content)
            print(core_content)
        
        await browser.close()
    
    end_count = time.perf_counter()
    elapsed_time = round(end_count - start_count, 2)
    print(f"本次查找时间:{elapsed_time} 秒")

# 运行主函数
if __name__ == "__main__":
    asyncio.run(main())

 

posted @ 2025-03-19 23:01  慕尘  阅读(57)  评论(0)    收藏  举报