使用requests及lxml爬取教程示例

很多教程网站都是静态html，爬取起来相对容易，使用requests请求页面后把响应内容保存为html文件即可。
一般爬取步骤如下：

从首页解析出课程列表，包含课程标题和URL链接
请求课程页面，解析出文章列表，包含文章标题和文章URL链接
请求文章页面，将响应内容保存为html文件

分析页面源代码，排除掉导航栏的URL，只提取下面的课程链接，参考XPath为 //div[contains(@class,"w3-row")]//a，

获取课程列表的方法如下

import requests
from lxml import etree

def get_course_list():
    start_url = 'https://www.tizi365.com/'
    course_list = []
    res = requests.get(start_url)
    root = etree.HTML(res.text)
    links = root.xpath('//div[contains(@class,"w3-row")]//a')
    for link in links:
        course_name = ''.join(link.itertext())[1:]  # 获取当前节点所有子节点文本，并去掉第一个图标字符
        course_url = link.attrib['href']  # 获取url链接
        course_list.append((course_name, course_url))
    return course_list

导入pprint使用 pprint(get_course_list()), 输出结果如下：

[('提示词工程（吴恩达）', 'https://www.tizi365.com/topic/1407.html'),
 ('ChatGPT开发教程', 'https://www.tizi365.com/topic/191.html'),
 ('Go语言教程', '/archives/415.html'),
 ('echo教程', '/archives/28.html'),
 ('sqlx教程', '/archives/100.html'),
 ('html模板引擎教程', '/archives/85.html'),
 ('ProtoBuf 入门教程', '/archives/367.html'),
 ('grpc 框架教程', '/archives/391.html'),
 ('Go Micro微服务框架', '/archives/478.html'),
 ('Consul教程', 'https://www.tizi365.com/archives/501.html'),
 ('etcd 教程', 'https://www.tizi365.com/archives/557.html'),
 ('RabbitMQ 教程', 'https://www.tizi365.com/course/1.html'),
 ('Golang RabbitMQ 教程', 'https://www.tizi365.com/topic/17.html'),
 ('Java RabbitMQ 教程', 'https://www.tizi365.com/topic/22.html'),
... ...

获取文章列表

可以看到有部分url不完整，需要拼接上base_url，网站服务地址。
因此我们可以封装一个get方法，补全URL并直接方法响应文本。

def get(url):
    base_url = 'https://www.tizi365.com'
    if not url.startswith('http'):
        url = f'{base_url}{url}'  # 补全URL
    return requests.get(url).text  # 直接返回响应文本

获取文章列表的方法和课程列表类似，请求课程页面，分析出章节菜单的链接，参考XPath: //div[contains(@class,"w3-sidebar")]//a。
参考代码如下：

def get_article_list(course_url):
    article_list = []
    res = get(course_url)
    root = etree.HTML(res)
    links = root.xpath('//div[contains(@class,"w3-sidebar")]//a')
    for link in links:
        article_name = link.text
        article_url = link.attrib['href']
        article_list.append((article_name, article_url))
    return article_list

下载文章
下载文章非常简单，请求文章链接，并保存文件即可，这里我们使用上面封装的get方法，来处理文章URL不全的场景，示例代码如下：

def download_article(article_url, output_file):
    res = get(article_url)
    with open(output_file, 'w') as f:
        f.write(res)

流程组装
首先我们要确定保存的根目录，例如我们保存到脚本同级的data目录中，每个课程新建一个目录，课程的文章保存到该目录中。
我们可以使用 data_dir = os.path.abspath('./data') 得到课程根目录路径，使用 course_dir = os.path.join(data_dir, course_name) 得到课程的目录，
用 os.makedirs(course_dir) 来新建目录（需要判断目录是否已创建），同时我们可以用线程池来多线程下载文章。

import os.path
from concurrent.futures.thread import ThreadPoolExecutor

def main():
    data_dir = os.path.abspath('./data')
    course_list = get_course_list()
    pool = ThreadPoolExecutor(max_workers=10)
    for course_name, course_url in course_list:
        print(f'下载课程 {course_name}')
        output_dir = os.path.join(data_dir, course_name)
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        article_list = get_article_list(course_url)
        for article_name, article_url in article_list:
            print(f' {article_name} {article_url}')
            output_file = os.path.join(output_dir, f'{article_name.replace(" ", "")}.html')
            pool.submit(download_article, article_url, output_file)

运行main()函数，输出结果如下：

下载课程 提示词工程（吴恩达）
 1.简介 /topic/1407.html
 2.Prompt设计原则 /topic/1450.html
 3.迭代优化Prompt /topic/1498.html
 4.文本概括Prompt设计 /topic/1509.html
 5.利用AI推理能力 /topic/1589.html
 6.文本转换Prompt设计 /topic/1628.html
 7.AI写作Prompt设计 /topic/1659.html
 8.聊天机器人Prompt设计 /topic/1672.html
下载课程 ChatGPT开发教程
 1.OpenAI API 介绍 /topic/191.html
 ... ...

执行完成结果如下图：

posted @ 2023-07-14 11:30 韩志超阅读(262) 评论(1) 编辑收藏举报

刷新页面返回顶部

...

临渊

使用requests及lxml爬取教程示例

公告