爬虫练手之爬取网页小说

本次练手项目使用了Beautiful Soup库。Beautiful Soup是一个可以从HTML或XML中提取数据的Python库。它可以通过你喜欢的转换器快速帮你解析并查找整个HTML文档。
在开始之前，需要确保一下环境是否安装Beautiful Soup。

pip install beautifulsoup4

代码：

import requests
from bs4 import BeautifulSoup

# 建立内容存储文件
f=open('E:\project\python\Cold.txt','a')

# 设置小说首页的URL
url = 'https://example.com/index' # 替换为要爬取的小说首页地址

# 发送HTTP GET请求获取小说首页内容
response = requests.get(url)
if response.status_code == 200:
    # 使用BeautifulSoup解析网页内容
    soup = BeautifulSoup(response.text, 'html.parser')
    # 找到小说章节的链接
    chapter_links = soup.select('ol.clearfix li a')  # 假设章节链接使用ol标签下的class为'clearfix'的li标签中的a标签
    print(chapter_links) 
    for link in chapter_links:
        chapter_url = link['href']
        chapter_response = requests.get(chapter_url)
        if chapter_response.status_code == 200:
            chapter_soup = BeautifulSoup(chapter_response.text, 'html.parser')
            # 提取章节内容
            chapter_title = chapter_soup.find('h1').text  # 假设章节标题为h1标签
            f.write(chapter_title)
            chapter_content = chapter_soup.find('div', id='content').text  # 假设章节内容在class为'content'的div标签中
            f.write(chapter_content) #将获取的内容存入文件中
            # 输出章节标题和内容
            print(chapter_title)
           # print(chapter_content)
            print('------------end------------')
        else:
            print(f"Failed to retrieve chapter. Status code: {chapter_response.status_code}")
else:
    print(f"Failed to retrieve webpage. Status code: {response.status_code}")
f.close()

posted @ 2024-03-20 16:37 a_u 阅读(96) 评论(0) 收藏举报

刷新页面返回顶部

au-up

爬虫练手之爬取网页小说

公告