关于笔趣阁爬虫

于2024年测试，该盗版网站已无法访问

最近新学了xpath，就拿笔趣阁来练手了
可能有点杂乱
孩子正在努力学习并优化
现在来做个记录

做了单本的和多本的
目前就记录单本的
后期再全部同步

版本	同步	更新时间	注释
单本	✔	2021.2.18	✘
----	----	----	----
多本	✘	未知	✘

单本

from lxml import etree
import requests
from bs4 import BeautifulSoup

# 小说地址（这里用了元尊的小说目录页面网址）
url = 'https://www.52bqg.net/book_103752/'

res = requests.get(url)
print('网页获取成功，解析中···')

ress = BeautifulSoup(res.text, 'lxml')

html = etree.HTML(str(ress))

result = html.xpath('//div[@class="box_con"]/div[@id="list"]/dl/dd')
num = 1
zuihou = []
shunxu = []

urltou = 'https://www.52bqg.net/book_103752/'
for i in result:
    if num <= 12:
        num = num + 1
        zuihou.append(urltou + i.xpath('./a/@href')[0])
    else:
        if not (i.xpath('./a/@href')):
            pass
        else:
            shunxu.append(urltou + i.xpath('./a/@href')[0])

for i in zuihou:
    shunxu.append(i)

for url in shunxu:
    res = requests.get(url)
    ress = BeautifulSoup(res.text, 'lxml')
    html = etree.HTML(str(ress))
    # 小说标题
    heading = '『{}』'.format((html.xpath('//h1/text()'))[0])
    print(heading + '成功获取')

    # 小说内容
    txt = html.xpath('//div[@id="content"]/text()')

    a = len(txt)  # 66

    if 'ps' in txt[a-1]:
        num = a-2
    else:
        num = a-1
    with open(heading, 'a', encoding='utf-8', newline='')as file:
        for i in range(num):
            # print(''.join(txt[1+i].split()))
            file.write(''.join(txt[1+i].split()))
            file.write('\n')

print('本章文本获取成功')

posted @ 2021-02-18 19:37 X同学阅读(61) 评论(1) 收藏举报

刷新页面返回顶部

X同学

关于笔趣阁爬虫

公告