https://www.cnblogs.com/longhai3/longhai

PYTHON>>爬虫爬取小说

Posted on   凡是过去,皆为序曲  阅读(67)  评论(0编辑  收藏  举报
复制代码
 1 import requests
 2 from bs4 import BeautifulSoup
 3 import re
 4 import time
 5 
 6 # https://wwcom/28_28714/19953985.html
 7 # https://wwwrg/83_83488/28981145.html
 8 
 9 def URL00(a):
10     url = 'https://wwworg/83_83488/'+ str(a) +'.html'
11     return url
12 
13 def DOWN00(a):
14     strhtml=requests.get(URL00(a))
15     strhtml.encoding = "UTF-8"
16     soup = BeautifulSoup(strhtml.text,'lxml')
17 
18 # 正文
19 # 选择“Copy”➔“Copy Selector”命令
20     data02 = soup.select('#read > div.container > div:nth-child(3) > div > div.panel.panel-default > div.panel-body.content-body.content-ext')
21     data02 = str(data02)
22     data02 = re.findall(r'>(.*?)</div>', data02, re.S)
23     data02 = ''.join(data02) + "\n"
24 
25 # 标题、章节
26     data01 = soup.select('#read > div.container > div:nth-child(3) > div > div.panel.panel-default > div.panel-heading')
27     data01 = str(data01)
28     data01 = re.findall(r'">(.*?)</div>', data01, re.S)
29     data01 = ''.join(data01) + "\n\n"
30 
31     data = data01 + data02 + "=" * 40 + "\n\n"
32     data = data.replace('<br/>','')
33     data = data + "\n"
34     return data
35 
36 def SAVE00(data0):
37     try:
38         f = open(r"TXT0XZ.txt", 'a+',encoding='utf-8')
39         f.write(data0)
40         f.close()
41     except IOError:
42         f = open(r"TXT0XZ.txt", 'w',encoding='utf-8')
43         f.write(data0)
44         f.close()
45 
46 def JINDU00(n):
47     n = int(n)
48     print('\r' + '#' * n + '=' * (100-n),end="")
49     time.sleep(2)
50 
51 if __name__ == "__main__":
52     # for i in range(28981145,28981146):
53     for i in range(28981145,28981337):
54         n = (i - 28981145)/(28981337 - 28981145)*100
55         JINDU00(n)
56         TXT0 = DOWN00(i)
57         SAVE00(TXT0)
58     print("\n完成!")
复制代码

 

编辑推荐:
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
阅读排行:
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· 写一个简单的SQL生成工具
· AI 智能体引爆开源社区「GitHub 热点速览」
· C#/.NET/.NET Core技术前沿周刊 | 第 29 期(2025年3.1-3.9)

随心,随记

https://www.cnblogs.com/w1hg/331817

点击右上角即可分享
微信分享提示