爬虫练习

有个网站特别宝藏,整理收集了很多木原音濑的作品。因为原网站的字体太小了,手机看实在费眼,所以写了个爬虫导成了txt文件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import  requests
import threading
from bs4 import BeautifulSoup
import re
import os
import time
import pdfkit
 
r=requests.get("") //原网站主页
soup=BeautifulSoup(r.text,"html.parser")
contents = soup.find_all("p")
 
//转成txt格式
i = 0
err = []
for c in contents:
    if c.a:
        i += 1
        series = c.strong.text
        series = series.replace('/', ' ')
        books = c.find_all("a")
        j = 0
        for b in books:
            if b.attrs['href'].split('.')[-1] == "jpg":
                continue
            else:
                j += 1
                filename = str(i)+"-"+series+'-'+str(j)+"-"+b.text+'.txt'
                res = requests.get(b.attrs['href'])
                res.encoding = 'gb18030' //这里乱码搞了好久
                soup = BeautifulSoup(res.text,'html.parser',from_encoding='gb18030')
                for body in soup.find_all('body'):
                    s = body.get_text()
                    print(s)
                    with open(filename,'w',1,'utf-8') as f:
                        f.write(s)

  

 

posted @   徐钏  阅读(45)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· 单线程的Redis速度为什么快?
点击右上角即可分享
微信分享提示