爬虫练习

有个网站特别宝藏，整理收集了很多木原音濑的作品。因为原网站的字体太小了，手机看实在费眼，所以写了个爬虫导成了txt文件。

import  requests
import threading
from bs4 import BeautifulSoup
import re
import os
import time
import pdfkit
 
r=requests.get("") //原网站主页
soup=BeautifulSoup(r.text,"html.parser")
contents = soup.find_all("p")
 
//转成txt格式
i = 0
err = []
for c in contents:
    if c.a:
        i += 1
        series = c.strong.text
        series = series.replace('/', ' ')
        books = c.find_all("a")
        j = 0
        for b in books:
            if b.attrs['href'].split('.')[-1] == "jpg":
                continue
            else:
                j += 1
                filename = str(i)+"-"+series+'-'+str(j)+"-"+b.text+'.txt'
                res = requests.get(b.attrs['href'])
                res.encoding = 'gb18030' //这里乱码搞了好久
                soup = BeautifulSoup(res.text,'html.parser',from_encoding='gb18030')
                for body in soup.find_all('body'):
                    s = body.get_text()
                    print(s)
                    with open(filename,'w',1,'utf-8') as f:
                        f.write(s)

posted @ 2022-01-07 17:09 徐钏阅读(45) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· 多线程爬虫

· 获取百度百科简介

· python爬虫小练习

· 爬虫练习一

· 爬虫入门-写一个小爬虫的思路

阅读排行：
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布：重大改进与新特性概览！
· AI与.NET技术实操系列（二）：开始使用ML.NET
· 单线程的Redis速度为什么快？

公告

昵称：徐钏
园龄： 3年7个月
粉丝： 0
关注： 0

+加关注

2025年3月

日

一

二

三

四

五

六

徐钏

爬虫练习

公告

搜索

常用链接

我的标签

合集

随笔分类

随笔档案

阅读排行榜