Python BeautifulSoup4 爬虫基础、多线程学习

针对崔庆才老师的 https://ssr1.scrape.center 的爬虫基础练习

学习网站：w3school 92python runoob

BeautifulSoup官方文档：readthedocs

总共用时：2小时 (代码在最后面)

学习内容：Threading多线程库、Time库、json库、BeautifulSoup4 爬虫库、py基本语法

学习建议：我是爬虫零基础，也没有看什么教程视频，只开了bs4的官方文档，那个文档写的比较详细。重点F12观察网页的Dom结构。多用搜索引擎。我会把我搜索过的问题放在下方供大家参考。这方面库比较完善，不是很难，会应用即可。

踩过的坑

4、python循环10次怎么写

for i in range(10):
    print("123")

5、Python三元运算符

Python三目运算符（三元运算符）用法详解

使用 if else 实现三目运算符（条件运算符）的格式如下：

exp1 if contion else exp2

6、json.dumps输出的中文乱码问题

json.dumps输出的中文乱码问题

添加参数ensure_ascii=False

json.dumps(data, indent=4, ensure_ascii=False)

7、python保留两位小数

python保留两位小数 - psztswcbyy - 博客园

8、Python 编码错误 UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 33: illegal..

file = open("data.json", "w") 
# 改为
file = open("data.json", "w", encoding="utf-8")

9、正则表达式re.compile()的使用

正则表达式re.compile()的使用_精灵码农-CSDN博客_recompile的意思

10、单线程与多线程时间比较

每个电影的简介在另一个单独的页面里，共100部电影有100个页面，加10个电影分页页面共110个页面

单线程耗时（110个页面）

每页一个线程耗时（110个页面）

单线程耗时（注释掉新页面电影简介读取）（10个页面）

每页一个线程耗时（注释掉新页面电影简介读取）（10个页面）

11、怎么使用开发者工具

打开你的浏览器，网址 Scrape | Movie
按下F12打开开发者控制台
对网页上感兴趣的内容右键点检查
可以看到标签为h2，class="m-b-sm" 的标签有我们感兴趣的内容，查阅bs4文档使用相关方法

我的代码

import requests
import threading
import time
import json
from bs4 import BeautifulSoup

# @ 将需要爬的URL拆分
host = 'https://ssr1.scrape.center'
api = '/page/{}'
page = 1
data = []


def MySpider(host, api, page):
    web = requests.get(host+api.format(page))

    # @ 官方建议的解析方法
    web = BeautifulSoup(web.text, 'lxml')

    # print(web.title)

    # @ 使用浏览器的F12开发者工具查看DOM结构,找到我们感兴趣的层的class
    res = web.find_all(class_="el-card__body")

    # print(res.__len__())
    # print(res[0])

    # @ 每页十条,res是个数组
    for item in res:
        # ^ 读取电影名称
        Name = item.find(class_="name")
        name = Name.h2.string.split('-')
        # @ 用这种方式去左右空格和字符
        chineseName = name[0].strip()
        englishName = name[1].strip()
        # print(chineseName, englishName)
        # ^ 读取电影地址
        href = host+Name['href']
        # print(href)
        # ^ 读取电影图片地址
        # @ src 是属性,所以不用string获取
        imgSrc = item.find('img', class_="cover")['src']
        # print(imgSrc)
        # ^ 读取类别信息
        tags = []
        Tags = item.find_all(class_="category")
        for tag in Tags:
            tags.append(tag.span.string)
        # print(tags)
        # ^ 读取其他信息
        Info = item.find_next(class_="info")
        Infos = Info.find_all('span')
        address = Infos[0].string
        time = float(Infos[2].string.strip(' 分钟'))
        # ^ 切换到下一个info div
        Info = Info.find_next(class_="info")
        # @ 三元运算符
        release = Info.span.string.strip(' 上映') if Info.span else ""
        # print(address, time, release)
        # ^ 读取分数(类型转换)
        score = float(item.find(class_='score').string)
        # print(score)
        # ^ 读取剧情简介
        WebDetail = requests.get(href)
        WebDetail = BeautifulSoup(WebDetail.text, 'lxml')
        detail = WebDetail.find(class_="drama").p.string.strip()
        # print(detail)
        # ^ test
        print("电影：", chineseName, address, score)
        data.append(
            dict(
                chineseName=chineseName,
                englishName=englishName,
                tags=tags,
                address=address,
                time=time,
                release=release,
                score=score,
                desc=detail,
                imgSrc=imgSrc
            )
        )


# ^ 统计时间
start = time.perf_counter()
threads = []
for page in range(10):
    # MySpider(host, api, page)
    t = threading.Thread(target=MySpider, args=(host, api, page))
    threads.append(t)
    t.start()

# ^ 等待所有子线程结束,主线程再运行
for t in threads:
    t.join()

end = time.perf_counter()
print("共耗时", round(end-start, 2), "秒")

print(data)
# ^ 格式化JSON,并防止转换成Unicode
data = json.dumps(data, indent=4, ensure_ascii=False)
# ^ encoding="utf-8" 防止保存出错
file = open("data.json", "w", encoding="utf-8")
file.write(data)
file.close()
print(data)

点赞是一种积极的生活态度，喵喵喵！（疯狂暗示）

posted @ 2022-03-12 02:56 小能日记阅读(202) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· Python 每日提醒写博客小程序,使用pywin32、bs4库

· Python 康德乐大药房网站爬虫，使用bs4获取json，导入mysql

· Python爬虫之bs4，非常详细

· Python 爬虫初探

· Beautiful Soup 库

公告

昵称：小能日记
园龄： 4年1个月
粉丝： 87
关注： 7

+加关注

2025年3月

日

一

二

三

四

五

六

小能的博客 IT Blog

我来人间一趟，奔着自由与光

Python BeautifulSoup4 爬虫基础、多线程学习

踩过的坑

1、爬虫简介

2、Python基础时间库

3、Threading多线程库

4、python循环10次怎么写

5、Python三元运算符

6、json.dumps输出的中文乱码问题

7、python保留两位小数

8、Python 编码错误 UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 33: illegal..

9、正则表达式re.compile()的使用

10、单线程与多线程时间比较

11、怎么使用开发者工具

我的代码

点赞是一种积极的生活态度，喵喵喵！（疯狂暗示）

公告

搜索

常用链接

积分与排名

随笔分类

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论