使用 Python 爬取豆瓣电影 Top250 多页数据

使用 Python 爬取豆瓣电影 Top250 多页数据

创建时间:2024-08-11

一、完整代码

'''
抓取单贞数据 中的评分 简介 评价人数
将上面的改为多页
https://movie.douban.com/top250?start=0
'''
import requests
from lxml import etree

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0'
}
url = 'https://movie.douban.com/top250?start=0'

pag_num = int(input('请输入需要抓取的页数,一页25条数据,共10页:')) * 25 - 25
if pag_num <= 0:
    print('输入错误!!')
    exit(-1)
for num in range(0, pag_num + 25, 25):
    url = f'https://movie.douban.com/top250?start={num}'
    # print(url)

    res = requests.get(url, headers=header)
    # print(res.status_code)
    res.encoding = res.apparent_encoding
    tree = etree.HTML(res.text)
    movies_name = tree.xpath('//ol[@class="grid_view"]//img/@alt')
    movies_start = tree.xpath('//ol[@class="grid_view"]//span[@class="rating_num"]/text()')
    movies_intro = tree.xpath('//ol[@class="grid_view"]//span[@class="inq"]//text()')
    movies_evaluators = tree.xpath('//ol[@class="grid_view"]//div[@class="star"]/span[4]/text()')

    for name, start, intro, evaluator in zip(movies_name, movies_start, movies_intro, movies_evaluators):
        movie = name + ',' + start + ',' + intro + ',' + evaluator + '\n'
        print(f'{name}-----下载成功~~~~')
        with open('top250.txt', 'a+', encoding='utf-8') as file:
            file.write(movie)

1.1 效果

二、代码解析

2.1 设置了请求的头部信息

模拟正常的浏览器访问,以避免被网站识别为爬虫并阻止访问。

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0'
}

2.2 参数设置

根据输入,通过循环构建不同的 URL 来访问不同的页面。

pag_num = int(input('请输入需要抓取的页数,一页 25 条数据,共 10 页:')) * 25 - 25
if pag_num <= 0:
    print('输入错误!!')
    exit(-1)
for num in range(0, pag_num + 25, 25):
    url = f'https://movie.douban.com/top250?start={num}'

2.3 解析数据

在获取到每个页面的 HTML 内容后,使用 lxml 的 etree 模块解析页面,并通过 xpath 表达式提取出电影的名称、评分、简介和评价人数等信息。

res = requests.get(url, headers=header)
res.encoding = res.apparent_encoding
tree = etree.HTML(res.text)
movies_name = tree.xpath('//ol[@class="grid_view"]//img/@alt')
movies_start = tree.xpath('//ol[@class="grid_view"]//span[@class="rating_num"]/text()')
movies_intro = tree.xpath('//ol[@class="grid_view"]//span[@class="inq"]//text()')
movies_evaluators = tree.xpath('//ol[@class="grid_view"]//div[@class="star"]/span[4]/text()')

2.4 合并保存

将提取到的每部电影的信息组合成一条记录,并保存到一个名为 top250.txt 的文件中。

for name, start, intro, evaluator in zip(movies_name, movies_start, movies_intro, movies_evaluators):
    movie = name + ',' + start + ',' + intro + ',' + evaluator + '\n'
    print(f'{name}-----下载成功~~~~')
    with open('top250.txt', 'a+', encoding='utf-8') as file:
        file.write(movie)
posted @ 2024-08-11 13:51  随风小屋  阅读(13)  评论(0编辑  收藏  举报