豆瓣影评数据抓取

豆瓣影评数据抓取

创建时间:2024-08-12

抓取豆瓣影评相关数据的代码,包括封面、标题、评论内容以及影评详情页的数据。

一、完整代码

'''
https://movie.douban.com/review/best/
抓取封面 标题 评论內容
抓取完整的评论内容 也就是点击展开后的完整的
抓取当前影评的详情页的数据
抓取影评多页 封面 标题 完整评论内容 以及影评的详情页的数据
'''
import json
import re
import requests
from lxml import etree

url = 'https://movie.douban.com/review/best/'
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0'
}
response = requests.get(url, headers=header)
response.encoding = 'utf-8'
html = response.text
tree = etree.HTML(html)
fm_urls = tree.xpath('//div[@class="review-list chart "]//a[@class="subject-img"]/img/@src')
bt_list = tree.xpath('//div[@class="main-bd"]/h2/a/text()')

# for fm, bt in zip(fm_urls, bt_list):
#     res = requests.get(fm, headers=header)
#     with open('./imgs/' + bt + '.jpg', 'wb') as f:
#         f.write(res.content)
#         print(bt + ' 已保存')
details_list = tree.xpath('//div[@class="main-bd"]/h2/a/@href')
details_urls = []
for i in details_list:
    num = re.findall(r'\d+', i)[0]
    details_url = f'https://movie.douban.com/j/review/{num}/full'
    details_urls.append(details_url)
# https://movie.douban.com/j/review/15980218/full
for bt, details_url in zip(bt_list, details_urls):
    response = requests.get(details_url, headers=header)
    response.encoding = response.apparent_encoding
    data = json.loads(response.text)
    # print(data['body'])
    datatree = etree.HTML(data['body'])
    details = datatree.xpath('//text()')
    # print(details)
    detailedInfo = '\n' + bt + '\n' + ''.join(details)
    with open('detail.txt', 'a+', encoding='utf-8') as f:
        f.write(detailedInfo)
        print(f'{bt}详情的内容全部下载完毕!!!')

    # exit()

二、代码详解

2.1 基本设置

导入了所需的库,并设置了要访问的豆瓣影评页面的 URL 和请求头,以模拟真实的浏览器访问。

import json
import re
import requests
from lxml import etree

url = 'https://movie.douban.com/review/best/'
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0'
}

2.2 发送请求并获取页面的 HTML 内容

response = requests.get(url, headers=header)
response.encoding = 'utf-8'
html = response.text
tree = etree.HTML(html)

2.3 使用 xpath 方法提取出封面图片的 URL 和标题

fm_urls = tree.xpath('//div[@class="review-list chart "]//a[@class="subject-img"]/img/@src')
bt_list = tree.xpath('//div[@class="main-bd"]/h2/a/text()')

2.4 保存封面图片

for fm, bt in zip(fm_urls, bt_list):
    res = requests.get(fm, headers=header)
    with open('./imgs/' + bt + '.jpg', 'wb') as f:
        f.write(res.content)
        print(bt + ' 已保存')

2.5 获取影评详情页的 URL 并构建完整的请求链接

details_list = tree.xpath('//div[@class="main-bd"]/h2/a/@href')
details_urls = []
for i in details_list:
    num = re.findall(r'\d+', i)[0]
    details_url = f'https://movie.douban.com/j/review/{num}/full'
    details_urls.append(details_url)

2.6 获取影评详情页的完整内容并保存

for bt, details_url in zip(bt_list, details_urls):
    response = requests.get(details_url, headers=header)
    response.encoding = response.apparent_encoding
    data = json.loads(response.text)
    # print(data['body'])
    datatree = etree.HTML(data['body'])
    details = datatree.xpath('//text()')
    # print(details)
    detailedInfo = '\n' + bt + '\n' + ''.join(details)
    with open('detail.txt', 'a+', encoding='utf-8') as f:
        f.write(detailedInfo)
        print(f'{bt}详情的内容全部下载完毕!!!')

三、效果

3.1 封面

3.2 信息

posted @ 2024-08-12 22:58  随风小屋  阅读(3)  评论(0编辑  收藏  举报