爬取王者荣耀官网的全部皮肤大图

创建时间：2024年3月1日

背景

爬取《王者荣耀》游戏图片。

思路分析

观察网页的html结构，我们可以找到每一个英雄的连接

将英雄页面提取出来，我们可以得出皮肤的名字和地址在什么地方，以上面规则组合起来的

效果

完整代码

# 导入模块
import os
import re
import requests
from bs4 import BeautifulSoup
# 创建文件夹
folder_path = "./honorOfKings"
if not os.path.exists(folder_path):
    os.makedirs(folder_path)
# 设置爬取的URL和请求头
url = "https://pvp.qq.com/web201605/herolist.shtml"  # 王者荣耀英雄资料列表
headers = \
    {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.203", }

# 发送HTTP请求获取网页内容
html = requests.request("get", url=url, headers=headers).text
# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(html, "lxml")
# 始化一个空列表，用于存储提取的英雄资料
list = []

for i in soup.find_all(attrs={"class": "herolist clearfix"}):
    for o in i.find_all("a"):
        list.append(f"https://pvp.qq.com/web201605/" + o.get("href"))
count = 1
for ix in list:  # 遍历每个英雄链接
    # 发送HTTP请求获取英雄页面内容
    htm = requests.get(url=ix, headers=headers).content
    soup1 = BeautifulSoup(htm, "lxml")
    # 提取皮肤图片相关信息
    for img_pf in soup1.find_all(attrs={"class": "pic-pf-list pic-pf-list3"}):
        img_pf_name = img_pf.get('data-imgname').split('|')
        temp_list = []
        for index, pf_no in enumerate(img_pf_name):
            temp_list.append([index + 1, pf_no])
            # 提取皮肤图片URL并下载
        for z in soup1.find_all(attrs={"class": "zk-con1"}):
            Get_style = z.get('style')
            img_url = re.search(r"//.*-", str(Get_style)).group()
            for io in temp_list:
                hero_name = soup1.find(attrs={"class": "cover-name"}).get_text()
                style_name = re.search(r'[\u4e00-\u9fa5]+', str(io[1])).group()
                style_index = (io[0])
                hero_url = f'https:' + str(img_url) + str(style_index) + ".jpg"
                pic = requests.get(hero_url).content
                with open(f"./{folder_path}/{hero_name}-{style_name}.jpg", "wb") as f:
                    f.write(pic)
                print(f"正在下载 === 英雄 :【{hero_name}】 的 【{style_name} 】皮肤图片 ==,这是下载的第{count}张图片。")
                count += 1
print(f"图片下载完毕！总共下载图片为{count}张。")

代码的分步解释

0.导入模块，判断文件夹是否存在

# 导入模块
import os
import re

import requests
from bs4 import BeautifulSoup
# 创建文件夹
folder_path = "./honorOfKings"
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

1.设置协议头和URl

# 设置爬取的URL和请求头
url = "https://pvp.qq.com/web201605/herolist.shtml"  # 王者荣耀英雄资料列表
headers = \
    {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.203", }

2.发送请求和BeautifulSoup解析页面

# 发送HTTP请求获取网页内容
html = requests.request("get", url=url, headers=headers).text
'''
requests.request("get", url=url, headers=headers): 发送GET请求，获取目标网页的HTML内容。
.text: 获取HTTP响应的文本内容。
'''
# print(html)
# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(html, "lxml")

3.找到我们的英雄详情页的页面链接

# 始化一个空列表，用于存储提取的英雄资料

list = []
for i in soup.find_all(attrs={"class": "herolist clearfix"}):
    ''' 通过BeautifulSoup对象 soup 查找所有具有 class 属性值为 "herolist clearfix" 的元素。 这个类通常用于包含英雄资料的部分'''
    for o in i.find_all("a"):
        '''对于每个找到的 i 元素，继续寻找其下的所有 <a> 元素。'''
        list.append(f"https://pvp.qq.com/web201605/" + o.get("href"))
        '''于每个找到的 <a> 元素 o，使用 o.get("href") 获取其 href 属性，即链接。 构建完整的链接，并将其添加到之前初始化的 list 中'''
# print(list)

4.遍历每个英雄链接（以便于后边进行处理）

按照我们的规则将我们的皮肤图片提取出来

for img_pf in soup1.find_all(attrs={"class": "pic-pf-list pic-pf-list3"}):
        img_pf_name = img_pf.get('data-imgname').split('|')
        temp_list = []
        for index, pf_no in enumerate(img_pf_name):
            temp_list.append([index + 1, pf_no])
            '''通过查找具有特定类名的元素，提取皮肤图片的相关信息，包括图片名称和编号。
使用 temp_list 存储图片编号和名称的列表。'''
            # 提取皮肤图片URL并下载
        print('----->', temp_list)  # [[1, '流云之翼&0'], [2, '荷鲁斯之眼&0'], [3, '纤云弄巧&65'], [4, '时之祈愿&95']]

5.使用正则表达式将每一个英雄页面的规则爬取出来

for z in soup1.find_all(attrs={"class": "zk-con1"}):
    Get_style = z.get('style')
    img_url = re.search(r"//.*-", str(Get_style)).group()
    '''
    
    re.search(r"//.*-", str(Get_style))：这部分代码在str(Get_style)中搜索匹配正则表达式r"//.*-"的位置。
    正则表达式r"//.*-"的意思是：
    //：匹配两个正斜杠字符。
    .*：.表示匹配任何字符（除了换行符），*表示匹配前面的字符0次或多次。
    -：匹配一个短划线字符。
    .group()：这部分代码提取并返回整个匹配项
    '''
    # print(img_url) # //game.gtimg.cn/images/yxzj/img201606/skin/hero-info/506/506-bigskin-

6.对我们的每一个英雄的页面进行拼接到具体的图片，然后保存下来

for io in temp_list:
    hero_name = soup1.find(attrs={"class": "cover-name"}).get_text()
    style_name = re.search(r'[\u4e00-\u9fa5]+', str(io[1])).group()
    '''
    [\u4e00-\u9fa5]：这是一个字符类，它匹配任何在这个范围内的Unicode字符。具体来说，\u4e00-\u9fa5涵盖了常用的中文字符范围。
    +：这个量词表示前面的字符类（即中文字符）可以出现一次或多次。
    '''
    style_index = (io[0])
    hero_url = f'https:' + str(img_url) + str(style_index) + ".jpg"
    pic = requests.get(hero_url).content
    with open(f"./{folder_path}/{hero_name}-{style_name}.jpg", "wb") as f:
        f.write(pic)

7.显示具体的日志信息

    print(f"正在下载 === 英雄 :【{hero_name}】 的 【{style_name} 】皮肤图片 ==,这是下载的第{count}张图片。")
    count += 1
print(f"图片下载完毕！总共下载图片为{count}张。")

8.还可以使用xpath提取。感兴趣的可以自己试一下

import requests
from lxml import etree
headers = \
    {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.203", }
url = "https://pvp.qq.com/web201605/herolist.shtml"

resp = requests.get(url, headers=headers)
resp.encoding = resp.apparent_encoding

#  将响应内容解析为etree对象
xp = etree.HTML(resp.text)

#  获取每页中的图片详情页链接
img_url = xp.xpath("//ul[@class='herolist clearfix']/li/a/@href")
print(img_url)

结果：

9.该项目的源码和图片地址

链接：https://pan.baidu.com/s/1T0mndsObKmiX9Iu5sZn08A?pwd=oib1 
提取码：oib1 
--来自百度网盘超级会员V6的分享

总结

本项目不仅支持常规的抓取方法，还可以使用XPath进行精确的数据提取，感兴趣可以按照上面的第八点进行尝试。官网提供了丰富的页面资源，如物品界面等，这些页面都可以通过修改我们的程序进行有效抓取。

posted @ 2024-12-05 14:33 随风小屋阅读(72) 评论(0) 收藏举报

刷新页面返回顶部

随风

爱意随风起风止意难平

爬取王者荣耀官网的全部皮肤大图

爬取王者荣耀官网的全部皮肤大图

背景

思路分析

效果

相关学习资料

完整代码

代码的分步解释

0.导入模块，判断文件夹是否存在

1.设置协议头和URl

2.发送请求和BeautifulSoup解析页面

3.找到我们的英雄详情页的页面链接

4.遍历每个英雄链接（以便于后边进行处理）

5.使用正则表达式将每一个英雄页面的规则爬取出来

6.对我们的每一个英雄的页面进行拼接到具体的图片，然后保存下来

7.显示具体的日志信息

8.还可以使用xpath提取。感兴趣的可以自己试一下

结果：

9.该项目的源码和图片地址

总结

公告

随风

爱意随风起 风止意难平

爬取王者荣耀官网的全部皮肤大图

爬取王者荣耀官网的全部皮肤大图

背景

思路分析

效果

相关学习资料

完整代码

代码的分步解释

0.导入模块，判断文件夹是否存在

1.设置协议头和URl

2.发送请求和BeautifulSoup解析页面

3.找到我们的英雄详情页的页面链接

4.遍历每个英雄链接（以便于后边进行处理）

5.使用正则表达式将每一个英雄页面的规则爬取出来

6.对我们的每一个英雄的页面进行拼接到具体的图片，然后保存下来

7.显示具体的日志信息

8.还可以使用xpath提取。感兴趣的可以自己试一下

结果：

9.该项目的源码和图片地址

总结

公告

爱意随风起风止意难平