爬虫实战（八）：爬表情包

爬虫实战（八）：爬取表情包

爬虫实战（八）：爬取表情包

一、网站分析

1、需求分析

在QQ斗图中，为什么有些人总有斗不完的图，今天，这里有了这个斗图小程序，终于可以告别斗图斗不赢的痛了。

这里，我们需要对发表情网站进行全站数据的爬取，让您拥有许许多多的表情包

2、页面分析

通过抓包分析，我们发现，页面链接数据都在页面中，不是加载出来的数据，故，我们可以对网址直接发起请求，来进行数据的爬取

3、链接分析

这里，我们对https://fabiaoqing.com/bqb/lists/type/hot.html进行链接的分析，

 热图
https://fabiaoqing.com/bqb/lists/type/hot/page/1.html  第一页
https://fabiaoqing.com/bqb/lists/type/hot/page/2.html  第二页
https://fabiaoqing.com/bqb/lists/type/hot/page/n.html  第 n 页
 
情侣图
https://fabiaoqing.com/bqb/lists/type/liaomei/page/1.html  
https://fabiaoqing.com/bqb/lists/type/liaomei/page/2.html
https://fabiaoqing.com/bqb/lists/type/liaomei/page/n.html

同时对于其他类型的表情包链接，也都是类似的

4、详情页分析

通过抓包分析，我们发现，图片的链接也是保存在页面源码上面，同时，要注意，为了实现懒加载，该网站并没有一开始就把图片放在src属性上面，而是放在了data-original里面；但是，当我们把图片数据下载下来的时候，会发现，图片是尺寸有点小，那么，我们应该如何解决呢？

再点击一张图片，进入每张图片的详情页，可以发现里面有一张大的图，那我们是不是应该访问详情页来获取图片的下载链接呢？可不可以不这么麻烦呢？

我们首先来对比两张图的链接，发现两个链接之间就相差一个单词，那么我们就可以使用replace替换来直接获取大图，而不是通过访问详情页

 http://tva3.sinaimg.cn/bmiddle/e16fc503gy1h3q1s3nl8tg20ge0geqv5.gif
http://tva3.sinaimg.cn/large/e16fc503gy1h3q1s3nl8tg20ge0geqv5.gif

5、流程分析

使用for循环遍历每一种链接
获取每一类表情包的链接
解析每一类表情包里面的每一个表情包

二、编写代码

1、解析页数

 import requests, os, re  # 导包
from lxml import etree  # 这次使用xpath解析数据
from fake_useragent import UserAgent  # 随机请求头
 
 
if not os.path.exists("./表情包"):
    os.mkdir("./表情包")
add_url = "https://fabiaoqing.com"  # 用来拼接表情包的url
# 我们先准备好一个基础的url列表，当然这个也可以使用爬虫获取
base_urls = [
    "https://fabiaoqing.com/bqb/lists/type/hot/page/%d.html", 
    "https://fabiaoqing.com/bqb/lists/type/liaomei/page/%d.html",
    "https://fabiaoqing.com/bqb/lists/type/qunliao/page/%d.html",
    "https://fabiaoqing.com/bqb/lists/type/doutu/page/%d.html"
    "https://fabiaoqing.com/bqb/lists/type/duiren/page/%d.html",
    "https://fabiaoqing.com/bqb/lists/type/emoji/page/%d.html"
]
headers = {
    "user-agent": UserAgent().random,
}
def get_num(url):
    """解析到每一大类表情包有多少页"""
    resp = requests.get(url, headers=headers)
    resp.encoding = resp.apparent_encoding
    html = etree.HTML(resp.text)
    temp = html.xpath('//*[@id="mobilepage"]/text()')[0]  # 定位到存放页数的地方
    num = int(re.search("\d+", temp).group())  # 正则匹配数字
    return num  # 返回结果
    
page = get_num(base_urls[0] % 1)

2、获取一类表情包

 def get_eve_express(page_url):
    resp = requests.get(page_url, headers=headers)
    resp.encoding = resp.apparent_encoding
    html = etree.HTML(resp.text)
    a = html.xpath('//*[@id="bqblist"]/a')  # 获取到存放每一类图片的a标签链接
    # 遍历a标签
    for i in a:
        href = i.xpath("./@href")[0]
        href = add_url + href  # url 拼接
        title = i.xpath("./@title")[0]
        dic = {
            "href": href,
            "title": title
        }
        # 这里调用解析并下载表情包的函数，下面，我们就随便取一个字典来测试
        print(dic)
 
get_eve_express(base_urls[0] % 1)

3、保存表情包

 test_dic = {'href': 'https://fabiaoqing.com/bqb/detail/id/54885.html', 'title': '站岗小狗表情包 \u200b_斗图表情包（8个表情）'}
def get_down_url(dic):
    """下载并存储表情包"""
    if not os.path.exists(f"./表情包/{dic['title']}"):
        os.mkdir(f"./表情包/{dic['title']}")
    resp = requests.get(dic['href'], headers=headers)
    info = re.findall('<img class="bqbppdetail lazy" data-original="(?P<href>.*?)" src', resp.text)
    for i in info:
        # 把图片换成大图
        i = i.replace("bmiddle", "large")
        resp = requests.get(i)
        name = i.split("/")[-1]
        with open(f"./表情包/{dic['title']}/{name}", "wb") as f:
            f.write(resp.content)
    print(f"{dic['title']}系列表情包保存完成！")
 
 
get_down_url(test_dic)

三、总代码

 import requests, os, re  # 导包
from lxml import etree  # 这次使用xpath解析数据
from fake_useragent import UserAgent  # 随机请求头
 
 
if not os.path.exists("./表情包"):
    os.mkdir("./表情包")
add_url = "https://fabiaoqing.com"  # 用来拼接表情包的url
 
 
# 我们先准备好一个基础的url列表，当然这个也可以使用爬虫获取
base_urls = [
    "https://fabiaoqing.com/bqb/lists/type/hot/page/%d.html", 
    "https://fabiaoqing.com/bqb/lists/type/liaomei/page/%d.html",
    "https://fabiaoqing.com/bqb/lists/type/qunliao/page/%d.html",
    "https://fabiaoqing.com/bqb/lists/type/doutu/page/%d.html"
    "https://fabiaoqing.com/bqb/lists/type/duiren/page/%d.html",
    "https://fabiaoqing.com/bqb/lists/type/emoji/page/%d.html"
]
headers = {
    "user-agent": UserAgent().random,
}
def get_num(url):
    """解析到每一大类表情包有多少页"""
    resp = requests.get(url, headers=headers)
    resp.encoding = resp.apparent_encoding
    html = etree.HTML(resp.text)
    temp = html.xpath('//*[@id="mobilepage"]/text()')[0]  # 定位到存放页数的地方
    num = int(re.search("\d+", temp).group())  # 正则匹配数字
    return num  # 返回结果
 
 
def get_down_url(dic):
    """下载并存储表情包"""
    if not os.path.exists(f"./表情包/{dic['title']}"):
        os.mkdir(f"./表情包/{dic['title']}")
    resp = requests.get(dic['href'], headers=headers)
    info = re.findall('<img class="bqbppdetail lazy" data-original="(?P<href>.*?)" src', resp.text)
    for i in info:
        # 把图片换成大图
        i = i.replace("bmiddle", "large")
        resp = requests.get(i)
        name = i.split("/")[-1]
        with open(f"./表情包/{dic['title']}/{name}", "wb") as f:
            f.write(resp.content)
    print(f"{dic['title']}系列表情包保存完成！")
 
    
def get_eve_express(page_url):
    resp = requests.get(page_url, headers=headers)
    resp.encoding = resp.apparent_encoding
    html = etree.HTML(resp.text)
    a = html.xpath('//*[@id="bqblist"]/a')  # 获取到存放每一类图片的a标签链接
    # 遍历a标签
    for i in a:
        href = i.xpath("./@href")[0]
        href = add_url + href  # url 拼接
        title = i.xpath("./@title")[0]
        dic = {
            "href": href,
            "title": title
        }
        # 这里调用解析并下载表情包的函数，下面，我们就随便取一个字典来测试
        # 下载表情包
        get_down_url(dic)
 
        
def main():
    for i in base_urls:
        num = get_num(i % 1)  # 获取到页数，再来一次for循环遍历
        for j in range(1, num + 1):
            get_eve_express(i % j) # 下载图片
        
        
 
if __name__ == "__main__":
    main()

posted @ 2022-07-07 21:59 Kenny_LZK 阅读(337) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· 爬虫实战（五）：爬豆瓣top250

· 爬虫实战（四）：爬优美图库

· 爬虫-页面解析和数据提取

· 源码实例-表情包爬取

· 爬虫-多线程抓取斗图表情

阅读排行：
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布：重大改进与新特性概览！
· AI与.NET技术实操系列（二）：开始使用ML.NET
· 单线程的Redis速度为什么快？

公告

昵称： Kenny_LZK
园龄： 3年3个月
粉丝： 68
关注： 5

+加关注

2025年3月

日

一

二

三

四

五

六

随笔分类 (297)

随笔档案 (185)

文章分类 (5)

国学(5)

Kenny

爬虫实战（八）：爬表情包

爬虫实战（八）：爬取表情包

一、网站分析

1、需求分析

2、页面分析

3、链接分析

4、详情页分析

5、流程分析

二、编写代码

1、解析页数

2、获取一类表情包

3、保存表情包

三、总代码

公告

搜索

常用链接

最新随笔

积分与排名

随笔分类 (297)

随笔档案 (185)

文章分类 (5)

文章档案 (5)

相册 (7)

阅读排行榜

评论排行榜

推荐排行榜

最新评论

	热图
	https://fabiaoqing.com/bqb/lists/type/hot/page/1.html 第一页
	https://fabiaoqing.com/bqb/lists/type/hot/page/2.html 第二页
	https://fabiaoqing.com/bqb/lists/type/hot/page/n.html 第 n 页

	情侣图
	https://fabiaoqing.com/bqb/lists/type/liaomei/page/1.html
	https://fabiaoqing.com/bqb/lists/type/liaomei/page/2.html
	https://fabiaoqing.com/bqb/lists/type/liaomei/page/n.html

	http://tva3.sinaimg.cn/bmiddle/e16fc503gy1h3q1s3nl8tg20ge0geqv5.gif
	http://tva3.sinaimg.cn/large/e16fc503gy1h3q1s3nl8tg20ge0geqv5.gif

	import requests, os, re # 导包
	from lxml import etree # 这次使用xpath解析数据
	from fake_useragent import UserAgent # 随机请求头


	if not os.path.exists("./表情包"):
	os.mkdir("./表情包")
	add_url = "https://fabiaoqing.com" # 用来拼接表情包的url
	# 我们先准备好一个基础的url列表，当然这个也可以使用爬虫获取
	base_urls = [
	"https://fabiaoqing.com/bqb/lists/type/hot/page/%d.html",
	"https://fabiaoqing.com/bqb/lists/type/liaomei/page/%d.html",
	"https://fabiaoqing.com/bqb/lists/type/qunliao/page/%d.html",
	"https://fabiaoqing.com/bqb/lists/type/doutu/page/%d.html"
	"https://fabiaoqing.com/bqb/lists/type/duiren/page/%d.html",
	"https://fabiaoqing.com/bqb/lists/type/emoji/page/%d.html"
	]
	headers = {
	"user-agent": UserAgent().random,
	}
	def get_num(url):
	"""解析到每一大类表情包有多少页"""
	resp = requests.get(url, headers=headers)
	resp.encoding = resp.apparent_encoding
	html = etree.HTML(resp.text)
	temp = html.xpath('//*[@id="mobilepage"]/text()')[0] # 定位到存放页数的地方
	num = int(re.search("\d+", temp).group()) # 正则匹配数字
	return num # 返回结果

	page = get_num(base_urls[0] % 1)

	def get_eve_express(page_url):
	resp = requests.get(page_url, headers=headers)
	resp.encoding = resp.apparent_encoding
	html = etree.HTML(resp.text)
	a = html.xpath('//*[@id="bqblist"]/a') # 获取到存放每一类图片的a标签链接
	# 遍历a标签
	for i in a:
	href = i.xpath("./@href")[0]
	href = add_url + href # url 拼接
	title = i.xpath("./@title")[0]
	dic = {
	"href": href,
	"title": title
	}
	# 这里调用解析并下载表情包的函数，下面，我们就随便取一个字典来测试
	print(dic)

	get_eve_express(base_urls[0] % 1)

	test_dic = {'href': 'https://fabiaoqing.com/bqb/detail/id/54885.html', 'title': '站岗小狗表情包 \u200b_斗图表情包（8个表情）'}
	def get_down_url(dic):
	"""下载并存储表情包"""
	if not os.path.exists(f"./表情包/{dic['title']}"):
	os.mkdir(f"./表情包/{dic['title']}")
	resp = requests.get(dic['href'], headers=headers)
	info = re.findall('<img class="bqbppdetail lazy" data-original="(?P<href>.*?)" src', resp.text)
	for i in info:
	# 把图片换成大图
	i = i.replace("bmiddle", "large")
	resp = requests.get(i)
	name = i.split("/")[-1]
	with open(f"./表情包/{dic['title']}/{name}", "wb") as f:
	f.write(resp.content)
	print(f"{dic['title']}系列表情包保存完成！")


	get_down_url(test_dic)

Kenny

爬虫实战（八）：爬表情包

爬虫实战（八）：爬取表情包

一、 网站分析

1、 需求分析

2、 页面分析

3、 链接分析

4、 详情页分析

5、 流程分析

二、 编写代码

1、 解析页数

2、 获取一类表情包

3、 保存表情包

三、 总代码

公告

搜索

常用链接

最新随笔

积分与排名

随笔分类 (297)

随笔档案 (185)

文章分类 (5)

文章档案 (5)

相册 (7)

阅读排行榜

评论排行榜

推荐排行榜

最新评论

一、网站分析

1、需求分析

2、页面分析

3、链接分析

4、详情页分析

5、流程分析

二、编写代码

1、解析页数

2、获取一类表情包

3、保存表情包

三、总代码