初识scrapy框架(十)------ 爬取手机APP的图片

  该文章以爬取手机斗鱼APP为例,我们希望爬取关键字”颜值“里面的主播大图,点击获取:前期手机配置与fiddler配置

配置好fiddler和手机之后,打开斗鱼APP,用fiddler抓包,这里的数据返回的都是JSON数据,所以我们可以直接用json提取。

首先找到链接,这里作者抓到的链接使用是有问题的,真实科研的链接是:https://capi.douyucdn.cn*******&offset=0  .这

里不方便公布别人的资源链接,请自己多抓,自行寻找。链接效果图如下:

 

根据这个链接,使用scrapy爬取数据:

创建爬虫项目:scrapy startproject douyu

创建爬虫文件:scrapy genspider yanzhi  douyucdn.cn

明确爬虫目标:

 1 import scrapy
 2 
 3 
 4 class DouyuItem(scrapy.Item):
 5     # define the fields for your item here like:
 6     # name = scrapy.Field()
 7     # 主播名
 8     nick_name = scrapy.Field()
 9     # 房间号
10     room_id = scrapy.Field()
11     # 作者所在城市
12     anchor_city = scrapy.Field()
13     # 图片链接
14     image_link = scrapy.Field()
15     # 文件储存的地址
16     image_path = scrapy.Field()
17 
18     source = scrapy.Field()
19     utc_time = scrapy.Field()

编写爬虫文件:

 1 import scrapy
 2 import json
 3 from douyu.items import DouyuItem
 4 
 5 
 6 class YanzhiSpider(scrapy.Spider):
 7     name = 'yanzhi'
 8     allowed_domains = ['douyucdn.cn']
 9     offset = 0
10     base_url = 'https://capi.douyucdn.cn**************&offset='
11     start_urls = [base_url + str(offset)]
12 
13     def parse(self, response):
14         node_list = json.loads(response.body.decode())["data"]
15 
16         if not node_list:
17             return
18 
19         for node in node_list:
20             item = DouyuItem()
21             item["nick_name"] = node["nickname"]
22             item["room_id"] = node["room_id"]
23             item["anchor_city"] = node["anchor_city"]
24             item["image_link"] = node["vertical_src"]
25             yield item
26         self.offset += 20
27         yield scrapy.Request(url=self.base_url+str(self.offset), callback=self.parse)

编写管道文件:

 1 from scrapy.pipelines.images import ImagesPipeline
 2 from douyu.settings import IMAGES_STORE
 3 from datetime import datetime
 4 import scrapy
 5 import os
 6 
 7 
 8 class ImageSource(object):
 9     def process_item(self, item, spider):
10         item["source"] = spider.name
11         item["utc_time"] = str(datetime.utcnow())
12         print("**"*20)
13         return item
14 
15 
16 class DouyuImagesPipeline(ImagesPipeline):
17 
18     # 发送图片链接请求
19     def get_media_requests(self, item, info):
20         # 获取item数据的图片链接
21         image_link = item["image_link"]
22         print(image_link)
23         # 发送图片请求,响应默认会保存在指定路径下
24         yield scrapy.Request(url=image_link)
25 
26     def item_completed(self, results, item, info):
27         # 每个result是一个图片信息,去除图片原来的路径
28         image_path = [x["path"] for ok, x in results if ok]
29         print(results)
30 
31         # 先保存当前图片的路径
32         old_name = IMAGES_STORE + "/" + image_path[0]
33         # 更改当前图片的路径名字
34         new_name = IMAGES_STORE + "/" + item["nick_name"] + ".jpg"
35         item["image_path"] = new_name
36         try:
37             # 修改为新的路径名,文件名
38             os.rename(old_name, new_name)
39         except Exception as e:
40             print("[INFO]:图片已经修改\n", e)
41         return item

编写下载中间件:

 1 import random
 2 from douyu.settings import USER_AGENTS as UA
 3 
 4 
 5 class UserAgentMiddleware(object):
 6 
 7     """
 8         给每一次请求随机赋值一个User_Agent
 9     """
10 
11     def process_request(self, request, spider):
12         user_agent = random.choice(UA)
13         request.headers['User_Agent'] = user_agent

配置管道和下载中间件:

 1 IMAGES_STORE = '/home/dan/data/images'
 2 # Crawl responsibly by identifying yourself (and your website) on the user-agent
 3 USER_AGENTS = [
 4             "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
 5             "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
 6             "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
 7             "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
 8             "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
 9             "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
10             "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
11             "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
12             "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3192.0 Safari/537.36Name"
13             ]
14 
15 
16 
17 # Enable or disable downloader middlewares
18 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
19 DOWNLOADER_MIDDLEWARES = {
20    'douyu.middlewares.UserAgentMiddleware': 543,
21 }
22 
23 
24 # Configure item pipelines
25 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
26 ITEM_PIPELINES = {
27     'douyu.pipelines.ImageSource': 100,
28     'douyu.pipelines.DouyuImagesPipeline': 200,
29 }

这里的有些配置在前面已经配置好了,只是在这里同一书写。

为了方便执行与数据管理,我们创建一个py文件执行爬虫文件,并删除冗余的文件夹full

1 import os
2 
3 print("开始执行爬虫程序")
4 os.system("scrapy crawl yanzhi")
5 print("删除多余的文件")
6 os.rmdir("/home/dan/data/images/full")

 

运行上面的代码,下面展示部分结果和数据:

文件夹下保存的数据:

上面的代码很简单,所以没有进行说明,如果没有基础还是先看看基础吧。

有木有很心动,代码已给,燥起来吧骚年!!!

 

posted @ 2018-05-24 21:28  巴蜀秀才  阅读(1374)  评论(1编辑  收藏  举报