初识scrapy框架(六)------ 下载中间件(中国气象数据案例)

  在爬数据的过程中,也许你会遇到动态页面,由于直接请求拿不到目标数据,所以我们需要解析网站的JS

代码或者使用selenium获取资源。这里我们将使用selenium加载数据,并返回给引擎进行调度分析。

第一步:明确爬虫目标,编写items文件

 1 import scrapy
 2 
 3 
 4 class MiddlewareItem(scrapy.Item):
 5     # 城市名字
 6     city_name = scrapy.Field()
 7     # 月度数据
 8     city_month = scrapy.Field()
 9     # 城市日期
10     city_date = scrapy.Field()
11     # 当天的AQI
12     city_AQI = scrapy.Field()
13     # 等级质量
14     grade = scrapy.Field()
15     # pm2.5含量
16     pm2_5 = scrapy.Field()
17     # pm10含量
18     pm10 = scrapy.Field()
19     # so2含量
20     so2 = scrapy.Field()
21     # co含量
22     co = scrapy.Field()
23     # no2含量
24     no2 = scrapy.Field()
25     # o3含量
26     o3 = scrapy.Field()
27 
28     source = scrapy.Field()
29     utc_time = scrapy.Field()

第二步:创建爬虫文件,并书写爬虫代码

 1 import scrapy
 2 from middleware.items import MiddlewareItem
 3 
 4 
 5 class AirSpider(scrapy.Spider):
 6     name = 'air'
 7     allowed_domains = ['aqistudy.cn']
 8     base_url = 'https://www.aqistudy.cn/historydata/'
 9     start_urls = [base_url]
10 
11     def parse(self, response):
12         print('正在爬取城市信息...')
13         city_list = response.xpath('//div[@class="all"]//li/a/text()').extract()[12:14]
14         link_list = response.xpath('//div[@class="all"]//li/a/@href').extract()[12:14]
15         for city, link in zip(city_list, link_list):
16             city_link = self.base_url + link
17             yield scrapy.Request(url=city_link, callback=self.city, meta={"city": city})
18 
19     def city(self, response):
20         print("正在爬取城市月份信息...")
21         month_list = response.xpath('//tr/td/a/text()').extract()[1:2]
22         month_link = response.xpath('//tr/td/a/@href').extract()[1:2]
23         for month, link in zip(month_list, month_link):
24             next_page = self.base_url + link
25             yield scrapy.Request(url=next_page, callback=self.detail, meta={"city": response.meta["city"], "month": month})
26 
27     def detail(self, response):
28         print("正在爬取城市空气质量信息...")
29         city = response.meta["city"]
30         month = response.meta["month"]
31         item = MiddlewareItem()
32         item["city_name"] = city
33         item["city_month"] = month
34         tr_list = response.xpath('//tr')
35         tr_list.pop(0)
36         for node in tr_list:
37             item["city_date"] = node.xpath("./td[1]/text()").extract_first()
38             item["city_AQI"] = node.xpath("./td[2]/text()").extract_first()
39             item["grade"] = node.xpath("./td[3]/span/text()").extract_first()
40             item["pm2_5"] = node.xpath("./td[4]/text()").extract_first()
41             item["pm10"] = node.xpath("./td[5]/text()").extract_first()
42             item["so2"] = node.xpath("./td[5]/text()").extract_first()
43             item["co"] = node.xpath("./td[7]/text()").extract_first()
44             item["no2"] = node.xpath("./td[8]/text()").extract_first()
45             item["o3"] = node.xpath("./td[9]/text()").extract_first()
46             yield item

  上面的代码是爬取某网站的全站空气质量数据,由于数据量很大,所以只取了一小部分代表。基本的思

路是先请求历史数据获取每个城市的资源链接,再跟进请求某城市获取月份信息与月份链接,最后根据链接

获取每月详细的空气质量数据。这里出了城市列表以外,其他都是动态页面,所以我们要在下载数据的时候

拿到正确的数据,就要重写下载中间件了。

第三步:书写下载中间件

 1 import random
 2 import time
 3 import scrapy
 4 from selenium import webdriver
 5 from middleware.settings import USER_AGENTS as UA
 6 
 7 
 8 class UserAgentMiddleware(object):
 9 
10     """
11         给每一次请求随机赋值一个User_Agent
12     """
13 
14     def process_request(self, request, spider):
15         user_agent = random.choice(UA)
16         request.headers['User_Agent'] = user_agent
17         # request.meta['proxy'] = ''  # 设置代理
18         print('*'*30)
19         print(request.headers['User_Agent'])
20 
21 
22 class SeleniumMiddleware(object):
23 
24     def process_request(self, request, spider):
25         if request.url != 'https://www.aqistudy.cn/historydata/':
26             self.driver = webdriver.Chrome()
27             self.driver.get(request.url)
28             time.sleep(2)
29             html = self.driver.page_source
30             self.driver.quit()
31             return scrapy.http.HtmlResponse(url=request.url, body=html, encoding="utf-8", request=request)

  下载中间件有两个类,一个是为请求随机取一个User_Agent,一个是使用selenium获取数据,并返回

一个response对象。

第四步:书写管道

 1 import json
 2 from datetime import datetime
 3 
 4 
 5 class MiddlewarePipeline(object):
 6 
 7     def open_spider(self, spider):
 8         self.file = open('air.json', 'w', encoding='utf-8')
 9 
10     def process_item(self, item, spider):
11         content = json.dumps(dict(item), ensure_ascii=False) + "\n"
12         self.file.write(content)
13         return item
14 
15     def close_spider(self, spider):
16         self.file.close()
17 
18 
19 class AreaPipeline(object):
20 
21     def process_item(self, item, spider):
22         item["source"] = spider.name
23         item["utc_time"] = str(datetime.utcnow())
24         return item

  这里仍然将数据保存为json数据,只是这是添加了数据源信息与添加时间,这是一个编程习惯,可以为

我们带来很多方便。

第五步:配置

 1 # 下载中间件
 2 DOWNLOADER_MIDDLEWARES = {
 3    'middleware.middlewares.UserAgentMiddleware': 543,
 4    'middleware.middlewares.SeleniumMiddleware': 300,
 5 }
 6 #  管道
 7 ITEM_PIPELINES = {
 8    'middleware.pipelines.MiddlewarePipeline': 300,
 9    'middleware.pipelines.AreaPipeline': 200,
10 }

  这里没有写 USER_AGENTS ,注意这是一个常量,需要我们自己添加的,所以应该是在写代码前先加

好,这里为了方便,同一放到一起:

 1 USER_AGENTS = [
 2             "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
 3             "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
 4             "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
 5             "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
 6             "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
 7             "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
 8             "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
 9             "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
10             "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3192.0 Safari/537.36Name"
11             ]

 

至此代码可以跑起来了,但要注意插件chormedriver的安装,你也可以使用无界面浏览器plantomjs.

 

posted @ 2018-05-18 10:25  巴蜀秀才  阅读(202)  评论(0编辑  收藏  举报