初识scrapy框架(六)------ 下载中间件(中国气象数据案例)
在爬数据的过程中,也许你会遇到动态页面,由于直接请求拿不到目标数据,所以我们需要解析网站的JS
代码或者使用selenium获取资源。这里我们将使用selenium加载数据,并返回给引擎进行调度分析。
第一步:明确爬虫目标,编写items文件
1 import scrapy 2 3 4 class MiddlewareItem(scrapy.Item): 5 # 城市名字 6 city_name = scrapy.Field() 7 # 月度数据 8 city_month = scrapy.Field() 9 # 城市日期 10 city_date = scrapy.Field() 11 # 当天的AQI 12 city_AQI = scrapy.Field() 13 # 等级质量 14 grade = scrapy.Field() 15 # pm2.5含量 16 pm2_5 = scrapy.Field() 17 # pm10含量 18 pm10 = scrapy.Field() 19 # so2含量 20 so2 = scrapy.Field() 21 # co含量 22 co = scrapy.Field() 23 # no2含量 24 no2 = scrapy.Field() 25 # o3含量 26 o3 = scrapy.Field() 27 28 source = scrapy.Field() 29 utc_time = scrapy.Field()
第二步:创建爬虫文件,并书写爬虫代码
1 import scrapy 2 from middleware.items import MiddlewareItem 3 4 5 class AirSpider(scrapy.Spider): 6 name = 'air' 7 allowed_domains = ['aqistudy.cn'] 8 base_url = 'https://www.aqistudy.cn/historydata/' 9 start_urls = [base_url] 10 11 def parse(self, response): 12 print('正在爬取城市信息...') 13 city_list = response.xpath('//div[@class="all"]//li/a/text()').extract()[12:14] 14 link_list = response.xpath('//div[@class="all"]//li/a/@href').extract()[12:14] 15 for city, link in zip(city_list, link_list): 16 city_link = self.base_url + link 17 yield scrapy.Request(url=city_link, callback=self.city, meta={"city": city}) 18 19 def city(self, response): 20 print("正在爬取城市月份信息...") 21 month_list = response.xpath('//tr/td/a/text()').extract()[1:2] 22 month_link = response.xpath('//tr/td/a/@href').extract()[1:2] 23 for month, link in zip(month_list, month_link): 24 next_page = self.base_url + link 25 yield scrapy.Request(url=next_page, callback=self.detail, meta={"city": response.meta["city"], "month": month}) 26 27 def detail(self, response): 28 print("正在爬取城市空气质量信息...") 29 city = response.meta["city"] 30 month = response.meta["month"] 31 item = MiddlewareItem() 32 item["city_name"] = city 33 item["city_month"] = month 34 tr_list = response.xpath('//tr') 35 tr_list.pop(0) 36 for node in tr_list: 37 item["city_date"] = node.xpath("./td[1]/text()").extract_first() 38 item["city_AQI"] = node.xpath("./td[2]/text()").extract_first() 39 item["grade"] = node.xpath("./td[3]/span/text()").extract_first() 40 item["pm2_5"] = node.xpath("./td[4]/text()").extract_first() 41 item["pm10"] = node.xpath("./td[5]/text()").extract_first() 42 item["so2"] = node.xpath("./td[5]/text()").extract_first() 43 item["co"] = node.xpath("./td[7]/text()").extract_first() 44 item["no2"] = node.xpath("./td[8]/text()").extract_first() 45 item["o3"] = node.xpath("./td[9]/text()").extract_first() 46 yield item
上面的代码是爬取某网站的全站空气质量数据,由于数据量很大,所以只取了一小部分代表。基本的思
路是先请求历史数据获取每个城市的资源链接,再跟进请求某城市获取月份信息与月份链接,最后根据链接
获取每月详细的空气质量数据。这里出了城市列表以外,其他都是动态页面,所以我们要在下载数据的时候
拿到正确的数据,就要重写下载中间件了。
第三步:书写下载中间件
1 import random 2 import time 3 import scrapy 4 from selenium import webdriver 5 from middleware.settings import USER_AGENTS as UA 6 7 8 class UserAgentMiddleware(object): 9 10 """ 11 给每一次请求随机赋值一个User_Agent 12 """ 13 14 def process_request(self, request, spider): 15 user_agent = random.choice(UA) 16 request.headers['User_Agent'] = user_agent 17 # request.meta['proxy'] = '' # 设置代理 18 print('*'*30) 19 print(request.headers['User_Agent']) 20 21 22 class SeleniumMiddleware(object): 23 24 def process_request(self, request, spider): 25 if request.url != 'https://www.aqistudy.cn/historydata/': 26 self.driver = webdriver.Chrome() 27 self.driver.get(request.url) 28 time.sleep(2) 29 html = self.driver.page_source 30 self.driver.quit() 31 return scrapy.http.HtmlResponse(url=request.url, body=html, encoding="utf-8", request=request)
下载中间件有两个类,一个是为请求随机取一个User_Agent,一个是使用selenium获取数据,并返回
一个response对象。
第四步:书写管道
1 import json 2 from datetime import datetime 3 4 5 class MiddlewarePipeline(object): 6 7 def open_spider(self, spider): 8 self.file = open('air.json', 'w', encoding='utf-8') 9 10 def process_item(self, item, spider): 11 content = json.dumps(dict(item), ensure_ascii=False) + "\n" 12 self.file.write(content) 13 return item 14 15 def close_spider(self, spider): 16 self.file.close() 17 18 19 class AreaPipeline(object): 20 21 def process_item(self, item, spider): 22 item["source"] = spider.name 23 item["utc_time"] = str(datetime.utcnow()) 24 return item
这里仍然将数据保存为json数据,只是这是添加了数据源信息与添加时间,这是一个编程习惯,可以为
我们带来很多方便。
第五步:配置
1 # 下载中间件 2 DOWNLOADER_MIDDLEWARES = { 3 'middleware.middlewares.UserAgentMiddleware': 543, 4 'middleware.middlewares.SeleniumMiddleware': 300, 5 } 6 # 管道 7 ITEM_PIPELINES = { 8 'middleware.pipelines.MiddlewarePipeline': 300, 9 'middleware.pipelines.AreaPipeline': 200, 10 }
这里没有写 USER_AGENTS ,注意这是一个常量,需要我们自己添加的,所以应该是在写代码前先加
好,这里为了方便,同一放到一起:
1 USER_AGENTS = [ 2 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", 3 "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", 4 "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", 5 "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", 6 "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", 7 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", 8 "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", 9 "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", 10 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3192.0 Safari/537.36Name" 11 ]
至此代码可以跑起来了,但要注意插件chormedriver的安装,你也可以使用无界面浏览器plantomjs.
清澈的爱,只为中国