scrapy 中间件
下载中间件的应用
- scrapy中间件有:-爬虫中间件,下载中间件
- 下载中间件应用较多
- 下载中间件:
- 作用:批量拦截请求和响应
- 拦截请求:
- UA伪装:将所有的请求尽可能多的设定成不同的请求载体身份标识
- request.headers['User-Agent'] = 'xxx'
- 代理操作:request.meta['proxy'] = 'http://ip:port'
- 拦截响应:篡改响应数据或者直接替换响应对象
1 拦截请求中间件
-
作用:
- UA伪装:将所有的请求尽可能多的设定成不同的请求载体身份标识
- request.headers['User-Agent'] = 'xxx'
- 代理操作:request.meta['proxy'] = 'http://ip:port'
- UA伪装:将所有的请求尽可能多的设定成不同的请求载体身份标识
-
爬取4567视频网示例:
- spider.py文件
import scrapy from moviespider.items import MoviespiderItem class MovieSpiderSpider(scrapy.Spider): name = 'movie_spider' # allowed_domains = ['https://www.4567tv.tv/index.php/vod/show/class/动作/id/1.html'] start_urls = ['https://www.4567tv.tv/index.php/vod/show/class/动作/id/1.html'] url = 'https://www.4567tv.tv/index.php/vod/show/class/动作/id/1/page/%d.html' pageNum = 1 def parse(self, response): li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li') for li in li_list: title = li.xpath('./div[1]/a/@title').extract_first() detail_url = 'https://www.4567tv.tv' + li.xpath('./div[1]/a/@href').extract_first() item = MoviespiderItem() item['title'] = title # meta参数是一个字典,该字典就可以传递给callback指定的回调函数 yield scrapy.Request(detail_url, callback=self.parse_detail, meta={"item": item}) def parse_detail(self, response): # 接收meta:response.meta item = response.meta['item'] desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first() item["desc"] = desc yield item
-
items.py
- 创建title和desc的属性
-
pipelines.py文件
- 存储
-
middleware.py
- downloadmiddleware文件
from scrapy import signals import random user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", ] # 常见浏览器user_agent PROXY_http = [ '153.180.102.104:80', ] PROXY_https = [ '120.83.49.90:9000', ] class MoviespiderDownloaderMiddleware(object): # 拦截正常的请求,参数request就是拦截到的请求对象 def process_request(self, request, spider): print('i am process_request()') # 实现:将拦截到的请求尽可能多的设定成不同的请求载体身份标识 request.headers['User-Agent'] = random.choice(user_agent_list) # 代理操作 if request.url.split(':')[0] == 'http': request.meta['proxy'] = 'http://' + random.choice(PROXY_http) # http://ip:port else: request.meta['proxy'] = 'https://' + random.choice(PROXY_https) # http://ip:port return # 拦截响应:参数response就是拦截到的响应 def process_response(self, request, response, spider): print('i am process_response()') return response def process_exception(self, request, exception, spider): print('i am process_exception()') # 拦截到异常的请求然后对其进行修正,然后重新进行请求发送 # 代理操作 if request.url.split(':')[0] == 'http': request.meta['proxy'] = 'http://' + random.choice(PROXY_http) # http://ip:port else: request.meta['proxy'] = 'https://' + random.choice(PROXY_https) # http://ip:port return request # 将修正之后的请求进行重新发送
-
常用浏览器user_agent
user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
2 拦截响应中间件
-
作用:拦截响应可以篡改响应数据或者直接替换响应对象
-
selenium在scrapy中的应用:
- 实例化浏览器对象:写在爬虫类的构造方法中
- 关闭浏览器:爬虫类中的closed(self,spider)关闭浏览器
- 在中间件中执行浏览器自动化的操作
-
示例:
-
需求; 爬取网易新闻的国内,国际,军事,航空,无人机这五个板块下对应的新闻标题和内容(数据是动态加载的)
-
spider.py文件
# -*- coding: utf-8 -*- import scrapy from selenium import webdriver from wangyinews.item import WangyinewsItem class NewsSpider(scrapy.Spider): name = 'news' # allowed_domains = ['www.wangyi.com'] start_urls = ['https://news.163.com'] five_model_urls = [] bro = webdriver.Chrome(executable_path=r'D:\教学视频\python 爬虫\tools\chromedriver.exe') # 用来解析五个板块对应的url,然后对其进行手动请求发送 def parse(self, response): model_index = [3, 4, 6, 7, 8] li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li') for index in model_index: li = li_list[index] # 获取了五个板块对应的url model_url = li.xpath('./a/@href').extract_first() self.five_model_urls.append(model_url) # 对每一个板块的url进行手动i请求发送 yield scrapy.Request(model_url, callback=self.parse_model) # 解析每一个板块页面中的新闻标题和新闻详情页的url # 问题:response(不满足需求的response)中并没有包含每一个板块中动态加载出的新闻数据 def parse_model(self, response): div_list = response.xpath('/html/body/div[1]/div[3]/div[4]/div[1]/div/div/ul/li/div/div') for div in div_list: title = div.xpath('./div/div[1]/h3/a/text()').extract_first() detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first() item = WangyinewsItem() item['title'] = title # 对详情页发起请求解析出新闻内容 yield scrapy.Request(detail_url, callback=self.parse_new_content, meta={'item': item}) # 获取内容详情 def parse_new_content(self, response): item = response.meta["item"] content = response.xpath('//*[@id="endText"]//text()').extract() item["content"] = content yield item # 最后执行,关闭bro def close(self, spider): self.bro.quit()
-
items.py
- 创建title和desc的属性
-
pipelines.py文件
- 存储
-
middleware.py文件:
# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from time import sleep from scrapy import signals from scrapy.http import HtmlResponse class WangyinewsDownloaderMiddleware(object): def process_request(self, request, spider): return None def process_response(self, request, response, spider): # spider就是爬虫文件中爬虫类实例化的对象 # 进行所有响应对象的拦截 # 1.将所有的响应中那五个不满足需求的响应对象找出 # 1.每一个响应对象对应唯一一个请求对象 # 2.如果我们可以定位到五个响应对应的请求对象后,就可以通过该i请求对象定位到指定的响应对象 # 3.可以通过五个板块的url定位请求对象 # 总结: url==》request==》response # 2.将找出的五个不满足需求的响应对象进行修正(替换) # spider.five_model_urls:五个板块对应的url bro = spider.bro if request.url in spider.five_model_urls: bro.get(request.url) sleep(1) page_text = bro.page_source # 包含了动态加载的新闻数据 # 如果if条件程利则该response就是五个板块对应的响应对象 new_responde = HtmlResponse(url=request.url, body=page_text, encoding="utf-8", request=request) return new_responde return response def process_exception(self, request, exception, spider): pass
-