Scrapy 扩展中间件: 针对特定响应状态码,使用代理重新请求
0.参考
1.主要实现
实际爬虫过程中如果请求过于频繁,通常会被临时重定向到登录页面即302,甚至是提示禁止访问即403,因此可以对这些响应执行一次代理请求:
(1) 参考原生 redirect.py 模块,满足 dont_redirect 或 handle_httpstatus_list 等条件时,直接传递 response
(2) 不满足条件(1),如果响应状态码为 302 或 403,使用代理重新发起请求
(3) 使用代理后,如果响应状态码仍为 302 或 403,直接丢弃
2.代码实现
保存至 /site-packages/my_middlewares.py
from w3lib.url import safe_url_string from six.moves.urllib.parse import urljoin from scrapy.exceptions import IgnoreRequest class MyAutoProxyDownloaderMiddleware(object): def __init__(self, settings): self.proxy_status = settings.get('PROXY_STATUS', [302, 403]) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=proxy#module-scrapy.downloadermiddlewares.httpproxy self.proxy_config = settings.get('PROXY_CONFIG', 'http://username:password@some_proxy_server:port') @classmethod def from_crawler(cls, crawler): return cls( settings = crawler.settings ) # See /site-packages/scrapy/downloadermiddlewares/redirect.py def process_response(self, request, response, spider): if (request.meta.get('dont_redirect', False) or response.status in getattr(spider, 'handle_httpstatus_list', []) or response.status in request.meta.get('handle_httpstatus_list', []) or request.meta.get('handle_httpstatus_all', False)): return response if response.status in self.proxy_status: if 'Location' in response.headers: location = safe_url_string(response.headers['location']) redirected_url = urljoin(request.url, location) else: redirected_url = '' # AutoProxy for first time if not request.meta.get('auto_proxy'): request.meta.update({'auto_proxy': True, 'proxy': self.proxy_config}) new_request = request.replace(meta=request.meta, dont_filter=True) new_request.priority = request.priority + 2 spider.log('Will AutoProxy for <{} {}> {}'.format( response.status, request.url, redirected_url)) return new_request # IgnoreRequest for second time else: spider.logger.warn('Ignoring response <{} {}>: HTTP status code still in {} after AutoProxy'.format( response.status, request.url, self.proxy_status)) raise IgnoreRequest return response
3.调用方法
(1) 项目 settings.py 添加代码,注意必须在默认的 RedirectMiddleware 和 HttpProxyMiddleware 之间。
# Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { # 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600, 'my_middlewares.MyAutoProxyDownloaderMiddleware': 601, # 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750, } PROXY_STATUS = [302, 403] PROXY_CONFIG = 'http://username:password@some_proxy_server:port'
4.运行结果
2018-07-18 18:42:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/> (referer: None) 2018-07-18 18:42:38 [test] DEBUG: Will AutoProxy for <302 http://httpbin.org/status/302> http://httpbin.org/redirect/1 2018-07-18 18:42:43 [test] DEBUG: Will AutoProxy for <403 https://httpbin.org/status/403> 2018-07-18 18:42:51 [test] WARNING: Ignoring response <302 http://httpbin.org/status/302>: HTTP status code still in [302, 403] after AutoProxy 2018-07-18 18:42:52 [test] WARNING: Ignoring response <403 https://httpbin.org/status/403>: HTTP status code still in [302, 403] after AutoProxy
代理服务器 log:
squid [18/Jul/2018:18:42:53 +0800] "GET http://httpbin.org/status/302 HTTP/1.1" 302 310 "-" "Mozilla/5.0" TCP_MISS:HIER_DIRECT squid [18/Jul/2018:18:42:54 +0800] "CONNECT httpbin.org:443 HTTP/1.1" 200 3560 "-" "-" TCP_TUNNEL:HIER_DIRECT