scrapy 爬虫中间件-offsite和refer中间件
环境使用anaconda 创建的pyithon3.6环境
mac下
source activate python36
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | mac@macdeMacBook-Pro:~$ source activate python36 (python36) mac@macdeMacBook-Pro:~$ cd /www (python36) mac@macdeMacBook-Pro:/www$ scrapy startproject testMiddlewile New Scrapy project 'testMiddlewile' , using template directory '/Users/mac/anaconda3/envs/python36/lib/python3.6/site-packages/scrapy/templates/project' , created in: /www/testMiddlewile You can start your first spider with: cd testMiddlewile scrapy genspider example example.com (python36) mac@macdeMacBook-Pro:/www$ cd testMiddlewile/ (python36) mac@macdeMacBook-Pro:/www/testMiddlewile$ scrapy genspider -t crawl yeves yeves.cn Created spider 'yeves' using template 'crawl' in module: testMiddlewile.spiders.yeves (python36) mac@macdeMacBook-Pro:/www/testMiddlewile$ |
启动爬虫
1 | scrapy crawl yeves |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | (python36) mac@macdeMacBook-Pro:/www/testMiddlewile$ scrapy crawl yeves 2019-11-10 09:10:27 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: testMiddlewile) 2019-11-10 09:10:27 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.6.9 |Anaconda, Inc.| ( default , Jul 30 2019, 13:42:17) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/ final )], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.7, Platform Darwin-17.7.0-x86_64-i386-64bit 2019-11-10 09:10:27 [scrapy.crawler] INFO: Overridden settings: { 'BOT_NAME' : 'testMiddlewile' , 'NEWSPIDER_MODULE' : 'testMiddlewile.spiders' , 'ROBOTSTXT_OBEY' : True, 'SPIDER_MODULES' : [ 'testMiddlewile.spiders' ]} 2019-11-10 09:10:27 [scrapy.extensions.telnet] INFO: Telnet Password: 29995a24067c48f8 2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled extensions: [ 'scrapy.extensions.corestats.CoreStats' , 'scrapy.extensions.telnet.TelnetConsole' , 'scrapy.extensions.memusage.MemoryUsage' , 'scrapy.extensions.logstats.LogStats' ] 2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled downloader middlewares: [ 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware' , 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware' , 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware' , 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware' , 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware' , 'scrapy.downloadermiddlewares.retry.RetryMiddleware' , 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware' , 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware' , 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware' , 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware' , 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware' , 'scrapy.downloadermiddlewares.stats.DownloaderStats' ] 2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled spider middlewares: [ 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware' , 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware' , 'scrapy.spidermiddlewares.referer.RefererMiddleware' , 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware' , 'scrapy.spidermiddlewares.depth.DepthMiddleware' ] 2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled item pipelines: [] 2019-11-10 09:10:27 [scrapy.core.engine] INFO: Spider opened 2019-11-10 09:10:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-11-10 09:10:27 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-11-10 09:10:27 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https: //www.yeves.cn/robots.txt> from <GET http://yeves.cn/robots.txt> 2019-11-10 09:10:30 [scrapy.core.engine] DEBUG: Crawled (404) <GET https: //www.yeves.cn/robots.txt> (referer: None) 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 14 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 15 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 21 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 22 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 27 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 28 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 29 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 30 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 31 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 32 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 36 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 37 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 39 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 41 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 47 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 48 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 49 without any user agent to enforce it on. 2019-11-10 09:10:30 [protego] DEBUG: Rule at line 53 without any user agent to enforce it on. 2019-11-10 09:10:30 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https: //www.yeves.cn/> from <GET http://yeves.cn/> 2019-11-10 09:10:30 [scrapy.core.engine] DEBUG: Crawled (404) <GET https: //www.yeves.cn/robots.txt> (referer: None) 2019-11-10 09:10:30 [protego] DEBUG: Rule at l |
从上面打印信息可以看到 scrapy默认启动了五个爬虫中间件
1 2 3 4 5 6 | 2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled spider middlewares: [ 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware' , 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware' , 'scrapy.spidermiddlewares.referer.RefererMiddleware' , 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware' , 'scrapy.spidermiddlewares.depth.DepthMiddleware' ] |
通过在pycharm 查看源码 先引入
1 2 3 4 5 6 7 | from scrapy.spidermiddlewares.offsite import OffsiteMiddleware from scrapy.spidermiddlewares.referer import RefererMiddleware from scrapy.spidermiddlewares.httperror import HttpErrorMiddleware from scrapy.spidermiddlewares.urllength import UrlLengthMiddleware from scrapy.spidermiddlewares.depth import DepthMiddleware |
offsite中间件
通过按住option进入offsite中间件源码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | """ Offsite Spider Middleware See documentation in docs/topics/spider-middleware.rst """ import re import logging import warnings from scrapy import signals from scrapy.http import Request from scrapy.utils.httpobj import urlparse_cached logger = logging.getLogger(__name__) class OffsiteMiddleware( object ): def __init__( self , stats): self .stats = stats @classmethod def from_crawler( cls , crawler): o = cls (crawler.stats) crawler.signals.connect(o.spider_opened, signal = signals.spider_opened) return o def process_spider_output( self , response, result, spider): for x in result: if isinstance (x, Request): if x.dont_filter or self .should_follow(x, spider): yield x else : domain = urlparse_cached(x).hostname if domain and domain not in self .domains_seen: self .domains_seen.add(domain) logger.debug( "Filtered offsite request to %(domain)r: %(request)s" , { 'domain' : domain, 'request' : x}, extra = { 'spider' : spider}) self .stats.inc_value( 'offsite/domains' , spider = spider) self .stats.inc_value( 'offsite/filtered' , spider = spider) else : yield x def should_follow( self , request, spider): regex = self .host_regex # hostname can be None for wrong urls (like javascript links) host = urlparse_cached(request).hostname or '' return bool (regex.search(host)) def get_host_regex( self , spider): """Override this method to implement a different offsite policy""" allowed_domains = getattr (spider, 'allowed_domains' , None ) if not allowed_domains: return re. compile ('') # allow all by default url_pattern = re. compile ( "^https?://.*$" ) for domain in allowed_domains: if url_pattern.match(domain): message = ( "allowed_domains accepts only domains, not URLs. " "Ignoring URL entry %s in allowed_domains." % domain) warnings.warn(message, URLWarning) domains = [re.escape(d) for d in allowed_domains if d is not None ] regex = r '^(.*\.)?(%s)$' % '|' .join(domains) return re. compile (regex) def spider_opened( self , spider): self .host_regex = self .get_host_regex(spider) self .domains_seen = set () class URLWarning(Warning): pass |
__init__ 类初始化
from_crawler scrapy 中间件管理所调用的 调用后得到对象
process_spider_output 处理输出
should_follow 是否要继续跟踪
get_host_regex 正则
spider_opend 为了兼容以前的一个函数
函数调用流程 from_crawler-》__init__》spider_opend-》get_host_regex
offsite中间件 就是判断当前要请求的url是否符合爬虫里面定义的运行的域名 防止跳到其他域名去了
allowed_domains = ['yeves.cn']
refer中间件 主要是因为有些图片访问需要提供refer访问来源才能访问,比如阿里云后台oss配置的防止盗链
通过把上次的请求url作为本次url的refer
源码如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | class RefererMiddleware( object ): def __init__( self , settings = None ): self .default_policy = DefaultReferrerPolicy if settings is not None : self .default_policy = _load_policy_class( settings.get( 'REFERRER_POLICY' )) @classmethod def from_crawler( cls , crawler): if not crawler.settings.getbool( 'REFERER_ENABLED' ): raise NotConfigured mw = cls (crawler.settings) # Note: this hook is a bit of a hack to intercept redirections crawler.signals.connect(mw.request_scheduled, signal = signals.request_scheduled) return mw def policy( self , resp_or_url, request): """ Determine Referrer-Policy to use from a parent Response (or URL), and a Request to be sent. - if a valid policy is set in Request meta, it is used. - if the policy is set in meta but is wrong (e.g. a typo error), the policy from settings is used - if the policy is not set in Request meta, but there is a Referrer-policy header in the parent response, it is used if valid - otherwise, the policy from settings is used. """ policy_name = request.meta.get( 'referrer_policy' ) if policy_name is None : if isinstance (resp_or_url, Response): policy_header = resp_or_url.headers.get( 'Referrer-Policy' ) if policy_header is not None : policy_name = to_native_str(policy_header.decode( 'latin1' )) if policy_name is None : return self .default_policy() cls = _load_policy_class(policy_name, warning_only = True ) return cls () if cls else self .default_policy() def process_spider_output( self , response, result, spider): def _set_referer(r): if isinstance (r, Request): referrer = self .policy(response, r).referrer(response.url, r.url) if referrer is not None : r.headers.setdefault( 'Referer' , referrer) return r return (_set_referer(r) for r in result or ()) def request_scheduled( self , request, spider): # check redirected request to patch "Referer" header if necessary redirected_urls = request.meta.get( 'redirect_urls' , []) if redirected_urls: request_referrer = request.headers.get( 'Referer' ) # we don't patch the referrer value if there is none if request_referrer is not None : # the request's referrer header value acts as a surrogate # for the parent response URL # # Note: if the 3xx response contained a Referrer-Policy header, # the information is not available using this hook parent_url = safe_url_string(request_referrer) policy_referrer = self .policy(parent_url, request).referrer( parent_url, request.url) if policy_referrer ! = request_referrer: if policy_referrer is None : request.headers.pop( 'Referer' ) else : request.headers[ 'Referer' ] = policy_referrer |
爬虫中间件里面的几个函数 offsite中间件只用到了output
process_spider_input 3
process_spider_output 2
process_start_requests 1
process_spider_exception
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 智能桌面机器人:用.NET IoT库控制舵机并多方法播放表情
· Linux glibc自带哈希表的用例及性能测试
· 深入理解 Mybatis 分库分表执行原理
· 如何打造一个高并发系统?
· .NET Core GC压缩(compact_phase)底层原理浅谈
· 手把手教你在本地部署DeepSeek R1,搭建web-ui ,建议收藏!
· 新年开篇:在本地部署DeepSeek大模型实现联网增强的AI应用
· 程序员常用高效实用工具推荐,办公效率提升利器!
· Janus Pro:DeepSeek 开源革新,多模态 AI 的未来
· 【译】WinForms:分析一下(我用 Visual Basic 写的)
2017-11-10 mysqli 封装
2017-11-10 mysqli DB封装
2017-11-10 封装自己DB
2017-11-10 php 异常
2017-11-10 【坐在马桶上看算法】算法9:开启“树”之旅 转载
2017-11-10 快速排序php
2017-11-10 坐在马桶上看算法:快速排序 转