scrapy 爬虫中间件-offsite和refer中间件

环境使用anaconda 创建的pyithon3.6环境 

mac下 

source activate python36

1
2
3
4
5
6
7
8
9
10
11
12
13
14
mac@macdeMacBook-Pro:~$     source activate python36
(python36) mac@macdeMacBook-Pro:~$     cd /www
(python36) mac@macdeMacBook-Pro:/www$     scrapy startproject testMiddlewile
New Scrapy project 'testMiddlewile', using template directory '/Users/mac/anaconda3/envs/python36/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /www/testMiddlewile
 
You can start your first spider with:
    cd testMiddlewile
    scrapy genspider example example.com
(python36) mac@macdeMacBook-Pro:/www$     cd testMiddlewile/
(python36) mac@macdeMacBook-Pro:/www/testMiddlewile$        scrapy genspider -t crawl yeves yeves.cn
Created spider 'yeves' using template 'crawl' in module:
  testMiddlewile.spiders.yeves
(python36) mac@macdeMacBook-Pro:/www/testMiddlewile$    

  

启动爬虫 

1
scrapy crawl yeves

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
(python36) mac@macdeMacBook-Pro:/www/testMiddlewile$     scrapy crawl yeves
2019-11-10 09:10:27 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: testMiddlewile)
2019-11-10 09:10:27 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 13:42:17) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.7, Platform Darwin-17.7.0-x86_64-i386-64bit
2019-11-10 09:10:27 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'testMiddlewile', 'NEWSPIDER_MODULE': 'testMiddlewile.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['testMiddlewile.spiders']}
2019-11-10 09:10:27 [scrapy.extensions.telnet] INFO: Telnet Password: 29995a24067c48f8
2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-10 09:10:27 [scrapy.core.engine] INFO: Spider opened
2019-11-10 09:10:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-11-10 09:10:27 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-10 09:10:27 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yeves.cn/robots.txt> from <GET http://yeves.cn/robots.txt>
2019-11-10 09:10:30 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.yeves.cn/robots.txt> (referer: None)
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 14 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 15 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 21 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 22 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 27 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 28 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 29 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 30 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 31 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 32 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 36 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 37 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 39 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 41 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 47 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 48 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 49 without any user agent to enforce it on.
2019-11-10 09:10:30 [protego] DEBUG: Rule at line 53 without any user agent to enforce it on.
2019-11-10 09:10:30 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yeves.cn/> from <GET http://yeves.cn/>
2019-11-10 09:10:30 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.yeves.cn/robots.txt> (referer: None)
2019-11-10 09:10:30 [protego] DEBUG: Rule at l

  

从上面打印信息可以看到 scrapy默认启动了五个爬虫中间件

 

1
2
3
4
5
6
2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']

 

通过在pycharm 查看源码 先引入

1
2
3
4
5
6
7
from scrapy.spidermiddlewares.offsite import  OffsiteMiddleware
from scrapy.spidermiddlewares.referer import RefererMiddleware
 
 
from scrapy.spidermiddlewares.httperror import  HttpErrorMiddleware
from scrapy.spidermiddlewares.urllength import  UrlLengthMiddleware
from scrapy.spidermiddlewares.depth import  DepthMiddleware

  

offsite中间件

通过按住option进入offsite中间件源码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
"""
Offsite Spider Middleware
 
See documentation in docs/topics/spider-middleware.rst
"""
import re
import logging
import warnings
 
from scrapy import signals
from scrapy.http import Request
from scrapy.utils.httpobj import urlparse_cached
 
logger = logging.getLogger(__name__)
 
 
class OffsiteMiddleware(object):
 
    def __init__(self, stats):
        self.stats = stats
 
    @classmethod
    def from_crawler(cls, crawler):
        o = cls(crawler.stats)
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o
 
    def process_spider_output(self, response, result, spider):
        for x in result:
            if isinstance(x, Request):
                if x.dont_filter or self.should_follow(x, spider):
                    yield x
                else:
                    domain = urlparse_cached(x).hostname
                    if domain and domain not in self.domains_seen:
                        self.domains_seen.add(domain)
                        logger.debug(
                            "Filtered offsite request to %(domain)r: %(request)s",
                            {'domain': domain, 'request': x}, extra={'spider': spider})
                        self.stats.inc_value('offsite/domains', spider=spider)
                    self.stats.inc_value('offsite/filtered', spider=spider)
            else:
                yield x
 
    def should_follow(self, request, spider):
        regex = self.host_regex
        # hostname can be None for wrong urls (like javascript links)
        host = urlparse_cached(request).hostname or ''
        return bool(regex.search(host))
 
    def get_host_regex(self, spider):
        """Override this method to implement a different offsite policy"""
        allowed_domains = getattr(spider, 'allowed_domains', None)
        if not allowed_domains:
            return re.compile('')  # allow all by default
        url_pattern = re.compile("^https?://.*$")
        for domain in allowed_domains:
            if url_pattern.match(domain):
                message = ("allowed_domains accepts only domains, not URLs. "
                           "Ignoring URL entry %s in allowed_domains." % domain)
                warnings.warn(message, URLWarning)
        domains = [re.escape(d) for d in allowed_domains if d is not None]
        regex = r'^(.*\.)?(%s)$' % '|'.join(domains)
        return re.compile(regex)
 
    def spider_opened(self, spider):
        self.host_regex = self.get_host_regex(spider)
        self.domains_seen = set()
 
 
class URLWarning(Warning):
    pass

 

__init__ 类初始化

from_crawler   scrapy 中间件管理所调用的 调用后得到对象

process_spider_output 处理输出

should_follow  是否要继续跟踪

get_host_regex  正则

spider_opend 为了兼容以前的一个函数

 

函数调用流程  from_crawler-》__init__》spider_opend-》get_host_regex

offsite中间件 就是判断当前要请求的url是否符合爬虫里面定义的运行的域名 防止跳到其他域名去了 

allowed_domains = ['yeves.cn']

 

refer中间件 主要是因为有些图片访问需要提供refer访问来源才能访问,比如阿里云后台oss配置的防止盗链 

通过把上次的请求url作为本次url的refer

源码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
class RefererMiddleware(object):
 
    def __init__(self, settings=None):
        self.default_policy = DefaultReferrerPolicy
        if settings is not None:
            self.default_policy = _load_policy_class(
                settings.get('REFERRER_POLICY'))
 
    @classmethod
    def from_crawler(cls, crawler):
        if not crawler.settings.getbool('REFERER_ENABLED'):
            raise NotConfigured
        mw = cls(crawler.settings)
 
        # Note: this hook is a bit of a hack to intercept redirections
        crawler.signals.connect(mw.request_scheduled, signal=signals.request_scheduled)
 
        return mw
 
    def policy(self, resp_or_url, request):
        """
        Determine Referrer-Policy to use from a parent Response (or URL),
        and a Request to be sent.
 
        - if a valid policy is set in Request meta, it is used.
        - if the policy is set in meta but is wrong (e.g. a typo error),
          the policy from settings is used
        - if the policy is not set in Request meta,
          but there is a Referrer-policy header in the parent response,
          it is used if valid
        - otherwise, the policy from settings is used.
        """
        policy_name = request.meta.get('referrer_policy')
        if policy_name is None:
            if isinstance(resp_or_url, Response):
                policy_header = resp_or_url.headers.get('Referrer-Policy')
                if policy_header is not None:
                    policy_name = to_native_str(policy_header.decode('latin1'))
        if policy_name is None:
            return self.default_policy()
 
        cls = _load_policy_class(policy_name, warning_only=True)
        return cls() if cls else self.default_policy()
 
    def process_spider_output(self, response, result, spider):
        def _set_referer(r):
            if isinstance(r, Request):
                referrer = self.policy(response, r).referrer(response.url, r.url)
                if referrer is not None:
                    r.headers.setdefault('Referer', referrer)
            return r
        return (_set_referer(r) for r in result or ())
 
    def request_scheduled(self, request, spider):
        # check redirected request to patch "Referer" header if necessary
        redirected_urls = request.meta.get('redirect_urls', [])
        if redirected_urls:
            request_referrer = request.headers.get('Referer')
            # we don't patch the referrer value if there is none
            if request_referrer is not None:
                # the request's referrer header value acts as a surrogate
                # for the parent response URL
                #
                # Note: if the 3xx response contained a Referrer-Policy header,
                #       the information is not available using this hook
                parent_url = safe_url_string(request_referrer)
                policy_referrer = self.policy(parent_url, request).referrer(
                    parent_url, request.url)
                if policy_referrer != request_referrer:
                    if policy_referrer is None:
                        request.headers.pop('Referer')
                    else:
                        request.headers['Referer'] = policy_referrer

  

爬虫中间件里面的几个函数 offsite中间件只用到了output

process_spider_input 3

process_spider_output 2

process_start_requests 1

process_spider_exception

 

posted @   brady-wang  阅读(3880)  评论(0编辑  收藏  举报
编辑推荐:
· 智能桌面机器人:用.NET IoT库控制舵机并多方法播放表情
· Linux glibc自带哈希表的用例及性能测试
· 深入理解 Mybatis 分库分表执行原理
· 如何打造一个高并发系统?
· .NET Core GC压缩(compact_phase)底层原理浅谈
阅读排行:
· 手把手教你在本地部署DeepSeek R1,搭建web-ui ,建议收藏!
· 新年开篇:在本地部署DeepSeek大模型实现联网增强的AI应用
· 程序员常用高效实用工具推荐,办公效率提升利器!
· Janus Pro:DeepSeek 开源革新,多模态 AI 的未来
· 【译】WinForms:分析一下(我用 Visual Basic 写的)
历史上的今天:
2017-11-10 mysqli 封装
2017-11-10 mysqli DB封装
2017-11-10 封装自己DB
2017-11-10 php 异常
2017-11-10 【坐在马桶上看算法】算法9:开启“树”之旅 转载
2017-11-10 快速排序php
2017-11-10 坐在马桶上看算法:快速排序 转
点击右上角即可分享
微信分享提示