scrapy meta信息丢失

在做58同城爬二手房时,由于房产详情页内对价格进行了转码处理,所以只能从获取详情页url时同时获取该url对应房产的价格,并通过meta传递给下回调函数

现在问题是,在回调函数中找不到原函数meta信息:

Traceback (most recent call last):
  File "c:\users\chen\python36\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\users\chen\python36\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
    for x in result:
  File "c:\users\chen\python36\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\users\chen\python36\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\users\chen\python36\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\oldHouse\oldHouse\spiders\old58House.py", line 69, in parse_detail
    item = response.meta['item']
KeyError: 'item'

我第一猜想是由于请求经过各种retry重试,和rediret到jump_url、firewall上面,在这个过程中retry和redirect中间件是不是只拿到相应的url而没有保存原来的meta信息,这两个中间件对请求是怎么处理的

1.先看redirect中间件:scrapy.downloadermiddlewares.redirect.BaseRedirectMiddleware,重点代码位于_redirect方法

 

    def _redirect(self, redirected, request, spider, reason):
        ttl = request.meta.setdefault('redirect_ttl', self.max_redirect_times)
        redirects = request.meta.get('redirect_times', 0) + 1

        if ttl and redirects <= self.max_redirect_times:
            redirected.meta['redirect_times'] = redirects
            redirected.meta['redirect_ttl'] = ttl - 1
            redirected.meta['redirect_urls'] = request.meta.get('redirect_urls', []) + \
                [request.url]
            redirected.dont_filter = request.dont_filter
            redirected.priority = request.priority + self.priority_adjust
            logger.debug("Redirecting (%(reason)s) to %(redirected)s from %(request)s",
                         {'reason': reason, 'redirected': redirected, 'request': request},
                         extra={'spider': spider})
            return redirected
        else:
            logger.debug("Discarding %(request)s: max redirections reached",
                         {'request': request}, extra={'spider': spider})
            raise IgnoreRequest("max redirections reached")

可以看到_redirect方法涉及到meta操作主要是刷新最大重试次数和已经重试次数,并没有丢失原有的meta信息

2.再看retry中间件:scrapy.downloadermiddlewares.retry.BaseRetryMiddleware,重点代码位于_redirect方法

    def _retry(self, request, reason, spider):
        retries = request.meta.get('retry_times', 0) + 1

        retry_times = self.max_retry_times

        if 'max_retry_times' in request.meta:
            retry_times = request.meta['max_retry_times']

        stats = spider.crawler.stats
        if retries <= retry_times:
            logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
                         {'request': request, 'retries': retries, 'reason': reason},
                         extra={'spider': spider})
            retryreq = request.copy()
            retryreq.meta['retry_times'] = retries
            retryreq.dont_filter = True
            retryreq.priority = request.priority + self.priority_adjust

            if isinstance(reason, Exception):
                reason = global_object_name(reason.__class__)

            stats.inc_value('retry/count')
            stats.inc_value('retry/reason_count/%s' % reason)
            return retryreq

可以看到_retry方法涉及到meta操作主要是刷新重试次数,并未丢失原有meta信息

事实上,框架没有错,我开始的猜想也没错,错在我定制的edirect中间件修改了meta信息:

return Request(request.url, callback=spider.parse_detail,  dont_filter=True)

我在real_url重定向到firewall上时,不允许它重定向,而是继续请求到real_url,重要的是没有携带real_url的meta信息,所以meta就是在这里丢失的!

第一次修改:

return Request(request.url, callback=spider.parse_detail, meta=response.meta, dont_filter=True)

由于我debug到redirect中间件中,response.url和request.url是一样的,所以我认为meta=response.meta和request.meta都是一样的效果,这是错误的,这样会报如下错误:

"Response.meta not available, this response is not tied to any request"

意思是这个response响应没有绑定给任何request,通过源码发现,response绑定给request是在引擎中发生的:

source:scrapy.core.engine.py line230~line241 in scrapy version 1.5.0

    def _download(self, request, spider):
        slot = self.slot
        slot.add_request(request)
        def _on_success(response):
            assert isinstance(response, (Response, Request))
            if isinstance(response, Response):
                response.request = request # tie request to response received
                logkws = self.logformatter.crawled(request, response, spider)
                logger.log(*logformatter_adapter(logkws), extra={'spider': spider})
                self.signals.send_catch_log(signal=signals.response_received, \
                    response=response, request=request, spider=spider)
            return response

从请求到spider过程是这样的:

1)request --> 2)downloadmiddleware --> 3)downloader --> 4)downloadmiddleware --> 5)engine --> 6)spidermiddleware --> 7)spider

而当前在4)处,将response绑定给request的操作还未发生,自然就会报错了(ps:spider中使用response.meta是因为在位置7,所以可以拿到)

第二次修改:

return Request(request.url, callback=spider.parse_detail, meta=request.meta, dont_filter=True)

结果很顺利拿到meta信息。

这次也带给我一个教训,程序出现问题,首先从自己身上找问题,而不是找项目问题,scrapy还是很强大的😂

posted on 2019-04-22 14:59  Tarantino  阅读(537)  评论(0编辑  收藏  举报

导航