使用scrapy shell 进行爬取过程中的调试

参考文档:Scrapy shell — Scrapy 2.6.2 documentation

使用scrapy.shell.inspect_response函数进行爬取过程的调试:

例:在爬虫中启用shell

import scrapy
class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net",
    ]
    
    def parse(self, response):
        # We want to inspect one specific response.
        if ".org" in response.url:
            from scrapy.shell import inspect_response   #此处引入inspect_response
            inspect_response(response, self)            #进入shell
        # Rest of parsing code.

当运行爬虫后,回进入如下shell状态:

2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.
˓→com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.
˓→org> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>
...
>>> response.url
'http://example.org'


此时,可检查期望的结果是否正常:

>>> response.xpath('//h1[@class="fn"]')
[]

不正常,可调用浏览器进行检查:

>>> view(response)
True

最后,可键Ctrl-D(windows下键Ctrl-Z)来退出shell并继续后续爬取:

>>> ^D
2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.
˓→net> (referer: None)
...

posted @ 2022-08-19 10:25  sfccl  阅读(63)  评论(0编辑  收藏  举报