使用scrapy shell 进行爬取过程中的调试
参考文档:Scrapy shell — Scrapy 2.6.2 documentation
使用scrapy.shell.inspect_response函数进行爬取过程的调试:
例:在爬虫中启用shell
import scrapy class MySpider(scrapy.Spider): name = "myspider" start_urls = [ "http://example.com", "http://example.org", "http://example.net", ] def parse(self, response): # We want to inspect one specific response. if ".org" in response.url: from scrapy.shell import inspect_response #此处引入inspect_response inspect_response(response, self) #进入shell # Rest of parsing code.
当运行爬虫后,回进入如下shell状态:
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.
˓→com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.
˓→org> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>
...
>>> response.url
'http://example.org'
此时,可检查期望的结果是否正常:
>>> response.xpath('//h1[@class="fn"]')
[]
不正常,可调用浏览器进行检查:
>>> view(response)
True
最后,可键Ctrl-D(windows下键Ctrl-Z)来退出shell并继续后续爬取:
>>> ^D
2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.
˓→net> (referer: None)
...