使用scrapy shell 进行爬取过程中的调试

参考文档：Scrapy shell — Scrapy 2.6.2 documentation

使用scrapy.shell.inspect_response函数进行爬取过程的调试：

例：在爬虫中启用shell

import scrapy
class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net",
    ]
    
    def parse(self, response):
        # We want to inspect one specific response.
        if ".org" in response.url:
            from scrapy.shell import inspect_response   #此处引入inspect_response
            inspect_response(response, self)            #进入shell
        # Rest of parsing code.

当运行爬虫后，回进入如下shell状态：

2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.
˓→com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.
˓→org> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>
...
>>> response.url
'http://example.org'

此时，可检查期望的结果是否正常：

>>> response.xpath('//h1[@class="fn"]')
[]

不正常，可调用浏览器进行检查：

>>> view(response)
True

最后，可键Ctrl-D（windows下键Ctrl-Z）来退出shell并继续后续爬取：

>>> ^D
2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.
˓→net> (referer: None)
...

posted @ 2022-08-19 10:25 sfccl 阅读(63) 评论(0) 编辑收藏举报

刷新页面返回顶部

sfccl

使用scrapy shell 进行爬取过程中的调试

公告