二.Pyhon_scrapy终端（scrapy shell）学习笔记

Scrapy shell

Scrapy shell是一个交互式shell，您可以非常快速地尝试调试您的抓取代码，而无需运行蜘蛛。它用于测试数据提取代码，但您实际上可以使用它来测试任何类型的代码，因为它也是常规的Python shell。

配置

官方原文：如果安装了IPython，Scrapy shell将使用它（而不是标准的Python控制台）。该IPython的控制台功能更强大，并提供智能自动完成和彩色输出，等等。

我们强烈建议您安装IPython，特别是如果您正在使用Unix系统（IPython擅长）。有关详细信息，请参阅IPython安装指南。

Scrapy也支持bpython，并且会尝试在IPython 不可用的地方使用它。

调用的话，可以进入你文件中的scrapy.cfg中设置，添加，例如ipython：

可以在笔记一的E:\pythoncode中设置：

[settings]
shell = ipython

启动

进入命令行
scrapy shell <url>

scrapy也可以抓取本地文件：

scrapy shell X:///XXX/XXX/XXX/XXX.html

使用

Scrapy shell只是一个常规的Python控制台（如果有的话，它可以是IPython控制台），它提供了一些额外的快捷功能以方便使用。

Available Shortcuts（可用的命令？）

shelp()

fetch(url[, redirect=True])

fetch(request)

view(response)

可用的Scrapy对象

Scrapy shell自动从下载的页面创建一些方便的对象，如Response对象和 Selector对象

crawler- 当前Crawler对象。
spider- 已知处理URL的Spider，或者Spider当前URL没有找到蜘蛛时的对象
request- Request最后一个获取页面的对象。您可以replace() 使用fetch 快捷方式使用或获取新请求（不离开shell）来修改此请求。
response- Response包含最后一个提取页面的对象
settings- 目前的Scrapy设置

shell会话的例子

首先，进入E:\pythoncode，然后启动shell：

scrapy shell "https://www.baidu.com" --nolog

可以看到使用的一些命令：

[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000000000B27390>
[s] item {}
[s] request <GET https://www.baidu.com>
[s] settings <scrapy.settings.Settings object at 0x0000000004BA03C8>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default
, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local object
s
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser

接着我们输入：
 response.css("div.celltop a b::text").extract_first()
 'Information'

fetch("http://www.guoxuedashi.com/")
注:记得url要加前缀（http://或者https://）
注：如果前面scrapy shell的时候没有加--nolog,会显示
注：DEBUG: Crawled (200)XXXXXXXXXXXXXXXXXX

response.css("a[target=_blank]::text").extract_first()
'四库全书'

request = request.replace(method="POST")

fetch(request)

注："POST","GET","PUT","HEAD"等等都是HTTP请求方法(一般是用GET，这里用POST是想举个例子）

response.status
200

注：200是网页响应代码

from pprint import pprint

pprint(response.headers)

注:ppint是美观的print

从爬虫中调用shell

有时您想要检查蜘蛛的某个特定点正在处理的响应，如果只是为了检查您期望的响应是否到达那里。

这可以通过使用该scrapy.shell.inspect_response功能来实现。

在E:\pythoncode\myproject\spiders创建

import scrapy


class MySpider(scrapy.Spider):
    name = "scrapy_sh"
    start_urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net",
    ]

    def parse(self, response):        
        if ".org" in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response, self)

注：shell就出来了~
response.url
'http://example.org'

response.css("p::text").extract()
["This domain is established to be used for illustrative examples in doc..........."]

view(response)
True

注:Ctrl+Z或者Ctrl+D可以退出

附上源头活水：https://docs.scrapy.org/en/latest/topics/shell.html

posted @ 2018-10-14 21:58 cqdef_xxx 阅读(442) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

cqdef_xxx

二.Pyhon_scrapy终端（scrapy shell）学习笔记

公告