一.Python_srcrapy的命令行工具 学习笔记(Command line tool)
命令行工具
Scrapy通过scrapy命令行工具进行控制,在此称为“Scrapy工具”,以区别于子命令,我们称之为“命令”或“Scrapy命令”。
Scrapy工具提供了多个命令,用于多种用途,每个命令都接受一组不同的参数和选项。
创建项目
scrapy startproject myproject [project_dir]
在命令行中创建项目
scrapy start myproject E:\pythoncode\
E:\pythoncode中创建myproject项目
接下来
cd E:\pythoncode
按:如果project_dir没有指定,project_dir将是相同的myproject
控制项目
例如:scrapy genspider mydomain mydomain.com
创建一个爬虫,名字为mydomain,爬取mydomain.com网站
可以在E:\pythoncode\myproject\spiders 看到这个爬虫的代码
scrapy -h
我们可以看到以下:
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
注:意思是不懂就scrapy <command> -h 自己学
startproject命令
句法: scrapy startproject <project_name> [project_dir]
创建一个爬虫项目
示例:
scrapy startproject myproject
genspider命令
句法: scrapy genspider [-t template] <name> <domain>
创建一个爬虫文件
示例:
E:\pythoncode>scrapy genspider -l Available templates: basic crawl csvfeed xmlfeed 注:-t中可用的模版basic....xmlfeed E:\pythoncode>scrapy genspider example example.com Created spider 'example' using template 'basic' in module: myproject.spiders.example 注:scrapy genspider xx xx.com相当于 scrapy genspider -t basic xx xx.com E:\pythoncode>scrapy genspider -t crawl scrapyorg scrapy.org Created spider 'scrapyorg' using template 'crawl' in module: myproject.spiders.scrapyorg 注:-t crawl创建出来的跟-t crawl不一样的,我想大概是为了满足网站一些不可知的需求吧
这只是一个方便的快捷方式命令,用于根据预定义的模板创建爬虫,但肯定不是创建爬虫的唯一方法。您可以自己创建爬虫源代码文件,而不是使用此命令。
crawl命令
句法: scrapy crawl <spider>
示例:
E:\pythoncode>scrapy crawl mydomain [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: myproject) ................. 注:crawl命令 在示例教程里已经用的不能再熟悉了 使用爬虫开始抓取。
check命令
句法: scrapy check [-l] <spider>
示例:
E:\pythoncode>scrapy check -l mydomain E:\pythoncode>scrapy check -l 注:现在没显示什么了,估计是版本的已经升级了。我猜以前是用来检查存在这个<spider>的 和访问速度的吧?
list命令
句法: scrapy list
示例:
E:\pythoncode>scrapy list
example
mydomain
scrapyorg
注:列出所有爬虫的名字
edit命令
句法: scrapy edit <spider>
Edit the given spider using the editor defined in the EDITOR
environment variable or (if unset) the EDITOR
setting.
This command is provided only as a convenience shortcut for the most common case, the developer is of course free to choose any tool or IDE to write and debug spiders.
注:不懂,贴上官方原话
fetch命令
句法: scrapy fetch <url>
示例:
E:\pythoncode>scrapy fetch --nolog http://www.example.com/some/page.html <?xml version="1.0" encoding="iso-8859-1"?> ..................................... 注:使用Scrapy下载程序下载给定的URL,并将内容写入标准输出。 E:\pythoncode>scrapy fetch --nolog --headers http://www.example.com/ > Accept-Language: en > Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > User-Agent: Scrapy/1.5.1 (+https://scrapy.org) > Accept-Encoding: gzip,deflate ....................................................... 注:--headers 你是用什么状态去访问url的
list命令
句法: scrapy view <url>
示例:
E:\pythoncode>scrapy view https://movie.douban.com/ 注:进去之后是不是403错误啊,豆瓣会判断你以什么姿态去访问的 我们看看--headers是什么效果 E:\pythoncode>scrapy fetch --headers --nolog https://movie.douban.com/ > Accept-Encoding: gzip,deflate > User-Agent: Scrapy/1.5.1 (+https://scrapy.org) > Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > Accept-Language: en > < Server: dae < Content-Type: text/html < Date: Sun, 07 Oct 2018 15:06:06 GMT 注:User-Agent: Scrapy/1.5.1 (+https://scrapy.org) 下次要爬这些网址的时候记得改下这个哦 不懂的话推荐一个网址https://blog.csdn.net/u012195214/article/details/78889602
shell命令
句法: scrapy shell [url]
示例:
E:\pythoncode>scrapy shell https://movie.douban.com/ [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: myproject) ........................ 注:shell要安装ipthon好像,进去学习的话自己看官网的教程示例
E:\pythoncode>scrapy shell --nolog https://movie.douban.com/ -c "(response.status, response.url)" (403, 'https://movie.douban.com/') 注:403访问错误,https://movie.douban.com/ 200访问成功 HTTP响应代码,不懂的话:https://blog.csdn.net/jackfrued/article/details/25662527
Parse命令
句法: scrapy parse <url> [options]
获取给定的URL并使用处理它的爬虫解析它
示例:这个爬虫解析没有,所以没有示例
settings命令
句法: scrapy settings [options]
示例:
E:\pythoncode>scrapy settings --get BOT_NAME myproject 注:项目名 E:\pythoncode>scrapy settings --get DOWNLOAD_DELAY 0 注:下载延迟 不懂的话自己打开项目目录下的scrapy.cfg看啦,加上scrapy settings -h
runspider命令
句法: scrapy runspider <spider_file.py>
示例:
E:\pythoncode>scrapy runspider E:\pythoncode\myproject\spiders\mydomain.py [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: myproject)
...................................
Version命令
句法: scrapy version [-v]
示例:
E:\pythoncode>scrapy version Scrapy 1.5.1 E:\pythoncode>scrapy version -v Scrapy : 1.5.1 lxml : XXX libxml2 : XXX cssselect : XXX parsel : XXX w3lib : XXX Twisted : XXX Python : 3.XXXX pyOpenSSL : XXX cryptography : XXX Platform : XXX 注:scrapy用到的库的版本号
原话:Prints the Scrapy version. If used with-v
it also prints Python, Twisted and Platform info, which is useful for bug reports.
bench命令
句法: scrapy bench
注:运行基准测试
原话:Run a quick benchmark test.
不懂就:https://docs.scrapy.org/en/latest/topics/benchmarking.html#benchmarking
自定义项目命令
没玩过,哪天玩下。
学不急而能修,附上源头活水的官方地址:https://docs.scrapy.org/en/latest/topics/commands.html