爬虫---Scrapy
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下
Scrapy主要包括了以下组件:
- 引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心) - 调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址 - 下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的) - 爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面 - 项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。 - 下载器中间件(Downloader Middlewares)
位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。 - 爬虫中间件(Spider Middlewares)
介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。 - 调度中间件(Scheduler Middewares)
介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。
在Scrapy的数据流是由执行引擎控制,具体流程如下:
- 1、spiders产生request请求,将请求交给引擎
- 2、引擎(EGINE)吧刚刚处理好的请求交给了调度器,以一个队列或者堆栈的形式吧这些请求保存起来,调度一个出来再传给引擎
- 3、调度器(SCHEDULER)返回给引擎一个要爬取的url
- 4、引擎把调度好的请求发送给download,通过中间件发送(这个中间件至少有 两个方法,一个请求的,一个返回的),
- 5、一旦完成下载就返回一个response,通过下载器中间件,返回给引擎,引擎把response 对象传给下载器中间件,最后到达引擎
- 6、引擎从下载器中收到response对象,从下载器中间件传给了spiders(spiders里面做两件事,1、产生request请求,2、为request请求绑定一个回调函数),spiders只负责解析爬取的任务。不做存储,
- 7、解析完成之后返回一个解析之后的结果items对象及(跟进的)新的Request给引擎
- 就被ITEM PIPELUMES处理了
- 8、引擎将(Spider返回的)爬取到的Item给Item Pipeline,存入数据库,持久化,如果数据不对,可重新封装成一个request请求,传给调度器
- 9、(从第二步)重复直到调度器中没有更多地request,引擎关闭该网站
Scrapy运行流程大概如下:
- 引擎从调度器中取出一个链接(URL)用于接下来的抓取
- 引擎把URL封装成一个请求(Request)传给下载器
- 下载器把资源下载下来,并封装成应答包(Response)
- 爬虫解析Response
- 解析出实体(Item),则交给实体管道进行进一步的处理
- 解析出的是链接(URL),则把URL交给调度器等待抓取
scrapy框架分为七大部分核心的组件
1、引擎(EGINE)
引擎负责控制系统所有组件之间的数据流,并在某些动作发生时触发事件。有关详细信息,请参见上面的数据流部分。
2、调度器(SCHEDULER)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
3、下载器(DOWLOADER)
用于下载网页内容, 并将网页内容返回给EGINE,下载器是建立在twisted这个高效的异步模型上的
4、爬虫(SPIDERS)
SPIDERS是开发人员自定义的类,用来解析responses,并且提取items,或者发送新的请求
5、项目管道(ITEM PIPLINES)
在items被提取后负责处理它们,主要包括清理、验证、持久化(比如存到数据库)等操作
6、下载器中间件(Downloader Middlewares)
下载器中间件是在引擎及下载器之间的特定钩子(specific hook),处理Downloader传递给引擎的response。 其提供了一个简便的机制,通过插入自定义代码来扩展Scrapy功能。
7、爬虫中间件(Spider Middlewares)
Spider中间件是在引擎及Spider之间的特定钩子(specific hook),处理spider的输入(response)和输出(items及requests)。 其提供了一个简便的机制,通过插入自定义代码来扩展Scrapy功能。
一、安装
1 Linux 2 pip3 install scrapy # 安装前要安装twisted模块 3 4 5 Windows 6 a. pip3 install wheel 7 b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 8 c. 进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl 9 d. pip3 install scrapy 10 e. 下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/
二、基本使用
1. 基本命令
1 1. scrapy startproject 项目名称 # Step1: 创建项目 2 - 在当前目录中创建中创建一个项目文件(类似于Django) 3 4 2. scrapy genspider [-t template] <name> <domain> # Step2: 创建爬虫 5 - 创建爬虫应用 6 cd 项目名称 7 如: 8 scrapy genspider -t basic oldboy oldboy.com 9 scrapy genspider -t xmlfeed autohome autohome.com.cn 10 PS: 11 查看所有命令:scrapy gensipider -l 12 查看模板命令:scrapy gensipider -d 模板名称 13 14 3. scrapy list 15 - 展示爬虫应用列表 16 17 4. 先进入project,
scrapy crawl 爬虫应用名称 18 - 运行单独爬虫应用
命令行工具
1 #1 查看帮助 2 scrapy -h 3 scrapy <command> -h 4 5 #2 有两种命令:其中Project-only必须切到项目文件夹下才能执行,而Global的命令则不需要 6 Global commands: 7 startproject #创建项目 8 genspider #创建爬虫程序 9 settings #如果是在项目目录下,则得到的是该项目的配置 10 runspider #运行一个独立的python文件,不必创建项目 11 shell #scrapy shell url地址 在交互式调试,如选择器规则正确与否 12 fetch #独立于程单纯地爬取一个页面,可以拿到请求头 13 view #下载完毕后直接弹出浏览器,以此可以分辨出哪些数据是ajax请求 14 version #scrapy version 查看scrapy的版本,scrapy version -v查看scrapy依赖库的版本 15 Project-only commands: 16 crawl #运行爬虫,必须创建项目才行,确保配置文件中ROBOTSTXT_OBEY = False 17 check #检测项目中有无语法错误 18 list #列出项目中所包含的爬虫名 19 edit #编辑器,一般不用 20 parse #scrapy parse url地址 --callback 回调函数 #以此可以验证我们的回调函数是否正确 21 bench #scrapy bentch压力测试 22 23 #3 官网链接 24 https://docs.scrapy.org/en/latest/topics/commands.html
1 全局命令:所有文件夹都使用的命令,可以不依赖与项目文件也可以执行 2 项目的文件夹下执行的命令 3 1、scrapy startproject Myproject #创建项目 4 cd Myproject 5 2、scrapy genspider baidu www.baidu.com #创建爬虫程序,baidu是爬虫名,定位爬虫的名字 6 #写完域名以后默认会有一个url, 7 3、scrapy settings --get BOT_NAME #获取配置文件 8 #全局:4、scrapy runspider budui.py 9 5、scrapy runspider AMAZON\spiders\amazon.py #执行爬虫程序 10 在项目下:scrapy crawl amazon #指定爬虫名,定位爬虫程序来运行程序 11 #robots.txt 反爬协议:在目标站点建一个文件,里面规定了哪些能爬,哪些不能爬 12 # 有的国家觉得是合法的,有的是不合法的,这就产生了反爬协议 13 # 默认是ROBOTSTXT_OBEY = True 14 # 修改为ROBOTSTXT_OBEY = False #默认不遵循反扒协议 15 6、scrapy shell https://www.baidu.com #直接超目标站点发请求 16 response 17 response.status 18 response.body 19 view(response) 20 7、scrapy view https://www.taobao.com #如果页面显示内容不全,不全的内容则是ajax请求实现的,以此快速定位问题 21 8、scrapy version #查看版本 22 9、scrapy version -v #查看scrapy依赖库锁依赖的版本 23 10、scrapy fetch --nolog http://www.logou.com #获取响应的内容 24 11、scrapy fetch --nolog --headers http://www.logou.com #获取响应的请求头 25 (venv3_spider) E:\twisted\scrapy框架\AMAZON>scrapy fetch --nolog --headers http://www.logou.com 26 > Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 27 > Accept-Language: en 28 > User-Agent: Scrapy/1.5.0 (+https://scrapy.org) 29 > Accept-Encoding: gzip,deflate 30 > 31 < Content-Type: text/html; charset=UTF-8 32 < Date: Tue, 23 Jan 2018 15:51:32 GMT 33 < Server: Apache 34 >代表请求 35 <代表返回 36 10、scrapy shell http://www.logou.com #直接朝目标站点发请求 37 11、scrapy check #检测爬虫有没有错误 38 12、scrapy list #所有的爬虫名 39 13、scrapy parse http://quotes.toscrape.com/ --callback parse #验证回调函函数是否成功执行 40 14、scrapy bench #压力测试
指定保存格式:
1 scrapy crawl quotes -o quotes.xml 2 3 scrapy crawl quotes -o quotes.json 4 5 scrapy crawl quotes -o quotes.jl # jsonline格式 6 7 scrapy crawl quotes -o quotes.csv # 存储为csv格式 8 9 scrapy crawl quotes -o quotes.marshal # marshal数据分析存储结构 10 11 还支持:scrapy crawl quotes -o ftp://user:pass@ftp.example.com/path/quotes.csv
2.项目结构以及爬虫应用简介
1 project_name/ 2 scrapy.cfg 3 project_name/ 4 __init__.py 5 items.py 6 pipelines.py 7 settings.py 8 spiders/ 9 __init__.py 10 爬虫1.py 11 爬虫2.py 12 爬虫3.py
文件说明:
- scrapy.cfg 项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中)
- items.py 设置数据存储模板,用于结构化数据,如:Django的Model
- pipelines 数据处理行为,如:一般结构化的数据持久化
- settings.py 配置文件,如:递归的层数、并发数,延迟下载等
- spiders 爬虫目录,如:创建文件,编写爬虫规则
注意:一般创建爬虫文件时,以网站域名命名
1 import scrapy 2 3 class XiaoHuarSpider(scrapy.spiders.Spider): 4 name = "baidu" # 爬虫名称 ***** 5 allowed_domains = ["baidu.com"] # 允许的域名 6 start_urls = [ 7 "http://www.baidu.com/", # 其实URL 8 ] 9 10 def parse(self, response): 11 # 访问起始URL并获取结果后的回调函数
1 import sys,io 2 sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
1 报错信息: 2 UnicodeEncodeError: 'gbk' codec can't encode character '\u2764' in positiononse
1 默认只能在cmd中执行爬虫,如果想在pycharm中执行需要做: 2 #在项目目录下新建:entrypoint.py 3 from scrapy.cmdline import execute 4 # execute(['scrapy', 'crawl', 'amazon','--nolog']) #不要日志打印 5 # execute(['scrapy', 'crawl', 'amazon']) 6 7 #我们可能需要在命令行为爬虫程序传递参数,就用下面这样的命令 8 #acrapy crawl amzaon -a keyword=iphone8 9 execute(['scrapy', 'crawl', 'amazon1','-a','keyword=iphone8','--nolog']) #不要日志打印 10 11 # execute(['scrapy', 'crawl', 'amazon1'])
3. Demo
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.selector import HtmlXPathSelector,Selector 4 from scrapy.http import Request 5 6 class XiaohuarSpider(scrapy.Spider): 7 name = 'xiaohuar' 8 allowed_domains = ['xiaohuar.com'] 9 start_urls = ['http://www.xiaohuar.com/hua/'] 10 11 def parse(self, response): # 回调函数 12 # 要废弃 13 # hxs = HtmlXPathSelector(response) 14 # print(hxs) 15 # result = hxs.select('//a[@class="item_list"]') 16 17 hxs = Selector(response=response) 18 # print(hxs) 19 user_list = hxs.xpath('//div[@class="item masonry_brick"]') 20 for item in user_list: 21 price = item.xpath('./span[@class="price"]/text()').extract_first() 22 url = item.xpath('div[@class="item_t"]/div[@class="class"]//a/@href').extract_first() 23 print(price,url) 24 25 result = hxs.xpath('/a[re:test(@href,"http://www.xiaohuar.com/list-1-\d+.html")]/@href') 26 print(result) 27 result = ['http://www.xiaohuar.com/list-1-1.html','http://www.xiaohuar.com/list-1-2.html'] 28 29 # 规则,递归 30 for url in result: 31 yield Request(url=url,callback=self.parse) # yield的是Request对象就会把函数放入scrapy的Scheduler调度器,递归爬取;其它对象会放入pipline进行持久化存储 32
执行此爬虫文件,则在终端进入项目目录执行如下命令:
scrapy crawl xiaohuar --nolog
对于上述代码重要之处在于:
- Request是一个封装用户请求的类,在回调函数中yield该对象表示继续访问
- HtmlXpathSelector用于结构化HTML代码并提供选择器功能
4. 选择器
1 scrapy shell https://doc.scrapy.org/en/latest/_static/selectors-sample1.html 2 3 <html> 4 <head> 5 <base href='http://example.com/' /> 6 <title>Example website</title> 7 </head> 8 <body> 9 <div id='images'> 10 <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a> 11 <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a> 12 <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a> 13 <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a> 14 <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a> 15 </div> 16 </body> 17 </html> 18 19 In [5]: response.selector.xpath('//title/text()').extract_first() 20 Out[5]: 'Example website' 21 22 In [8]: response.selector.css('title::text').extract_first() 23 Out[8]: 'Example website' 24 25 # selector可省略: 26 In [16]: response.css('title::text').extract_first() 27 Out[16]: 'Example website' 28 29 In [17]: response.xpath('//title/text()').extract_first() 30 Out[17]: 'Example website' 31 32 In [18]: response.xpath('//title/text()') 33 Out[18]: [<Selector xpath='//title/text()' data='Example website'>] 34 35 36 37 38 xpath中我们获取元素是通过.entry-header h1::text 39 如果是属性则用.entry-header a::attr(href) 40 41 42 43 In [20]: response.xpath('//div[@id="images"]').css('img::attr(src)') # '//div[@id="images"]')中,images外侧必须用双引号 44 Out[20]: 45 [<Selector xpath='descendant-or-self::img/@src' data='image1_thumb.jpg'>, 46 <Selector xpath='descendant-or-self::img/@src' data='image2_thumb.jpg'>, 47 <Selector xpath='descendant-or-self::img/@src' data='image3_thumb.jpg'>, 48 <Selector xpath='descendant-or-self::img/@src' data='image4_thumb.jpg'>, 49 <Selector xpath='descendant-or-self::img/@src' data='image5_thumb.jpg'>] 50 51 In [23]: response.xpath('//div[@id="images"]').css('img::attr(src)').extract() 52 Out[23]: 53 ['image1_thumb.jpg', 54 'image2_thumb.jpg', 55 'image3_thumb.jpg', 56 'image4_thumb.jpg', 57 'image5_thumb.jpg'] 58 59 In [21]: response.xpath('//div[@id="images"]').css('img::attr(src)').extract_first() 60 Out[21]: 'image1_thumb.jpg' 61 62 63 64 # default设置默认参数,当查找不到指定的attr时,用default代替,防止报错 65 In [25]: response.xpath('//div[@id="images"]').css('img::attr(srcc)').extract_first(default='replace') 66 Out[25]: 'replace' 67 68 69 # css、xpath获取属性: 70 In [27]: response.xpath('//a/@href') 71 Out[27]: 72 [<Selector xpath='//a/@href' data='image1.html'>, 73 <Selector xpath='//a/@href' data='image2.html'>, 74 <Selector xpath='//a/@href' data='image3.html'>, 75 <Selector xpath='//a/@href' data='image4.html'>, 76 <Selector xpath='//a/@href' data='image5.html'>] 77 78 In [28]: response.xpath('//a/@href').extract() 79 Out[28]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] 80 81 In [29]: response.css('a::attr(href)') 82 Out[29]: 83 [<Selector xpath='descendant-or-self::a/@href' data='image1.html'>, 84 <Selector xpath='descendant-or-self::a/@href' data='image2.html'>, 85 <Selector xpath='descendant-or-self::a/@href' data='image3.html'>, 86 <Selector xpath='descendant-or-self::a/@href' data='image4.html'>, 87 <Selector xpath='descendant-or-self::a/@href' data='image5.html'>] 88 89 In [30]: response.css('a::attr(href)').extract() 90 Out[30]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] 91 92 93 94 # css、xpath获取文本信息: 95 In [35]: response.css('a::text') 96 Out[35]: 97 [<Selector xpath='descendant-or-self::a/text()' data='Name: My image 1 '>, 98 <Selector xpath='descendant-or-self::a/text()' data='Name: My image 2 '>, 99 <Selector xpath='descendant-or-self::a/text()' data='Name: My image 3 '>, 100 <Selector xpath='descendant-or-self::a/text()' data='Name: My image 4 '>, 101 <Selector xpath='descendant-or-self::a/text()' data='Name: My image 5 '>] 102 103 In [36]: response.css('a::text').extract_first() 104 Out[36]: 'Name: My image 1 ' 105 106 107 In [37]: response.xpath('//a/text()') 108 ...: 109 Out[37]: 110 [<Selector xpath='//a/text()' data='Name: My image 1 '>, 111 <Selector xpath='//a/text()' data='Name: My image 2 '>, 112 <Selector xpath='//a/text()' data='Name: My image 3 '>, 113 <Selector xpath='//a/text()' data='Name: My image 4 '>, 114 <Selector xpath='//a/text()' data='Name: My image 5 '>] 115 116 In [38]: response.xpath('//a/text()').extract_first() 117 Out[38]: 'Name: My image 1 ' 118 119 120 121 # css、xpath 选取属性包含指定名字的标签: 122 In [42]: response.xpath('//a[contains(@href, "image")]/@href') 123 Out[42]: 124 [<Selector xpath='//a[contains(@href, "image")]/@href' data='image1.html'>, 125 <Selector xpath='//a[contains(@href, "image")]/@href' data='image2.html'>, 126 <Selector xpath='//a[contains(@href, "image")]/@href' data='image3.html'>, 127 <Selector xpath='//a[contains(@href, "image")]/@href' data='image4.html'>, 128 <Selector xpath='//a[contains(@href, "image")]/@href' data='image5.html'>] 129 130 131 In [46]: response.css('a[href*="image"]::attr(href)') 132 Out[46]: 133 [<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image1.html'>, 134 <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image2.html'>, 135 <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image3.html'>, 136 <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image4.html'>, 137 <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image5.html'>] 138 139 140 # css选择器http://www.w3school.com.cn/cssref/css_selectors.asp: 141 [attribute^=value] a[src^="https"] 选择其 src 属性值以 "https" 开头的每个 <a> 元素。 142 [attribute$=value] a[src$=".pdf"] 选择其 src 属性以 ".pdf" 结尾的所有 <a> 元素。 143 [attribute*=value] a[src*="abc"] 选择其 src 属性中包含 "abc" 子串的每个 <a> 元素。 144 145 146 147 # 级联获取 148 # xpath方法: 149 In [49]: response.xpath('//a[contains(@href, "image")]/img/@src').extract() 150 Out[49]: 151 ['image1_thumb.jpg', 152 'image2_thumb.jpg', 153 'image3_thumb.jpg', 154 'image4_thumb.jpg', 155 'image5_thumb.jpg'] 156 157 # CSS方法: 158 In [54]: response.css('a[href*=image] img::attr(src)').extract() 159 Out[54]: 160 ['image1_thumb.jpg', 161 'image2_thumb.jpg', 162 'image3_thumb.jpg', 163 'image4_thumb.jpg', 164 'image5_thumb.jpg'] 165 166 167 168 169 # css、xpath 的 re正则: 170 In [55]: response.css('a::text').re('Name:(.*)') 171 Out[55]: 172 [' My image 1 ', 173 ' My image 2 ', 174 ' My image 3 ', 175 ' My image 4 ', 176 ' My image 5 '] 177 178 In [56]: response.xpath('//a/text()').re('Name:(.*)') 179 Out[56]: 180 [' My image 1 ', 181 ' My image 2 ', 182 ' My image 3 ', 183 ' My image 4 ', 184 ' My image 5 '] 185 186 # re_first取匹配结果的第一个 187 In [61]: response.css('a::text').re_first('Name:(.*)').strip() 188 Out[61]: 'My image 1' 189 190 In [59]: response.xpath('//a/text()').re_first('Name:(.*)').strip() 191 Out[59]: 'My image 1'
1 #!/usr/bin/env python 2 # -*- coding:utf-8 -*- 3 from scrapy.selector import Selector, HtmlXPathSelector 4 from scrapy.http import HtmlResponse 5 6 7 html = """<!DOCTYPE html> 8 <html> 9 <head lang="en"> 10 <meta charset="UTF-8"> 11 <title></title> 12 </head> 13 <body> 14 <ul> 15 <li class="item-"><a id='i1' href="link.html">first item</a></li> 16 <li class="item-0"><a id='i2' href="llink.html">first item</a></li> 17 <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li> 18 </ul> 19 <div><a href="llink2.html">second item</a></div> 20 </body> 21 </html> 22 """ 23 response = HtmlResponse(url='http://example.com', body=html, encoding='utf-8') 24 hxs = HtmlXPathSelector(response) 25 print(hxs) 26 hxs = Selector(response=response).xpath('//a') # 找html页面中全部a标签 27 for i in hxs: 28 print(i) 29 print(hxs) 30 31 hxs = Selector(response=response).xpath('//a[2]') # 结果为空? 32 for i in hxs: 33 print(i) 34 print(hxs) 35 print(hxs) 36 37 hxs = Selector(response=response).xpath('//a[@id]') # 找具有id属性的a标签 38 print(hxs) 39 40 hxs = Selector(response=response).xpath('//a[@id="i1"]') # 找具有id属性且值为i1的a标签 41 print(hxs) 42 43 hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]') # href属性值为link.html且id属性且值为i1的a标签 44 print(hxs) 45 46 hxs = Selector(response=response).xpath('//a[contains(@href, "link2")]') # contains函数:匹配一个属性值中包含的字符串,找出属性href值包含link2的a标签 47 print(hxs) 48 49 hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]') # 属性href以link开头的a标签 50 print(hxs) 51 52 hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]') # 正则匹配,id属性包含“i+数字”的a标签 53 print(hxs) 54 hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract() # id属性包含“i+数字”的a标签的值 55 print(hxs) 56 hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract() # id属性包含“i+数字”的a标签的href属性值 57 print(hxs) 58 hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract() # 取/html/body/ul/li/a/路径下@href的值,多个值返回列表 59 print(hxs) 60 hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()# 取/html/body/ul/li/a/路径下@href的第一个值 61 print(hxs) 62 63 64 # 相对路径的xpath 65 ul_list = Selector(response=response).xpath('//body/ul/li') 66 for item in ul_list: 67 # v = item.xpath('./a/span') 68 # 或 69 # v = item.xpath('a/span') 70 # 或 71 v = item.xpath('*/a/span') 72 print(v)
1 for item in obj: 2 price = item.xpath('.//span[@class="price"]/text()').extract_first() 3 url = item.xpath('div[@class="item_t"]/div[@class="class"]//a/@href').extract_first() 4 5 # 一般情况: 6 // 表示寻找范围为:所有子孙 7 / 表示寻找范围为:儿子 8 9 # 特殊: 10 item.xpath('./') # 相对当前子孙中找 11 item.xpath('a') # 相对当前子孙中找 12 13 14 # 几个简单的例子: 15 /html/head/title: 选择HTML文档<head>元素下面的<title> 标签。 16 /html/head/title/text(): 选择前面提到的<title> 元素下面的文本内容; 17 //td: 选择所有 <td> 元素; 18 //div[@class="mine"]: 选择所有包含 class="mine" 属性的div 标签元素;
基本的路径意义:
表达式 | 描述 |
---|---|
nodename | 选取此节点的所有子节点。 |
/ | 从根节点选取。 |
// | 从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置。 |
. | 选取当前节点。 |
.. | 选取当前节点的父节点。 |
@ | 选取属性。 |
注意:settings.py中设置DEPTH_LIMIT = 1来指定“递归”的层数。
三.自定制scrapy
1. 起始URL - parse
1 import scrapy 2 from scrapy.http import Request 3 4 class ChoutiSpider(scrapy.Spider): 5 name = 'chouti' 6 allowed_domains = ['chouti.com'] 7 start_urls = ['http://chouti.com/'] # 起始url 8 9 def start_requests(self): # 重定义Spider类中start_requests方法,进而可以自定义回调函数名 10 for url in self.start_urls: 11 yield Request(url, dont_filter=True,callback=self.parse1) 12 13 def parse1(self, response): 14 pass
2. POST请求,请求头
回顾:requests模块发送get或post请求:
1 requests.get(params={},headers={},cookies={}) 2 requests.post(params={},headers={},cookies={},data={},json={}) 3 4 url, 5 method='GET', 6 headers=None, 7 body=None, 8 cookies=None,
在scrapy中发送get或post请求参数说明:
1 # GET请求: 2 url, 3 method='GET', 4 headers={}, 5 cookies={}, cookiejar 6 7 # POST请求: 8 url, 9 method='GET', 10 headers={}, 11 cookies={}, cookiejar 12 body=None, 13 14 # application/x-www-form-urlencoded; charset=UTF-8 # 对于这种请求头的Content-Type,需要这种arg1=66&arg2=66&oneMonth=1字符串格式 15 form_data = { 16 'user':'alex', 17 'pwd': 123 18 } 19 import urllib.parse 20 data = urllib.parse.urlencode({'k1':'v1','k2':'v2'}) # 把字典转换为arg1=66&arg2=66&oneMonth=1字符串格式 21 22 "phone=86155fa&password=asdf&oneMonth=1" 23 24 # application/json; charset=UTF-8 # 对于这种请求头的Content-Type,直接把字典json.dumps()即可 25 json.dumps() 26 27 "{k1:'v1','k2':'v2'}" 28 29 示例: 30 Request( 31 url='http://dig.chouti.com/login', 32 method='POST', 33 headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, 34 body='phone=8615131255089&password=pppppppp&oneMonth=1', 35 callback=self.check_login 36 )
3.cookie
通过cookie_jar获取response返回中的cookie
第一种方式:通过把cookie放入构造cookie_dict字典发送
1 def login(self, response): 2 cookie_dict = {} 3 cookie_jar = CookieJar() # 创建cookie_jar对象 4 cookie_jar.extract_cookies(response, response.request) # 利用cookie_jar提取响应中的cookie 5 for k, v in cookie_jar._cookies.items(): 6 for i, j in v.items(): 7 for m, n in j.items(): 8 self.cookie_dict[m] = n.value 9 10 req = Request( 11 url='http://dig.chouti.com/login', 12 method='POST', 13 headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, 14 body='phone=8615131255089&password=pppppppp&oneMonth=1', 15 cookies=self.cookie_dict, # 以字典的形式封装氢气 16 callback=self.check_login 17 ) 18 yield req
第二种方式(推荐):直接携带cookie_jar对象到Request中发送
1 def login(self, response): 2 cookie_jar = CookieJar() 3 cookie_jar.extract_cookies(response, response.request) 4 5 req = Request( 6 url='http://dig.chouti.com/login', 7 method='POST', 8 headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, 9 body='phone=8615131255089&password=pppppppp&oneMonth=1', 10 cookies=self.cookie_jar, # 以对象的形式封装请求 11 callback=self.check_login 12 ) 13 yield req
1 class ChoutiSpider(scrapy.Spider): 2 name = 'chouti' 3 allowed_domains = ['chouti.com'] 4 start_urls = ['http://chouti.com/'] 5 6 def start_requests(self): 7 for url in self.start_urls: 8 yield Request(url, dont_filter=True, callback=self.parse1) 9 10 def parse1(self, response): 11 """获取首页登录""" 12 # response.text 首页所有内容 13 from scrapy.http.cookies import CookieJar 14 cookie_jar = CookieJar() 15 self.cookie_jar = cookie_jar.extract_cookies(response, response.request) # 获取响应中的cookies 16 17 post_dict = { 18 'phone': '8617748232617', 19 'password': 'password', 20 'oneMonth': 1, 21 } 22 23 import urllib.parse 24 data = urllib.parse.urlencode(post_dict) # urlencode转换为:phone=86123&password=123&oneMonth=1这种格式 25 # 发送post请求准备登录 26 yield Request( 27 url='http://dig.chouti.com/login', 28 method='POST', 29 cookies=self.cookie_jar, 30 body=data, 31 headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, 32 callback=self.parse2 33 ) 34 35 def parse2(self, response): 36 """response返回登录结果""" 37 print(response.text) 38 # 获取新闻列表 39 yield Request(url='http://dig.chouti.com', cookies=self.cookie_jar, callback=self.parse3) 40 41 def parse3(self, response): 42 """点赞""" 43 hxs = Selector(response) 44 linkid_list = hxs.xpath('//div[@class="news-pic"]/img/@lang').extract() 45 print(linkid_list) 46 for link_id in linkid_list: 47 # 获取每一个id去点赞 48 base_url = "https://dig.chouti.com/link/vote?linksId={0}".format(link_id) 49 yield Request(url=base_url, method='POST', cookies=self.cookie_jar, callback=self.parse4) 50 51 # hxs.xpath('//div[@id="dig_lcpage"]//a/@href') 52 # 寻找所有分页页面 53 page_list = hxs.xpath('//a[@class="ct_pagepa"] /@href').extract() 54 """https://dig.chouti.com/all/hot/recent/2""" 55 for page in page_list: 56 page_url = "https://dig.chouti.com%s" % page 57 yield Request(url=page_url, method='GET', cookies=self.cookie_jar, callback=self.parse3) 58 59 def parse4(self, responese): 60 # 输出点赞结果 61 print(responese.text)
pipeline持久化写入文件
1 # jandan.py: 2 #!/usr/bin/env python 3 # -*-coding:utf-8 -*- 4 import scrapy 5 from scrapy.selector import HtmlXPathSelector, Selector 6 from scrapy.http.request import Request 7 8 9 class JianDanSpider(scrapy.Spider): 10 name = 'jiandan' 11 allowed_domains = ['jandan.net'] 12 start_urls = ['http://jandan.net/'] 13 14 def start_requests(self): 15 for url in self.start_urls: 16 yield Request(url, dont_filter=True, callback=self.parse1) 17 18 def parse1(self, response): 19 # print(response.text) 20 hxs = Selector(response) 21 a_list = hxs.xpath('//div[@class="indexs"]/h2') 22 for tag in a_list: 23 url = tag.xpath('./a/@href').extract_first() 24 text = tag.xpath('./a/text()').extract_first() 25 from ..items import Sp2Item 26 yield Sp2Item(url=url, text=text) # 会执行pipeline中的process_item方法 27 28 29 30 # pipelines.py 31 32 # -*- coding: utf-8 -*- 33 34 # Define your item pipelines here 35 # 36 # Don't forget to add your pipeline to the ITEM_PIPELINES setting 37 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html 38 39 from scrapy.exceptions import DropItem 40 41 42 class Sp2Pipeline(object): 43 def __init__(self, val): 44 self.val = val 45 self.f = None 46 47 def process_item(self, item, spider): 48 """ 49 :param item: 爬虫中yield回来的对象,是一个一个返回的,spider中循环一次返回一个item 50 :param spider: 爬虫对象 obj = JianDanSpider(); 有spider.name等属性 51 """ 52 a = item['url'] + ' ' + item['text'] + '\n' 53 self.f.write(a) # 持久化写入文件 54 return item 55 56 @classmethod 57 def from_crawler(cls, crawler): 58 """ 59 用于实例化Sp2Pipeline对象,过程中可以通过crawler封装一些参数到对象中 60 初始化时候,用于创建pipeline对象,classmethod装饰器会把cls替换为当前类 61 :param crawler:有爬虫相关的参数封装,可以取配置文件中的值(settings定义的key必须大写):crawler.settings.get('MMMM') 62 :return: 63 """ 64 print('执行pipeline的from_crawler方法,进行实例化对象') 65 val = crawler.settings.get('MMMM') 66 return cls(val) # 实例化对象,并通过对象的__init__方法初始化val封装到对象中 67 68 def open_spider(self, spider): 69 """ 70 爬虫开始执行时,调用 71 :param spider: 72 :return: 73 """ 74 self.f = open('a.log', 'a+', encoding='utf-8') 75 print('打开爬虫') 76 77 def close_spider(self,spider): 78 """ 79 爬虫关闭时,被调用 80 :param spider: 81 :return: 82 """ 83 self.f.close() 84 print('关闭爬虫') 85 86 87 88 items.py: 89 # -*- coding: utf-8 -*- 90 91 # Define here the models for your scraped items 92 # 93 # See documentation in: 94 # https://doc.scrapy.org/en/latest/topics/items.html 95 96 import scrapy 97 98 99 class Sp2Item(scrapy.Item): 100 # define the fields for your item here like: 101 # name = scrapy.Field() 102 """负责结构化、规则化数据字段""" 103 url = scrapy.Field() 104 text = scrapy.Field() 105
多个pipeline执行顺序说明:
1 # -*- coding: utf-8 -*- 2 3 # Define your item pipelines here 4 # 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 8 9 class Sp2Pipeline(object): 10 def __init__(self, val): 11 self.val = val 12 self.f = None 13 14 def process_item(self, item, spider): 15 """ 16 :param item: 爬虫中yield回来的对象 17 :param spider: 爬虫对象 obj = JianDanSpider(); 有spider.name等属性 18 """ 19 if spider.name == 'jiadnan': # 由于pipeline是全局的,所有的spider执行时,只要yield过来item对象,都会执行pipeline,这里可以通过spider.name属性做判断,选择性的执行pipeline 20 pass 21 print(item) 22 23 a = item['url'] + ' ' + item['text'] + '\n' 24 self.f.write(a) 25 # return item # 将item传递给下一个pipeline的process_item方法,不写传递None,依然会执行 26 # from scrapy.exceptions import DropItem 27 # raise DropItem() # DropItem:下一个pipeline的process_item方法不在执行 28 29 @classmethod 30 def from_crawler(cls, crawler): 31 """ 32 用于实例化Sp2Pipeline对象,过程中可以通过crawler封装一些参数到对象中 33 初始化时候,用于创建pipeline对象,classmethod装饰器会把cls替换为当前类 34 :param crawler:有爬虫相关的参数封装,可以取配置文件中的值(settings定义的key必须大写):crawler.settings.get('MMMM') 35 :return: 36 """ 37 print('执行pipeline的from_crawler方法,进行实例化对象') 38 val = crawler.settings.get('MMMM') 39 return cls(val) # 实例化对象,并通过对象的__init__方法初始化val封装到对象中 40 41 def open_spider(self, spider): 42 """ 43 爬虫开始执行时,调用 44 :param spider: 45 :return: 46 """ 47 self.f = open('a.log', 'a+', encoding='utf-8') 48 print('打开爬虫') 49 50 def close_spider(self,spider): 51 """ 52 爬虫关闭时,被调用 53 :param spider: 54 :return: 55 """ 56 self.f.close() 57 print('关闭爬虫') 58 59 60 class Sp3Pipeline(object): 61 def __init__(self): 62 self.f = None 63 64 def process_item(self, item, spider): 65 """ 66 67 :param item: 爬虫中yield回来的对象 68 :param spider: 爬虫对象 obj = JianDanSpider() 69 :return: 70 """ 71 print('in the pipeline3--------------------------') 72 return item 73 74 @classmethod 75 def from_crawler(cls, crawler): 76 """ 77 初始化时候,用于创建pipeline对象 78 :param crawler: 79 :return: 80 """ 81 # val = crawler.settings.get('MMMM') 82 print('执行pipeline的from_crawler,进行实例化对象') 83 return cls() 84 85 def open_spider(self, spider): 86 """ 87 爬虫开始执行时,调用 88 :param spider: 89 :return: 90 """ 91 print('in the pipeline3-------------------------- 打开爬虫') 92 # self.f = open('a.log', 'a+') 93 94 def close_spider(self, spider): 95 """ 96 爬虫关闭时,被调用 97 :param spider: 98 :return: 99 """ 100 print('in the pipeline3-------------------------- 关闭爬虫')
小结:
为什么使用pipeline: 爬取的数据保存时,如果在spider中保存,会有频繁的打开/关闭文件、DB连接等等操作,放在pipeline中,只在爬虫运行前打开一次文件/DB连接即可 pipeline执行的前提: - spider中yield Item(url=url,text='asdasd')对象后(yield Item(),意为:持久化Item对象,交给pipeline处理),在items中定义Item类,并在spider中导入后,会直接交给pipeline处理 - settings中注册 ITEM_PIPELINES = { 'sp2.pipelines.Sp2Pipeline': 300, # 数字越小代表优先级越高 'sp2.pipelines.Sp2Pipeline': 100, } # 自定义pipeline的执行流程: 一:爬虫运行前的动作: 1.检测自定义的CustomPipeline类中是否有负责初始化的from_crawler方法(通过hasattr,getattr) 如果有: obj = 类.from_crawler() #初始化时候,用于创建pipeline对象 如果没有: obj = 类() 2.打开爬虫,爬虫开爬前的一些动作写在这里: obj.open_spider() # 这里可以执行打开文件、数据库等待操作 二.执行调度器中的爬虫,process_item方法 while True: 爬虫运行,并且执行parse各种各样的方法,yield item obj.process_item() 三. 如果存在更多的pipeline,依次执行他们的process_item方法,pipeline中的其他方法:open_spider等等不会执行; 如果不想继续执行pipeline: from scrapy.exceptions import DropItem raise DropItem() # DropItem:下一个pipeline的process_item方法不在执行 四.关闭爬虫 3.obj.close_spider() 注意: pipeline是全局的,所有的spider爬虫执行时,只要yield过来item对象,都会执行pipeline,这里可以通过spider.name属性做判断,选择性的执行pipeline
补充:
1 scrapy:中文文档 2 http://scrapy-chs.readthedocs.io/zh_CN/latest/ 3 英文: 4 https://docs.scrapy.org/en/latest/topics/spiders.html 5 6 7 allowed_domains: 8 每次生成一个Request时,判断当前url是否匹配allowed_domains指定的域名,匹配则爬取,否则丢弃; 9 包含了spider允许爬取的域名(domain)列表(list)。 当 OffsiteMiddleware 启用时, 域名不在列表中的URL不会被跟进。 10 11 custom_settings: 12 spiders中以类属性且为字典形式(键名:settings中的变量名)局部自定义的配置,会覆盖settings中的全局配置,可以设置spiders各自的useragent 13 A dictionary of settings that will be overridden from the project wide configuration when running this spider. It must be defined as a class attribute since the settings are updated before instantiation. 14 15 16 from_crawler(crawler, *args, **kwargs) 17 可以在spiders中定义,也可以在pipeline中定义;作用:利用它的crawler参数,可以通过settings中配置参数来构造类(通过crawler对象,可以获取scrapy所有核心组件,如全局配置的每个信息) 18 19 This is the class method used by Scrapy to create your spiders. 20 You probably won’t need to override this directly, since the default implementation acts as a proxy to the __init__() method, calling it with the given arguments args and named arguments kwargs. 21 Nonetheless, this method sets the crawler and settings attributes in the new instance, so they can be accessed later inside the spider’s code. 22 23 参数: 24 crawler (Crawler instance) – crawler to which the spider will be bound 25 args (list) – arguments passed to the __init__() method 26 kwargs (dict) – keyword arguments passed to the __init__() method 27 28 class QuotesSpider(scrapy.Spider): 29 name = 'quotes' 30 allowed_domains = ['quotes.toscrape.com'] 31 start_urls = ['http://quotes.toscrape.com/'] 32 33 34 def __init__(self, mongo_url, mongo_db, *args, **kwargs): 35 super(QuotesSpider, self).__init__(*args, **kwargs) 36 self.mongo_url = mongo_url 37 self.mongo_db = mongo_db 38 39 @classmethod 40 def from_crawler(cls, crawler): 41 print('in MongoPipeline from_crawler ---------------------------------') # step 1 42 return cls( 43 mongo_url=crawler.settings.get('MONGO_URL'), 44 mongo_db=crawler.settings.get('MONGO_DB') 45 ) 46 47 48 49 # spider的日志: 50 Logging from Spiders 51 Scrapy provides a logger within each Spider instance, which can be accessed and used like this: 52 import scrapy 53 54 class MySpider(scrapy.Spider): 55 56 name = 'myspider' 57 start_urls = ['https://scrapinghub.com'] 58 59 def parse(self, response): 60 self.logger.info('Parse function called on %s', response.url) 61 62 63 That logger is created using the Spider’s name, but you can use any custom Python logger you want. For example: 64 import logging 65 import scrapy 66 67 logger = logging.getLogger('mycustomlogger') 68 69 class MySpider(scrapy.Spider): 70 71 name = 'myspider' 72 start_urls = ['https://scrapinghub.com'] 73 74 def parse(self, response): 75 logger.info('Parse function called on %s', response.url) 76 77 85 86 87 pipeline: 可用来数据清洗、存储 88 89 # https://docs.scrapy.org/en/latest/topics/item-pipeline.html# 90 91 # from_crawler方法: 92 # 一般用他来获取scrapy的settings配置信息 93 from_crawler(cls, crawler) 94 If present, this classmethod is called to create a pipeline instance from a Crawler. It must return a new instance of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for pipeline to access them and hook its functionality into Scrapy. 95 Parameters: crawler (Crawler object) – crawler that uses this pipeline 96 97 98 99 Item pipeline example 100 例1: 101 Price validation and dropping items with no prices 102 Let’s take a look at the following hypothetical pipeline that adjusts the price attribute for those items that do not include VAT (price_excludes_vat attribute), and drops those items which don’t contain a price: 103 104 from scrapy.exceptions import DropItem 105 106 class PricePipeline(object): 107 108 vat_factor = 1.15 109 110 def process_item(self, item, spider): 111 if item['price']: 112 if item['price_excludes_vat']: 113 item['price'] = item['price'] * self.vat_factor 114 return item 115 else: 116 raise DropItem("Missing price in %s" % item) 117 118 119 例2: 120 Write items to a JSON file 121 The following pipeline stores all scraped items (from all spiders) into a single items.jl file, containing one item per line serialized in JSON format: 122 import json 123 124 class JsonWriterPipeline(object): 125 126 def open_spider(self, spider): 127 self.file = open('items.jl', 'w') 128 129 def close_spider(self, spider): 130 self.file.close() 131 132 def process_item(self, item, spider): 133 line = json.dumps(dict(item)) + "\n" 134 self.file.write(line) 135 return item 136 注意: 137 The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports:https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports 138 139 140 141 例3:存储到MongoDB: 142 Write items to MongoDB 143 In this example we’ll write items to MongoDB using pymongo. MongoDB address and database name are specified in Scrapy settings; MongoDB collection is named after item class. 144 145 The main point of this example is to show how to use from_crawler() method and how to clean up the resources properly.: 146 pymongo官方api:https://api.mongodb.com/python/current/api/index.html 147 import pymongo 148 149 class MongoPipeline(object): 150 151 collection_name = 'scrapy_items' 152 153 def __init__(self, mongo_uri, mongo_db): 154 self.mongo_uri = mongo_uri 155 self.mongo_db = mongo_db 156 157 @classmethod 158 def from_crawler(cls, crawler): 159 return cls( 160 mongo_uri=crawler.settings.get('MONGO_URI'), 161 mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') 162 ) 163 164 def open_spider(self, spider): 165 self.client = pymongo.MongoClient(self.mongo_uri) 166 self.db = self.client[self.mongo_db] 167 168 def close_spider(self, spider): 169 self.client.close() 170 171 def process_item(self, item, spider): 172 self.db[self.collection_name].insert_one(dict(item)) 173 return item 174 175 例4:pipeline中去重 176 Duplicates filter 177 A filter that looks for duplicate items, and drops those items that were already processed. Let’s say that our items have a unique id, but our spider returns multiples items with the same id: 178 from scrapy.exceptions import DropItem 179 180 class DuplicatesPipeline(object): 181 182 def __init__(self): 183 self.ids_seen = set() 184 185 def process_item(self, item, spider): 186 if item['id'] in self.ids_seen: 187 raise DropItem("Duplicate item found: %s" % item) 188 else: 189 self.ids_seen.add(item['id']) 190 return item 191
1 from twisted.enterprise import adbapi 2 3 4 5 class MySQLPipeline(object): 6 @classmethod 7 def from_crawler(cls, crawler): 8 # 从项目的配置文件中读取相应的参数 9 cls.MYSQL_DB_NAME = crawler.settings.get("MYSQL_DB_NAME", 'scrapy_default') 10 cls.HOST = crawler.settings.get("MYSQL_HOST", 'localhost') 11 cls.PORT = crawler.settings.get("MYSQL_PORT", 3306) 12 cls.USER = crawler.settings.get("MYSQL_USER", 'root') 13 cls.PASSWD = crawler.settings.get("MYSQL_PASSWORD", '123456') 14 return cls() 15 16 def open_spider(self, spider): 17 self.dbpool = adbapi.ConnectionPool('pymysql', host=self.HOST, port=self.PORT, user=self.USER, 18 passwd=self.PASSWD, db=self.MYSQL_DB_NAME, charset='utf8') 19 20 def close_spider(self, spider): 21 self.dbpool.close() 22 23 def process_item(self, item, spider): 24 self.dbpool.runInteraction(self.insert_db, item) 25 # query.addErrback(self.handle_error) 26 return item 27 28 def insert_db(self, tx, item): 29 values = ( 30 item['houseRecord'], item['price'], item['unitPrice'], item['room'], item['type'], item['area'], 31 item['communityName'], item['areaName'], item['visitTime'], str(item['base']), str(item['transaction']), 32 item['base_more'], item['tags'], item['url'] 33 ) 34 sql = 'INSERT INTO lianjia(houseRecord, price,unitPrice,room,type,area,communityName,areaName,visitTime,' \ 35 'base,transaction,base_more,tags,url) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)' 36 tx.execute(sql, values) 37 38 # def handle_error(self, failure): 39 # print(failure)
1 # pipelines.py 2 import pymysql 3 4 class MysqlPipeline(object): 5 def __init__(self, host, database, user, password, port): 6 self.host = host 7 self.database = database 8 self.user = user 9 self.password = password 10 self.port = port 11 12 @classmethod 13 def from_crawler(cls, crawler): 14 return cls( 15 host=crawler.settings.get('MYSQL_HOST'), 16 database=crawler.settings.get('MYSQL_DB'), 17 user=crawler.settings.get('MYSQL_USER'), 18 password=crawler.settings.get('MYSQL_PASSWORD'), 19 port=crawler.settings.get('MYSQL_PORT') 20 ) 21 22 def open_spider(self, spider): 23 self.db = pymysql.connect(self.host,self.user,self.password, self.database, charset='utf8', port=self.port) 24 self.cursor = self.db.cursor() 25 26 27 def process_item(self, item, spider): 28 data = dict(item) 29 keys = ', '.join(data.keys()) 30 values = ', '.join(['%s'] * len(data)) 31 sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values) 32 self.cursor.execute(sql, tuple(data.values())) 33 self.db.commit() 34 return item 35 36 37 def close_spider(self, spider): 38 self.db.close() 39 40 41 # settings 42 MYSQL_HOST = '192.168.1.110' 43 MYSQL_DB = 'images360' 44 MYSQL_PORT = 3306 # 注意:port一定要是int类型,不能加引号 45 MYSQL_USER = 'root' 46 MYSQL_PASSWORD = '123456' 47
1 from twisted.enterprise import adbapi 2 class WebcrawlerScrapyPipeline(object): 3 '''保存到数据库中对应的class 4 1、在settings.py文件中配置 5 2、在自己实现的爬虫类中yield item,会自动执行''' 6 7 def __init__(self, dbpool): 8 self.dbpool = dbpool 9 10 @classmethod 11 def from_settings(cls, settings): 12 '''1、@classmethod声明一个类方法,而对于平常我们见到的叫做实例方法。 13 2、类方法的第一个参数cls(class的缩写,指这个类本身),而实例方法的第一个参数是self,表示该类的一个实例 14 3、可以通过类来调用,就像C.f(),相当于java中的静态方法''' 15 #读取settings中配置的数据库参数 16 dbparams = dict( 17 host=settings['MYSQL_HOST'], 18 db=settings['MYSQL_DBNAME'], 19 user=settings['MYSQL_USER'], 20 passwd=settings['MYSQL_PASSWD'], 21 charset='utf8', # 编码要加上,否则可能出现中文乱码问题 22 cursorclass=MySQLdb.cursors.DictCursor, 23 use_unicode=False, 24 ) 25 dbpool = adbapi.ConnectionPool('MySQLdb', **dbparams) # **表示将字典扩展为关键字参数,相当于host=xxx,db=yyy.... 26 return cls(dbpool) # 相当于dbpool付给了这个类,self中可以得到 27 28 # pipeline默认调用 29 def process_item(self, item, spider): 30 query = self.dbpool.runInteraction(self._conditional_insert, item) # 调用插入的方法 31 query.addErrback(self._handle_error, item, spider) # 调用异常处理方法 32 return item 33 34 # 写入数据库中 35 # SQL语句在这里 36 def _conditional_insert(self, tx, item): 37 sql = "insert into jsbooks(author,title,url,pubday,comments,likes,rewards,views) values(%s,%s,%s,%s,%s,%s,%s,%s)" 38 params = (item['author'], item['title'], item['url'], item['pubday'],item['comments'],item['likes'],item['rewards'],item['reads']) 39 tx.execute(sql, params) 40 41 # 错误处理方法 42 def _handle_error(self, failue, item, spider): 43 print failue 44 45 # https://blog.csdn.net/xdl1278/article/details/79056380 46 from twisted.enterprise import adbapi 47 from MySQLdb import cursors 48 49 class MySQL_Twisted_Pipelines(object): 50 # 基于twisted.enterprise.adbapi的异步MySQL管道 51 # def __init__(self, dbpool): 52 # self.dbpool = dbpool 53 54 # @classmethod 55 # def from_settings(cls, settings): 56 # dbpool = adbapi.ConnectionPool("MySQLdb", **settings["MYSQL_INFO"], cursorclass=cursors.DictCursor) 57 # return cls(dbpool) 58 59 # @classmethod 60 # def from_crawler(cls, crawler): 61 # settings = crawler.settings 62 # dbpool = adbapi.ConnectionPool("MySQLdb", **settings["MYSQL_INFO"], cursorclass=cursors.DictCursor) 63 # return cls(dbpool) 64 65 def open_spider(self, spider): 66 self.dbpool = adbapi.ConnectionPool("MySQLdb", **spider.settings["MYSQL_INFO"], cursorclass=cursors.DictCursor) 67 68 def process_item(self, item, spider): 69 query = self.dbpool.runInteraction(self.db_insert, item) 70 query.addErrback(self.handle_error) 71 72 def handle_error(self, failure): 73 print(failure) 74 75 def db_insert(self, cursor, item): 76 sql_insert = 'INSERT INTO teachers(name,title,info) VALUES (%s,%s,%s)' 77 cursor.execute(sql_insert, (item['name'], item['title'], item['info'])) 78 79 # 伪装浏览器请求,设置延迟抓取,防ban 80 DOWNLOAD_DELAY = 0.25 # 250 ms of delay
1 # 安装MySQLdb 2 pip3 install mysqlclient
1 # pipeline: 2 from scrapy import Request 3 from scrapy.exceptions import DropItem 4 from scrapy.pipelines.images import ImagesPipeline 5 6 7 class ImagePipeline(ImagesPipeline): 8 def file_path(self, request, response=None, info=None): 9 url = request.url 10 file_name = url.split('/')[-1] 11 return file_name 12 13 def item_completed(self, results, item, info): 14 image_paths = [x['path'] for ok, x in results if ok] 15 if not image_paths: 16 raise DropItem('Image Downloaded Failed') 17 return item 18 19 def get_media_requests(self, item, info): 20 yield Request(item['url']) 21 22 23 # setting 24 IMAGES_STORE = './images' 25 26 # 开启 27 ITEM_PIPELINES = { 28 'images360.pipelines.ImagePipeline': 300, 29 30 }
去重之避免重复访问url:
scrapy默认使用 scrapy.dupefilter.RFPDupeFilter 进行去重,相关配置有:
1 DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' 2 DUPEFILTER_DEBUG = False 3 JOBDIR = "保存范文记录的日志路径,如:/root/" # 最终路径为 /root/requests.seen
第2、3两行的配置是属于RFPDupeFilter 类独有的,自定义去重时,可以根据需要更改:在项目目录下自定义去重文件re.py,在settings中声明;
1 class RepeatUrl: 2 def __init__(self): 3 self.visited_url = set() # 放在当前服务的内存,也可以放在memcache、redis中 4 5 @classmethod 6 def from_settings(cls, settings): 7 """ 8 初始化时,调用 9 :param settings: 10 :return: 11 """ 12 return cls() 13 14 def request_seen(self, request): 15 """ 16 检测当前请求是否已经被访问过 17 :param request: 18 :return: True表示已经访问过;False表示未访问过 19 """ 20 if request.url in self.visited_url: 21 return True 22 self.visited_url.add(request.url) 23 return False 24 25 def open(self): 26 """ 27 开始爬去请求时,调用 28 :return: 29 """ 30 print('open replication') 31 32 def close(self, reason): 33 """ 34 结束爬虫爬取时,调用 35 :param reason: 36 :return: 37 """ 38 print('close replication') 39 40 def log(self, request, spider): 41 """ 42 记录日志 43 :param request: 44 :param spider: 45 :return: 46 """ 47 print('repeat', request.url) 48 # 配置文件中定义: 49 DUPEFILTER_CLASS = 'sp2.rep.RepeatUrl'
1 去重规则应该多个爬虫共享的,但凡一个爬虫爬取了,其他都不要爬了,实现方式如下 2 3 #方法一: 4 1、新增类属性 5 visited=set() #类属性 6 7 2、回调函数parse方法内: 8 def parse(self, response): 9 if response.url in self.visited: 10 return None 11 ....... 12 13 self.visited.add(response.url) 14 15 #方法一改进:针对url可能过长,所以我们存放url的hash值 16 def parse(self, response): 17 url=md5(response.request.url) 18 if url in self.visited: 19 return None 20 ....... 21 22 self.visited.add(url) 23 24 #方法二:Scrapy自带去重功能 25 配置文件: 26 DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' #默认的去重规则帮我们去重,去重规则在内存中 27 DUPEFILTER_DEBUG = False 28 JOBDIR = "保存范文记录的日志路径,如:/root/" # 最终路径为 /root/requests.seen,去重规则放文件中 29 30 scrapy自带去重规则默认为RFPDupeFilter,只需要我们指定 31 Request(...,dont_filter=False) ,如果dont_filter=True则告诉Scrapy这个URL不参与去重。 32 33 #方法三: 34 我们也可以仿照RFPDupeFilter自定义去重规则, 35 36 from scrapy.dupefilter import RFPDupeFilter,看源码,仿照BaseDupeFilter 37 38 #步骤一:在项目目录下自定义去重文件cumstomdupefilter.py 39 ''' 40 if hasattr("MyDupeFilter",from_settings): 41 func = getattr("MyDupeFilter",from_settings) 42 obj = func() 43 else: 44 return MyDupeFilter() 45 ''' 46 class MyDupeFilter(object): 47 def __init__(self): 48 self.visited = set() 49 50 @classmethod 51 def from_settings(cls, settings): 52 '''读取配置文件''' 53 return cls() 54 55 def request_seen(self, request): 56 '''请求看过没有,这个才是去重规则该调用的方法''' 57 if request.url in self.visited: 58 return True 59 self.visited.add(request.url) 60 61 def open(self): # can return deferred 62 '''打开的时候执行''' 63 pass 64 65 def close(self, reason): # can return a deferred 66 pass 67 68 def log(self, request, spider): # log that a request has been filtered 69 '''日志记录''' 70 pass 71 72 #步骤二:配置文件settings.py 73 # DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' #默认会去找这个类实现去重 74 #自定义去重规则 75 DUPEFILTER_CLASS = 'AMAZON.cumstomdupefilter.MyDupeFilter' 76 77 # 源码分析: 78 from scrapy.core.scheduler import Scheduler 79 见Scheduler下的enqueue_request方法:self.df.request_seen(request)
自定义扩展:
自定义扩展(与django的信号类似) 1、django的信号是django是预留的扩展,信号一旦被触发,相应的功能就会执行 2、scrapy自定义扩展的好处是可以在任意我们想要爬虫执行过程中添加功能,而其他组件中提供的功能只能在规定的位置执行
1 ''' 2 3 engine_started = object() 4 engine_stopped = object() 5 spider_opened = object() 6 spider_idle = object() 7 spider_closed = object() 8 spider_error = object() 9 request_scheduled = object() 10 request_dropped = object() 11 response_received = object() 12 response_downloaded = object() 13 item_scraped = object() 14 item_dropped = object() 15 '''
1 from scrapy import signals 2 3 4 # 1、在与settings同级目录下新建一个文件,文件名可以为extentions.py,内容如下 5 class MyExtension(object): 6 def __init__(self, value): 7 self.value = value 8 9 @classmethod 10 def from_crawler(cls, crawler): 11 val = crawler.settings.getint('MMMM') 12 ext = cls(val) 13 print('in the signals=================================') 14 # crawler.signals.connect: 在scrapy中注册信号 15 crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened) 16 crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed) 17 18 return ext 19 20 def spider_opened(self, spider): 21 print('signals==========================open') 22 23 def spider_closed(self, spider): 24 print('signals==========================close') 25 26 # 配置生效 27 # EXTENSIONS = { 28 # 'sp1.extensions.MyExtension': 200, 29 # }
scrapy中间件:
定义在项目中的middlewares.py中,中间件的类名可以自定义,在配置文件中声明即可
爬虫中间件:
当需要对所有的爬虫进行统一的处理时,考虑使用爬虫中间件:
1 class SpiderMiddleware(object): 2 3 def process_spider_input(self,response, spider): 4 """ 5 下载完成,执行,然后交给parse处理 6 :param response: Downloader下载后返回的response,里面还包含封装了response.request(上一次发送的请求) 7 :param spider: 8 :return: 9 """ 10 pass 11 12 def process_spider_output(self,response, result, spider): 13 """ 14 spider上次爬取处理完成,重新发起爬取请求,yield返回时调用 15 :param response: 上一次爬取的结果 16 :param result: yield出去的对象,可能是Item或Request 17 :param spider: 18 :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable) 19 """ 20 return result 21 22 def process_spider_exception(self,response, exception, spider): 23 """ 24 异常调用 25 :param response: 26 :param exception: 27 :param spider: 28 :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline 29 """ 30 return None 31 32 33 def process_start_requests(self,start_requests, spider): 34 """ 35 爬虫启动时调用 36 :param start_requests: 爬虫传递过来的start_urls,可迭代对象 37 :param spider: 38 :return: 包含 Request 对象的可迭代对象 39 """ 40 return start_requests # 如果返回的不是start_requests,将替换爬取的url 41 42 # 注册,可以写多个中间件,值小的优先级高: 43 SPIDER_MIDDLEWARES = { 44 'sp1.middlewares.Sp1SpiderMiddleware': 543, 45 }
下载中间件(常用设置请求头:代理设置):
从scheduler过来的请求:经过下载中间价,可做如下方法处理:
1 # 从scheduler过来的请求:经过下载中间价,可做如下方法处理: 2 class DownMiddleware1(object): 3 def process_request(self, request, spider): 4 """ 5 请求需要被下载时,经过所有下载器中间件的process_request调用 6 :param request: 请求相关所有的数据(url、callback、headers等等),可以在请求头中加入代理;之后所有的请求都自动加上代理了 7 :param spider: 8 :return: 9 返回None,什么都不做,继续执行后续中间件的request去下载,应用场景:代理的设置;注意:可以在process_request方法中设置代理,返回None,继续后面的中间件执行; 10 返回Response对象,表示已经获取到结果了,停止process_request的执行(用于自定义下载),开始执行process_response ; from scrapy.http import response 11 返回Request对象(不太常用),停止中间件的执行,将Request重新放入调度器,等待下一次重新调度发起request请求 ; from scrapy.http import Request 12 返回raise IgnoreRequest异常(不常用),停止process_request的执行,开始执行process_exception 13 """ 14 ''' 15 # 应用场景1:设置代理 16 request.method = "POST" # 把所有的请求方法都改为POST 17 request.headers['kkk'] = 'vvvv' 18 request.headers['proxy'] = to_bytes('ip_port': '111.11.228.75:80') #设置代理 19 return None 20 ''' 21 22 """ 23 # 应用场景2:自定义下载 24 from scrapy.http import Response 25 import requests 26 v = request.get('http://www.baidu.com') 27 data = Response(url='xxxxxxxx',body=v.content,request=request) 28 return data # 返回了一个response对象,表示已经获取到结果了,停止process_request的执行,不会再去执行下 29 """ 30 pass 31 32 33 34 def process_response(self, request, response, spider): # process_response执行的时候已经下载完了 35 """ 36 spider处理完成,返回时调用,可用于响应头响应体的解析 37 :param response: 38 :param result: 39 :param spider: 40 :return: 41 返回Response 对象:转交给其他中间件process_response; 42 返回Request 对象:停止中间件,又放回调度器中,request会被重新调度下载 43 raise IgnoreRequest 异常:调用Request.errback 44 """ 45 print('response1') 46 ''' 47 场景1:返回数据统一编码; 48 # from scrapy.http import Response 49 # response.encoding = 'utf-8' # 把所有返回都变成utf8编码,response.text得到的都是utf8编码类型 50 场景2: 51 根据业务需求重写response类 52 ''' 53 return response 54 55 def process_exception(self, request, exception, spider): 56 """ 57 Scrapy calls process_exception() when a download handler or a process_request() (from a downloader middleware) raises an exception (including an IgnoreRequest exception) 58 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常如:超时异常、连接失败异常,scrapy会捕捉到异常,会自动调用process_exception 59 :param response: 60 :param exception: 61 :param spider: 62 :return: 63 None:继续交给后续中间件的process_exception处理异常; 64 Response对象:停止后续process_exception方法,执行其他中间件的process_response方法,不会再调用process_exception 65 Request对象:停止中间件,request将会被重新调用下载,得到异常之后重新发起请求,进行失败的request重试(如失败后再加一些参数、代理...) 66 """ 67 return request # 异常后重新发起请求 68 69 70 #DOWNLOADER_MIDDLEWARES = { 71 # 'sp3.middlewares.Sp3DownloaderMiddleware': 543, 72 #} 73
1 # middlewares.py中定义: 2 class RandomUserAgentMiddleware(object): 3 def __init__(self, arg): 4 self.user_agents = arg 5 6 @classmethod 7 def from_crawler(cls, crawler): 8 # This method is used by Scrapy to create your spiders. 9 val = crawler.settings.get('USER_AGENT') 10 return cls(str(val)) 11 12 def process_request(self, request, spider): 13 ret = re.sub('\n\s+', '', self.user_agents) 14 result = ret.split(',') 15 request.headers['User-Agent'] = random.choice(result) 16 17 18 19 20 # settings中添加: 21 22 USER_AGENT = """ 23 'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)', 24 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2', 25 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1' 26 """ 27 28 # 测试验证: 29 class HttpbinSpider(scrapy.Spider): 30 name = 'httpbin' 31 allowed_domains = ['httpbin.org'] 32 start_urls = ['http://httpbin.org/get'] 33 34 def parse(self, response): 35 self.logger.warn(response.text)
1 # scrapy设置动态随机的User-Agent 2 3 settings配置: 4 USER_AGENTS = [ 5 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)', 6 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)', 7 ] 8 9 10 middlewares.py中配置: 11 from settings import USER_AGENTS 12 13 class RandomUserAgent(object): 14 def process_request(self, request, spider): 15 useragent = random.choice(USER_AGENTS) 16 17 request.headers.setdefault("User-Agent", useragent)
1 import json 2 import logging 3 from scrapy import signals 4 import requests 5 6 7 class ProxyMiddleware(): 8 def __init__(self, proxy_url): 9 self.logger = logging.getLogger(__name__) 10 self.proxy_url = proxy_url 11 12 def get_random_proxy(self): 13 try: 14 response = requests.get(self.proxy_url) 15 if response.status_code == 200: 16 proxy = response.text 17 return proxy 18 except requests.ConnectionError: 19 return False 20 21 def process_request(self, request, spider): 22 if request.meta.get('retry_times'): 23 proxy = self.get_random_proxy() 24 if proxy: 25 uri = 'https://{proxy}'.format(proxy=proxy) 26 self.logger.debug('使用代理 ' + proxy) 27 request.meta['proxy'] = uri 28 29 @classmethod 30 def from_crawler(cls, crawler): 31 settings = crawler.settings 32 return cls( 33 proxy_url=settings.get('PROXY_URL') 34 ) 35 36 37 class CookiesMiddleware(): 38 def __init__(self, cookies_url): 39 self.logger = logging.getLogger(__name__) 40 self.cookies_url = cookies_url 41 42 def get_random_cookies(self): 43 try: 44 response = requests.get(self.cookies_url) 45 if response.status_code == 200: 46 cookies = json.loads(response.text) 47 return cookies 48 except requests.ConnectionError: 49 return False 50 51 def process_request(self, request, spider): 52 self.logger.debug('正在获取Cookies') 53 cookies = self.get_random_cookies() 54 if cookies: 55 request.cookies = cookies 56 self.logger.debug('使用Cookies ' + json.dumps(cookies)) 57 58 @classmethod 59 def from_crawler(cls, crawler): 60 settings = crawler.settings 61 return cls( 62 cookies_url=settings.get('COOKIES_URL') 63 )
1 middlewares中: 2 class ZhihuuserProxyMiddleware(object): 3 4 logger = logging.getLogger(__name__) 5 6 def process_request(self, request, spider): 7 self.logger.debug('Using Proxy------------------------') 8 request.meta['proxy'] = '92.222.150.204:2028' 9 10 11 settings中: 12 DOWNLOADER_MIDDLEWARES = { 13 'zhihuuser.middlewares.ZhihuuserProxyMiddleware': 543, 14 } 15
自定制命令:
- 在spiders同级创建任意目录,如:commands
- 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
1 from scrapy.commands import ScrapyCommand 2 from scrapy.utils.project import get_project_settings 3 4 5 class Command(ScrapyCommand): 6 7 requires_project = True 8 9 def syntax(self): 10 return '[options]' 11 12 def short_desc(self): 13 return 'Runs all of the spiders' 14 15 def run(self, args, opts): 16 # 爬虫列表 17 spider_list = self.crawler_process.spiders.list() # 获取spiders目录下所有爬虫 18 for name in spider_list: 19 # 初始化爬虫 20 self.crawler_process.crawl(name, **opts.__dict__) 21 # 开始执行所有的爬虫 22 self.crawler_process.start()
- 在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
- 在项目目录执行命令:scrapy crawlall
代理设置、https自定义证书、配置文件详解:
1 # -*- coding: utf-8 -*- 2 3 # Scrapy settings for step8_king project 4 # 5 # For simplicity, this file contains only settings considered important or 6 # commonly used. You can find more settings consulting the documentation: 7 # 8 # http://doc.scrapy.org/en/latest/topics/settings.html 9 # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 10 # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 11 12 # 1. 爬虫名称 13 BOT_NAME = 'step8_king' 14 15 # 2. 爬虫应用路径 16 SPIDER_MODULES = ['step8_king.spiders'] 17 NEWSPIDER_MODULE = 'step8_king.spiders' 18 19 # Crawl responsibly by identifying yourself (and your website) on the user-agent 20 # 3. 客户端 user-agent请求头 21 # USER_AGENT = 'step8_king (+http://www.yourdomain.com)' 22 23 # Obey robots.txt rules 24 # 4. 禁止爬虫配置 25 # ROBOTSTXT_OBEY = False 26 27 # Configure maximum concurrent requests performed by Scrapy (default: 16) 28 # 5. 并发请求数 29 # CONCURRENT_REQUESTS = 4 30 31 # Configure a delay for requests for the same website (default: 0) 32 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay 33 # See also autothrottle settings and docs 34 # 6. 延迟下载秒数 35 # DOWNLOAD_DELAY = 2 36 37 38 # The download delay setting will honor only one of: 39 # 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名 40 # CONCURRENT_REQUESTS_PER_DOMAIN = 2 41 # 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP 42 # CONCURRENT_REQUESTS_PER_IP = 3 43 # 注意:设置上面两个参数会覆盖CONCURRENT_REQUESTS参数的值 44 45 # Disable cookies (enabled by default) 46 # 8. 是否支持cookie,cookiejar进行操作cookie 47 # COOKIES_ENABLED = True 48 # COOKIES_DEBUG = True # 会在爬虫执行日志中打印出cookie信息 49 50 # Disable Telnet Console (enabled by default) 51 # 9. Telnet用于查看当前爬虫的信息,操作爬虫等... 52 # 使用telnet ip port ,然后通过命令操作 ; 用于监控爬虫 53 # TELNETCONSOLE_ENABLED = True 54 # TELNETCONSOLE_HOST = '127.0.0.1' 55 # TELNETCONSOLE_PORT = [6023,] 56 57 58 # 10. 对所有的请求,默认请求头,在spiders中定义的headers优先 59 # Override the default request headers: 60 # DEFAULT_REQUEST_HEADERS = { 61 # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 62 # 'Accept-Language': 'en', 63 # } 64 65 66 # Configure item pipelines 67 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html 68 # 11. 定义pipeline处理请求 69 # ITEM_PIPELINES = { 70 # 'step8_king.pipelines.JsonPipeline': 700, # 值范围:[0-1000] 71 # 'step8_king.pipelines.FilePipeline': 500, 72 # } 73 74 75 76 # 12. 自定义扩展,基于信号进行调用 77 # Enable or disable extensions 78 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html 79 # EXTENSIONS = { 80 # # 'step8_king.extensions.MyExtension': 500, 81 # } 82 83 84 # 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度 85 # DEPTH_LIMIT = 3 86 87 # 14. 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo 88 89 # 后进先出,深度优先 90 # DEPTH_PRIORITY = 0 91 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue' 92 # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue' 93 # 先进先出,广度优先 94 95 # DEPTH_PRIORITY = 1 96 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue' 97 # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue' 98 99 # 15. 调度器队列 100 # SCHEDULER = 'scrapy.core.scheduler.Scheduler' 101 # from scrapy.core.scheduler import Scheduler 102 103 104 # 16. 访问URL去重 105 # DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl' 106 107 108 # Enable and configure the AutoThrottle extension (disabled by default) 109 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html 110 111 """ 112 17. 自动限速算法 113 from scrapy.contrib.throttle import AutoThrottle 114 自动限速设置 115 1. 获取最小延迟 DOWNLOAD_DELAY 116 2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY 117 3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY 118 4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间 119 5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY 120 target_delay = latency / self.target_concurrency 121 new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间 122 new_delay = max(target_delay, new_delay) 123 new_delay = min(max(self.mindelay, new_delay), self.maxdelay) 124 slot.delay = new_delay 125 """ 126 127 # 开始自动限速 128 # AUTOTHROTTLE_ENABLED = True 129 # The initial download delay 130 # 初始下载延迟 131 # AUTOTHROTTLE_START_DELAY = 5 132 # The maximum download delay to be set in case of high latencies 133 # 最大下载延迟 134 # AUTOTHROTTLE_MAX_DELAY = 10 135 # The average number of requests Scrapy should be sending in parallel to each remote server 136 # 平均每秒并发数 137 # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 138 139 # Enable showing throttling stats for every response received: 140 # 是否显示 141 # AUTOTHROTTLE_DEBUG = True 142 143 # Enable and configure HTTP caching (disabled by default) 144 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 145 146 147 """ 148 18. 启用缓存 # 没有网络时、测试时可以使用 149 目的用于将已经发送的请求或相应缓存下来,以便以后使用 150 151 from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware 152 from scrapy.extensions.httpcache import DummyPolicy 153 from scrapy.extensions.httpcache import FilesystemCacheStorage 154 """ 155 # 是否启用缓存策略 156 # HTTPCACHE_ENABLED = True 157 158 # 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可 159 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy" 160 # 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略 161 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy" 162 163 # 缓存超时时间 164 # HTTPCACHE_EXPIRATION_SECS = 0 165 166 # 缓存保存路径 167 # HTTPCACHE_DIR = 'httpcache' 168 169 # 缓存忽略的Http状态码 170 # HTTPCACHE_IGNORE_HTTP_CODES = [] 171 172 # 缓存存储的插件 173 # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' 174 175 176 """ 177 19. 代理,需要在环境变量中设置 178 from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware 179 180 方式一:使用默认【局限:不能(不方便)随机取代理池中的代理】 181 在spiders中的爬虫中设置 182 import os 183 os.environ 184 { 185 http_proxy:http://root:woshiniba@192.168.11.11:9999/ 186 https_proxy:http://192.168.11.11:9999/ 187 } 188 import os 189 # 设置的key必须带有_proxy后缀名 190 os.environ['http_proxy'] = "http://root:woshiniba@192.168.11.11:9999/" 191 os.environ['https_proxy'] = "http://192.168.11.11:9999/" 192 os.environ['xx_proxy'] = "http://192.168.11.11:9999/" 193 194 方式二:使用自定义下载中间件(参考默认HttpProxyMiddleware来写的),可以 195 def to_bytes(text, encoding=None, errors='strict'): 196 if isinstance(text, bytes): 197 return text 198 if not isinstance(text, six.string_types): 199 raise TypeError('to_bytes must receive a unicode, str or bytes ' 200 'object, got %s' % type(text).__name__) 201 if encoding is None: 202 encoding = 'utf-8' 203 return text.encode(encoding, errors) 204 205 class ProxyMiddleware(object): # 类名自定义,在配置文件中DOWNLOADER_MIDDLEWARES变量中写好就OK 206 def process_request(self, request, spider): # 重写下载中间件的process_request方法 207 PROXIES = [ 208 {'ip_port': '111.11.228.75:80', 'user_pass': ''}, 209 {'ip_port': '120.198.243.22:80', 'user_pass': ''}, 210 {'ip_port': '111.8.60.9:8123', 'user_pass': ''}, 211 {'ip_port': '101.71.27.120:80', 'user_pass': ''}, 212 {'ip_port': '122.96.59.104:80', 'user_pass': ''}, 213 {'ip_port': '122.224.249.122:8088', 'user_pass': ''}, 214 ] 215 import random 216 proxy = random.choice(PROXIES) 217 if proxy['user_pass'] is not None: 218 request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) # python3要转字节,to_bytes方法参考HttpProxyMiddleware中的to_bytes方法 219 encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass'])) 220 request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass) # 代理本质加这个Proxy-Authorization请求头 221 print "**************ProxyMiddleware have pass************" + proxy['ip_port'] 222 else: 223 print "**************ProxyMiddleware no pass************" + proxy['ip_port'] 224 request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) 225 226 DOWNLOADER_MIDDLEWARES = { 227 'step8_king.middlewares.ProxyMiddleware': 500, 228 } 229 230 """ 231 232 """ 233 20. Https访问(本质通过下载中间件实现) 234 Https访问时有两种情况: 235 1. 要爬取网站使用的可信任证书(默认) 236 DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" 237 DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory" 238 239 2. 要爬取网站使用的自定义证书,配置文件中加入如下配置: 240 DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" 241 DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory" 242 243 # https.py 244 from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory 245 from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate) 246 247 class MySSLFactory(ScrapyClientContextFactory): 248 def getCertificateOptions(self): 249 from OpenSSL import crypto 250 v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read()) 251 v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read()) 252 return CertificateOptions( 253 privateKey=v1, # pKey对象 254 certificate=v2, # X509对象 255 verify=False, 256 method=getattr(self, 'method', getattr(self, '_ssl_method', None)) 257 ) 258 其他: 259 相关类 260 scrapy.core.downloader.handlers.http.HttpDownloadHandler 261 scrapy.core.downloader.webclient.ScrapyHTTPClientFactory 262 scrapy.core.downloader.contextfactory.ScrapyClientContextFactory 263 相关配置 264 DOWNLOADER_HTTPCLIENTFACTORY 265 DOWNLOADER_CLIENTCONTEXTFACTORY 266 267 """ 268 269 270 271 """ 272 21. 爬虫中间件 273 class SpiderMiddleware(object): 274 275 def process_spider_input(self,response, spider): 276 ''' 277 下载完成,执行,然后交给parse处理 278 :param response: 279 :param spider: 280 :return: 281 ''' 282 pass 283 284 def process_spider_output(self,response, result, spider): 285 ''' 286 spider处理完成,返回时调用 287 :param response: 288 :param result: 289 :param spider: 290 :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable) 291 ''' 292 return result 293 294 def process_spider_exception(self,response, exception, spider): 295 ''' 296 异常调用 297 :param response: 298 :param exception: 299 :param spider: 300 :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline 301 ''' 302 return None 303 304 305 def process_start_requests(self,start_requests, spider): 306 ''' 307 爬虫启动时调用 308 :param start_requests: 309 :param spider: 310 :return: 包含 Request 对象的可迭代对象 311 ''' 312 return start_requests 313 314 内置爬虫中间件: 315 'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50, 316 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500, 317 'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700, 318 'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800, 319 'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900, 320 321 """ 322 # from scrapy.contrib.spidermiddleware.referer import RefererMiddleware 323 # Enable or disable spider middlewares 324 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 325 SPIDER_MIDDLEWARES = { 326 # 'step8_king.middlewares.SpiderMiddleware': 543, 327 } 328 329 330 """ 331 22. 下载中间件 332 class DownMiddleware1(object): 333 def process_request(self, request, spider): 334 ''' 335 请求需要被下载时,经过所有下载器中间件的process_request调用 336 :param request: 337 :param spider: 338 :return: 339 None,继续后续中间件去下载; 340 Response对象,停止process_request的执行,开始执行process_response 341 Request对象,停止中间件的执行,将Request重新调度器 342 raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception 343 ''' 344 pass 345 346 347 348 def process_response(self, request, response, spider): 349 ''' 350 spider处理完成,返回时调用 351 :param response: 352 :param result: 353 :param spider: 354 :return: 355 Response 对象:转交给其他中间件process_response 356 Request 对象:停止中间件,request会被重新调度下载 357 raise IgnoreRequest 异常:调用Request.errback 358 ''' 359 print('response1') 360 return response 361 362 def process_exception(self, request, exception, spider): 363 ''' 364 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常 365 :param response: 366 :param exception: 367 :param spider: 368 :return: 369 None:继续交给后续中间件处理异常; 370 Response对象:停止后续process_exception方法 371 Request对象:停止中间件,request将会被重新调用下载 372 ''' 373 return None 374 375 376 默认下载中间件 377 { 378 'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100, 379 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300, 380 'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350, 381 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400, 382 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500, 383 'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550, 384 'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580, 385 'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590, 386 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600, 387 'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700, 388 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750, 389 'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830, 390 'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850, 391 'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900, 392 } 393 394 """ 395 # from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware 396 # Enable or disable downloader middlewares 397 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 398 # DOWNLOADER_MIDDLEWARES = { 399 # 'step8_king.middlewares.DownMiddleware1': 100, 400 # 'step8_king.middlewares.DownMiddleware2': 500, 401 # }
TinyScrapy
1 #!/usr/bin/env python 2 # -*- coding:utf-8 -*- 3 from twisted.web.client import getPage, defer 4 from twisted.internet import reactor 5 6 # 1. 基本使用 7 """ 8 def all_done(arg): # 所有爬虫执行完后,循环终止 9 reactor.stop() 10 11 def callback(contents): # 每一个爬虫获取结果后自动执行 12 print(contents) 13 14 15 deferred_list = [] 16 17 url_list = ['http://www.bing.com', 'http://www.baidu.com', ] 18 for url in url_list: 19 deferred = getPage(bytes(url, encoding='utf8')) 20 deferred.addCallback(callback) 21 deferred_list.append(deferred) 22 23 dlist = defer.DeferredList(deferred_list) 24 dlist.addBoth(all_done) 25 26 reactor.run() # 事件循环;监测哪个socket完成了 27 """ 28 29 30 # 2. 基于装饰器(一) 31 """ 32 def all_done(arg): 33 reactor.stop() 34 35 36 def onedone(response): 37 print(response) 38 39 40 @defer.inlineCallbacks 41 def task(url): 42 deferred = getPage(bytes(url, encoding='utf8')) 43 deferred.addCallback(onedone) 44 yield deferred 45 46 47 deferred_list = [] 48 49 url_list = ['http://www.bing.com', 'http://www.baidu.com', ] 50 for url in url_list: 51 deferred = task(url) 52 deferred_list.append(deferred) 53 54 dlist = defer.DeferredList(deferred_list) 55 dlist.addBoth(all_done) 56 57 reactor.run() 58 """ 59 60 61 # 3. 基于装饰器(二) 62 """ 63 def all_done(arg): 64 reactor.stop() 65 66 67 def onedone(response): 68 print(response) 69 70 71 @defer.inlineCallbacks 72 def task(): 73 deferred2 = getPage(bytes("http://www.baidu.com", encoding='utf8')) 74 deferred2.addCallback(onedone) 75 yield deferred2 76 77 78 deferred1 = getPage(bytes("http://www.google.com", encoding='utf8')) 79 deferred1.addCallback(onedone) 80 yield deferred1 81 82 83 ret = task() 84 ret.addBoth(all_done) 85 86 reactor.run() 87 """ 88 89 90 91 # 4. 基于装饰器,永恒循环 92 """ 93 def all_done(arg): 94 reactor.stop() 95 96 97 def onedone(response): 98 print(response) 99 100 101 @defer.inlineCallbacks 102 def task(): 103 deferred2 = getPage(bytes("http://www.baidu.com", encoding='utf8')) 104 deferred2.addCallback(onedone) 105 yield deferred2 106 107 stop_deferred = defer.Deferred() # 创建了一个空的deferred对象,永远夯住;scrapy利用这一点保持一直运行,队列不断接收爬取请求,直到队列中没有请求,再调用callback停止scrapy运行 108 # stop_deferred.callback() # 执行回调函数run监测循环立即终止 109 yield stop_deferred # 返回Deferred()循环一直运行,不终止 110 111 112 ret = task() 113 ret.addBoth(all_done) 114 115 reactor.run() 116 """ 117 118 119 # 5. 基于装饰器,执行完毕后停止事件循环 120 """ 121 running_list = [] 122 stop_deferred = None 123 124 def all_done(arg): 125 reactor.stop() 126 127 def onedone(response, url): 128 print(response) 129 running_list.remove(url) 130 131 def check_empty(response): 132 if not running_list: 133 stop_deferred.callback('......') 134 135 @defer.inlineCallbacks 136 def task(url): 137 deferred2 = getPage(bytes(url, encoding='utf8')) 138 deferred2.addCallback(onedone, url) 139 deferred2.addCallback(check_empty) 140 yield deferred2 141 142 stop_deferred = defer.Deferred() 143 yield stop_deferred 144 145 146 ret = task("http://www.baidu.com") 147 ret.addBoth(all_done) 148 149 reactor.run() 150 """ 151 152 153 # reactor.callLater(0) # 结束当前Deferred,事件循环也会终止 154 155 156 157 # 6. 基于装饰器,执行完毕后停止事件循环 158 """ 159 import queue 160 161 running_list = [] 162 stop_deferred = None 163 q = queue.Queue() 164 165 166 def all_done(arg): 167 reactor.stop() 168 169 def onedone(response): 170 print(response) 171 172 def check_empty(response): 173 if not running_list: 174 stop_deferred.callback('......') 175 176 177 def open_spider(): 178 url = q.get() 179 deferred = getPage(bytes(url, encoding='utf8')) 180 deferred.addCallback(onedone) 181 deferred.addCallback(check_empty) 182 183 184 @defer.inlineCallbacks 185 def task(start_url): 186 q.put(start_url) 187 open_spider() 188 189 global stop_deferred 190 stop_deferred = defer.Deferred() 191 yield stop_deferred 192 193 li = [] 194 ret = task("http://www.baidu.com") 195 li.append(ret) 196 197 lid = defer.DeferredList(li) 198 lid.addBoth(all_done) 199 200 reactor.run() 201 """
1 #!/usr/bin/env python 2 # -*- coding:utf-8 -*- 3 from twisted.web.client import getPage, defer 4 from twisted.internet import reactor 5 import queue 6 7 8 class Response(object): 9 def __init__(self, body, request): 10 self.body = body 11 self.request = request 12 self.url = request.url 13 14 @property 15 def text(self): 16 return self.body.decode('utf-8') 17 18 19 class Request(object): 20 def __init__(self, url, callback=None): 21 self.url = url 22 self.callback = callback 23 24 25 class Scheduler(object): 26 def __init__(self, engine): 27 self.q = queue.Queue() 28 self.engine = engine 29 30 def enqueue_request(self, request): 31 self.q.put(request) 32 33 def next_request(self): 34 try: 35 req = self.q.get(block=False) 36 except Exception as e: 37 req = None 38 39 return req 40 41 def size(self): 42 return self.q.qsize() 43 44 45 class ExecutionEngine(object): 46 def __init__(self): 47 self._closewait = None 48 self.running = True 49 self.start_requests = None 50 self.scheduler = Scheduler(self) 51 52 self.inprogress = set() 53 54 def check_empty(self, response): 55 if not self.running: 56 self._closewait.callback('......') 57 58 def _next_request(self): 59 while self.start_requests: 60 try: 61 request = next(self.start_requests) 62 except StopIteration: 63 self.start_requests = None 64 else: 65 self.scheduler.enqueue_request(request) 66 67 while len(self.inprogress) < 5 and self.scheduler.size() > 0: # 最大并发数为5 68 # while循环不会夯住; 69 request = self.scheduler.next_request() 70 if not request: 71 break 72 73 self.inprogress.add(request) 74 d = getPage(bytes(request.url, encoding='utf-8')) 75 # 回调函数说明: 76 # 如果getPage下载页面出错会执行d.addError;如果执行正确,会执行d.addCallback函数;addBoth不管结果怎么样都会执行 77 d.addBoth(self._handle_downloader_output, request) # 如果有数据回来,执行_handle_downloader_output 78 d.addBoth(lambda x, req: self.inprogress.remove(req), request) 79 d.addBoth(lambda x: self._next_request()) 80 81 if len(self.inprogress) == 0 and self.scheduler.size() == 0: #条件成立结束夯住scrapy的deferred()函数 82 self._closewait.callback(None) 83 84 def _handle_downloader_output(self, body, request): 85 """ 86 获取内容,执行回调函数,并且把回调函数中的返回值获取,并添加到队列中 87 :param response: 88 :param request: 89 :return: 90 """ 91 import types 92 93 response = Response(body, request) 94 func = request.callback or self.spider.parse 95 gen = func(response) 96 if isinstance(gen, types.GeneratorType): # 如果是生成器(spiders yield过来的),继续放入队列执行;还可能是Item,直接去pipeline持久化 97 for req in gen: 98 self.scheduler.enqueue_request(req) 99 100 @defer.inlineCallbacks 101 def start(self): 102 self._closewait = defer.Deferred() 103 yield self._closewait 104 105 @defer.inlineCallbacks 106 def open_spider(self, spider, start_requests): 107 self.start_requests = start_requests 108 self.spider = spider 109 yield None 110 reactor.callLater(0, self._next_request) 111 112 113 class Crawler(object): 114 def __init__(self, spidercls): 115 self.spidercls = spidercls 116 117 self.spider = None 118 self.engine = None 119 120 @defer.inlineCallbacks 121 def crawl(self): 122 self.engine = ExecutionEngine() # 创建引擎 123 self.spider = self.spidercls() # 创建spider对象,spidercls是爬虫对象列表 124 start_requests = iter(self.spider.start_requests()) # 拿到start_urls执行引擎 125 yield self.engine.open_spider(self.spider, start_requests) 126 yield self.engine.start() # 夯住,让scrapy保持运行 127 128 129 class CrawlerProcess(object): 130 def __init__(self): 131 self._active = set() 132 self.crawlers = set() 133 134 def crawl(self, spidercls, *args, **kwargs): 135 crawler = Crawler(spidercls) 136 self.crawlers.add(crawler) 137 138 d = crawler.crawl(*args, **kwargs) 139 self._active.add(d) 140 return d 141 142 def start(self): 143 dl = defer.DeferredList(self._active) 144 dl.addBoth(self._stop_reactor) 145 reactor.run() 146 147 def _stop_reactor(self, _=None): 148 reactor.stop() 149 150 151 class Spider(object): 152 def start_requests(self): 153 for url in self.start_urls: 154 yield Request(url) 155 156 157 class ChoutiSpider(Spider): 158 name = "chouti" 159 start_urls = [ 160 'http://dig.chouti.com/', 161 ] 162 163 def parse(self, response): 164 print(response.text) 165 166 167 class CnblogsSpider(Spider): 168 name = "cnblogs" 169 start_urls = [ 170 'http://www.cnblogs.com/', 171 ] 172 173 def parse(self, response): 174 print(response.text) 175 176 177 if __name__ == '__main__': 178 179 spider_cls_list = [ChoutiSpider, CnblogsSpider] 180 181 crawler_process = CrawlerProcess() 182 for spider_cls in spider_cls_list: 183 crawler_process.crawl(spider_cls) 184 185 crawler_process.start()