Scrapy

Scrapy是一个可扩展功能齐全的爬虫框架

Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下

Scrapy主要包括了以下组件:

  • 引擎(Scrapy)
    用来处理整个系统的数据流处理, 触发事务(框架核心)
  • 调度器(Scheduler)
    用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
  • 下载器(Downloader)
    用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
  • 爬虫(Spiders)
    爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
  • 项目管道(Pipeline)
    负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
  • 下载器中间件(Downloader Middlewares)
    位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
  • 爬虫中间件(Spider Middlewares)
    介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
  • 调度中间件(Scheduler Middewares)
    介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。

Scrapy运行流程大概如下:

  1. 引擎从调度器中取出一个链接(URL)用于接下来的抓取
  2. 引擎把URL封装成一个请求(Request)传给下载器
  3. 下载器把资源下载下来,并封装成应答包(Response)
  4. 爬虫解析Response
  5. 解析出实体(Item),则交给实体管道进行进一步的处理
  6. 解析出的是链接(URL),则把URL交给调度器等待抓取
安装:

Linux/mac

- pip3 install scrapy

Windows:

- 安装twsited

a. pip3 install wheel

b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

c. 进入下载目录,执行 pip3 install Twisted-xxxxx.whl

- 安装scrapy

d. pip3 install scrapy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

- 安装pywin32

e. pip3 install pywin32 -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

快速使用

scrapy startproject weiwu   # 创建项目
cd weiwu
 
scrapy genspider chouti chouti.com  #创建爬虫应用
 
scrapy crawl chouti --nolog     # 运行爬虫应用

项目结构

 

文件说明:

  • scrapy.cfg  项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中)
  • items.py    设置数据存储模板,用于结构化数据,如:Django的Model
  • pipelines    数据处理行为,如:一般结构化的数据持久化
  • settings.py 配置文件,如:递归的层数、并发数,延迟下载等
  • spiders      爬虫目录,如:创建文件,编写爬虫规则

注意:一般创建爬虫文件时,以网站域名命名

# -*- coding: utf-8 -*-
import scrapy


class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['http://dig.chouti.com/']

    def parse(self, response):
        print(response.text)

如果windows出现乱码加上以下设置

import sys,os
sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

小试牛刀

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from ..items import WeiwuItem

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['http://dig.chouti.com/',]

    def parse(self, response):
        """
        当起始URL下载完毕后,自动执行parse函数:response封装了响应相关的所有内容。
        :param response:
        :return:
        """

        hxs = HtmlXPathSelector(response=response)

        # 去下载的页面中:找新闻
        items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
        for item in items:
            href = item.xpath('.//div[@class="part1"]//a[1]/@href').extract_first()
            text = item.xpath('.//div[@class="part1"]//a[1]/text()').extract_first()
            item = WeiwuItem(title=text,href=href)
            yield item

        pages = hxs.xpath('//div[@id="page-area"]//a[@class="ct_pagepa"]/@href').extract()
        for page_url in pages:
            page_url = "https://dig.chouti.com" + page_url
            yield Request(url=page_url,callback=self.parse)

执行此爬虫文件,则在终端进入项目目录执行如下命令:

scrapy crawl chouti --nolog

对于上述代码重要之处在于:

  • Request是一个封装用户请求的类,在回调函数中yield该对象表示继续访问
  • HtmlXpathSelector用于结构化HTML代码并提供选择器功能

1. spider,编写爬虫程序,去解析并处理请求。
def parse():
      - HtmlXPathSelector  # 类似BS4的beautifulsoup,用于结构化HTML代码并提供选择器功能解析器
      - yield item    # 返回item对象到pipeline进行数据持久化
      - yield request # 是一个封装用户请求的类,在回调函数中yield该对象表示继续访问
2. item/pipelines   # 格式化和持久化

配置:
ITEM_PIPELINES = {
            'weiwu.pipelines.WeiwuPipeline': 300,
          }
使用:    
 class WeiwuPipeline(object):

            def process_item(self, item, spider):
               self.f.write(item['href']+'\n')
               self.f.flush()   # 存到数据库

               return item

            def open_spider(self, spider):
               """
               爬虫开始执行时,调用
               :param spider:
               :return:
               """
               self.f = open('url.log','w')

            def close_spider(self, spider):
               """
               爬虫关闭时,被调用
               :param spider:
               :return:
               """
               self.f.close()

选择器

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse
html = """<!DOCTYPE html>
<html>
    <head lang="en">
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <ul>
            <li class="item-"><a id='i1' href="link.html">first item</a></li>
            <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
            <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
        </ul>
        <div><a href="llink2.html">second item</a></div>
    </body>
</html>
"""
response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
# hxs = HtmlXPathSelector(response)
# print(hxs)
# hxs = Selector(response=response).xpath('//a')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[2]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id="i1"]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()
# print(hxs)
 
# ul_list = Selector(response=response).xpath('//body/ul/li')
# for item in ul_list:
#     v = item.xpath('./a/span')
#     # 或
#     # v = item.xpath('a/span')
#     # 或
#     # v = item.xpath('*/a/span')
#     print(v)

解析器的作用是把字符串转化为对象

- 方式一:
response.xpath('//div[@id='content-list']/div[@class='item']')
- 方式二:
hxs = HtmlXPathSelector(response=response)
items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
查找规则:
//a
//div/a
//a[re:test(@id, "i\d+")] # 正则[re:test(@标签名,正则表达式)]

items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
for item in items:
item.xpath('.//div')
解析:
标签对象: xpath('/html/body/ul/li/a/@href')
列表: xpath('/html/body/ul/li/a/@href').extract()
值: xpath('//body/ul/li/a/@href').extract_first()
应用示例:
from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse

html = """<!DOCTYPE html>
                <html>
                    <head lang="en">
                        <meta charset="UTF-8">
                        <title></title>
                    </head>
                    <body>
                        <ul>
                            <li class="item-"><a id='i1' href="link.html">first item</a></li>
                            <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
                            <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
                        </ul>
                        <div><a href="llink2.html">second item</a></div>
                    </body>
                </html>
                """
response = HtmlResponse(url='http://example.com', body=html, encoding='utf-8')
obj = response.xpath('//a[@id="i1"]/text()').extract_first()
print(obj)

PS:浏览器右键打开检查,在elements选择标签右键选copy-->copy xpath可以直接拿到标签的解析结果

 

示例:自动登录抽屉并点赞

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy.http.cookies import CookieJar
from scrapy import FormRequest


class ChouTiSpider(scrapy.Spider):
    # 爬虫应用的名称,通过此名称启动爬虫命令
    name = "chouti"
    # 允许的域名
    allowed_domains = ["chouti.com"]

    cookie_dict = {}
    has_request_set = {}

    def start_requests(self):
        url = 'http://dig.chouti.com/'
        # return [Request(url=url, callback=self.login)]
        yield Request(url=url, callback=self.login)

    def login(self, response):
        cookie_jar = CookieJar()
        cookie_jar.extract_cookies(response, response.request)
        for k, v in cookie_jar._cookies.items():
            for i, j in v.items():
                for m, n in j.items():
                    self.cookie_dict[m] = n.value

        req = Request(
            url='http://dig.chouti.com/login',
            method='POST',
            headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
            body='phone=8615131255089&password=pppppppp&oneMonth=1',
            cookies=self.cookie_dict,
            callback=self.check_login
        )
        yield req

    def check_login(self, response):
        req = Request(
            url='http://dig.chouti.com/',
            method='GET',
            callback=self.show,
            cookies=self.cookie_dict,
            dont_filter=True
        )
        yield req

    def show(self, response):
        # print(response)
        hxs = HtmlXPathSelector(response)
        news_list = hxs.select('//div[@id="content-list"]/div[@class="item"]')
        for new in news_list:
            # temp = new.xpath('div/div[@class="part2"]/@share-linkid').extract()
            link_id = new.xpath('*/div[@class="part2"]/@share-linkid').extract_first()
            yield Request(
                url='http://dig.chouti.com/link/vote?linksId=%s' %(link_id,),
                method='POST',
                cookies=self.cookie_dict,
                callback=self.do_favor
            )

        page_list = hxs.select('//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()
        for page in page_list:

            page_url = 'http://dig.chouti.com%s' % page
            import hashlib
            hash = hashlib.md5()
            hash.update(bytes(page_url,encoding='utf-8'))
            key = hash.hexdigest()
            if key in self.has_request_set:
                pass
            else:
                self.has_request_set[key] = page_url
                yield Request(
                    url=page_url,
                    method='GET',
                    callback=self.show
                )

    def do_favor(self, response):
        print(response.text)

注意:settings.py中设置DEPTH_LIMIT = 1来指定“递归”的层数。

格式化处理

上述实例只是简单的处理,所以在parse方法中直接处理。如果对于想要获取更多的数据处理,则可以利用Scrapy的items将数据格式化,然后统一交由pipelines来处理。

import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy.http.cookies import CookieJar
from scrapy import FormRequest


class XiaoHuarSpider(scrapy.Spider):
    # 爬虫应用的名称,通过此名称启动爬虫命令
    name = "xiaohuar"
    # 允许的域名
    allowed_domains = ["xiaohuar.com"]

    start_urls = [
        "http://www.xiaohuar.com/list-1-1.html",
    ]
    # custom_settings = {
    #     'ITEM_PIPELINES':{
    #         'spider1.pipelines.JsonPipeline': 100
    #     }
    # }
    has_request_set = {}

    def parse(self, response):
        # 分析页面
        # 找到页面中符合规则的内容(校花图片),保存
        # 找到所有的a标签,再访问其他a标签,一层一层的搞下去

        hxs = HtmlXPathSelector(response)

        items = hxs.select('//div[@class="item_list infinite_scroll"]/div')
        for item in items:
            src = item.select('.//div[@class="img"]/a/img/@src').extract_first()
            name = item.select('.//div[@class="img"]/span/text()').extract_first()
            school = item.select('.//div[@class="img"]/div[@class="btns"]/a/text()').extract_first()
            url = "http://www.xiaohuar.com%s" % src
            from ..items import XiaoHuarItem
            obj = XiaoHuarItem(name=name, school=school, url=url)
            yield obj

        urls = hxs.select('//a[re:test(@href, "http://www.xiaohuar.com/list-1-\d+.html")]/@href')
        for url in urls:
            key = self.md5(url)
            if key in self.has_request_set:
                pass
            else:
                self.has_request_set[key] = url
                req = Request(url=url,method='GET',callback=self.parse)
                yield req

    @staticmethod
    def md5(val):
        import hashlib
        ha = hashlib.md5()
        ha.update(bytes(val, encoding='utf-8'))
        key = ha.hexdigest()
        return key
spiders/xiahuar.py
import scrapy


class XiaoHuarItem(scrapy.Item):
    name = scrapy.Field()
    school = scrapy.Field()
    url = scrapy.Field()
items
import json
import os
import requests


class JsonPipeline(object):
    def __init__(self):
        self.file = open('xiaohua.txt', 'w')

    def process_item(self, item, spider):
        v = json.dumps(dict(item), ensure_ascii=False)
        self.file.write(v)
        self.file.write('\n')
        self.file.flush()
        return item


class FilePipeline(object):
    def __init__(self):
        if not os.path.exists('imgs'):
            os.makedirs('imgs')

    def process_item(self, item, spider):
        response = requests.get(item['url'], stream=True)
        file_name = '%s_%s.jpg' % (item['name'], item['school'])
        with open(os.path.join('imgs', file_name), mode='wb') as f:
            f.write(response.content)
        return item
pipelines
ITEM_PIPELINES = {
   'spider1.pipelines.JsonPipeline': 100,
   'spider1.pipelines.FilePipeline': 300,
}
# 每行后面的整型值,确定了他们运行的顺序,item按数字从低到高的顺序,通过pipeline,通常将这些数字定义在0-1000范围内。
settings

对于pipeline可以做更多,如下:

from scrapy.exceptions import DropItem

class CustomPipeline(object):
    def __init__(self,v):
        self.value = v

    def process_item(self, item, spider):
        # 操作并进行持久化

        # return表示会被后续的pipeline继续处理
        return item

        # 表示将item丢弃,不会被后续pipeline处理
        # raise DropItem()


    @classmethod
    def from_crawler(cls, crawler):
        """
        初始化时候,用于创建pipeline对象
        :param crawler: 
        :return: 
        """
        val = crawler.settings.getint('MMMM')
        return cls(val)

    def open_spider(self,spider):
        """
        爬虫开始执行时,调用
        :param spider: 
        :return: 
        """
        print('000000')

    def close_spider(self,spider):
        """
        爬虫关闭时,被调用
        :param spider: 
        :return: 
        """
        print('111111')
自定义pipelines

 

start_requests

def start_requests(self):
    for url in self.start_urls:
        yield Request(url=url, callback=self.parse2)


def start_requests(self):
    req_list = []
    for url in self.start_urls:
        req_list.append(Request(url=url, callback=self.parse2))
    return req_list


# 因为scrapy内部会将返回值转换成迭代器。

对于爬虫,循环经历这样的事情:

  1. 您首先生成用于抓取第一个URL的初始请求,然后指定要使用从这些请求下载的响应调用的回调函数。

    第一个执行的请求通过调用 start_requests()(默认情况下)Request为在start_urls和中指定的URL生成的parse方法获取, 并且该方法作为请求的回调函数。

  2. 在回调函数中,您将解析响应(网页),并返回带有提取的数据,Item对象, Request对象或这些对象的可迭代的对象。这些请求还将包含回调(可能是相同的),然后由Scrapy下载,然后由指定的回调处理它们的响应。

  3. 在回调函数中,您通常使用选择器来解析页面内容 (但您也可以使用BeautifulSoup,lxml或您喜欢的任何机制),并使用解析的数据生成项目。

  4. 最后,从爬虫返回的项目通常将持久存储到数据库(在某些项目管道中)或使用Feed导出写入文件。

start_requests()它是爬虫的入口方法,此方法必须返回一个可迭代的对象或者生成器,爬虫将根据这些请求爬取网站

之前在start_urls中存入我们要爬虫的网页链接,但是如果我们要爬虫的链接很多,而且是有一定规律的,我们就需要重写这个方法了,首先我们看看start_requests这个方法是干嘛的:

可见它就是从start_urls中读取链接,然后使用make_requests_from_url生成Request,

那么这就意味我们可以在start_requests方法中根据我们自己的需求往start_urls中写入我们自定义的规律的链接:

Scrapy将start_requests()方法返回的scrapy.Request对象列入计划表, 根据收到每个对象的响应, 实例化Response对象并且调用callback方法和请求联系在一起(也就是解析,即parse()方法)把响应(response)作为参数返回。

start_reauests()方法的捷径

你可以不用实现start_requests()方法来从URL连接中生成scrapy.Resquest对象,而是仅仅定义一个叫做start_urls的属性,即一个包括url连接的列表。这个列表将会被默认的start_requests()方法调用来为你的爬虫创建初始的请求。两种方法都可取。

2、parse方法:

此方法来对每一个请求进行下载处理。response属性是TextResponse的一个实例,它拥有页面的内容并且有更深入的方法来处理它; 
parse()方法常常用来解析response属性,将提取爬下来的数据作为字典类型或找到新的URL连接并建立新的请求

 

pipelines

当我们成功获取信息后,要进行信息的验证、储存等工作,这里以储存为例。 
当Item在Spider中被收集之后,它将会被传递到Pipeline,一些组件会按照一定的顺序执行对Item的处理。 
Pipeline经常进行一下一些操作: 
1.清理HTML数据 
2.验证爬取的数据(检查item包含某些字段) 
3.查重(并丢弃) 
4.将爬取结果保存到数据库中

首先在scrapytest/目录下建立一个文件MyPipelines.py

MyPipelines.py代码如下

from scrapy.exceptions import DropItem
import json

class MyPipeline(object):
    def __init__(self):
        #打开文件
        self.file = open('data.json', 'w', encoding='utf-8')
    #该方法用于处理数据
    def process_item(self, item, spider):
        #读取item中的数据
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        #写入文件
        self.file.write(line)
        #返回item
        return item
    #该方法在spider被开启时被调用。
    def open_spider(self, spider):
        pass
    #该方法在spider被关闭时被调用。
    def close_spider(self, spider):
        pass

要使用Pipeline,首先要注册Pipeline

找到settings.py文件,这个文件时爬虫的配置文件

在其中添加

ITEM_PIPELINES = {
    'scrapytest.MyPipelines.MyPipeline': 1,
}

上面的代码用于注册Pipeline,其中scrapytest.MyPipelines.MyPipeline为你要注册的类,右侧的’1’为该Pipeline的优先级,范围1~1000,越小越先执行。

进行完以上操作,我们的一个最基本的爬取操作就完成了

这时我们再运行

scrapy crawl MySpider --nolog

就可以在项目根目录下发现data.json文件,里面存储着爬取的信息。

多pipelines(值越小优先级越高)
多pipelines,返回值会传递给下一个pipelines的process_item


PS:如果想要丢弃,不给后续pipeline使用:
from scrapy.exceptions import DropItem
class FilePipeline(object):

    def process_item(self, item, spider):
        print('写入文件',item['href'])

        # return item
        raise DropItem()            


process_item(self, item, spider)

对于每个项目管道组件调用此方法。process_item() 必须返回一个带数据的dict,返回一个Item (或任何后代类)对象,返回一个Twisted Deferred或者raise DropItemexception。丢弃的项目不再由其他管道组件处理。

根据配置文件读取相关值,再进行pipeline处理
class FilePipeline(object):
    def __init__(self, path):
        self.path = path
        self.f = None

    @classmethod
    def from_crawler(cls, crawler):
        """
        初始化时候,用于创建pipeline对象
        :param crawler:
        :return:
        """
        path = crawler.settings.get('XL_FILE_PATH')
        return cls(path)

    def process_item(self, item, spider):
        self.f.write(item['href'] + '\n')
        return item

    def open_spider(self, spider):
        """
        爬虫开始执行时,调用
        :param spider:
        :return:
        """
        self.f = open(self.path, 'w')

    def close_spider(self, spider):
        """
        爬虫关闭时,被调用
        :param spider:
        :return:
        """
        self.f.close()

 

POST/请求头/Cookie


POST+请求头: 
from scrapy.http import Request 
req = Request(
      url='http://dig.chouti.com/login',
      method='POST',
      body='phone=8613121758648&password=woshiniba&oneMonth=1',
      headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
      cookies={},
      callback=self.parse_check_login,
)
cookies:

# 手动:
cookie_dict = {}
cookie_jar = CookieJar()
cookie_jar.extract_cookies(response, response.request)
for k, v in cookie_jar._cookies.items():
    for i, j in v.items():
        for m, n in j.items():
            cookie_dict[m] = n.value

req = Request(
    url='http://dig.chouti.com/login',
    method='POST',
    headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
    body='phone=8615131255089&password=pppppppp&oneMonth=1',
    cookies=cookie_dict,  # 手动携带
    callback=self.check_login
)
yield req

# 自动:
class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['http://dig.chouti.com/', ]

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse_index, meta={'cookiejar': True})

    def parse_index(self, response):
        req = Request(
            url='http://dig.chouti.com/login',
            method='POST',
            headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
            body='phone=8613121758648&password=woshiniba&oneMonth=1',
            callback=self.parse_check_login,
            meta={'cookiejar': True}
        )
        yield req

    def parse_check_login(self, response):
        # print(response.text)
        yield Request(
            url='https://dig.chouti.com/link/vote?linksId=19440976',
            method='POST',
            callback=self.parse_show_result,
            meta={'cookiejar': True}
        )

    def parse_show_result(self, response):
        print(response.text)

配置文件制定是否允许操作cookie:
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False

 

避免重复访问

scrapy默认使用 scrapy.dupefilter.RFPDupeFilter 进行去重,相关配置有:

DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = False
JOBDIR = "保存范文记录的日志路径,如:/root/"  # 最终路径为 /root/requests.seen

自定义url去重操作

class RepeatUrl:
    def __init__(self):
        self.visited_url = set()

    @classmethod
    def from_settings(cls, settings):
        """
        初始化时,调用
        :param settings: 
        :return: 
        """
        return cls()

    def request_seen(self, request):
        """
        检测当前请求是否已经被访问过
        :param request: 
        :return: True表示已经访问过;False表示未访问过
        """
        if request.url in self.visited_url:
            return True
        self.visited_url.add(request.url)
        return False

    def open(self):
        """
        开始爬去请求时,调用
        :return: 
        """
        print('open replication')

    def close(self, reason):
        """
        结束爬虫爬取时,调用
        :param reason: 
        :return: 
        """
        print('close replication')

    def log(self, request, spider):
        """
        记录日志
        :param request: 
        :param spider: 
        :return: 
        """
        print('repeat', request.url)

 

为请求创建唯一标识:

问题:记录到低要不要放在数据库?
使用redis集合存储,访问记录可以放在redis中。
问题:dont_filter到低在哪里?
from scrapy.core.scheduler import Scheduler
   
def enqueue_request(self, request):
      # request.dont_filter=False
         # self.df.request_seen(request):
         #   - True,已经访问
         #   - False,未访问
      # request.dont_filter=True,全部加入到调度器
      if not request.dont_filter and self.df.request_seen(request):
         self.df.log(request, self.spider)
         return False
      # 如果往下走,把请求加入调度器
      dqok = self._dqpush(request)

 

中间件

如何给爬虫中所有的请求都携带响应头?

方案一:在每个Request对象中添加一个请求头

方案二:下载中间件

1.配置:

DOWNLOADER_MIDDLEWARES = {'xianglong.middlewares.UserAgentDownloaderMiddleware': 543,
                         }

2.编写类:

class SpiderMiddleware(object):

    def process_spider_input(self,response, spider):
        """
        下载完成,执行,然后交给parse处理
        :param response: 
        :param spider: 
        :return: 
        """
        pass

    def process_spider_output(self,response, result, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
        """
        return result

    def process_spider_exception(self,response, exception, spider):
        """
        异常调用
        :param response:
        :param exception:
        :param spider:
        :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
        """
        return None


    def process_start_requests(self,start_requests, spider):
        """
        爬虫启动时调用
        :param start_requests:
        :param spider:
        :return: 包含 Request 对象的可迭代对象
        """
        return start_requests
爬虫中间件
class UserAgentDownloaderMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called

        request.headers[
            'User-Agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

    # return None # 继续执行后续的中间件的process_request

    # from scrapy.http import Request
    # return Request(url='www.baidu.com') # 重新放入调度器中,当前请求不再继续处理

    # from scrapy.http import HtmlResponse # 执行从最后一个开始执行所有的process_response
    # return HtmlResponse(url='www.baidu.com',body=b'asdfuowjelrjaspdoifualskdjf;lajsdf')

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
方案三:内置下载中间件
配置文件:
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'

下载中间件Downloader Middleware

下载器中间件是介于Scrapy 的 request/response 处理的钩子框架。 是用于全局修改 Scrapy request 和 response 的一个轻量、底层的系统。

  1. 激活下载器中间件
    要激活下载器中间件组件,将其加入到 DOWNLOADER_MIDDLEWARES 设置中。该设置是一个字典(dict),键为中间件类的路径,值为其中间件的顺序(order)。

    DOWNLOADER_MIDDLEWARES = {
        'myproject.middlewares.CustomDownloaderMiddleware': 543,
    }
    

    关于如何分配中间件的顺序请查看 DOWNLOADER_MIDDLEWARES_BASE 设置,而后根据您想要放置中间件的位置选择一个值。由于每个中间件执行不同的动作,您的中间件可能会依赖于之前(或者之后)执行的中间件,因此顺序是很重要的。

    如果您想禁止内置的(在 DOWNLOADER_MIDDLEWARES_BASE 中设置并默认启用的)中间件,您必须在项目的 DOWNLOADER_MIDDLEWARES 设置中定义该中间件,并将其值赋为 None。例如,如果您想要关闭 user-agent 中间件:

    DOWNLOADER_MIDDLEWARES = {
        'myproject.middlewares.CustomDownloaderMiddleware': 543,
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    }
    

    最后,请注意,有些中间件需要通过特定的设置来启用。更多内容请查看相关中间件文档。

  2. 编写您自己的下载器中间件

    编写下载器中间件十分简单。每个中间件组件是一个定义了以下一个或多个方法的 Python 类:

    class scrapy.contrib.downloadermiddleware.DownloaderMiddleware
    

    参数:
    * request (Request 对象) – 处理的 request
    * spider (Spider 对象) – 该 request 对应的 spider
    * process_request(request, spider)

    当每个 request 通过下载中间件时,该方法被调用。

    process_request() 必须返回其中之一: 返回 None 、返回一个 Response 对象、返回一个 Request 对象或 raise IgnoreRequest

    如果其返回 None,Scrapy 将继续处理该 request,执行其他的中间件的相应方法,直到合适的下载器处理函数(download handler)被调用,该 request 被执行(其 response 被下载)。

    如果其返回 Response 对象,Scrapy 将不会调用 任何 其他的 process_request()或 process_exception()方法,或相应地下载函数; 其将返回该 response。已安装的中间件的 process_response()方法则会在每个 response 返回时被调用。

    如果其返回 Request 对象,Scrapy 则停止调用 process_request 方法并重新调度返回的 request。当新返回的 request 被执行后, 相应地中间件链将会根据下载的 response 被调用。

    如果其 抛出 一个 IgnoreRequest 异常,则安装的下载中间件的 process_exception() 方法会被调用。如果没有任何一个方法处理该异常, 则 request 的 errback(Request.errback)方法会被调用。如果没有代码处理抛出的异常, 则该异常被忽略且不记录(不同于其他异常那样)。

    参数:

    *   request (Request 对象) – 处理的 request
    *   spider (Spider 对象) – 该 request 对应的 spider
    *   process_response(request, response, spider)
    

    process_response() 必须返回以下之一: 返回一个 Response 对象、返回一个 Request 对象或 raise 一个 IgnoreRequest 异常。

    如果其返回一个 Response (可以与传入的 response 相同,也可以是全新的对象) 该 response 会被在链中的其他中间件的 process_response()方法处理。

    如果其返回一个 Request 对象,则中间件链停止,返回的 request 会被重新调度下载。处理类似于 process_request()返回 request 所做的那样。

    如果其抛出一个 IgnoreRequest 异常,则调用 request 的 errback(Request.errback)。如果没有代码处理抛出的异常,则该异常被忽略且不记录(不同于其他异常那样)。

    参数:

    *   request (Request 对象) – response 所对应的 request
    *    response (Response 对象) – 被处理的 response
    *   spider (Spider 对象) – response 所对应的 spider
    *    process_exception(request, exception, spider)
    

    当下载处理器(download handler)或 process_request()(下载中间件)抛出异常(包括 IgnoreRequest 异常)时,Scrapy 调用 process_exception() 。

    process_exception()应该返回以下之一: 返回 None、一个 Response 对象、或者一个 Request 对象。

    如果其返回 None,Scrapy 将会继续处理该异常,接着调用已安装的其他中间件的 process_exception()方法,直到所有中间件都被调用完毕,则调用默认的异常处理。

    如果其返回一个 Response 对象,则已安装的中间件链的 process_response()方法被调用。Scrapy 将不会调用任何其他中间件的 process_exception() 方法。

    如果其返回一个 Request 对象, 则返回的 request 将会被重新调用下载。这将停止中间件的 process_exception()方法执行,就如返回一个 response 的那样。

    参数:

    *    request (是 Request 对象) – 产生异常的 request
    *    exception (Exception 对象) – 抛出的异常
    *    spider (Spider 对象) – request 对应的 spider

问题:scrapy中如何添加代理?

方式一:内置添加代理功能

# -*- coding: utf-8 -*-
import os
import scrapy
from scrapy.http import Request


class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['https://dig.chouti.com/']

    def start_requests(self):
        os.environ['HTTP_PROXY'] = "http://192.168.11.11"

        for url in self.start_urls:
            yield Request(url=url, callback=self.parse)

    def parse(self, response):
        print(response)

方式二:自定义下载中间件

import random
import base64
import six


def to_bytes(text, encoding=None, errors='strict'):
    """Return the binary representation of `text`. If `text`
    is already a bytes object, return it as-is."""
    if isinstance(text, bytes):
        return text
    if not isinstance(text, six.string_types):
        raise TypeError('to_bytes must receive a unicode, str or bytes '
                        'object, got %s' % type(text).__name__)
    if encoding is None:
        encoding = 'utf-8'
    return text.encode(encoding, errors)


class MyProxyDownloaderMiddleware(object):
    def process_request(self, request, spider):
        proxy_list = [
            {'ip_port': '111.11.228.75:80', 'user_pass': 'xxx:123'},
            {'ip_port': '120.198.243.22:80', 'user_pass': ''},
            {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
            {'ip_port': '101.71.27.120:80', 'user_pass': ''},
            {'ip_port': '122.96.59.104:80', 'user_pass': ''},
            {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
        ]
        proxy = random.choice(proxy_list)
        if proxy['user_pass'] is not None:
            request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
            encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
            request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
        else:
            request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])

此外需要添加配置文件

DOWNLOADER_MIDDLEWARES = {
# 'xiaohan.middlewares.MyProxyDownloaderMiddleware': 543,
                    }

问题:scrapy中如何处理https

# 配置文件
DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" DOWNLOADER_CLIENTCONTEXTFACTORY = "xiaohan.middlewares.MySSLFactory"
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)


class MySSLFactory(ScrapyClientContextFactory):
    def getCertificateOptions(self):
        from OpenSSL import crypto
        v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
        v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
        return CertificateOptions(
            privateKey=v1,  # pKey对象
            certificate=v2,  # X509对象
            verify=False,
            method=getattr(self, 'method', getattr(self, '_ssl_method', None))
        )

下载中间件的作用:在每次下载前和下载后对请求和响应定制功能,例如:USER_AGENT/Cookie/代理

自定义下载器中间件需要以下几个步骤

 

1. 重写一个类

 

class CustomMiddleware(object):

 

2. def process_request(self, request, spider):

 

必须返回一个值None, Response, Request,IgnoreRequest

 

建议返回response

 

          return response

 

3. def process_response(self, request, response, spider):

 

必须返回一值 Response, Request, IgnoreRequest

 

建议返回response

 

          return response

 

4. def process_exception(self, request, exception, spider):

 

必须返回一个值 Response, Request, None

 

建议不要写这个函数, 或者直接pass

 

5. 最后在settings.py文件的DOWNLOADER_MIDDLEWARES = {

 

    '项目名称.middlewares.CustomMiddleware': 1

 

}

 

爬虫中间件

配置文件:

SPIDER_MIDDLEWARES = {
               'xiaohan.middlewares.XiaohanSpiderMiddleware': 543,
            }

编写爬虫中间件类:

class XiaohanSpiderMiddleware(object):
 
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    # 每次下载完成之后,未执行parse函数之前。
    def process_spider_input(self, response, spider):
        print('process_spider_input', response)
        return None

    def process_spider_output(self, response, result, spider):
        print('process_spider_output', response)
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        pass

    # 爬虫启动时,第一次执行start_requests时,触发。(只执行一次)
    def process_start_requests(self, start_requests, spider):
        print('process_start_requests')
        for r in start_requests:
            yield r

扩展:信号

单纯扩展:

extends.py

class MyExtension(object):
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        obj = cls()
        return obj

配置:

EXTENSIONS = {'xiaohan.extends.MyExtension':500,}

 

扩展+信号:

extends.py

from scrapy import signals


class MyExtension(object):
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        obj = cls()
        # 在爬虫打开时,触发spider_opened信号相关的所有函数:xxxxxxxxxxx
        crawler.signals.connect(obj.xxxxxxxxxxx1, signal=signals.spider_opened)
        # 在爬虫关闭时,触发spider_closed信号相关的所有函数:xxxxxxxxxxx
        crawler.signals.connect(obj.uuuuuuuuuu, signal=signals.spider_closed)
        return obj

    def xxxxxxxxxxx1(self, spider):
        print('open')

    def uuuuuuuuuu(self, spider):
        print('close')

        return obj

配置:

EXTENSIONS = {'xiaohan.extends.MyExtension':500,}

 

其他:配置文件

# -*- coding: utf-8 -*-

# Scrapy settings for step8_king project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

# 1. 爬虫名称
BOT_NAME = 'step8_king'

# 2. 爬虫应用路径
SPIDER_MODULES = ['step8_king.spiders']
NEWSPIDER_MODULE = 'step8_king.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 3. 客户端 user-agent请求头
# USER_AGENT = 'step8_king (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 4. 禁止爬虫配置
# ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 5. 并发请求数
# CONCURRENT_REQUESTS = 4

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 6. 延迟下载秒数
# DOWNLOAD_DELAY = 2


# The download delay setting will honor only one of:
# 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名
# CONCURRENT_REQUESTS_PER_DOMAIN = 2
# 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP
# CONCURRENT_REQUESTS_PER_IP = 3

# Disable cookies (enabled by default)
# 8. 是否支持cookie,cookiejar进行操作cookie
# COOKIES_ENABLED = True
# COOKIES_DEBUG = True

# Disable Telnet Console (enabled by default)
# 9. Telnet用于查看当前爬虫的信息,操作爬虫等...
#    使用telnet ip port ,然后通过命令操作
# TELNETCONSOLE_ENABLED = True
# TELNETCONSOLE_HOST = '127.0.0.1'
# TELNETCONSOLE_PORT = [6023,]


# 10. 默认请求头
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#     'Accept-Language': 'en',
# }


# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# 11. 定义pipeline处理请求
# ITEM_PIPELINES = {
#    'step8_king.pipelines.JsonPipeline': 700,
#    'step8_king.pipelines.FilePipeline': 500,
# }



# 12. 自定义扩展,基于信号进行调用
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#     # 'step8_king.extensions.MyExtension': 500,
# }


# 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度
# DEPTH_LIMIT = 3

# 14. 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo

# 后进先出,深度优先
# DEPTH_PRIORITY = 0
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
# 先进先出,广度优先

# DEPTH_PRIORITY = 1
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

# 15. 调度器队列
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler


# 16. 访问URL去重
# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'


# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html

"""
17. 自动限速算法
    from scrapy.contrib.throttle import AutoThrottle
    自动限速设置
    1. 获取最小延迟 DOWNLOAD_DELAY
    2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY
    3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY
    4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间
    5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY
    target_delay = latency / self.target_concurrency
    new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间
    new_delay = max(target_delay, new_delay)
    new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
    slot.delay = new_delay
"""

# 开始自动限速
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# 初始下载延迟
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# 最大下载延迟
# AUTOTHROTTLE_MAX_DELAY = 10
# The average number of requests Scrapy should be sending in parallel to each remote server
# 平均每秒并发数
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:
# 是否显示
# AUTOTHROTTLE_DEBUG = True

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings


"""
18. 启用缓存
    目的用于将已经发送的请求或相应缓存下来,以便以后使用
    
    from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
    from scrapy.extensions.httpcache import DummyPolicy
    from scrapy.extensions.httpcache import FilesystemCacheStorage
"""
# 是否启用缓存策略
# HTTPCACHE_ENABLED = True

# 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"

# 缓存超时时间
# HTTPCACHE_EXPIRATION_SECS = 0

# 缓存保存路径
# HTTPCACHE_DIR = 'httpcache'

# 缓存忽略的Http状态码
# HTTPCACHE_IGNORE_HTTP_CODES = []

# 缓存存储的插件
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


"""
19. 代理,需要在环境变量中设置
    from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
    
    方式一:使用默认
        os.environ
        {
            http_proxy:http://root:woshiniba@192.168.11.11:9999/
            https_proxy:http://192.168.11.11:9999/
        }
    方式二:使用自定义下载中间件
    
    def to_bytes(text, encoding=None, errors='strict'):
        if isinstance(text, bytes):
            return text
        if not isinstance(text, six.string_types):
            raise TypeError('to_bytes must receive a unicode, str or bytes '
                            'object, got %s' % type(text).__name__)
        if encoding is None:
            encoding = 'utf-8'
        return text.encode(encoding, errors)
        
    class ProxyMiddleware(object):
        def process_request(self, request, spider):
            PROXIES = [
                {'ip_port': '111.11.228.75:80', 'user_pass': ''},
                {'ip_port': '120.198.243.22:80', 'user_pass': ''},
                {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
                {'ip_port': '101.71.27.120:80', 'user_pass': ''},
                {'ip_port': '122.96.59.104:80', 'user_pass': ''},
                {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
            ]
            proxy = random.choice(PROXIES)
            if proxy['user_pass'] is not None:
                request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
                encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
                request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
                print "**************ProxyMiddleware have pass************" + proxy['ip_port']
            else:
                print "**************ProxyMiddleware no pass************" + proxy['ip_port']
                request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
    
    DOWNLOADER_MIDDLEWARES = {
       'step8_king.middlewares.ProxyMiddleware': 500,
    }
    
"""

"""
20. Https访问
    Https访问时有两种情况:
    1. 要爬取网站使用的可信任证书(默认支持)
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"
        
    2. 要爬取网站使用的自定义证书
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"
        
        # https.py
        from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
        from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)
        
        class MySSLFactory(ScrapyClientContextFactory):
            def getCertificateOptions(self):
                from OpenSSL import crypto
                v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
                v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
                return CertificateOptions(
                    privateKey=v1,  # pKey对象
                    certificate=v2,  # X509对象
                    verify=False,
                    method=getattr(self, 'method', getattr(self, '_ssl_method', None))
                )
    其他:
        相关类
            scrapy.core.downloader.handlers.http.HttpDownloadHandler
            scrapy.core.downloader.webclient.ScrapyHTTPClientFactory
            scrapy.core.downloader.contextfactory.ScrapyClientContextFactory
        相关配置
            DOWNLOADER_HTTPCLIENTFACTORY
            DOWNLOADER_CLIENTCONTEXTFACTORY

"""



"""
21. 爬虫中间件
    class SpiderMiddleware(object):

        def process_spider_input(self,response, spider):
            '''
            下载完成,执行,然后交给parse处理
            :param response: 
            :param spider: 
            :return: 
            '''
            pass
    
        def process_spider_output(self,response, result, spider):
            '''
            spider处理完成,返回时调用
            :param response:
            :param result:
            :param spider:
            :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
            '''
            return result
    
        def process_spider_exception(self,response, exception, spider):
            '''
            异常调用
            :param response:
            :param exception:
            :param spider:
            :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
            '''
            return None
    
    
        def process_start_requests(self,start_requests, spider):
            '''
            爬虫启动时调用
            :param start_requests:
            :param spider:
            :return: 包含 Request 对象的可迭代对象
            '''
            return start_requests
    
    内置爬虫中间件:
        'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,
        'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500,
        'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700,
        'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800,
        'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900,

"""
# from scrapy.contrib.spidermiddleware.referer import RefererMiddleware
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   # 'step8_king.middlewares.SpiderMiddleware': 543,
}


"""
22. 下载中间件
    class DownMiddleware1(object):
        def process_request(self, request, spider):
            '''
            请求需要被下载时,经过所有下载器中间件的process_request调用
            :param request:
            :param spider:
            :return:
                None,继续后续中间件去下载;
                Response对象,停止process_request的执行,开始执行process_response
                Request对象,停止中间件的执行,将Request重新调度器
                raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
            '''
            pass
    
    
    
        def process_response(self, request, response, spider):
            '''
            spider处理完成,返回时调用
            :param response:
            :param result:
            :param spider:
            :return:
                Response 对象:转交给其他中间件process_response
                Request 对象:停止中间件,request会被重新调度下载
                raise IgnoreRequest 异常:调用Request.errback
            '''
            print('response1')
            return response
    
        def process_exception(self, request, exception, spider):
            '''
            当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
            :param response:
            :param exception:
            :param spider:
            :return:
                None:继续交给后续中间件处理异常;
                Response对象:停止后续process_exception方法
                Request对象:停止中间件,request将会被重新调用下载
            '''
            return None

    
    默认下载中间件
    {
        'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
        'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
        'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
        'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
        'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
        'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
        'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
        'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
        'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
        'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
        'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
        'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
    }

"""
# from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'step8_king.middlewares.DownMiddleware1': 100,
#    'step8_king.middlewares.DownMiddleware2': 500,
# }
settings

自定义命令

  • 在spiders同级创建任意目录,如:commands
  • 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings


class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        spider_list = self.crawler_process.spiders.list()
        for name in spider_list:
            self.crawler_process.crawl(name, **opts.__dict__)
        self.crawler_process.start()
crawlall.py
  • 在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
  • 在项目目录执行命令:scrapy crawlall 
def run(self, args, opts):
    from scrapy.crawler import CrawlerProcess
    CrawlerProcess.crawl
    CrawlerProcess.start
    """
    self.crawler_process对象中含有:_active = {d,}
    """
    self.crawler_process.crawl('chouti', **opts.__dict__)
    self.crawler_process.crawl('cnblogs', **opts.__dict__)
    #
    self.crawler_process.start()
简单源码

 

 

 

 

 

posted on 2018-05-14 17:09  Py行僧  阅读(281)  评论(0编辑  收藏  举报