scrapy爬虫框架

Scrapy不是一个函数功能库,而是一个爬虫框架
爬虫框架是实现爬虫功能的一个软件结构和功能组件集合。
爬虫框架是一个半成品,能够帮助用户实现专业网络爬虫。

1.scrapy爬虫框架结构

 

数据流向步骤1:

1 EngineSpider处获得爬取请求(Request)
2 Engine将爬取请求转发给Scheduler,用于调度

数据流向步骤2:

3 EngineScheduler处获得下一个要爬取的请求

4 Engine将爬取请求通过中间件发送给Downloader

5 爬取网页后,Downloader形成响应(Response通过中间件发给Engine

6 Engine将收到的响应通过中间件发送给Spider处理

数据流向步骤3:

7 Spider处理响应后产生爬取项(scraped Item和新的爬取请求(Requests)给Engine

8 Engine将爬取项发送给Item Pipeline(框架出口)
9 Engine将爬取请求发送给Scheduler

Engine控制各模块数据流,不间断从Scheduler
获得爬取请求,直至请求为空
框架入口:Spider的初始爬取请求
框架出口:Item Pipeline
engine、scheduler、Downloader是已有实现,spiders和pipline需要编写。

Engine
(1) 控制所有模块之间的数据流
(2) 根据条件触发事件

Downloader
根据请求下载网页

Scheduler

对所有爬取请求进行调度管理 

Downloader Middleware

目的:实施EngineSchedulerDownloader之间进行用户可配置的控制

功能:修改、丢弃、新增请求或响应
用户可以编写配置代码

Spider
(1) 解析Downloader返回的响应(Response
(2) 产生爬取项(scraped item
(3) 产生额外的爬取请求(Request

用户可以编写配置代码

Item Pipelines
(1) 以流水线方式处理Spider产生的爬取项
(2) 由一组操作顺序组成,类似流水线,每个操作是一个Item Pipeline类型
(3) 可能操作包括:清理、检验和查重爬取项中的HTML数据、将数据存储到数据库
需要用户编写配置代码

Spider Middleware
目的:对请求和爬取项的再处理
功能:修改、丢弃、新增请求或爬取项
2.scrapy常用命令

startproject   创建一个新工程  scrapy startproject <name> [dir]
genspider    创建一个爬虫  scrapy genspider [options] <name> <domain>
settings      获得爬虫配置信息  scrapy settings [options]
crawl         运行一个爬虫  scrapy crawl <spider>
list             列出工程中所有爬虫  scrapy list
shell         启动URL调试命令行  scrapy shell [url]        

3.创建scrapy

应用Scrapy爬虫框架主要是编写配置型代码

步骤1:建立一个Scrapy爬虫工程
选取一个目录(D:\,然后执行如下命令

scrapy startproject <name> [dir]
生成的目录结构


# -*- coding: utf-8 -*-
import scrapy
import sys
import io
from scrapy.http import Request
from scrapy.selector import Selector, HtmlXPathSelector
from ..items import ChoutiItem

sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

class ChoutiSpider(scrapy.Spider):
    name = "chouti"
    allowed_domains = ["chouti.com"]
    start_urls = ['http://dig.chouti.com/']

    visited_urls =set()

    # def start_requests(self):
    #     for url in self.start_urls:
    #         yield Request(url,callback=self.parse)

    def parse(self, response):
        # content = str(response.body,encoding='utf-8')
        # 找到文档中所有A标签
        # hxs = Selector(response=response).xpath('//a') # 标签对象列表
        # for i in hxs:
        #     print(i) # 标签对象

        # 对象转换为字符串
        # hxs = Selector(response=response).xpath('//div[@id="content-list"]/div[@class="item"]').extract()  # 标签对象列表
        # hxs = Selector(response=response).xpath('//div[@id="content-list"]/div[@class="item"]')  # 标签对象列表
        # for obj in hxs:
        #     a = obj.xpath('.//a[@class="show-content"]/text()').extract_first()
        #     print(a.strip())
        # 选择器:
        """
        //   表示子孙中
        .//  当前对象的子孙中
        /    儿子
        /div 儿子中的div标签
        /div[@id="i1"]  儿子中的div标签且id=i1
        /div[@id="i1"]  儿子中的div标签且id=i1
        obj.extract()         # 列表中的每一个对象转换字符串 =》 []
        obj.extract_first()   # 列表中的每一个对象转换字符串 => 列表第一个元素
        //div/text()    获取某个标签的文本

       """

        # 获取当前页的所有页码

        # hxs = Selector(response=response).xpath('//div[@id="dig_lcpage"]//a/text()')
        # hxs0 = Selector(response=response).xpath('//div[@id="dig_lcpage"]//a/@href').extract()
        # for item in hxs0:
        #     if item in self.visited_urls:
        #         print('已经存在', item)
        #     else:
        #         self.visited_urls.add(item)
        #         print(item)

        # hxs2 = Selector(response=response).xpath('//div[@id="dig_lcpage"]//a/@href').extract()
        # hxs2 = Selector(response=response).xpath('//a[starts-with(@href, "/all/hot/recent/")]/@href').extract()
    #     hxs2 = Selector(response=response).xpath('//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()
    #     for url in hxs2:
    #         md5_url = self.md5(url)
    #         if md5_url in self.visited_urls:
    #             pass
    #             # print('已经存在', url)
    #         else:
    #             self.visited_urls.add(md5_url)
    #             print(url)
    #             url = "http://dig.chouti.com%s" %url
    #             # 将新要访问的url添加到调度器
    #             yield Request(url=url,callback=self.parse)
    #             # a/@href  获取属性
    #             # //a[starts-with(@href, "/all/hot/recent/")]/@href'  已xx开始
    #             # //a[re:test(@href, "/all/hot/recent/\d+")]          正则
    #             # yield Request(url=url,callback=self.parse)          # 将新要访问的url添加到调度器
    #             # 重写start_requests,指定最开始处理请求的方法
    #
    #             # def show(self,response):
    #             #     print(response.text)
    #
    # def md5(self, url):
    #     import hashlib
    #     obj = hashlib.md5()
    #     obj.update(bytes(url, encoding='utf-8'))
    #     return obj.hexdigest()


        # hxs = HtmlXPathSelector(response)
        # response
        hxs1 = Selector(response=response).xpath('//div[@id="content-list"]/div[@class="item"]')  # 标签对象列表
        for obj in hxs1:
            title = obj.xpath('.//a[@class="show-content color-chag"]/text()').extract_first().strip()
            href =  obj.xpath('.//a[@class="show-content color-chag"]/@href').extract_first().strip()
            # print(title)
            # print(href)
            item_obj = ChoutiItem(title=title,href=href)
            # 将item对象传递给pipeline
            yield item_obj
初试scrapy

parse()用于处理响应,解析内容形成字典,发现新的URL爬取请求 
配置产生的spider爬虫:(1)初始URL地址 (2)获取页面后的解析方式 

yield 生成器
生成器每次产生一个值(yield语句),函数被冻结,被唤醒后再产生一个值,生成器是一个不断产生值的函数
生成器相比一次列出所有内容的优势
1)更节省存储空间
2)响应更迅速
3)使用更灵活
4.scrapy数据类型

request类:

class scrapy.http.Request()
Request对象表示一个HTTP请求
Spider生成,由Downloader执行

属性或方法 说明
.url Request对应的请求URL地址
.method 对应的请求方法,'GET' 'POST'等
.headers 字典类型风格的请求头
.body 请求内容主体,字符串类型
.meta 用户添加的扩展信息,在Scrapy内部模块间传递信息使用
.copy() 复制该请求

response类:

class scrapy.http.Response()
Response对象表示一个HTTP响应
Downloader生成,由Spider处理

属性或方法 说明
.url Response对应的URL地址
.status HTTP状态码,默认是200
.headers Response对应的头部信息
.body Response对应的内容信息,字符串类型
.flags 一组标记
.request 产生Response类型对应的Request对象
.copy() 复制该响应

Item类

class scrapy.item.Item()
Item对象表示一个从HTML页面中提取的信息内容
Spider生成,由Item Pipeline处理
Item类似字典类型,可以按照字典类型操作 

Scrapy爬虫支持多种HTML信息爬取方法:

Beautiful Soup
lxml
re
XPath Selector
CSS Selector
settings.py配置并发连接选项
选项 说明
CONCURRENT_REQUESTS Downloader最大并发请求下载数量,默认32
CONCURRENT_ITEMS Item Pipeline最大并发ITEM处理数量,默认100
CONCURRENT_REQUESTS_PER_DOMAIN 每个目标域名最大的并发请求数量,默认8
CONCURRENT_REQUESTS_PER_IP 每个目标IP最大的并发请求数量,默认0,非0有效

5.格式化处理

import scrapy
import sys
import io
from scrapy.http import Request
from scrapy.selector import Selector, HtmlXPathSelector
from ..items import ChoutiItem


sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

class ChoutiSpider(scrapy.Spider):
    name = "chouti"
    allowed_domains = ["chouti.com"]
    start_urls = ['http://dig.chouti.com/']

    visited_urls =set()
 
        hxs1 = Selector(response=response).xpath('//div[@id="content-list"]/div[@class="item"]')  # 标签对象列表
        for obj in hxs1:
            title = obj.xpath('.//a[@class="show-content"]/text()').extract_first().strip()
            href =  obj.xpath('.//a[@class="show-content"]/@href').extract_first().strip()
            item_obj = ChoutiItem(title=title,href=href)

            # 将item对象传递给pipeline
            yield item_obj
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ChoutiItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    href = scrapy.Field()
Item
class Day96Pipeline(object):
    def process_item(self, item, spider):
        print(spider, item)
        # if spider.name == 'chouti'
        tpl = "%s\n%s\n\n" %(item['title'],item['href'])
        f = open('news.json', 'a')
        f.write(tpl)
        f.close()
pipline
ITEM_PIPELINES = {
   # 'day96.pipelines.Day96Pipeline': 300,
   # 'day96.pipelines.Day96Pipeline': 300,
   'day96.pipelines.Day96Pipeline': 300,
}
settings

自定义pipeline

from scrapy.exceptions import DropItem

class Day96Pipeline(object):

    def __init__(self,conn_str):
        self.conn_str = conn_str

    @classmethod
    def from_crawler(cls, crawler):
        """
        初始化时候,用于创建pipeline对象
        :param crawler:
        :return:
        """
        conn_str = crawler.settings.get('DB')
        return cls(conn_str)

    def open_spider(self,spider):
        """
        爬虫开始执行时,调用
        :param spider:
        :return:
        """
        self.conn = open(self.conn_str, 'a')

    def close_spider(self,spider):
        """
        爬虫关闭时,被调用
        :param spider:
        :return:
        """
        self.conn.close()

    def process_item(self, item, spider):
        """
        每当数据需要持久化时,就会被调用
        :param item:
        :param spider:
        :return:
        """
        # if spider.name == 'chouti'
        tpl = "%s\n%s\n\n" %(item['title'],item['href'])
        self.conn.write(tpl)
        # 交给下一个pipeline处理
        return item
        # 丢弃item,不交给
        # raise DropItem()
自定义pipeline

6.cookies

from scrapy.http.cookies import CookieJar
        cookie_obj = CookieJar()
        cookie_obj.extract_cookies(response,response.request)
        print(cookie_obj._cookies)
# -*- coding: utf-8 -*-
import scrapy
import sys
import io
from scrapy.http import Request
from scrapy.selector import Selector, HtmlXPathSelector
from ..items import ChoutiItem


sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
from scrapy.http.cookies import CookieJar
class ChoutiSpider(scrapy.Spider):
    name = "chouti"
    allowed_domains = ["chouti.com",]
    start_urls = ['http://dig.chouti.com/']

    cookie_dict = None
    def parse(self, response):
        print("spider.reponse",response)

        cookie_obj = CookieJar()
        cookie_obj.extract_cookies(response,response.request)
        self.cookie_dict = cookie_obj._cookies
        # 带上用户名密码+cookie
        yield Request(
            url="http://dig.chouti.com/login",
            method='POST',
            body = "phone=8615131255089&password=woshiniba&oneMonth=1",
            headers={'Content-Type': "application/x-www-form-urlencoded; charset=UTF-8"},
            cookies=cookie_obj._cookies,
            callback=self.check_login
        )

    def check_login(self,response):
        print(response.text)
        yield Request(url="http://dig.chouti.com/",callback=self.good)


    def good(self,response):
        id_list = Selector(response=response).xpath('//div[@share-linkid]/@share-linkid').extract()
        for nid in id_list:
            print(nid)
            url = "http://dig.chouti.com/link/vote?linksId=%s" % nid
            yield Request(
                url=url,
                method="POST",
                cookies=self.cookie_dict,
                callback=self.show
            )

        page_urls = Selector(response=response).xpath('//div[@id="dig_lcpage"]//a/@href').extract()
        for page in page_urls:
            url = "http://dig.chouti.com%s" % page
            yield Request(url=url,callback=self.good)


    def show(self,response):
        print(response.text)
带cookie登录

7.自定义扩展

from scrapy import signals
class MyExtend:

    def __init__(self,crawler):
        self.crawler = crawler
        # 钩子上挂障碍物
        # 在指定信号上注册操作
        crawler.signals.connect(self.start, signals.engine_started)
        crawler.signals.connect(self.close, signals.spider_closed)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def start(self):
        print('signals.engine_started.start')

    def close(self):
        print('signals.spider_closed.close')
extension

8.证书

  from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
    
    方式一:使用默认
        os.environ
        {
            http_proxy:http://root:@ip:port/
            https_proxy:http://ip:port/
        }
    方式二:使用自定义下载中间件
    
    def to_bytes(text, encoding=None, errors='strict'):
        if isinstance(text, bytes):
            return text
        if not isinstance(text, six.string_types):
            raise TypeError('to_bytes must receive a unicode, str or bytes '
                            'object, got %s' % type(text).__name__)
        if encoding is None:
            encoding = 'utf-8'
        return text.encode(encoding, errors)
        
    class ProxyMiddleware(object):
        def process_request(self, request, spider):
            PROXIES = [
                {'ip_port': '111.11.228.75:80', 'user_pass': ''},
                {'ip_port': '120.198.243.22:80', 'user_pass': ''},
                {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
                {'ip_port': '101.71.27.120:80', 'user_pass': ''},
                {'ip_port': '122.96.59.104:80', 'user_pass': ''},
                {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
            ]
            proxy = random.choice(PROXIES)
            if proxy['user_pass'] is not None:
                request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
                encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
                request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
                print "**************ProxyMiddleware have pass************" + proxy['ip_port']
            else:
                print "**************ProxyMiddleware no pass************" + proxy['ip_port']
                request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
    
    DOWNLOADER_MIDDLEWARES = {
       'step8_king.middlewares.ProxyMiddleware': 500,
    }
certification

9.中间件

class SpiderMiddleware(object):

    def process_spider_input(self,response, spider):
        """
        下载完成,执行,然后交给parse处理
        :param response: 
        :param spider: 
        :return: 
        """
        pass

    def process_spider_output(self,response, result, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
        """
        return result

    def process_spider_exception(self,response, exception, spider):
        """
        异常调用
        :param response:
        :param exception:
        :param spider:
        :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
        """
        return None


    def process_start_requests(self,start_requests, spider):
        """
        爬虫启动时调用
        :param start_requests:
        :param spider:
        :return: 包含 Request 对象的可迭代对象
        """
        return start_requests
爬虫中间件
class DownMiddleware1(object):
    def process_request(self, request, spider):
        """
        请求需要被下载时,经过所有下载器中间件的process_request调用
        :param request: 
        :param spider: 
        :return:  
            None,继续后续中间件去下载;
            Response对象,停止process_request的执行,开始执行process_response
            Request对象,停止中间件的执行,将Request重新调度器
            raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
        """
        pass



    def process_response(self, request, response, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return: 
            Response 对象:转交给其他中间件process_response
            Request 对象:停止中间件,request会被重新调度下载
            raise IgnoreRequest 异常:调用Request.errback
        """
        print('response1')
        return response

    def process_exception(self, request, exception, spider):
        """
        当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
        :param response:
        :param exception:
        :param spider:
        :return: 
            None:继续交给后续中间件处理异常;
            Response对象:停止后续process_exception方法
            Request对象:停止中间件,request将会被重新调用下载
        """
        return None
下载中间件

 10.自定制命令

scrapy除了命令行,还可以在程序中启动爬虫,由于scrapy实在Twisted异步网络库上构建的,因此必须在Twisted reactor里运行。

from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings


class Command(ScrapyCommand):

    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        # 找到所有的爬虫名称
        print(type(self.crawler_process))
        #<class 'scrapy.crawler.CrawlerProcess'>
        from scrapy.crawler import CrawlerProcess
        # 1. 执行CrawlerProcess构造方法
        # 2. CrawlerProcess对象含有配置文件的spiders
            # 2.1,为每个爬虫创建一个Crawler
            # 2.2,执行 d = Crawler.crawl(...)   # ************************ #
            #           d.addBoth(_done)
            # 2.3, CrawlerProcess对象._active = {d,}

        # 3. dd = defer.DeferredList(self._active)
        #    dd.addBoth(self._stop_reactor)  # self._stop_reactor ==> reactor.stop()
        #    reactor.run

        # 获取当前所有爬虫的名称
        spider_list = self.crawler_process.spiders.list()
        # spider_list = ["chouti",'cnblogs']

        for name in spider_list:
            self.crawler_process.crawl(name, **opts.__dict__)
        self.crawler_process.start()
commandrun

 

posted @ 2017-11-19 18:54  ninxin18  阅读(555)  评论(0编辑  收藏  举报