scrapy框架

Scrapy框架

一、介绍

Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛,可用于如数据挖掘、监测和自动化测试等领域,也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。     Scrapy 是基于twisted框架开发而来,twisted是一个流行的事件驱动的python网络框架。因此Scrapy使用了一种非阻塞(又名异步)的代码来实现并发。整体架构大致如下

1.引擎(EGINE)

引擎负责控制系统所有组件之间的数据流,并在某些动作发生时触发事件。有关详细信息,请参见上面的数据流部分。

2.调度器(SCHEDULER)

用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址

3.下载器(DOWLOADER)

用于下载网页内容, 并将网页内容返回给EGINE,下载器是建立在twisted这个高效的异步模型上的

4.爬虫(SPIDERS)

SPIDERS是开发人员自定义的类,用来解析responses,并且提取items,或者发送新的请求

5.项目管道(ITEM PIPLINES)

在items被提取后负责处理它们,主要包括清理、验证、持久化(比如存到数据库)等操作

6.下载器中间件(DOWNLOADER MIDDLEWARES)

位于Scrapy引擎和下载器之间,主要用来处理从EGINE传到DOWLOADER的请求request,已经从DOWNLOADER传到EGINE的响应response,你可用该中间件做以下几件事

​ 1.process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website)

​ 2.change received response before passing it to a splider

​ 3.send a new Request instead of passing received response to a splider

​ 4.pass response to a splider without fetching a web page

​ 5.silently drop some requests

7.爬虫中间件(Spider Middlewares)

位于EGINE和SPIDERS之间,主要工作是处理SPIDERS的输入(即responses)和输出(即requests)

二、安装

windowns平台
	1.pip3 install wheel  安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
	2.pip3 install lxml
	3.pip3 install pyopenssl
	4.pip3 install pywin32
	5.下载twisted的wheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
	6.执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
	7.pip3 install scrapy
	
	
Linux平台
	1.pip3 install scrapy

三、命令行工具

#1 查看帮助
    scrapy -h
    scrapy <command> -h

#2 有两种命令:其中Project-only必须切到项目文件夹下才能执行,而Global的命令则不需要
    Global commands:
        startproject #创建项目(需要记住)
        genspider    #创建爬虫程序,文件名 爬取网址(需要记住)
        settings     #如果是在项目目录下,则得到的是该项目的配置
        runspider    #运行一个独立的python文件,不必创建项目
        shell        #scrapy shell url地址  在交互式调试,如选择器规则正确与否
        fetch        #独立于程单纯地爬取一个页面,可以拿到请求头
        view         #下载完毕后直接弹出浏览器,以此可以分辨出哪些数据是ajax请求
        version      #scrapy version 查看scrapy的版本,scrapy version -v查看scrapy依赖库的版本(需要记住)
        
    Project-only commands:
        crawl        #运行爬虫,必须创建项目才行,确保配置文件中ROBOTSTXT_OBEY = False(需要记住)
        check        #检测项目中有无语法错误
        list         #列出项目中所包含的爬虫名
        edit         #编辑器,一般不用
        parse        #scrapy parse url地址 --callback 回调函数  #以此可以验证我们的回调函数是否正确
        bench        #scrapy bentch压力测试

#3 官网链接
    https://docs.scrapy.org/en/latest/topics/commands.html

四、项目结构以及爬虫应用简介

project_name/
   scrapy.cfg
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py
           爬虫1.py
           爬虫2.py
           爬虫3.py

文件说明:

  • scrapy.cfg 项目的主配置信息,用来部署scrapy时使用,爬虫相关的配置信息在settings.py文件中。
  • items.py 设置数据存储模板,用于结构化数据,如:Django的Model
  • pipelines 数据处理行为,如:一般结构化的数据持久化
  • settings.py 配置文件,如:递归的层数、并发数,延迟下载等。强调:配置文件的选项必须大写否则视为无效,正确写法USER_AGENT='xxxx'
  • spiders 爬虫目录,如:创建文件,编写爬虫规则

注意:一般创建爬虫文件时,以网站域名命名

#在项目目录下新建:run.py
from scrapy.cmdline import execute
execute(['scrapy','crawl','爬虫文件名','--nolog'])  # 列表里就是cmd输入的东西(就可以在pycharm中执行了)

五、Spiders

import scrapy


class BaiduSpider(scrapy.Spider):
    name = 'tmall'   #crawl执行爬虫程序的一个标识(scrapy crawl tmall)
    allowed_domains = ['www.tmall.com']  #允许的域名范围
    start_urls = ['https://www.tmall.com/']  #起始路由
    custom_settings = {
        'DEFAULT_REQUEST_HEADERS':{
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
        }
    }             # 自定义配置文件,优先查找
    
    def start_requests(self):
        yield scrapy.Request(url=self.start_urls[0],callback=self.parse)

    def parse(self, response):  # 解析函数
        pass

六、Selectors选择器

import scrapy
from .. import items
class XiaopapaSpider(scrapy.Spider):
    name = 'xiaopapa'
    allowed_domains = ['list.tmall.com']
    start_urls = 'https://list.tmall.com/search_product.htm?q=%C4%D0%D7%B0&totalPage=1&jumpto=1'
    custom_settings = {
        'DEFAULT_REQUEST_HEADERS':{
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36 LIEBO"
        }
    }

    def start_requests(self):
        yield scrapy.Request(url=self.start_urls,callback=self.totalPage_parse,dont_filter=True,errback=self.err_case)   #errback会处理所有异常

    def totalPage_parse(self, response):
        totalPage = int(response.css('[name="totalPage"]::attr(value)').extract_first())  # 获取标签值,取第一个,如果是extract(),得到的是一个包含所有值的列表
        url = self.start_urls.replace('&totalPage=1&jumpto=1','&totalPage=%s&jumpto={}'%totalPage)  # 替换路由
        for i in range(1,totalPage+1):
            next_url = url.format(i)
            yield scrapy.Request(url=next_url,callback=self.parse_info,dont_filter=True,errback=self.err_case)  # 拿到每一页的路由,再发送请求
            
    def parse_info(self,response):
        product_selector_list = response.css('[class="product  "]') # 找到每一页的所有商品信息
        for product_selector in product_selector_list:
            product_imgurl = product_selector.css('[class="productImg-wrap"] img::attr(src)').extract_first(product_selector.css('[class="productImg-wrap"] img::attr(data-ks-lazyload)').extract_first())  # 因为有懒加载的原因,所以设置默认值(如果拿不到就拿懒加载的)
            # print(product_imgurl)
            product_title = product_selector.css('[class="productTitle"] a::attr(title)').extract_first(None)
            print(product_title)
            product_price = product_selector.css('[class="productPrice"] em::attr(title)').extract_first(None)
            # print(product_price)
            item = items.DemoItem()
            item['product_imgurl'] = product_imgurl
            item['product_title'] = product_title
            item['product_price'] = product_price
            yield item

    def err_case(self,res):
        print(res)
        print("我是错误回调")

七、Items

import scrapy


class DemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    product_imgurl = scrapy.Field()  # 有几个参数就设置几个
    product_title = scrapy.Field()
    product_price = scrapy.Field()

注意:在settings.py配置文件中解注ITEM_PIPELINES,后面数字越小,优先级越高

八、Item Pipeline

import pymongo

class DemoPipeline(object):
    def __init__(self,MONGO_INFO_DICT,MONGO_DB,MONGO_TABLE):
        self.MONGO_INFO_DICT = MONGO_INFO_DICT
        self.MONGO_DB = MONGO_DB
        self.MONGO_TABLE = MONGO_TABLE

    @classmethod
    def from_crawler(cls, crawler):
        """
        Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完
        成实例化,不用走init
        """
        MONGO_INFO_DICT = crawler.settings.get('MONGO_INFO_DICT')  # crawl下直接有settings,不需要导包了
        MONGO_DB = crawler.settings.get('MONGO_DB')
        MONGO_TABLE = crawler.settings.get('MONGO_TABLE')
        return cls(MONGO_INFO_DICT,MONGO_DB,MONGO_TABLE)

    def open_spider(self,spider):
        """
        爬虫刚启动时执行一次
        """
        print('爬虫启动了')
        self.client = pymongo.MongoClient(**self.MONGO_INFO_DICT)
        self.table = self.client[self.MONGO_DB][self.MONGO_TABLE]

    def close_spider(self,spider):
        """
        爬虫关闭时执行一次
        """
        print('爬虫结束了')
        self.client.close()

    def process_item(self, item, spider):

        product_dict = dict(item)
        self.table.insert(product_dict)
        print("%s写入成功"%product_dict)
        return item

九、Downloader Midderware

from scrapy.exceptions import IgnoreRequest
from scrapy import signals
from scrapy.http import Response,Request
import requests

# 代理文件在爬6
def get_proxy():
    return requests.get("http://127.0.0.1:5010/get/").text

def delete_proxy(proxy):
    requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))


class DemoSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class DemoDownloaderMiddleware1(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        print(get_proxy())
        request.meta['proxy'] = 'http://' + get_proxy()
        request.meta['download_timeout'] = 10
        # print(request.meta['proxy'])
        # return Request(url='https://www.baidu.com')

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        print("我是响应1")
        return response
        # return Request(url='https://www.baidu.com')

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        print("我是异常1")
        proxy = request.meta.get('proxy').replace('http://','')
        delete_proxy(proxy)
        request.meta['proxy'] = 'http://' + get_proxy()
        return request

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

注意:

1.将settings.py配置文件中的DOWNLOADER_MIDDLEWARES解注,数字越小,优先级越高

2.还要注意请求和响应的返回值

请求异常从最后开始返回再给errorback

响应异常直接返回给errorback(不会给process_exception处理)

3.代理配置

posted @ 2019-08-13 23:14  ymg-颜  阅读(140)  评论(0编辑  收藏  举报