scrapy框架

Scrapy框架

一、介绍

Scrapy一个开源和协作的框架，其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的，使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛，可用于如数据挖掘、监测和自动化测试等领域，也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。     Scrapy 是基于twisted框架开发而来，twisted是一个流行的事件驱动的python网络框架。因此Scrapy使用了一种非阻塞（又名异步）的代码来实现并发。整体架构大致如下

1.引擎(EGINE)

引擎负责控制系统所有组件之间的数据流，并在某些动作发生时触发事件。有关详细信息，请参见上面的数据流部分。

2.调度器(SCHEDULER)

用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址

3.下载器(DOWLOADER)

用于下载网页内容, 并将网页内容返回给EGINE，下载器是建立在twisted这个高效的异步模型上的

4.爬虫(SPIDERS)

SPIDERS是开发人员自定义的类，用来解析responses，并且提取items，或者发送新的请求

5.项目管道(ITEM PIPLINES)

在items被提取后负责处理它们，主要包括清理、验证、持久化（比如存到数据库）等操作

6.下载器中间件(DOWNLOADER MIDDLEWARES)

位于Scrapy引擎和下载器之间，主要用来处理从EGINE传到DOWLOADER的请求request，已经从DOWNLOADER传到EGINE的响应response，你可用该中间件做以下几件事

1.process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website)

2.change received response before passing it to a splider

3.send a new Request instead of passing received response to a splider

4.pass response to a splider without fetching a web page

5.silently drop some requests

7.爬虫中间件(Spider Middlewares)

位于EGINE和SPIDERS之间，主要工作是处理SPIDERS的输入（即responses）和输出（即requests）

二、安装

windowns平台
	1.pip3 install wheel  安装后，便支持通过wheel文件安装软件，wheel文件官网：https://www.lfd.uci.edu/~gohlke/pythonlibs
	2.pip3 install lxml
	3.pip3 install pyopenssl
	4.pip3 install pywin32
	5.下载twisted的wheel文件：http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
	6.执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
	7.pip3 install scrapy
	
	
Linux平台
	1.pip3 install scrapy

三、命令行工具

#1 查看帮助
    scrapy -h
    scrapy <command> -h

#2 有两种命令：其中Project-only必须切到项目文件夹下才能执行，而Global的命令则不需要
    Global commands:
        startproject #创建项目(需要记住)
        genspider    #创建爬虫程序，文件名 爬取网址(需要记住)
        settings     #如果是在项目目录下，则得到的是该项目的配置
        runspider    #运行一个独立的python文件，不必创建项目
        shell        #scrapy shell url地址  在交互式调试，如选择器规则正确与否
        fetch        #独立于程单纯地爬取一个页面，可以拿到请求头
        view         #下载完毕后直接弹出浏览器，以此可以分辨出哪些数据是ajax请求
        version      #scrapy version 查看scrapy的版本，scrapy version -v查看scrapy依赖库的版本(需要记住)
        
    Project-only commands:
        crawl        #运行爬虫，必须创建项目才行，确保配置文件中ROBOTSTXT_OBEY = False(需要记住)
        check        #检测项目中有无语法错误
        list         #列出项目中所包含的爬虫名
        edit         #编辑器，一般不用
        parse        #scrapy parse url地址 --callback 回调函数  #以此可以验证我们的回调函数是否正确
        bench        #scrapy bentch压力测试

#3 官网链接
    https://docs.scrapy.org/en/latest/topics/commands.html

四、项目结构以及爬虫应用简介

project_name/
   scrapy.cfg
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py
           爬虫1.py
           爬虫2.py
           爬虫3.py

文件说明：

scrapy.cfg 项目的主配置信息，用来部署scrapy时使用，爬虫相关的配置信息在settings.py文件中。
items.py 设置数据存储模板，用于结构化数据，如：Django的Model
pipelines 数据处理行为，如：一般结构化的数据持久化
settings.py 配置文件，如：递归的层数、并发数，延迟下载等。强调:配置文件的选项必须大写否则视为无效，正确写法USER_AGENT='xxxx'
spiders 爬虫目录，如：创建文件，编写爬虫规则

注意：一般创建爬虫文件时，以网站域名命名

#在项目目录下新建:run.py
from scrapy.cmdline import execute
execute(['scrapy','crawl','爬虫文件名','--nolog'])  # 列表里就是cmd输入的东西(就可以在pycharm中执行了)

五、Spiders

import scrapy


class BaiduSpider(scrapy.Spider):
    name = 'tmall'   #crawl执行爬虫程序的一个标识(scrapy crawl tmall)
    allowed_domains = ['www.tmall.com']  #允许的域名范围
    start_urls = ['https://www.tmall.com/']  #起始路由
    custom_settings = {
        'DEFAULT_REQUEST_HEADERS':{
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
        }
    }             # 自定义配置文件，优先查找
    
    def start_requests(self):
        yield scrapy.Request(url=self.start_urls[0],callback=self.parse)

    def parse(self, response):  # 解析函数
        pass

六、Selectors选择器

import scrapy
from .. import items
class XiaopapaSpider(scrapy.Spider):
    name = 'xiaopapa'
    allowed_domains = ['list.tmall.com']
    start_urls = 'https://list.tmall.com/search_product.htm?q=%C4%D0%D7%B0&totalPage=1&jumpto=1'
    custom_settings = {
        'DEFAULT_REQUEST_HEADERS':{
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36 LIEBO"
        }
    }

    def start_requests(self):
        yield scrapy.Request(url=self.start_urls,callback=self.totalPage_parse,dont_filter=True,errback=self.err_case)   #errback会处理所有异常

    def totalPage_parse(self, response):
        totalPage = int(response.css('[name="totalPage"]::attr(value)').extract_first())  # 获取标签值，取第一个，如果是extract(),得到的是一个包含所有值的列表
        url = self.start_urls.replace('&totalPage=1&jumpto=1','&totalPage=%s&jumpto={}'%totalPage)  # 替换路由
        for i in range(1,totalPage+1):
            next_url = url.format(i)
            yield scrapy.Request(url=next_url,callback=self.parse_info,dont_filter=True,errback=self.err_case)  # 拿到每一页的路由，再发送请求
            
    def parse_info(self,response):
        product_selector_list = response.css('[class="product  "]') # 找到每一页的所有商品信息
        for product_selector in product_selector_list:
            product_imgurl = product_selector.css('[class="productImg-wrap"] img::attr(src)').extract_first(product_selector.css('[class="productImg-wrap"] img::attr(data-ks-lazyload)').extract_first())  # 因为有懒加载的原因，所以设置默认值(如果拿不到就拿懒加载的)
            # print(product_imgurl)
            product_title = product_selector.css('[class="productTitle"] a::attr(title)').extract_first(None)
            print(product_title)
            product_price = product_selector.css('[class="productPrice"] em::attr(title)').extract_first(None)
            # print(product_price)
            item = items.DemoItem()
            item['product_imgurl'] = product_imgurl
            item['product_title'] = product_title
            item['product_price'] = product_price
            yield item

    def err_case(self,res):
        print(res)
        print("我是错误回调")

七、Items

import scrapy


class DemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    product_imgurl = scrapy.Field()  # 有几个参数就设置几个
    product_title = scrapy.Field()
    product_price = scrapy.Field()

注意：在settings.py配置文件中解注ITEM_PIPELINES，后面数字越小，优先级越高

八、Item Pipeline

import pymongo

class DemoPipeline(object):
    def __init__(self,MONGO_INFO_DICT,MONGO_DB,MONGO_TABLE):
        self.MONGO_INFO_DICT = MONGO_INFO_DICT
        self.MONGO_DB = MONGO_DB
        self.MONGO_TABLE = MONGO_TABLE

    @classmethod
    def from_crawler(cls, crawler):
        """
        Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完
        成实例化，不用走init
        """
        MONGO_INFO_DICT = crawler.settings.get('MONGO_INFO_DICT')  # crawl下直接有settings，不需要导包了
        MONGO_DB = crawler.settings.get('MONGO_DB')
        MONGO_TABLE = crawler.settings.get('MONGO_TABLE')
        return cls(MONGO_INFO_DICT,MONGO_DB,MONGO_TABLE)

    def open_spider(self,spider):
        """
        爬虫刚启动时执行一次
        """
        print('爬虫启动了')
        self.client = pymongo.MongoClient(**self.MONGO_INFO_DICT)
        self.table = self.client[self.MONGO_DB][self.MONGO_TABLE]

    def close_spider(self,spider):
        """
        爬虫关闭时执行一次
        """
        print('爬虫结束了')
        self.client.close()

    def process_item(self, item, spider):

        product_dict = dict(item)
        self.table.insert(product_dict)
        print("%s写入成功"%product_dict)
        return item

九、Downloader Midderware

from scrapy.exceptions import IgnoreRequest
from scrapy import signals
from scrapy.http import Response,Request
import requests

# 代理文件在爬6
def get_proxy():
    return requests.get("http://127.0.0.1:5010/get/").text

def delete_proxy(proxy):
    requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))


class DemoSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class DemoDownloaderMiddleware1(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        print(get_proxy())
        request.meta['proxy'] = 'http://' + get_proxy()
        request.meta['download_timeout'] = 10
        # print(request.meta['proxy'])
        # return Request(url='https://www.baidu.com')

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        print("我是响应1")
        return response
        # return Request(url='https://www.baidu.com')

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        print("我是异常1")
        proxy = request.meta.get('proxy').replace('http://','')
        delete_proxy(proxy)
        request.meta['proxy'] = 'http://' + get_proxy()
        return request

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

注意：

1.将settings.py配置文件中的DOWNLOADER_MIDDLEWARES解注，数字越小，优先级越高

2.还要注意请求和响应的返回值

请求异常从最后开始返回再给errorback

响应异常直接返回给errorback(不会给process_exception处理)

3.代理配置

posted @ 2019-08-13 23:14 ymg-颜阅读(140) 评论(0) 编辑收藏举报

刷新页面返回顶部

ymg-颜