欢迎来到十九分快乐的博客

生死看淡,不服就干。

5. scrapy 请求参数 - 分布式 - 增量式

1.请求传参

深度爬取:爬取的数据没有存储在同一张页面中。
创建工程:
    scrapy startproject moviePro
创建爬虫文件:
    cd moviePro
    scrapy genspider movie www.xxx.com
运行项目:
    scrapy crawl movie

配置文件settings.py

BOT_NAME = 'moviePro'

SPIDER_MODULES = ['moviePro.spiders']
NEWSPIDER_MODULE = 'moviePro.spiders'

# 请求头
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# 只打印错误日志
LOG_LEVEL = 'ERROR' 
# 打开管道
ITEM_PIPELINES = {
   'moviePro.pipelines.MovieproPipeline': 300,
}

items.py 配置两个字段

import scrapy

class MovieproItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    actor = scrapy.Field()

spiders/movie.py

# 4567电影网:https://www.4567kan.com/index.php/vod/show/id/1.html

import scrapy
from moviePro.items import MovieproItem

class MovieSpider(scrapy.Spider):
    name = 'movie'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.4567kan.com/index.php/vod/show/id/1.html']
    
    def parse(self, response):
        li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
        for li in li_list:
            title = li.xpath('./div/a/@title').extract_first()
            # 获取详情数据路径
            detail_url = 'https://www.4567kan.com'+li.xpath('./div/a/@href').extract_first()
            item = MovieproItem()
            item['title'] = title

            # 对详情页的url发起get请求
            # meta字典会传递给callback回调函数
            yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})
    
    def parse_detail(self,response):
        actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first()
        # 使用response接受传递过来的item
        item = response.meta['item']
        item['actor'] = actor

        yield item # 提交管道

pipelines.py

class MovePipeline:
    def process_item(self, item, spider):
        print(item) # 打印,查看是否传过来数据
        return item

2.中间件

- 中间件
    - 中间件的种类:
        - 爬虫中间件
        - 下载中间件(**)
    - 作用:拦截请求和响应 
    - 拦截请求:
        - 修改请求头信息
        - 进行代理操作

middlewares.py

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter

class MiddleproDownloaderMiddleware:
    # 拦截所有的请求
    # request就是拦截到的请求
    # spider就是爬虫文件中定义类实例化的对象
    def process_request(self, request, spider):
        print('i am process_request')
        # 请求头信息的修改
        # request.headers['User-Agent'] = 'xxx'
        return None

    # 拦截响应
    # response:拦截到的响应对象
    # request:拦截到响应对象对应的请求对象
    def process_response(self, request, response, spider):
        print('i am process_response')
        return response

    # 拦截发生异常的请求对象 -- 修正请求对象
    def process_exception(self, request, exception, spider):
        print('i am process_exception')
        
        # 代理操作 -- 如果网站封禁本机ip 
        request.meta['proxy'] = 'https://ip:port'
        return request #对修正后的请求重新进行发送

3.全站数据爬取

- CrawlSpider父类
    1. 作用: 全站数据爬取
    2. 使用:
        - 创建项目: 
            scrapy startproject crawlPro(项目名称)
        - 创建爬虫文件:
            cd crawlPro
            scrapy genspider -t crawl sun(爬虫文件名) www.xx.com
        - 运行项目:
            scrapy crawl sun
    3. 连接提取器
        - 可以根据指定规则LinkExtractor(allow=正则)进行连接(url)的提取
    4. 规则解析器
        - 可以将link提取到的链接对应页面数据根据指定函数规则(callback回调函数)进行数据解析

spiders/sun.py

# 电影网站:https://www.4567kan.com/index.php/vod/show/id/1.html
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class SunSpider(CrawlSpider):
    name = 'sun'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.4567kan.com/index.php/vod/show/id/1.html']

    # 链接提取器:可以根据指定规则(allow=正则)进行连接(url)的提取
    link = LinkExtractor(allow=r'page/\d+\.html')

    rules = (
        # 实例化了一个Rule(规则解析器)对象
        # 作用:可以将link提取到的链接对应页面数据根据指定函数规则(callback回调函数)进行数据解析
        # follow = False 只解析首次提取的页面, follow = True 可加载所有页面
        Rule(link, callback='parse_item', follow=False),
    )
    # 该方法调用的次数取决于link提取到的链接的个数
    def parse_item(self, response):
        print(response)

4.分布式

概念:

1.概念:搭建一个分布式集群,对同一个网络资源进行联合且分布的数据爬取
2.原生的scrapy框架是无法实现分布式原因?
    - 调度器无法被共享
    - 管道无法被共享
3.实现分布式的解决方案:
    - 组件:scrapy-redis
    - 组件作用:提供可以被共享的管道和调度器
    - 安装: pip install scrapy-redis

实现流程

0.创建项目
	- 创建项目: 
   	 	scrapy startproject fbsPro(项目名称)
    - 创建爬虫文件:
   	 	cd crawlPro
    	scrapy genspider -t crawl fbs(爬虫文件名) www.xx.com
1.修改爬虫文件
	- 导包:from scrapy_redis.spiders import RedisCrawlSpider
    - 将爬虫类的父类修改为RedisCrawlSpider
    - start_urls删除或注释掉
    - 添加一个新属性:redis_key = ’xxx‘ -- 可以被共享的调度器队列的名称
    - 编写请求和数据解析操作
2.修改setting.py
    - 指定scrapy_redis管道:
        ITEM_PIPELINES = {
            'scrapy_redis.pipelines.RedisPipeline': 400
        }
    - 指定调度器:
        # 使用scrapy-redis组件的去重队列
        DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
        # 使用scrapy-redis组件自己的调度器
        SCHEDULER = "scrapy_redis.scheduler.Scheduler"
        # 是否允许暂停
        SCHEDULER_PERSIST = True
    - 指定redis数据库地址
        REDIS_HOST = 'redis服务的ip地址'
        REDIS_PORT = 6379
        REDIS_ENCODING = ‘utf-8’
        REDIS_PARAMS = {‘password’:’123456’}
3.配置redis
    - 找到redis-x64文件夹 -- C:\Users\Administrator\Downloads\Redis-x64-5.0.10
    - 配置文件:redis.window.conf 或 redis.window-service.conf
    - 注释56行:关闭默认绑定(#bind 127.0.0.1)
    - 关闭保护模式:75行(protected-mode no)
    - 启动redis服务端:redis-server.exe
    - 启动redis客户端:redis-cli.exe

4.执行工程:
    - cd 到spiders目录下
    - scrapy runspider (爬虫文件名)fbs.py
5.向调度器队列中仍入一个起始的url:(调度器的队列在redis中)
	电影网址: https://www.4567kan.com/index.php/vod/show/id/1.html
	- redis-cli客户端:lpush fbsQueue www.xxx.com(网站地址)

代码

settings.py

BOT_NAME = 'fbsPro'

SPIDER_MODULES = ['fbsPro.spiders']
NEWSPIDER_MODULE = 'fbsPro.spiders'

# 请求头
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR' # 只打印错误日志

# redis共享管道
ITEM_PIPELINES = {
   'scrapy_redis.pipelines.RedisPipeline': 400,
}

# redis共享调度器
# 使用scrapy-redis组件的去重队列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis组件自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允许暂停
SCHEDULER_PERSIST = True

# redis数据库
REDIS_HOST = '192.168.19.47'
REDIS_PORT = 6379

items.py

import scrapy

class FbsproItem(scrapy.Item):
    # define the fields for your item here like:
    # 添加字段
    title = scrapy.Field()
    actor = scrapy.Field()

spiders/fbs.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_redis.spiders import RedisCrawlSpider
from fbsPro.items import FbsproItem

class FbsSpider(RedisCrawlSpider):
    name = 'fbs'
    # allowed_domains = ['www.xxx.com']
    # start_urls = ['http://www.xxx.com/']

    # 可以被共享的调度器队列的名称
    redis_key = 'fbsQueue'

    # 提取页码链接
    link = LinkExtractor(allow=r'page/\d+\.html')
    rules = (
        Rule(link,callback='parse_item', follow=True),
    )

    def parse_item(self, response):
       li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
       for li in li_list:
           title = li.xpath('./div/a/@title').extract_first()
           # 详细路径
           detail_url = 'https://www.4567kan.com'+li.xpath('./div/a/@href').extract_first()
           item = FbsproItem()
           item['title'] = title
           yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})
    
    def parse_detail(self,response):
        item = response.meta['item']
        actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first()
        item['actor'] = actor
        yield item

5.增量式

简介

增量式:
    1. 作用:监测网站数据更新的情况。
    2. 记录表:记录爬取过的数据标识(数据指纹),爬过得不再重新爬取。
        - 谁充当记录表:redis的set
    
0.创建项目
	- 创建项目: 
   	 	scrapy startproject zlsPro(项目名称)
    - 创建爬虫文件:
   	 	cd crawlPro
    	scrapy genspider -t crawl zls(爬虫文件名) www.xx.com
        

代码

settings.py

BOT_NAME = 'zlsPro'

SPIDER_MODULES = ['zlsPro.spiders']
NEWSPIDER_MODULE = 'zlsPro.spiders'

# 请求头
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# 只输出错误日志
LOG_LEVEL = 'ERROR'
# 开启管道
ITEM_PIPELINES = {
   'zlsPro.pipelines.ZlsproPipeline': 300,
}

items.py

import scrapy

class ZlsproItem(scrapy.Item):
    # 添加字段
    title = scrapy.Field()
    actor = scrapy.Field()

spiders/zls.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from redis import Redis
from zlsPro.items import ZlsproItem

class ZlsSpider(CrawlSpider):
    name = 'zls'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.4567kan.com/index.php/vod/show/id/1/page/1.html']

    # 创建redis数据库连接
    conn = Redis(host='127.0.0.1',port=6379)

    # 提取网页分页连接
    link = LinkExtractor(allow=r'page/\d+\.html')

    rules = (
        # 解释器
        Rule(link, callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
        for li in li_list:
            title = li.xpath('./div/a/@title').extract_first()
            detail_url = 'https://www.4567kan.com' + li.xpath('./div/a/@href').extract_first()
            item = ZlsproItem()
            item['title'] = title

            # 添加记录到记录表
            res = self.conn.sadd('movie_urls',detail_url)
            if res == 1:
                print('有新数据更新,正在抓取......')
                # 发送请求获取数据
                yield scrapy.Request(detail_url, callback=self.parse_detail, meta={'item': item})
            else:
                print('暂无新数据的更新!')

    def parse_detail(self, response):
        item = response.meta['item']
        actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first()
        item['actor'] = actor
        yield item

posted @ 2021-04-25 17:54  十九分快乐  阅读(363)  评论(0编辑  收藏  举报