分布式爬虫

概念：多台机器上可以执行同一个爬虫程序，实现网站数据的分布爬取。

原生的scrapy是不可以实现分布式爬虫？

a. 调度器无法共享

b. 管道无法共享

crapy-redis组件

专门为scrapy开发的一套组件。该组件可以让scrapy实现分布式。

下载：pip install scrapy-redis

分布式爬取的流程：

1. redis配置文件的配置

2. bind 127.0.0.1 进行注释

3. protected-mode no 关闭保护模式

4. redis服务器的开启：基于配置文件

5. 创建scrapy工程后，创建基于crawlSpider的爬虫文件

6. 导入RedisCrawlSpider类，然后将爬虫文件修改成基于该类的源文件

7. 将start_url修改成redis_key = ‘XXX’

8. 在配置文件中进行相应配置：将管道配置成scrapy-redis集成的管道

9. 在配置文件中将调度器切换成scrapy-redis集成好的调度器

10. 执行爬虫程序：scrapy runspider xxx.py

11. redis客户端：lpush 调度器队列的名称 “起始url”

如果redis服务器不在自己本机，则需要在setting中进行如下配置
REDIS_HOST = 'redis服务的ip地址'
REDIS_PORT = 6379

爬虫相关操作

# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from redisPro.items import RedisproItem

from scrapy_redis.spiders import RedisCrawlSpider


class QiubaiSpider(RedisCrawlSpider):
    name = 'qiubai'
    # allowed_domains = ['https://www.qiushibaike.com/pic/']
    # start_urls = ['http://https://www.qiushibaike.com/pic//']

    # 调度器队列的名称
    redis_key = 'qiubaispider'  # 跟start_urls含义是一样的
    link = LinkExtractor(allow=r'/pic/page/\d+')

    rules = (
        Rule(link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')
        for div in div_list:
            img_url = 'https:' + div.xpath('.//div[@class="thumb"]/a/img/@src').extract_first()
            item = RedisproItem()
            item['img_url'] = img_url

            yield item

储存解析到的页面数据

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class RedisproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    img_url = scrapy.Field()

管道

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class RedisproPipeline(object):
    def process_item(self, item, spider):
        return item

配置

BOT_NAME = 'redisPro'

SPIDER_MODULES = ['redisPro.spiders']
NEWSPIDER_MODULE = 'redisPro.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'redisPro (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
    # 'redisPro.pipelines.RedisproPipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 300,
}


# 使用scrapy-redis组件的去重队列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis组件自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允许暂停
SCHEDULER_PERSIST = True

# 如果redis服务器不在自己本机，则需要如下配置
REDIS_HOST = '172.20.10.9'  # 存储到的redsi服务器
REDIS_PORT = 6379

posted @ 2019-04-15 16:45 梁少华阅读(244) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

梁少华