分布式爬虫:scrapy本身并不是一个为分布式爬取而设计的框架,但第三方库scrapy-redis为其扩展了分布式爬取的功能,两者结合便是一个分布式Scrapy爬虫框架。在分布式爬虫框架中,需要使用某种通信机制协调各个爬虫的工作,让每一个爬虫明确自己的任务:
1.当前的爬取任务,即下载+提取数据(分配任务)
2.当前爬取任务是否已经被其他爬虫执行过(任务去重)
3.如何存储爬取到的数据(数据存储)
前期准备:Redis的安装与基本知识(http://www.runoob.com/redis/redis-keys.html)
老规矩,先上爬取效果图,大家也赶快行动起来!QAQ
开始爬取:
1.首先看看分布式爬虫的整体文件架构
Books
Books
spiders
__init__.py
books.py
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
scrapy_redis (第三方库下载地址--https://github.com/rmax/scrapy-redis)
__init__.py
connection.py
defaults.py
dupefilter.py
picklecompat.py
pipelines.py
queque.py
scheduler.py
spiders.py
utils.py
scrapy.cfg
2.看起来比较复杂,其实和之前爬虫没太大变化,不需要动scrapy_redis文件下脚本,只需要调用就好
books.py
# -*- coding: utf-8 -*-
import scrapy
import pdb
from scrapy.linkextractors import LinkExtractor
from Books.items import BooksItem
from scrapy_redis.spiders import RedisSpider
#class BooksSpider(scrapy.Spider):
class BooksSpider(RedisSpider): #(调用分布式爬虫最重要的,继承RedisSpider的类)
name = 'books'
#allowed_domains = ['books.toscrape.com']
#start_urls = ['http://books.toscrape.com/'] (这里起始地址需要备注掉,运行爬虫的时候,在redis-cli之后,启动)
def parse(self, response):
sels = response.css('article.product_pod')
book = BooksItem()
for sel in sels:
book["name"] = sel.css('h3 a::attr(title)').extract()[0]
book["price"] = sel.css('div.product_price p::text').extract()[0]
yield book
links = LinkExtractor(restrict_css='ul.pager li.next').extract_links(response)
yield scrapy.Request(links[0].url,callback=self.parse)
3.Pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import scrapy
import pdb
from scrapy.exceptions import DropItem
from scrapy.item import Item
import pymongo
import redis
class BooksPipeline(object):
def process_item(self, item, spider):
return item
class PriceConverterPipeline(object): #提取的price进行价额转换
exchange_rate = 8.5309
def process_item(self,item,spider):
price = float(item['price'][1:])*self.exchange_rate
item['price']= '$%.2f'%price
return item
class DuplicatesPipeline(object): #去重进行过滤
def __init__(self):
self.set= set()
def process_item(self,item,spider):
name = item["name"]
if name in self.set:
raise DropItem("Duplicate book found:%s"%item)
self.set.add(name)
return item
class MongoDBPipeline(object): #存储到mongodb中
@classmethod
def from_crawler(cls,crawler):
cls.DB_URL = crawler.settings.get("MONGO_DB_URL",'mongodb://localhost:27017/')
cls.DB_NAME = crawler.settings.get("MONGO_DB_NAME",'scrapy_data')
return cls()
def open_spider(self,spider):
pdb.set_trace()
self.client = pymongo.MongoClient(self.DB_URL)
self.db = self.client[self.DB_NAME]
def close_spider(self,spider):
self.client.close()
def process_item(self,item,spider):
collection = self.db[spider.name]
post = dict(item) if isinstance(item,Item) else item
collection.insert_one(post)
return item
class RedisPipeline: #下载到redis数据库中
def open_spider(self,spider):
db_host = spider.settings.get("REDIS_HOST",'10.240.176.134')
#db_host = spider.settings.get("REDIS_HOST",'localhost')
db_port = spider.settings.get("REDIS_PORT",6379)
db_index= spider.settings.get("REDIS_DB_INDEX",0)
#db_passwd = spider.settings.get('REDIS_PASSWD','redisredis')
#self.db_conn = redis.StrictRedis(host=db_host,port=db_port,db=db_index,password=db_passwd)
self.db_conn = redis.StrictRedis(host=db_host,port=db_port,db=db_index)
self.item_i = 0
def close_spider(self,spider):
self.db_conn.connection_pool.disconnect()
def process_item(self,item,spider):
self.insert_db(item)
return item
def insert_db(self,item):
if isinstance(item,Item):
item = dict(item)
self.item_i += 1
self.db_conn.hmset('books12:%s'%self.item_i,item)
4.1settings.py
(1)添加代理:Middlewares.py文件中
class BooksSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
def __init__(self,ip=''):
self.ip = ip
def process_request(self,request,spider):
print('http://10.240.252.16:911')
request.meta['proxy']= 'http://10.240.252.16:911'
(2)settings.py
DOWNLOADER_MIDDLEWARES = {
#'Books.middlewares.BooksDownloaderMiddleware': 543,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':543,
'Books.middlewares.BooksSpiderMiddleware':125,
}
ITEM_PIPELINES = {
#'Books.pipelines.BooksPipeline': 300,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':1,
'Books.pipelines.PriceConverterPipeline': 300,
'Books.pipelines.DuplicatesPipeline':350,
#'Books.pipelines.MongoDBPipeline':400,
'Books.pipelines.RedisPipeline':404,
}
4.2基础设置
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 3
COOKIES_ENABLED = False
4.3下载到mongodb数据库中
MONGO_DB_URL = 'mongodb://localhost:27017/'
MONGO_DB_NAME = 'eilinge'
FEED_EXPORT_FIELDS = ['name','price']#设置导出文件格式顺序
4.4实现redis和存储
REDIS_HOST = '10.240.176.134'
#REDIS_HOST = 'localhost'
REDIS_PORT = 6379
REDIS_DB_INDEX = 0
#REDIS_PASSWD = 'redisredis'
REDIS_URL = 'redis://10.240.176.134:6379' #指定爬虫所使用的Redis数据库
SCHEDULER = 'scrapy_redis.scheduler.Scheduler' #使用scrapy_redis的调度器替代Scrapy原版调度器(FreeBSD系统中运行会报错,需要绑定core,然而freebsd中core路径不同)
DUPEFILER = 'scrapy_redis.dupefilter.RFPDupeFilter' #使用scrapy_redis的RFPDupeFilter作为去重过滤器
SCHEDULER_PERSIST = True #爬虫停止后,保留/清理Redis中请求队列以及去重集合
需要注意的点:
1.假使你有3台服务器可以同时运行爬取,使用scp远程传输Books文件,进行拷贝
2.分别在3台主机使用相同命令运行爬虫:scrapy crawl books
3.2018-09-03 12:30:47 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-03 12:31:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ...停止在此
运行后,由于Redis中的起始爬虫点列表和请求队列都是空的,3个爬虫都进入了暂停等待的状态,因此在任意主机上使用Redis客户端设置起始爬取点
redis-cli -h 10.240.176.134
10.240.176.134:6379>lpush books:start_urls "http://books.toscrape.com"
补充知识:
Redis数据库的配置文件redis.conf
#bind 127.0.0.1 bind 0.0.0.0 #接收来自任意IP的请求 #acquirepass redisredis #远程连接需要密码验证
不同系统下运行redis服务
1.ubuntu:sudo service redis-server restart 2.linux(fedora):service redis restart 3.Freebsd:service redis onerestart