redis分布式爬虫
概念:多台机器上可以执行同一个爬虫程序,实现网站数据的爬取
原生的scrapy是不可以实现分布式爬虫, 原因如下:
- 调度器无法共享
- 管道无法共享
scrapy-redis组件:专门为scrapy开发的一套组件。 该组件可以让scrapy实现分布式 pip install scrapy-redis
分布式爬取的流程:
1 redis配置文件的配置
- 将 bind 127.0.0.1 进行注释
- 将 protected-mode no 关闭保护模式
2 redis服务器的开启:基于配置文件的开启
3 创建scrapy工程后, 创建基于crawlSpider的爬虫文件
4 导入RedisCrawSpider类 from scrapy_redis.spiders import RedisCrawlSpider
5 将start_url修改成redis_key = 'xxx'
6 解析代码编写
7 将项目的管道和调度器配置成基于scrapy-redis组件中
ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 } # 使用scrapy-redis组件的去重队列 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis组件自己的调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 是否允许暂停 SCHEDULER_PERSIST = True
8 配置Redis服务器地址和端口
# 如果redis服务器不在本机,则需如下配置 REDIS_HOST = '192.168.0.108' REDIS_PORT = 6379 REDIS_PARAMS = {"password":123456}
9 执行爬虫文件
scrapy runspider qiubai
10 向调度器队列中扔入一个起始url(在redis客户端中操作):lpush redis_key属性值 起始url
lpush qiubaispider https://www.qiushibaike.com/pic/
实现代码
class QiubaiSpider(RedisCrawlSpider): name = 'qiubai' # allowed_domains = ['www.qiushibaike.com/pic'] # start_urls = ['http://www.qiushibaike.com/pic/'] redis_key = 'qiubaispider' # 表示跟start_urls含义一样 link = LinkExtractor(allow=r'/pic/page/\d+') rules = ( Rule(link, callback='parse_item', follow=True), ) def parse_item(self, response): print('开始爬虫') div_list = response.xpath('//*[@id="content-left"]/div') for div in div_list: print(div) img_url = "http://" + div.xpath('.//div[@class="thumb"]/a/img/@src').extract_first() item = RedisproItem() item['img_url'] = img_url yield item
基于RedisSpider的分布式爬虫
案例需求:爬取的是基于文字的新闻数据(国内, 国际,军师, 航空)
- 1 在爬虫文件中导入webdriver类
- 2 在爬虫文件的爬虫类的构造方法中进行了浏览器实例化操作
- 3 在爬虫类的closed方法中进行浏览器的关闭操作
- 4 在下载中间件的process_response方法中编写执行浏览器自动化操作
wangyi.py:
# -*- coding: utf-8 -*- import scrapy from selenium import webdriver from wanyiPro.items import WanyiproItem from scrapy_redis.spiders import RedisSpider class WangyiSpider(RedisSpider): name = 'wangyi' # allowed_domains = ['news.163.com'] # start_urls = ['https://news.163.com/'] redis_key = "wangyi" def __init__(self): # 实例化一个浏览器对象 self.bro = webdriver.Chrome(executable_path='G:\myprogram\路飞学城\第七模块\wanyiPro\chromedriver.exe') # 必须在整个爬虫结束后关闭浏览器 def closed(self, spider): print('爬虫结束') self.bro.quit() def parse(self, response): lis = response.xpath('//div[@class="ns_area list"]/ul/li') indexs = [3, 4, 6, 7] li_list = [] # 存储的就是国内 国际 军事 航空四个板块对应的li标签对象 for index in indexs: li_list.append(lis[index]) # 获取四个板块中的链接和文字标题 for li in li_list: url = li.xpath('./a/@href').extract_first() title = li.xpath('./a/text()').extract_first() # print(url+":"+title) # 对每一个板块对应的url发起请求,获取页面数据(标题, 缩略图, 关键字, 发布时间, url) yield scrapy.Request(url=url, callback=self.parseSecond, meta={'title': title}) def parseSecond(self, response): div_list = response.xpath('//div[@class="data_row news_article clearfix "]') for div in div_list: head = div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first() url = div.xpath('.//div[@class="news_title"]/h3/a/@href').extract_first() img_url = div.xpath('./a/img/@src').extract_first() tag_list = div.xpath('.//div[@class="news_tag"]//text()').extract() tags = [] for t in tag_list: t = t.strip('\n \t') tags.append(t) tag = "".join(tags) # 获取meta传递的数据值title title = response.meta['title'] print(head + ":" + url + ":" + img_url) # 实例化item对象, 将解析到的数据值存储在item中 item = WanyiproItem() item['head'] = head item['url'] = url item['imgUrl'] = img_url item['tag'] = tag item['title'] = title # 对url发起请求 解析新闻详细内容 yield scrapy.Request(url=url, callback=self.getContent, meta={'item': item}) def getContent(self, response): # 获取传递过来的item item = response.meta['item'] # 解析当前页面中存储的新闻数据 content_list = response.xpath('//div[@class="post_text"]/p/text()').extract() content = "".join(content_list) item['content'] = content yield item
middlewares.py:
from scrapy import signals from scrapy.http import HtmlResponse class WanyiproDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # 拦截到响应对象(下载器传递给Spider的响应对象) # request: 响应对象对应的请求对象 # response: 拦截到的响应对象 # spider: 爬虫文件对应的爬虫类的实例 print(request.url + "这是下载中间件") # 响应对象中存储页面数据的篡改 if request.url in ['http://news.163.com/domestic/', 'http://news.163.com/world/', 'http://war.163.com/', 'http://news.163.com/air/']: spider.bro.get(url=request.url) js = 'window.scrollTo(0,document.body.scrollHeight)' spider.bro.execute_script(js) time.sleep(2) # 一定要给与浏览器一定的缓冲加载数据的时间 # 页面数据包含了动态加载出来的新闻数据对应的页面数据 page_text = spider.bro.page_source return HtmlResponse(url=spider.bro.current_url, body=page_text, encoding='utf-8', request=request) else: return response
UA池和地址池:
from scrapy import signals from scrapy.http import HtmlResponse from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware import random user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] # UA池代码的编写(单独给UA池封装一个下载中间件的一个类) # 导包UserAgentMiddleware类 class RandomUserAgent(UserAgentMiddleware): def process_request(self, request, spider): # 从列表中随机抽选一个ua值 ua = random.choice(user_agent_list) # ua值进行当前拦截到请求的ua的写入操作 request.headers.setdefault('User-Agent', ua) # 可被选用的代理IP PROXY_http = [ '153.180.102.104:80', '195.208.131.189:56055', ] PROXY_https = [ '120.83.49.90:9000', '95.189.112.214:35508', ] # 批量对拦截到的请求进行IP更换 class Proxy(object): def process_request(self, request, spider): # 对拦截到请求的url进行判断(协议头到底是http还是https) # request.url返回值:http://www.xxx.com h = request.url.split(':')[0] # 请求的协议头 if h == 'https': ip = random.choice(PROXY_https) request.meta['proxy'] = 'https://' + ip else: ip = random.choice(PROXY_http) request.meta['proxy'] = 'http://' + ip
基于RedisSpider实现分布式爬虫步骤
1 导包:from scrapy_redis.spiders import RedisSpider
2 将爬虫类的父类修改成RedisSpider
3 将起始URL列表注释, 添加一个redis_key(调度器队列的名称)的属性
4 进行redis数据库配置文件的配置:
- 将 bind 127.0.0.1 进行注释
- 将 protected-mode no 关闭保护模式
5 settings中配置redis
REDIS_HOST = '192.168.0.108' REDIS_PORT = 6379 REDIS_PARAMS = {"password": 123456} # 使用scrapy-redis组件的去重队列 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis组件自己的调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 是否允许暂停 SCHEDULER_PERSIST = True ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 }
6 执行爬虫文件
scrapy runspider wangyi.py
7 向调度器的管道中扔一个起始url
lpush wangyi https://news.163.com/