Scrapy去重
一、原生
1、模块
from scrapy.dupefilters import RFPDupeFilter
2、RFPDupeFilter方法
a、request_seen
核心:爬虫每执行一次yield Request对象,则执行一次request_seen方法
作用:用来去重,相同的url只能访问一次
实现:将url值变成定长、唯一的值,如果这个url对象存在,则返回True表名已经访问过,若url不存在则添加该url到集合
1)、request_fingerprint
作用:对request(url)变成定长唯一的值,如果使用md5的话,下面的两个url值不一样
注意:request_fingerprint() 只接收request对象
from scrapy.utils.request import request_fingerprint from scrapy.http import Request # url1 = 'https://test.com/?a=1&b=2' url2 = 'https://test.com/?b=2&a=1' request1 = Request(url=url1) request2 = Request(url=url2) # 只接收request对象 rfp1 = request_fingerprint(request=request1) rfp2 = request_fingerprint(request=request2) print(rfp1) print(rfp2) if rfp1 == rfp2: print('url相同') else: print('url不同')
2)、request_seen
def request_seen(self, request): # request_fingerprint 将request(url) -> 唯一、定长 fp = self.request_fingerprint(request) if fp in self.fingerprints: return True # 返回True,表明已经执行过一次 self.fingerprints.add(fp)
b、open
父类BaseDupeFilter中的方法,爬虫开始时,执行
def open(self): # 爬虫开始 pass
c、close
爬虫结束时执行
def close(self, reason): # 关闭爬虫时执行 pass
d、log
记录日志
def log(self, request, spider): # 记录日志 pass
e、from_settings
原理及作用:和pipelines中的from_crawler一致
@classmethod def from_settings(cls, settings): return cls()
二、自定义
待续
1、配置文件(settings.py)
# 原生 # DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' DUPEFILTER_CLASS = 'toscrapy.dupefilters.MyDupeFilter'
2、自定义去重类(继承BaseDupeFilter)
from scrapy.dupefilters import BaseDupeFilter from scrapy.utils.request import request_fingerprint # class MyDupeFilter(BaseDupeFilter): def __init__(self): self.visited_fp = set() @classmethod def from_settings(cls, settings): return cls() def request_seen(self, request): # 判断当前的request对象是否,在集合中,若在则放回True,表明已经访问,否则,访问该request的url并将该url添加到集合中 if request_fingerprint(request) in self.visited_fp: return True self.visited_fp.add(request_fingerprint(request)) def open(self): # can return deferred print('开启爬虫') def close(self, reason): # can return a deferred print('结束爬虫') def log(self, request, spider): # log that a request has been filtered pass
3、前提条件
yield request的对象
yield scrapy.Request(url=_next, callback=self.parse, dont_filter=True)
dont_filter不能为True,这个值默认为False