Scrapy深度和优先级
一、深度
配置文件 settings.py
DEPTH_LIMIT = 5
二、优先级
配置文件
DEPTH_PRIORITY=1
优先级为正数时,随着深度越大,优先级越低
源码中,优先级
request.priority -= depth * self.prio
三、源码分析
1、深度
class QuoteSpider(scrapy.Spider): name = 'quote' allowed_domains = ['quotes.toscrape.com'] start_urls = ['http://quotes.toscrape.com/'] def start_requests(self): for url in self.start_urls: yield Request(url=url, callback=self.parse) def parse(self, response): # response.request.meta = None # from scrapy.http import Response # response.request.meta ->response.meta from scrapy.spidermiddlewares.depth import DepthMiddleware print(response.request.url, response.meta.get('depth')) next_url = response.xpath('//div[@class="col-md-8"]/nav/ul[@class="pager"]/li[@class="next"]/a/@href').extract_first() # print(next_url) # 拼接url _next = response.urljoin(next_url) # print(_next) # callback 回调函数 yield scrapy.Request(url=_next, callback=self.parse)
前提:scrapy yield request对象 -> 中间件 ->调度器...
yield Request对象没有设置meta的值,meta默认为None
parse方法中的respone.request相当于request对象->response.request.meta=None
from scrapy.http import Response ->response.meta 等价于 response.request.meta --->response.meta=None
DepthMiddleware中间件->如果'depth'不在response.meta,那么response.meta['depth'] = 0
# result是存放request对象的列表,通过_filter进行过滤 # 返回 True,存放到调度器 # 返回 False, 丢弃 return (r for r in result or () if _filter(r))
超出最大深度,返回False
# 在配置文件,设置最大深度 maxdepth = settings.getint('DEPTH_LIMIT')
2、优先级
待续...