5. scrapy 请求参数 - 分布式 - 增量式
1.请求传参
深度爬取:爬取的数据没有存储在同一张页面中。
创建工程:
scrapy startproject moviePro
创建爬虫文件:
cd moviePro
scrapy genspider movie www.xxx.com
运行项目:
scrapy crawl movie
配置文件settings.py
BOT_NAME = 'moviePro'
SPIDER_MODULES = ['moviePro.spiders']
NEWSPIDER_MODULE = 'moviePro.spiders'
# 请求头
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# 只打印错误日志
LOG_LEVEL = 'ERROR'
# 打开管道
ITEM_PIPELINES = {
'moviePro.pipelines.MovieproPipeline': 300,
}
items.py 配置两个字段
import scrapy
class MovieproItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
actor = scrapy.Field()
spiders/movie.py
# 4567电影网:https://www.4567kan.com/index.php/vod/show/id/1.html
import scrapy
from moviePro.items import MovieproItem
class MovieSpider(scrapy.Spider):
name = 'movie'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.4567kan.com/index.php/vod/show/id/1.html']
def parse(self, response):
li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
for li in li_list:
title = li.xpath('./div/a/@title').extract_first()
# 获取详情数据路径
detail_url = 'https://www.4567kan.com'+li.xpath('./div/a/@href').extract_first()
item = MovieproItem()
item['title'] = title
# 对详情页的url发起get请求
# meta字典会传递给callback回调函数
yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})
def parse_detail(self,response):
actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first()
# 使用response接受传递过来的item
item = response.meta['item']
item['actor'] = actor
yield item # 提交管道
pipelines.py
class MovePipeline:
def process_item(self, item, spider):
print(item) # 打印,查看是否传过来数据
return item
2.中间件
- 中间件
- 中间件的种类:
- 爬虫中间件
- 下载中间件(**)
- 作用:拦截请求和响应
- 拦截请求:
- 修改请求头信息
- 进行代理操作
middlewares.py
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
class MiddleproDownloaderMiddleware:
# 拦截所有的请求
# request就是拦截到的请求
# spider就是爬虫文件中定义类实例化的对象
def process_request(self, request, spider):
print('i am process_request')
# 请求头信息的修改
# request.headers['User-Agent'] = 'xxx'
return None
# 拦截响应
# response:拦截到的响应对象
# request:拦截到响应对象对应的请求对象
def process_response(self, request, response, spider):
print('i am process_response')
return response
# 拦截发生异常的请求对象 -- 修正请求对象
def process_exception(self, request, exception, spider):
print('i am process_exception')
# 代理操作 -- 如果网站封禁本机ip
request.meta['proxy'] = 'https://ip:port'
return request #对修正后的请求重新进行发送
3.全站数据爬取
- CrawlSpider父类
1. 作用: 全站数据爬取
2. 使用:
- 创建项目:
scrapy startproject crawlPro(项目名称)
- 创建爬虫文件:
cd crawlPro
scrapy genspider -t crawl sun(爬虫文件名) www.xx.com
- 运行项目:
scrapy crawl sun
3. 连接提取器
- 可以根据指定规则LinkExtractor(allow=正则)进行连接(url)的提取
4. 规则解析器
- 可以将link提取到的链接对应页面数据根据指定函数规则(callback回调函数)进行数据解析
spiders/sun.py
# 电影网站:https://www.4567kan.com/index.php/vod/show/id/1.html
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class SunSpider(CrawlSpider):
name = 'sun'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.4567kan.com/index.php/vod/show/id/1.html']
# 链接提取器:可以根据指定规则(allow=正则)进行连接(url)的提取
link = LinkExtractor(allow=r'page/\d+\.html')
rules = (
# 实例化了一个Rule(规则解析器)对象
# 作用:可以将link提取到的链接对应页面数据根据指定函数规则(callback回调函数)进行数据解析
# follow = False 只解析首次提取的页面, follow = True 可加载所有页面
Rule(link, callback='parse_item', follow=False),
)
# 该方法调用的次数取决于link提取到的链接的个数
def parse_item(self, response):
print(response)
4.分布式
概念:
1.概念:搭建一个分布式集群,对同一个网络资源进行联合且分布的数据爬取
2.原生的scrapy框架是无法实现分布式原因?
- 调度器无法被共享
- 管道无法被共享
3.实现分布式的解决方案:
- 组件:scrapy-redis
- 组件作用:提供可以被共享的管道和调度器
- 安装: pip install scrapy-redis
实现流程
0.创建项目
- 创建项目:
scrapy startproject fbsPro(项目名称)
- 创建爬虫文件:
cd crawlPro
scrapy genspider -t crawl fbs(爬虫文件名) www.xx.com
1.修改爬虫文件
- 导包:from scrapy_redis.spiders import RedisCrawlSpider
- 将爬虫类的父类修改为RedisCrawlSpider
- start_urls删除或注释掉
- 添加一个新属性:redis_key = ’xxx‘ -- 可以被共享的调度器队列的名称
- 编写请求和数据解析操作
2.修改setting.py
- 指定scrapy_redis管道:
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 400
}
- 指定调度器:
# 使用scrapy-redis组件的去重队列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis组件自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允许暂停
SCHEDULER_PERSIST = True
- 指定redis数据库地址
REDIS_HOST = 'redis服务的ip地址'
REDIS_PORT = 6379
REDIS_ENCODING = ‘utf-8’
REDIS_PARAMS = {‘password’:’123456’}
3.配置redis
- 找到redis-x64文件夹 -- C:\Users\Administrator\Downloads\Redis-x64-5.0.10
- 配置文件:redis.window.conf 或 redis.window-service.conf
- 注释56行:关闭默认绑定(#bind 127.0.0.1)
- 关闭保护模式:75行(protected-mode no)
- 启动redis服务端:redis-server.exe
- 启动redis客户端:redis-cli.exe
4.执行工程:
- cd 到spiders目录下
- scrapy runspider (爬虫文件名)fbs.py
5.向调度器队列中仍入一个起始的url:(调度器的队列在redis中)
电影网址: https://www.4567kan.com/index.php/vod/show/id/1.html
- redis-cli客户端:lpush fbsQueue www.xxx.com(网站地址)
代码
settings.py
BOT_NAME = 'fbsPro'
SPIDER_MODULES = ['fbsPro.spiders']
NEWSPIDER_MODULE = 'fbsPro.spiders'
# 请求头
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR' # 只打印错误日志
# redis共享管道
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 400,
}
# redis共享调度器
# 使用scrapy-redis组件的去重队列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis组件自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允许暂停
SCHEDULER_PERSIST = True
# redis数据库
REDIS_HOST = '192.168.19.47'
REDIS_PORT = 6379
items.py
import scrapy
class FbsproItem(scrapy.Item):
# define the fields for your item here like:
# 添加字段
title = scrapy.Field()
actor = scrapy.Field()
spiders/fbs.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_redis.spiders import RedisCrawlSpider
from fbsPro.items import FbsproItem
class FbsSpider(RedisCrawlSpider):
name = 'fbs'
# allowed_domains = ['www.xxx.com']
# start_urls = ['http://www.xxx.com/']
# 可以被共享的调度器队列的名称
redis_key = 'fbsQueue'
# 提取页码链接
link = LinkExtractor(allow=r'page/\d+\.html')
rules = (
Rule(link,callback='parse_item', follow=True),
)
def parse_item(self, response):
li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
for li in li_list:
title = li.xpath('./div/a/@title').extract_first()
# 详细路径
detail_url = 'https://www.4567kan.com'+li.xpath('./div/a/@href').extract_first()
item = FbsproItem()
item['title'] = title
yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})
def parse_detail(self,response):
item = response.meta['item']
actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first()
item['actor'] = actor
yield item
5.增量式
简介
增量式:
1. 作用:监测网站数据更新的情况。
2. 记录表:记录爬取过的数据标识(数据指纹),爬过得不再重新爬取。
- 谁充当记录表:redis的set
0.创建项目
- 创建项目:
scrapy startproject zlsPro(项目名称)
- 创建爬虫文件:
cd crawlPro
scrapy genspider -t crawl zls(爬虫文件名) www.xx.com
代码
settings.py
BOT_NAME = 'zlsPro'
SPIDER_MODULES = ['zlsPro.spiders']
NEWSPIDER_MODULE = 'zlsPro.spiders'
# 请求头
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# 只输出错误日志
LOG_LEVEL = 'ERROR'
# 开启管道
ITEM_PIPELINES = {
'zlsPro.pipelines.ZlsproPipeline': 300,
}
items.py
import scrapy
class ZlsproItem(scrapy.Item):
# 添加字段
title = scrapy.Field()
actor = scrapy.Field()
spiders/zls.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from redis import Redis
from zlsPro.items import ZlsproItem
class ZlsSpider(CrawlSpider):
name = 'zls'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.4567kan.com/index.php/vod/show/id/1/page/1.html']
# 创建redis数据库连接
conn = Redis(host='127.0.0.1',port=6379)
# 提取网页分页连接
link = LinkExtractor(allow=r'page/\d+\.html')
rules = (
# 解释器
Rule(link, callback='parse_item', follow=False),
)
def parse_item(self, response):
li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
for li in li_list:
title = li.xpath('./div/a/@title').extract_first()
detail_url = 'https://www.4567kan.com' + li.xpath('./div/a/@href').extract_first()
item = ZlsproItem()
item['title'] = title
# 添加记录到记录表
res = self.conn.sadd('movie_urls',detail_url)
if res == 1:
print('有新数据更新,正在抓取......')
# 发送请求获取数据
yield scrapy.Request(detail_url, callback=self.parse_detail, meta={'item': item})
else:
print('暂无新数据的更新!')
def parse_detail(self, response):
item = response.meta['item']
actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first()
item['actor'] = actor
yield item