scrapy-splash组件的使用
使用scrapy-splash,可以模拟浏览器来加载js,因此可以用来处理那些使用js动态加载的页面。
scrapy-splash组件的使用:
1、借助docker启动splash服务
-
docker的安装及配置镜像加速器
-
安装splash镜像
docker pull scrapinghub/splash
-
运行splash 服务
docker run --name splash_server -d -p 8050:8050 scrapinghub/splash --max-timeout 3600
为了解决状态码504的问题,后面加上了--max-timeout参数,增大超时时间,此外还可以在发送请求时,设置参数args={'wait':5, 'viewport':'1024x2480', 'timeout':90, 'images':0, 'resource_timeout':10},wait可以根据需要设置,最大不能超过30;resource_timeout表示单个网络请求的最大超时时间,timeout表示总的渲染超时时间
2、python中安装scrapy-splash包
-
pip3 install scrapy-splash
3、scrapy_splash使用代理
-
middlewares.py文件
#设置随机UA class RandomUserAgentMiddleware: def process_request(self, request, spider): useragent = get_ua() #自己封装的方法,获取随机useragent request.headers['User-Agent'] = useragent def process_response(self, request, response, spider): return response def process_exception(self, request, exception, spider): pass # 设置随机Proxy class RandomProxyMiddleware: def process_request(self, request, spider): proxy = get_proxies() request.meta['splash']['args']['proxy'] = proxy['http'] def process_response(self, request, response, spider): return response def process_exception(self, request, exception, spider): pass
中间件中设置useragent和普通scrapy爬虫一致,但是需要注意代理的设置,不再是request.meta['proxy'] = proxyServer,而是request.meta['splash']['args']['proxy'] = proxyServer
4、project中需要修改配置文件和爬虫文件
-
配置文件settings.py
#爬虫中间件 SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } # 渲染服务的url SPLASH_URL = 'http://192.168.99.100:8050' #ip地址是docker宿主机的ip地址 #下载中间件 DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 'Spider.middlewares.RandomUserAgentMiddleware': 820, #新增的下载中间键,权重放在上述之后 'Spider.middlewares.RandomProxyMiddleware':843 } # 去重过滤器 DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' # 使用Splash的Http缓存 HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' ITEM_PIPELINES = { 'Spider.pipelines.SpiderPipeline': 300, }
自己添加的中间件,权重要放在scrapy_splash之后
-
爬虫文件
- scrapy.Spider
# -*- coding: utf-8 -*- import scrapy from scrapy_splash import SplashRequest class TestSpider(scrapy.Spider): name = 'test' start_urls = ['https://xxx.com'] def start_requests(self): for url in self.start_urls: yield SplashRequest(url=url, args={'wait': 5, 'viewport': '1024x2480', 'timeout': 90, 'images': 0, 'resource_timeout': 10}, dont_filter=True) # wait后面填的是最大超时时间 def parse(self, response): pass
请求对象由scrapy.Request变更为scrapy_splash.SplashRequest
- CrawlSpider
# -*- coding: utf-8 -*- import scrapy from scrapy_splash import SplashRequest from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from scrapy_splash import SplashTextResponse, SplashJsonResponse, SplashResponse from scrapy.http import HtmlResponse class TestSpider(CrawlSpider): name = 'test' start_urls = ['https://xxx.com'] rules = ( Rule(LinkExtractor(allow=r'匹配链接的规则'), callback='parse_list_page', follow=True), ) # 构造起始请求 def start_requests(self): for url in self.start_urls: yield SplashRequest(url=url, args={'wait': 5, 'viewport': '1024x2480', 'timeout': 90, 'images': 0, 'resource_timeout': 10}, dont_filter=True) # wait后面填的是最大超时时间 # 重写CrawlSpider方法,使用SplashRequest发送请求 def _build_request(self, rule_index, link): return SplashRequest( url=link.url, callback=self._callback, errback=self._errback, meta=dict(rule=rule_index, link_text=link.text), args={'wait': 5, 'viewport': '1024x2480', 'timeout': 90, 'images': 0, 'resource_timeout': 10} ) # 重写CrawlSpider方法 def _requests_to_follow(self, response): if not isinstance(response, (SplashTextResponse, SplashJsonResponse, SplashResponse, HtmlResponse)): return seen = set() for rule_index, rule in enumerate(self._rules): links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen] for link in rule.process_links(links): seen.add(link) request = self._build_request(rule_index, link) yield rule.process_request(request, response) def parse_list_page(self, response): pass
请求对象由scrapy.Request变更为scrapy_splash.SplashRequest,同时需要重写父类_build_request()和_requests_to_follow()方法
- scrapy.Spider
5、示例
豆瓣电影:https://movie.douban.com/typerank?type_name=%E5%89%A7%E6%83%85&type=11&interval_id=100:90
-
网页中的电影信息都是通过js动态生成的:
-
源码预览:
-
爬虫代码:
import scrapy from scrapy_splash import SplashRequest class DoubanSpider(scrapy.Spider): name = 'douban' allowed_domains = ['douban.com'] start_urls = ['https://movie.douban.com/typerank?type_name=%E5%89%A7%E6%83%85&type=11&interval_id=100:90'] def start_requests(self): for url in self.start_urls: yield SplashRequest(url=url,callback=self.parse,args={'wait':10}) def parse(self, response): movie_list = response.xpath('//div[@class="movie-list-panel pictext"]/div') for movie_item in movie_list: movie_name = movie_item.xpath('.//span[@class="movie-name-text"]/a/text()').extract_first() print(movie_name)
-
爬取结果展示:
肖申克的救赎 霸王别姬 美丽人生 控方证人 伊丽莎白 阿甘正传 辛德勒的名单 背靠背,脸对脸 茶馆 控方证人 十二怒汉(电视版) 这个杀手不太冷 千与千寻 泰坦尼克号 忠犬八公的故事 十二怒汉 泰坦尼克号 3D版 灿烂人生 横空出世 高山下的花环
通过scrapy-splash组件,成功获取到js渲染后的网页源码,并解析出所需数据