scrapy-splash组件的使用

使用scrapy-splash,可以模拟浏览器来加载js,因此可以用来处理那些使用js动态加载的页面。

scrapy-splash组件的使用:

1、借助docker启动splash服务

  • docker的安装及配置镜像加速器

  • 安装splash镜像

    docker pull scrapinghub/splash
  • 运行splash 服务

    docker run --name splash_server -d -p 8050:8050 scrapinghub/splash --max-timeout 3600

    为了解决状态码504的问题,后面加上了--max-timeout参数,增大超时时间,此外还可以在发送请求时,设置参数args={'wait':5, 'viewport':'1024x2480', 'timeout':90, 'images':0, 'resource_timeout':10},wait可以根据需要设置,最大不能超过30;resource_timeout表示单个网络请求的最大超时时间,timeout表示总的渲染超时时间

2、python中安装scrapy-splash包

  • pip3 install scrapy-splash

3、scrapy_splash使用代理

  • middlewares.py文件

    #设置随机UA
    class RandomUserAgentMiddleware:
    
        def process_request(self, request, spider):
            useragent = get_ua() #自己封装的方法,获取随机useragent
            request.headers['User-Agent'] = useragent
    
        def process_response(self, request, response, spider):
            return response
    
        def process_exception(self, request, exception, spider):
            pass
    
    
    # 设置随机Proxy
    class RandomProxyMiddleware:
    
        def process_request(self, request, spider):
            proxy = get_proxies()
            request.meta['splash']['args']['proxy'] = proxy['http']
    
        def process_response(self, request, response, spider):
            return response
    
        def process_exception(self, request, exception, spider):
            pass

    中间件中设置useragent和普通scrapy爬虫一致,但是需要注意代理的设置,不再是request.meta['proxy'] = proxyServer,而是request.meta['splash']['args']['proxy'] = proxyServer

4、project中需要修改配置文件和爬虫文件

  • 配置文件settings.py

    #爬虫中间件
    SPIDER_MIDDLEWARES = {
       'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    }
    
    # 渲染服务的url
    SPLASH_URL = 'http://192.168.99.100:8050' #ip地址是docker宿主机的ip地址
    
    #下载中间件
    DOWNLOADER_MIDDLEWARES = {
       'scrapy_splash.SplashCookiesMiddleware': 723,
       'scrapy_splash.SplashMiddleware': 725,
       'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
       'Spider.middlewares.RandomUserAgentMiddleware': 820, #新增的下载中间键,权重放在上述之后
       'Spider.middlewares.RandomProxyMiddleware':843
    }
    # 去重过滤器
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    
    # 使用Splash的Http缓存
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    
    ITEM_PIPELINES = {
       'Spider.pipelines.SpiderPipeline': 300,
    }

    自己添加的中间件,权重要放在scrapy_splash之后

  • 爬虫文件

    • scrapy.Spider
      # -*- coding: utf-8 -*-
      import scrapy
      from scrapy_splash import SplashRequest
      
      class TestSpider(scrapy.Spider):
          name = 'test'
          start_urls = ['https://xxx.com']
      
          def start_requests(self):
              for url in self.start_urls:
                  yield SplashRequest(url=url,
                                      args={'wait': 5, 'viewport': '1024x2480', 'timeout': 90, 'images': 0,
                                            'resource_timeout': 10}, dont_filter=True)  # wait后面填的是最大超时时间
      
          def parse(self, response):
              pass

      请求对象由scrapy.Request变更为scrapy_splash.SplashRequest

    • CrawlSpider
      # -*- coding: utf-8 -*-
      import scrapy
      from scrapy_splash import SplashRequest
      from scrapy.linkextractors import LinkExtractor
      from scrapy.spiders import CrawlSpider, Rule
      from scrapy_splash import SplashTextResponse, SplashJsonResponse, SplashResponse
      from scrapy.http import HtmlResponse
      
      class TestSpider(CrawlSpider):
          name = 'test'
          start_urls = ['https://xxx.com']
          rules = (
              Rule(LinkExtractor(allow=r'匹配链接的规则'), callback='parse_list_page',
                   follow=True),
          )
      
          # 构造起始请求
          def start_requests(self):
              for url in self.start_urls:
                  yield SplashRequest(url=url,
                                      args={'wait': 5, 'viewport': '1024x2480', 'timeout': 90, 'images': 0,
                                            'resource_timeout': 10}, dont_filter=True)  # wait后面填的是最大超时时间
      
          # 重写CrawlSpider方法,使用SplashRequest发送请求
          def _build_request(self, rule_index, link):
              return SplashRequest(
                  url=link.url,
                  callback=self._callback,
                  errback=self._errback,
                  meta=dict(rule=rule_index, link_text=link.text),
                  args={'wait': 5, 'viewport': '1024x2480', 'timeout': 90, 'images': 0,
                        'resource_timeout': 10}
              )
      
          # 重写CrawlSpider方法
          def _requests_to_follow(self, response):
              if not isinstance(response, (SplashTextResponse, SplashJsonResponse, SplashResponse, HtmlResponse)):
                  return
              seen = set()
              for rule_index, rule in enumerate(self._rules):
                  links = [lnk for lnk in rule.link_extractor.extract_links(response)
                           if lnk not in seen]
                  for link in rule.process_links(links):
                      seen.add(link)
                      request = self._build_request(rule_index, link)
                      yield rule.process_request(request, response)
      
          def parse_list_page(self, response):
              pass

      请求对象由scrapy.Request变更为scrapy_splash.SplashRequest,同时需要重写父类_build_request()和_requests_to_follow()方法

5、示例

豆瓣电影:https://movie.douban.com/typerank?type_name=%E5%89%A7%E6%83%85&type=11&interval_id=100:90

  • 网页中的电影信息都是通过js动态生成的:

  • 源码预览:

  • 爬虫代码: 

    import scrapy
    from scrapy_splash import SplashRequest
    
    class DoubanSpider(scrapy.Spider):
        name = 'douban'
        allowed_domains = ['douban.com']
        start_urls = ['https://movie.douban.com/typerank?type_name=%E5%89%A7%E6%83%85&type=11&interval_id=100:90']
    
        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(url=url,callback=self.parse,args={'wait':10})
    
        def parse(self, response):
            movie_list = response.xpath('//div[@class="movie-list-panel pictext"]/div')
            for movie_item in movie_list:
                movie_name = movie_item.xpath('.//span[@class="movie-name-text"]/a/text()').extract_first()
                print(movie_name)
  • 爬取结果展示:

    肖申克的救赎
    霸王别姬
    美丽人生
    控方证人
    伊丽莎白
    阿甘正传
    辛德勒的名单
    背靠背,脸对脸
    茶馆
    控方证人
    十二怒汉(电视版)
    这个杀手不太冷
    千与千寻
    泰坦尼克号
    忠犬八公的故事
    十二怒汉
    泰坦尼克号 3D版
    灿烂人生
    横空出世
    高山下的花环

    通过scrapy-splash组件,成功获取到js渲染后的网页源码,并解析出所需数据

 

posted @ 2021-10-06 18:32  eliwang  阅读(747)  评论(0编辑  收藏  举报