scrapy-splash学习

材料清单

docker

scrapy


当我们经常遇到js加载的页面,用scrapy来抓取其实挺麻烦的。Splash是做来加载渲染后的页面,可以支持scrapy使用。由于Splash和Scrapy都支持异步处理,而Selenium的对接过程中每个页面渲染下载过程是在Downloader Middleware里面完成的,所以整个过程是堵塞式的,Scrapy会等待这个过程完成后再继续处理和调度其他请求,影响了爬取效率,因此使用Splash爬取效率上比Selenium高出很多。
首先安装docker,直接拉取镜像 docker pull scrapinghub/splash
启动Splashdocker run -p 8050:8050 scrapinghub/splash
然后测试一下是否可以连上curl http://localhost:8050


如果关闭防火墙之类操作已经做完了,那么远程是可以连接上splash的
在这里插入图片描述


接着开始在scrapy的配置,在settings.py中添加如下配置

# 加入splash的url以及去重类
SPLASH_URL = 'http://192.168.99.100:8050'  
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'  
# 修改下载中间件
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 723,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    #'splash_163.middlewares.Splash163DownloaderMiddleware': 543,
}
# 修改爬虫中间件
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

在spider中加入生成的splash请求

from scrapy_splash import SplashRequest
...
...
yield SplashRequest(url, callback=self.parse_result,
    args={
        # optional; parameters passed to Splash HTTP API
        'wait': 0.5,
        # 'url' is prefilled from request url
        # 'http_method' is set to 'POST' for POST requests
        # 'body' is set to request body for POST requests
    },
    endpoint='render.json', # optional; default is render.html
    splash_url='<url>',     # optional; overrides SPLASH_URL
)

# 另外我们也可以生成Request对象,关于Splash的配置通过meta属性配置即可,代码如下:
yield scrapy.Request(url, self.parse_result, meta={
    'splash': {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,
            # 'url' is prefilled from request url
            # 'http_method' is set to 'POST' for POST requests
            # 'body' is set to request body for POST requests
        },
        # optional parameters
        'endpoint': 'render.json',  # optional; default is render.json
        'splash_url': '<url>',      # optional; overrides SPLASH_URL
        'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,
        'splash_headers': {},       # optional; a dict with headers sent to Splash
        'dont_process_response': True, # optional, default is False
        'dont_send_headers': True,  # optional, default is False
        'magic_response': False,    # optional, default is True
    }
})

当我们使用脚本来执行某些操作时,就需要Lua脚本了。Lua脚本可以像selenium那样来实现页面加载、模拟点击翻页的功能

script = """
function main(splash, args)
  args = {
    url="https://s.taobao.com/search?q=羽毛球",
    wait=5,
    page=5
  }
  splash.images_enabled = false
  assert(splash:go(args.url))
  assert(splash:wait(args.wait))
  js = string.format("document.querySelector('#mainsrp-pager div.form > input').value=%d;document.querySelector('#mainsrp-pager div.form > span.btn.J_Submit').click()", args.page)
  splash:evaljs(js)
  assert(splash:wait(args.wait))
  return splash:png()
end
"""
class TaobaoSpider(Spider):
    name = 'taobao'
    allowed_domains = ['www.taobao.com']
    base_url = 'https://s.taobao.com/search?q='
    
    def start_requests(self):
        for keyword in self.settings.get('KEYWORDS'):
            for page in range(1, self.settings.get('PAGE_NUM') + 1):
                url = self.base_url + quote(keyword)
                yield SplashRequest(url, callback=self.parse, endpoint='execute', 
                                    args={'lua_source': script, 'page': page, 'wait': 3})

顺便贴个post请求的Lua脚本

script = """
function main(splash, args)
  local treat = require("treat")
  local json = require("json")
  local response = splash:http_post{args.url, 
  					body=json.encode({keywords="园林"})}
  splash:wait(10)
  return {
    html = treat.as_string(response.body),
    url = response.url,
    status = response.status
  }
end
"""
posted @ 2019-12-11 18:28  上周的bug还没搞好  阅读(627)  评论(0编辑  收藏  举报