scrapy + browserless 集成简单说明
以前简单说明过scrapy 集成s3 feed exports 的配置,以下是集成browserless 的处理,通过browserless 进行数据内容的处理(尤其适合包含了基于ajax的请求,以及延迟加载的项目)
项目准备
主要是s3以及browserless
- docker-compose
version: "3"
services:
browser:
image: ghcr.io/browserless/chromium:latest
environment:
- CONCURRENT=40
- QUEUED=20
- CORS=true
- CORS_MAX_AGE=300
- DATA_DIR=/tmp/my-profile
- TOKEN=6R0W53R135510
volumes:
- ./my-profile:/tmp/my-profile
ports:
- "3000:3000"
minio:
image: minio/minio
ports:
- "9000:9000"
- "19001:19001"
environment:
MINIO_ACCESS_KEY: minio
MINIO_SECRET_KEY: minio123
MINIO_API_SELECT_PARQUET: on
command: server --console-address :19001 --quiet /data
参考处理
如下图,实际上对于原始website 的请求,直接通过browserless 提供的content api 进行内容获取,之后browserless 会进行实际website 内容的加载,然后返回html content,scrapy 的parse 对于content 进行解析
代码说明
关于scrapy 快速项目创建可以参考scrapy startproject <name>
命令,我只说明内容处理
- blog spider 代码
整个处理还是比较简单的,就是进行替换,scrapy.Request 的request 请求本来是website 现在到browserless 了,对于browserless 的请求是post
from pathlib import Path
import scrapy
import json
from dalongdemo.items import DalongdemoItem
class BlogsSpider(scrapy.Spider):
name = "blogs"
def start_requests(self):
urls = [
"https://cnblogs.com/rongfengliang"
]
for url in urls:
options = {
"url": url,
"gotoOptions": {
"waitUntil": "networkidle0"
}
}
yield scrapy.Request(
url="http://localhost:3000/content?token=6R0W53R135510",
method='POST',
dont_filter=True,
headers={"Content-Type": "application/json"},
body=json.dumps(options),
callback=self.parse
)
def parse(self, response):
for item in response.css("div.post"):
dalongItem = DalongdemoItem()
dalongItem["title"] = item.css("div.post .postTitle2 > span").get()
dalongItem["content"] = item.css("div.postbody .c_b_p_desc").get()
yield dalongItem
- s3 feed exports 配置
settings.py
FEED_EXPORT_ENCODING = "utf-8"
FEEDS = {
"s3://mydemo/%(name)s/%(name)s_%(time)s.jsonl": {
"format": "jsonlines",
}
}
AWS_ACCESS_KEY_ID = "minio"
AWS_SECRET_ACCESS_KEY = "minio123"
AWS_REGION_NAME = "us-east-1"
AWS_ENDPOINT_URL = "http://localhost:9000"
效果
说明
以上是一个简单示例,实际上对于大量爬去需求我们可以基于lb 部署多个browserless 服务,提升处理的稳定性,有些时候基于的模browserless 并不是一个较好的选择,毕竟browserless 处理还会慢一些,基于api 直接发起请求的处理可能是更好的选择
参考资料
https://www.browserless.io/blog/scrapy-headless
https://github.com/rongfengliang/scrapy-browserless-learning
https://docs.scrapy.org/en/latest/intro/tutorial.html