playwright结合adblocker进行广告拦截
现在基本网站都会有埋点(统计分析)或者不少有广告的,对于爬虫场景可能不太方便,社区已经提供了相关的插件
我们可以直接使用,以下是一个简单的使用说明
环境准备
基于browserless
- docker-compose
version: "3"
services:
browser:
image: ghcr.io/browserless/chromium:latest
environment:
- CONCURRENT=40
- QUEUED=20
- CORS=true
- CORS_MAX_AGE=300
- DATA_DIR=/tmp/my-profile
- TOKEN=6R0W53R135510
volumes:
- ./my-profile:/tmp/my-profile
ports:
- "3000:3000"
使用
- packageg.json
{
"name": "local_storage",
"version": "1.0.0",
"main": "index.js",
"license": "MIT",
"dependencies": {
"@cliqz/adblocker-playwright": "^1.27.3",
"cross-fetch": "^4.0.0",
"playwright": "^1.44.0",
},
"scripts": {
"start": "node dalong.js"
}
}
- dalong.js
const { chromium } = require("playwright");
const {PlaywrightBlocker} = require("@cliqz/adblocker-playwright");
const fetch = require("cross-fetch");
(async () => {
let browser = await chromium.connectOverCDP(
"ws://localhost:3000?token=6R0W53R135510"
);
let bContext = await browser.newContext({
userAgent:
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
});
let page = await bContext.newPage();
const blocker = await PlaywrightBlocker.fromPrebuiltAdsAndTracking(fetch);
blocker.enableBlockingInPage(page);
await page.goto("https://www.cnblogs.com/rongfengliang", {
waitUntil: "networkidle",
});
let resutlv2 = await page.locator("#content").all();
for (const row of await resutlv2) {
console.log(await row.innerHTML());
}
await browser.close();
})();
- 请求效果
拦截的
说明
以上是一个简单使用,对于广告进行拦截在一些场景还是比较有用的,比如减少网络开销,提升处理速度,实际上adblocker 提供的能力还是不少的,我们还可以自己进行拦截的自定义,实现灵活的处理
参考资料
https://www.npmjs.com/package/@cliqz/adblocker-playwright
https://playwright.dev/docs/locators#lists