playwright结合adblocker进行广告拦截

现在基本网站都会有埋点(统计分析)或者不少有广告的,对于爬虫场景可能不太方便,社区已经提供了相关的插件
我们可以直接使用,以下是一个简单的使用说明

环境准备

基于browserless

  • docker-compose
version: "3"
services:
  browser:
    image: ghcr.io/browserless/chromium:latest
    environment:
      - CONCURRENT=40
      - QUEUED=20
      - CORS=true
      - CORS_MAX_AGE=300
      - DATA_DIR=/tmp/my-profile
      - TOKEN=6R0W53R135510
    volumes:
      - ./my-profile:/tmp/my-profile
    ports:
      - "3000:3000"

使用

  • packageg.json
{
  "name": "local_storage",
  "version": "1.0.0",
  "main": "index.js",
  "license": "MIT",
  "dependencies": {
    "@cliqz/adblocker-playwright": "^1.27.3",
    "cross-fetch": "^4.0.0",
    "playwright": "^1.44.0",
  },
  "scripts": {
    "start": "node dalong.js"
  }
}
  • dalong.js
const { chromium } = require("playwright");
const {PlaywrightBlocker} = require("@cliqz/adblocker-playwright");
const fetch = require("cross-fetch");
(async () => {
  let browser = await chromium.connectOverCDP(
    "ws://localhost:3000?token=6R0W53R135510"
  );
  let bContext = await browser.newContext({
    userAgent:
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
  });
 
  let page = await bContext.newPage();
 
  const blocker = await PlaywrightBlocker.fromPrebuiltAdsAndTracking(fetch);
 
  blocker.enableBlockingInPage(page);
  await page.goto("https://www.cnblogs.com/rongfengliang", {
    waitUntil: "networkidle",
  });
 
  let resutlv2 = await page.locator("#content").all();
  for (const row of await resutlv2) {
    console.log(await row.innerHTML());
  }
  await browser.close();
})();
  • 请求效果

拦截的

说明

以上是一个简单使用,对于广告进行拦截在一些场景还是比较有用的,比如减少网络开销,提升处理速度,实际上adblocker 提供的能力还是不少的,我们还可以自己进行拦截的自定义,实现灵活的处理

参考资料

https://www.npmjs.com/package/@cliqz/adblocker-playwright
https://playwright.dev/docs/locators#lists

posted on 2024-06-24 08:00  荣锋亮  阅读(27)  评论(0编辑  收藏  举报

导航