Ajax爬取

动态渲染工具安装(Playwright)

即我们直接操控浏览器来获取数据,接口数据都是加密的。但前端肯定有对应的一个解密程序,那么我们就可以进行操控浏览器来进行一个对应的解密。、

安装使用

pip3 install playwright
playwright install
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    for browser_type in [p.chromium, p.firefox, p.webkit]:
        browser = browser_type.launch(headless=False)
        page = browser.new_page()
        page.goto('https://www.baidu.com')
        page.screenshot(path=f'screenshot-{browser_type.name}.png')
        print(page.title())
        browser.close()

选择器

即我们去定位我们要去操控的元素

from playwright.sync_api import sync_playwright

def run(playwright):
    browser = playwright.firefox.launch(headless=False)
    context = browser.new_context()

    # Open new page
    page = context.new_page()

    # Go to https://www.baidu.com/
    page.goto("https://www.baidu.com/")

    # Click input[name="wd"]
    page.click("input[name=\"wd\"]")

    # Fill input[name="wd"]
    page.fill("input[name=\"wd\"]", "nba")

    # Click text=百度一下
    with page.expect_navigation():
        page.click("text=百度一下")

    context.close()
    browser.close()

with sync_playwright() as playwright:
    run(playwright)

当然这个也是支持使用XPath语法的

page.click("xpath=//button")

节点信息获取

属性获取

page.get_attribute(selector, name, **kwargs)

实例代码:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto('https://spa6.scrape.center/')
    page.wait_for_load_state('networkidle')
    href = page.get_attribute('a.name', 'href')
    print(href)
    browser.close()

多节点获取

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto('https://spa6.scrape.center/')
    page.wait_for_load_state('networkidle')
    elements = page.query_selector_all('a.name')
    for element in elements:
        print(element.get_attribute('href'))
        print(element.text_content())
    browser.close()

单节点获取

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto('https://spa6.scrape.center/')
    page.wait_for_load_state('networkidle')
    element = page.query_selector('a.name')
    print(element.get_attribute('href'))
    print(element.text_content())
    browser.close()

操纵api

事件操作

点击

我们选择了对应的一个元素后,肯定要进行对应的一个操作。

page.click(selector, **kwargs)

  • click_count:点击次数,默认为 1。
  • timeout:等待要点击的节点的超时时间,默认是 30 秒。
  • position:需要传入一个字典,带有 x 和 y 属性,代表点击位置相对节点左上角的偏移位置。
  • force:即使不可点击,那也强制点击。默认是 False。

输入

page.fill(selector, value, **kwargs)

  • 第一个参数是 selector
  • 第二个参数是 value,代表输入的内容,另外还可以通过 timeout 参数指定对应节点的最长等待时间。
page.click(selector, **kwargs)

def run(playwright):
    browser = playwright.firefox.launch(headless=False)
    context = browser.new_context()

    # Open new page
    page = context.new_page()

    # Go to https://www.baidu.com/
    page.goto("https://www.baidu.com/")

    # Click input[name="wd"]
    page.click("input[name=\"wd\"]")

    # Fill input[name="wd"]
    page.fill("input[name=\"wd\"]", "nba")

    # Click text=百度一下
    with page.expect_navigation():
        page.click("text=百度一下")

    context.close()
    browser.close()

页面等待

页面加载

使用waitForLoadState方法

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const context = await browser.newContext();
  const page = await context.newPage();

  await page.goto('https://example.com');
  await page.waitForLoadState('load'); // 等待页面加载完毕

  // 继续进行其他操作
  await page.click('text=More information...');

  await browser.close();
})();

waitForLoadState 方法等待页面达到指定的加载状态。常见的加载状态有:

  • "load": 页面的所有资源加载完毕。
  • "domcontentloaded": DOM 加载完毕。
  • "networkidle": 没有网络活动至少 500ms。

元素加载

使用waitForSelector

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const context = await browser.newContext();
  const page = await context.newPage();

  await page.goto('https://example.com');
  await page.waitForSelector('h1'); // 等待页面上出现 h1 标签

  // 继续进行其他操作
  await page.click('text=More information...');

  await browser.close();
})();

网络加载

使用waitUntil

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const context = await browser.newContext();
  const page = await context.newPage();

  await page.goto('https://example.com', { waitUntil: 'networkidle' });

  // 继续进行其他操作
  await page.click('text=More information...');

  await browser.close();
})();

自定义逻辑

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const context = await browser.newContext();
  const page = await context.newPage();

  await page.goto('https://example.com');
  await page.waitForFunction(() => document.querySelector('h1') !== null); // 等待 h1 标签出现

  // 继续进行其他操作
  await page.click('text=More information...');

  await browser.close();
})();

事件监听

使用on 方法,来监听页面中发生的各个事件

from playwright.sync_api import sync_playwright

def on_response(response):
    print(f'Statue {response.status}: {response.url}')

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.on('response', on_response)
    page.goto('https://spa6.scrape.center/')
    page.wait_for_load_state('networkidle')
    browser.close()

源码获取

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto('https://spa6.scrape.center/')
    page.wait_for_load_state('networkidle')
    html = page.content()
    print(html)
    browser.close()

网络劫持

使用def cancel_request(route, request),对网络进行劫持

#阻止加载图片,来提升性能
from playwright.sync_api import sync_playwright
import re

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()

    def cancel_request(route, request):
        route.abort()

    page.route(re.compile(r"(\.png)|(\.jpg)"), cancel_request)
    page.goto("https://spa6.scrape.center/")
    page.wait_for_load_state('networkidle')
    page.screenshot(path='no_picture.png')
    browser.close()


#控制页面显示内容
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()

    def modify_response(route, request):
        route.fulfill(path="./custom_response.html")

    page.route('/', modify_response)
    page.goto("https://spa6.scrape.center/")
    browser.close()

参考文章

https://cuiqingcai.com/202262.html

posted @   Ho1d_F0rward  阅读(14)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 25岁的心里话
· 闲置电脑爆改个人服务器(超详细) #公网映射 #Vmware虚拟网络编辑器
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· 零经验选手,Compose 一天开发一款小游戏!
· 一起来玩mcp_server_sqlite,让AI帮你做增删改查!!
点击右上角即可分享
微信分享提示