Ajax爬取
动态渲染工具安装(Playwright)
即我们直接操控浏览器来获取数据,接口数据都是加密的。但前端肯定有对应的一个解密程序,那么我们就可以进行操控浏览器来进行一个对应的解密。、
安装使用
pip3 install playwright
playwright install
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
for browser_type in [p.chromium, p.firefox, p.webkit]:
browser = browser_type.launch(headless=False)
page = browser.new_page()
page.goto('https://www.baidu.com')
page.screenshot(path=f'screenshot-{browser_type.name}.png')
print(page.title())
browser.close()
选择器
即我们去定位我们要去操控的元素
from playwright.sync_api import sync_playwright
def run(playwright):
browser = playwright.firefox.launch(headless=False)
context = browser.new_context()
# Open new page
page = context.new_page()
# Go to https://www.baidu.com/
page.goto("https://www.baidu.com/")
# Click input[name="wd"]
page.click("input[name=\"wd\"]")
# Fill input[name="wd"]
page.fill("input[name=\"wd\"]", "nba")
# Click text=百度一下
with page.expect_navigation():
page.click("text=百度一下")
context.close()
browser.close()
with sync_playwright() as playwright:
run(playwright)
当然这个也是支持使用XPath语法的
page.click("xpath=//button")
节点信息获取
属性获取
page.get_attribute(selector, name, **kwargs)
实例代码:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://spa6.scrape.center/')
page.wait_for_load_state('networkidle')
href = page.get_attribute('a.name', 'href')
print(href)
browser.close()
多节点获取
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://spa6.scrape.center/')
page.wait_for_load_state('networkidle')
elements = page.query_selector_all('a.name')
for element in elements:
print(element.get_attribute('href'))
print(element.text_content())
browser.close()
单节点获取
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://spa6.scrape.center/')
page.wait_for_load_state('networkidle')
element = page.query_selector('a.name')
print(element.get_attribute('href'))
print(element.text_content())
browser.close()
操纵api
事件操作
点击
我们选择了对应的一个元素后,肯定要进行对应的一个操作。
page.click(selector, **kwargs)
- click_count:点击次数,默认为 1。
- timeout:等待要点击的节点的超时时间,默认是 30 秒。
- position:需要传入一个字典,带有 x 和 y 属性,代表点击位置相对节点左上角的偏移位置。
- force:即使不可点击,那也强制点击。默认是 False。
输入
page.fill(selector, value, **kwargs)
- 第一个参数是 selector
- 第二个参数是 value,代表输入的内容,另外还可以通过 timeout 参数指定对应节点的最长等待时间。
page.click(selector, **kwargs)
def run(playwright):
browser = playwright.firefox.launch(headless=False)
context = browser.new_context()
# Open new page
page = context.new_page()
# Go to https://www.baidu.com/
page.goto("https://www.baidu.com/")
# Click input[name="wd"]
page.click("input[name=\"wd\"]")
# Fill input[name="wd"]
page.fill("input[name=\"wd\"]", "nba")
# Click text=百度一下
with page.expect_navigation():
page.click("text=百度一下")
context.close()
browser.close()
页面等待
页面加载
使用waitForLoadState方法
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
await page.goto('https://example.com');
await page.waitForLoadState('load'); // 等待页面加载完毕
// 继续进行其他操作
await page.click('text=More information...');
await browser.close();
})();
waitForLoadState
方法等待页面达到指定的加载状态。常见的加载状态有:
"load"
: 页面的所有资源加载完毕。"domcontentloaded"
: DOM 加载完毕。"networkidle"
: 没有网络活动至少 500ms。
元素加载
使用waitForSelector
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
await page.goto('https://example.com');
await page.waitForSelector('h1'); // 等待页面上出现 h1 标签
// 继续进行其他操作
await page.click('text=More information...');
await browser.close();
})();
网络加载
使用waitUntil
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
await page.goto('https://example.com', { waitUntil: 'networkidle' });
// 继续进行其他操作
await page.click('text=More information...');
await browser.close();
})();
自定义逻辑
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
await page.goto('https://example.com');
await page.waitForFunction(() => document.querySelector('h1') !== null); // 等待 h1 标签出现
// 继续进行其他操作
await page.click('text=More information...');
await browser.close();
})();
事件监听
使用on 方法,来监听页面中发生的各个事件
from playwright.sync_api import sync_playwright
def on_response(response):
print(f'Statue {response.status}: {response.url}')
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.on('response', on_response)
page.goto('https://spa6.scrape.center/')
page.wait_for_load_state('networkidle')
browser.close()
源码获取
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://spa6.scrape.center/')
page.wait_for_load_state('networkidle')
html = page.content()
print(html)
browser.close()
网络劫持
使用def cancel_request(route, request),对网络进行劫持
#阻止加载图片,来提升性能
from playwright.sync_api import sync_playwright
import re
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
def cancel_request(route, request):
route.abort()
page.route(re.compile(r"(\.png)|(\.jpg)"), cancel_request)
page.goto("https://spa6.scrape.center/")
page.wait_for_load_state('networkidle')
page.screenshot(path='no_picture.png')
browser.close()
#控制页面显示内容
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
def modify_response(route, request):
route.fulfill(path="./custom_response.html")
page.route('/', modify_response)
page.goto("https://spa6.scrape.center/")
browser.close()
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 25岁的心里话
· 闲置电脑爆改个人服务器(超详细) #公网映射 #Vmware虚拟网络编辑器
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· 零经验选手,Compose 一天开发一款小游戏!
· 一起来玩mcp_server_sqlite,让AI帮你做增删改查!!