Playwright的使用
1. 基本使用
- 同步模式
from playwright.sync_api import sync_playwright
url = 'https://www.baidu.com'
with sync_playwright() as p:
for browser_type in [p.chromium, p.firefox, p.webkit]:
browser = browser_type.launch(headless=False)
page = browser.new_page()
page.goto(url)
page.screenshot(path=f'sync-{browser_type.name}.png')
print(page.title())
browser.close()
- 异步模式
import asyncio
from playwright.async_api import async_playwright
url = 'https://www.baidu.com'
async def main():
async with async_playwright() as p:
for browser_type in [p.chromium, p.firefox, p.webkit]:
browser = await browser_type.launch()
page = await browser.new_page()
await page.goto(url)
await page.screenshot(path=f'async-{browser_type.name}.png')
print(await page.title())
await browser.close()
asyncio.run(main())
2. 代码生成
Playwright可以录制在浏览器的操作并自动生成代码。 【codegen】
# 查看codegen命令的参数
playwright codegen --help
# 例如:启动firefox浏览器,并将操作结果输出到script.py文件
playwright codegen -o script.py -b firefox https://www.baidu.com
3. 选择器
- 文本选择
page.click("text=Log in")
- CSS选择器
page.click("button")
page.click("#nav-bar .contact-us-item")
page.click("[data-test=login-button]")
page.click("[aria-label='Sign in']")
- XPath
# 需在开头自行指定 “xpath=字符串”
page.click("xpath=//button")
4. 事件监听
page对象提供一个on方法,用来监听页面中发生的各个事件,例如close, console, load, request, response等。
对于Ajax加载的数据,即使这个Ajax请求中有加密参数,也不用担心,因为我们截获的是最后的响应结果
from playwright.sync_api import Playwright, sync_playwright
# def on_response(response):
# """
# 输出浏览器Network面板中的所有请求和相应
# """
# print(f'Status {response.status}: {response.url}')
def on_response(response):
"""
通过on_response方法拦截Ajax请求,直接获取响应结果。
"""
if "api/movie/" in response.url and response.status == 200:
print(response.json())
def run(playwright: Playwright) -> None:
browser = playwright.chromium.launch(headless=False)
page = browser.new_page()
# 监听response事件,同时将回调方法设为on_response
page.on('response', on_response)
page.goto("https://spa6.scrape.center/")
page.wait_for_load_state("networkidle")
page.close()
browser.close()
with sync_playwright() as playwright:
run(playwright)
5. 常用方法
-
获取网页源码:page.content()
-
页面点击:page.click(selector, kwargs) 参考官方文档
-
文本输入:page.fill(selector, value, kwargs)
-
获取节点属性:page.get_attribute(selector, name, kwargs)
# 只返回单个节点属性 href = page.get_attribute("a.name", "href")
-
获取多个节点:query_selector_all()
- 节点属性:element.get_attribute(name)
- 节点文本:element.text_content()
elements = page.query_selector_all("a.name") for element in elements: href = element.get_attribute("href") text = element.text_content()
-
获取单个节点:query_selector()
element = page.query_selector("a.name") href = element.get_attribute("href") text = element.text_content()
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· 从HTTP原因短语缺失研究HTTP/2和HTTP/3的设计差异
· 三行代码完成国际化适配,妙~啊~