pyquery 的初步了解(实例引入)
简单举例
复制 | from pyquery import PyQuery as pq |
| |
| html = ''' |
| <div> |
| <ul> |
| <li class="item-O"><a href="linkl.html">first item</a></li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-inactive"><a href="link3.html">third item</a></li> |
| <li class="item-1"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a> |
| </ul> |
| </div> |
| ''' |
| |
| doc = pq(html) |
| print(doc) |
| |
| |
| |
| <div> |
| <ul> |
| <li class="item-O"><a href="linkl.html">first item</a></li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-inactive"><a href="link3.html">third item</a></li> |
| <li class="item-1"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a> |
| </li></ul> |
| </div> |
字符串
复制 | from pyquery import PyQuery as pq |
| import requests |
| |
| |
| doc1 = pq(url='https://www.cnblogs.com/liyihua/') |
| print(doc1('title')) |
| |
| doc2 = pq(requests.get('https://www.cnblogs.com/liyihua/').text) |
| print(doc1('title')) |
| |
| |
| 1 |
| <title>李亦华 - 博客园</title>& |
| |
| <title>李亦华 - 博客园</title>& |
URL
复制 | from pyquery import PyQuery as pq |
| |
| doc = pq(filename='test.html') |
| print(doc('li')) |
| |
| |
| |
| <li class="item-O"><a href="linkl.html">first item</a></li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-inactive"><a href="link3.html">third item</a></li> |
| <li class="item-1"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a> |
| </li> |
| |
| |
| |
| <div> |
| <ul> |
| <li class="item-O"><a href="linkl.html">first item</a></li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-inactive"><a href="link3.html">third item</a></li> |
| <li class="item-1"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a> |
| </ul> |
| </div> |
pyquery 中的基本CSS选择器
实例切入:
复制 | from pyquery import PyQuery as pq |
| |
| html = ''' |
| <div id="container"> |
| <ul class="list"> |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| </ul> |
| </div> |
| ''' |
| |
| doc = pq(html) |
| print(doc('#container .list li')) |
| |
| print( |
| type( |
| doc('#container .list li') |
| ) |
| ) |
| |
| |
| |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| |
| <class 'pyquery.pyquery.PyQuery'> |
查找节点
获取子孙节点
说明:find()方法查找的是所有子孙节点,如果只查找子节点,可以使用children()方法。
复制 | from pyquery import PyQuery |
| |
| html = ''' |
| <div id="container"> |
| <ul class="list"> |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| </ul> |
| </div> |
| ''' |
| |
| doc = PyQuery(html) |
| items = doc('.list') |
| |
| print( |
| type(items), |
| items, |
| sep='\n' |
| ) |
| |
| print( |
| type(items.find('li')), |
| items.find('li'), |
| sep='\n' |
| ) |
复制 | # 输出: |
| <class 'pyquery.pyquery.PyQuery'> |
| <ul class="list"> |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| </ul> |
| |
| <class 'pyquery.pyquery.PyQuery'> |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
获取父节点
复制 | from pyquery import PyQuery |
| |
| html = ''' |
| <div id="container"> |
| <ul class="list"> |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| </ul> |
| 1</div> |
| 13 ''' |
| |
| doc = PyQuery(html) |
| items = doc('.list') |
| |
| print(items, '\n') |
| |
| print( |
| type(items.parent()), |
| items.parent(), |
| sep='\n' |
| ) |
复制 | # 输出: |
| <ul class="list"> |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| </ul> |
| |
| |
| <class 'pyquery.pyquery.PyQuery'> |
| <div id="container"> |
| <ul class="list"> |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| </ul> |
| </div> |
兄弟节点
复制 | from pyquery import PyQuery |
| |
| html = ''' |
| <div id="container"> |
| <ul class="list"> |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| </ul> |
| </div> |
| ''' |
| |
| doc = PyQuery(html) |
| |
| |
| items = doc('.list .item-0.active') |
| |
| print( |
| type(items.siblings()), |
| items.siblings(), |
| sep='\n' |
| ) |
| |
| print("\n", items.siblings('.active')) |
复制 | # 输出: |
| <class 'pyquery.pyquery.PyQuery'> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0">first item</li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| |
| |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
遍历节点
复制 | from pyquery import PyQuery |
| |
| html = ''' |
| <div id="container"> |
| <ul class="list"> |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| </ul> |
| </div> |
| ''' |
| |
| doc = PyQuery(html) |
| lis = doc('li').items() |
| |
| for li in lis: |
| print( |
| li, |
| type(li) |
| ) |
复制 | # 输出: |
| <li class="item-0">first item</li> |
| <class 'pyquery.pyquery.PyQuery'> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <class 'pyquery.pyquery.PyQuery'> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <class 'pyquery.pyquery.PyQuery'> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <class 'pyquery.pyquery.PyQuery'> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| <class 'pyquery.pyquery.PyQuery'> |
获取信息
-
attr()方法获取属性
复制 | from pyquery import PyQuery |
| |
| html = ''' |
| <div id="container"> |
| <ul class="list"> |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| </ul> |
| </div> |
| ''' |
| |
| doc = PyQuery(html) |
| a = doc('.item-0.active a') |
| print( |
| a, |
| type(a), |
| a.attr('href'), |
| sep='\n' |
| ) |
复制 | # 输出: |
| <a href="link3.html"><span class="bold">third item</span></a> |
| <class 'pyquery.pyquery.PyQuery'> |
| link3.html |
-
text()方法获取文本
复制 | from pyquery import PyQuery |
| |
| html = ''' |
| <div id="container"> |
| <ul class="list"> |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| </ul> |
| </div> |
| ''' |
| |
| doc = PyQuery(html) |
| li = doc('li') |
| |
| print( |
| li.html(), |
| li.text(), |
| type(li.text()), |
| sep='\n' |
| ) |
复制 | # 输出: |
| first item |
| first item second item third item fourth item fifth item |
| <class 'str'> |
节点操作
添加和移除class
复制 | add_class() 和 remove_class() ---- 添加class、移除class |
复制 | from pyquery import PyQuery |
| |
| html = ''' |
| <div id="container"> |
| <ul class="list"> |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| </ul> |
| </div> |
| ''' |
| |
| doc = PyQuery(html) |
| li = doc('.item-0.active') |
| |
| print(li) |
| print(li.remove_class('active')) |
| print(li.add_class('active')) |
复制 | # 输出: |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| |
| <li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li> |
| |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| |
attr、text 和 html 方法
复制 | from pyquery import PyQuery |
| |
| html = ''' |
| <div id="container"> |
| <ul class="list"> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| </ul> |
| </div> |
| ''' |
| |
| doc = PyQuery(html) |
| |
| li = doc('.item-0.active') |
| print(li) |
| |
| li.attr('name', 'link') |
| print(li) |
| |
| li.text('change item') |
| print(li) |
| |
| li.html('<span>change item</span>') |
| print(li) |
| |
| |
| |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| |
| <li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li> |
| |
| <li class="item-0 active" name="link">change item</li> |
| |
| <li class="item-0 active" name="link"><span>change item</span></li> |
复制 | # 输出: |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| |
| <li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li> |
| |
| <li class="item-0 active" name="link">change item</li> |
| |
| <li class="item-0 active" name="link"><span>change item</span></li> |
删除节点
复制 | from pyquery import PyQuery |
| |
| html = ''' |
| <div class="LeeHua"> |
| LiYihua |
| <ul class="201802004731">liyihua</ul> |
| </div> |
| ''' |
| |
| doc = PyQuery(html) |
| Leehua = doc('.LeeHua') |
| print("移除节点ul前的输出:\n"+Leehua.text()) |
| |
| Leehua.find('ul').remove() |
| print("移除节点ul后的输出:\n"+Leehua.text()) |
复制 | # 输出: |
| 移除节点ul前的输出: |
| LiYihua |
| liyihua |
| 移除节点ul后的输出: |
| LiYihua |
伪选择器
示例:
复制 | from pyquery import PyQuery |
| |
| html = ''' |
| <div class="wrap"> |
| <div id="container"> |
| <ul class="list"> |
| <li class="item-0">first item</li> |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| </ul> |
| </div> |
| </div> |
| ''' |
| |
| doc = PyQuery(html) |
| |
| |
| li = doc('li:first-child') |
| print(li) |
| |
| |
| li = doc('li:last-child') |
| print(li) |
| |
| |
| li = doc('li:nth-child(2)') |
| print(li) |
| |
| |
| li = doc('li:gt(2)') |
| print(li) |
| |
| |
| li = doc('li:nth-child(2n)') |
| print(li) |
| |
| |
| li = doc('li:contains(second)') |
| print(li) |
复制 | # 输出: |
| <li class="item-0">first item</li> |
| |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| <li class="item-0"><a href="link5.html">fifth item</a></li> |
| |
| <li class="item-1"><a href="link2.html">second item</a></li> |
| <li class="item-1 active"><a href="link4.html">fourth item</a></li> |
| |
| <li class="item-1"><a href="link2.html">second item</a></li> |
CSS 选择器的用法:http://www.w3school.com.cn/cssref/css_selectors.asp
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· winform 绘制太阳,地球,月球 运作规律
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 上周热点回顾(3.3-3.9)