Python3爬虫(七) 解析库的使用之pyquery
Infi-chu:
http://www.cnblogs.com/Infi-chu/
pyquery专门针对CSS和jQuery的操作处理
1.初始化
字符串初始化
1 2 3 | from pyquery import PyQuery as pq doc = pq(html) # 传入html文本 print (doc( 'li' )) |
URL初始化
1 2 3 4 5 6 7 8 | from pyquery import PyQuery as pq doc = pq(url = 'www.baidu.com' ) print (doc( 'title' )) # 另一种方法 from pyquery import PyQuery as pq import requests doc = pq(requests.get( 'http://www.baidu.com' )) print (doc( 'title' )) |
文件初始化
1 2 3 | from pyquery import PyQuery as pq doc = pq(filename = 'text.html' ) print (doc( 'li' )) |
2.基本CSS选择器
1 2 3 4 | from pyquery import PyQuery as pq doc = pq(url = 'http://www.baidu.com' ) print (doc( #head .head_wrapper a)) print ( type (doc( #head .head_wrapper a))) |
3.查找节点
子节点
1 2 3 4 5 6 7 8 | from pyquery import PyQuery as pq doc = pq(url = 'http://www.baidu.com' ) items = doc( '.head_wrapper' ) print ( type (items)) print (items) lis = items.find( 'a' ) # find()是查找符合条件的所有子孙节点,只查找子节点的可以使用children() print ( type (lis)) print (lis) |
父节点
使用parent()方法获取该节点的父节点
使用parents()方法获取该节点的祖先节点
兄弟节点
使用siblings()方法获取兄弟节点
4.遍历
1 2 3 4 5 6 | from pyquery import PyQuery as pq doc = pq(html) lis = doc( 'li' ).items() print ( type (lis)) for li in lis: print (li, type (li)) |
5.获取信息
获取属性
使用attr()方法获取属性(值)
1 2 3 4 5 6 7 8 9 10 11 12 13 | from pyquery import PyQuery as pq doc = pq(url = 'http://www.baidu.com' ) items = doc( '.head_wrapper' ) print (items.attr( 'href' )) # 也可以写成 print (items.attr.href) # 获取所有a的属性 from pyquery import PyQuery as pq doc = pq(url = 'http://www.baidu.com' ) a = doc( 'a' ) for i in a: print (i.attr.href) |
获取文本
使用text()方法获取纯文本纯字符串内容
1 2 3 4 | from pyquery import PyQuery as pq doc = pq(url = 'http://www.baidu.com' ) a = doc( 'a' ) print (i.text()) # 无需遍历 |
使用html()方法保留标签内部的东西
1 2 3 4 5 6 | from pyquery import PyQuery as pq doc = pq(url = 'http://www.baidu.com' ) a = doc( 'a' ) for i in a: print (i) print (i.html()) |
6.节点操作
addClass和removeClass
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from pyquery import PyQuery as pq html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class"bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0 active"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' doc = pq(html) li = doc( '.item-0 active' ) print (li) li.removeClass( 'active' ) print (li) li.addClass( 'active' ) print (li) |
attr、text和html
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | from pyquery import PyQuery as pq html = ''' <div class="div"> <p>ASD</p> <ul class="list"> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> </ul> </div> ''' doc = pq(html) li = doc( '.item-0 active' ) print (li) li.attr( 'name' , 'link' ) print (li) li.text( 'changed item' ) print (li) li.html( '<span>changed item</span>' ) print (li) |
remove()
1 2 3 4 | from pyquery import PyQuery as pq doc = pq(html) res = doc( '.div' ) print (res.find( 'ul' ).remove().text()) |
7.伪类选择器
待完善
【推荐】还在用 ECharts 开发大屏?试试这款永久免费的开源 BI 工具!
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步