Python3爬虫(七) 解析库的使用之pyquery
Infi-chu:
http://www.cnblogs.com/Infi-chu/
pyquery专门针对CSS和jQuery的操作处理
1.初始化
字符串初始化
from pyquery import PyQuery as pq doc = pq(html) # 传入html文本 print(doc('li'))
URL初始化
from pyquery import PyQuery as pq doc = pq(url='www.baidu.com') print(doc('title')) # 另一种方法 from pyquery import PyQuery as pq import requests doc = pq(requests.get('http://www.baidu.com')) print(doc('title'))
文件初始化
from pyquery import PyQuery as pq doc = pq(filename='text.html') print(doc('li'))
2.基本CSS选择器
from pyquery import PyQuery as pq doc = pq(url='http://www.baidu.com') print(doc(#head .head_wrapper a)) print(type(doc(#head .head_wrapper a)))
3.查找节点
子节点
from pyquery import PyQuery as pq doc = pq(url='http://www.baidu.com') items = doc('.head_wrapper') print(type(items)) print(items) lis = items.find('a') # find()是查找符合条件的所有子孙节点,只查找子节点的可以使用children() print(type(lis)) print(lis)
父节点
使用parent()方法获取该节点的父节点
使用parents()方法获取该节点的祖先节点
兄弟节点
使用siblings()方法获取兄弟节点
4.遍历
from pyquery import PyQuery as pq doc = pq(html) lis = doc('li').items() print(type(lis)) for li in lis: print(li,type(li))
5.获取信息
获取属性
使用attr()方法获取属性(值)
from pyquery import PyQuery as pq doc = pq(url='http://www.baidu.com') items = doc('.head_wrapper') print(items.attr('href')) # 也可以写成 print(items.attr.href) # 获取所有a的属性 from pyquery import PyQuery as pq doc = pq(url='http://www.baidu.com') a = doc('a') for i in a: print(i.attr.href)
获取文本
使用text()方法获取纯文本纯字符串内容
from pyquery import PyQuery as pq doc = pq(url = 'http://www.baidu.com') a = doc('a') print(i.text()) # 无需遍历
使用html()方法保留标签内部的东西
from pyquery import PyQuery as pq doc = pq(url = 'http://www.baidu.com') a = doc('a') for i in a: print(i) print(i.html())
6.节点操作
addClass和removeClass
from pyquery import PyQuery as pq html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class"bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0 active"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' doc = pq(html) li = doc('.item-0 active') print(li) li.removeClass('active') print(li) li.addClass('active') print(li)
attr、text和html
from pyquery import PyQuery as pq html = ''' <div class="div"> <p>ASD</p> <ul class="list"> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> </ul> </div> ''' doc = pq(html) li = doc('.item-0 active') print(li) li.attr('name','link') print(li) li.text('changed item') print(li) li.html('<span>changed item</span>') print(li)
remove()
from pyquery import PyQuery as pq doc = pq(html) res = doc('.div') print(res.find('ul').remove().text())
7.伪类选择器
待完善