pyQuery库

pyQuery 也是做筛选的一个库

一般引用 from pyquery import pyQuery as pq  常规用法

1、字符串初始化

html = '''
<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('li'))

返回一个 py对象 doc

结果 返回一个 doc中的所有 li标签

<li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
字符串初始化结果

 

2、url 网址初始化

from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com')
print(doc('head'))

 

3、文件初始化

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc('li'))

 

4、CSS选择器

例如  .list(空格).item-0.active    表示 list类中的(嵌套)同时包含 item-0和 active类的标签

li.siblings()  表示 返回 li标签的所有 同等级的标签(除了 li本身)

 

5、查找元素

可以用 find查询,例如 item =doc('.list')  lis =item.find('li')  来查询 list类下的 li标签

父元素    container = items.parent()  查询得到上一级标签

 

6、遍历元素

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-2 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-3 active"><a href="link4.html">fourth item</a></li>
             <li class="item-4"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-2.active')
print(li)
from pyquery import PyQuery as pq
doc = pq(html)
lis = doc('li').items()
print(type(lis))
for li in lis:
    print(li)

items()  生成器  python基础有写

遍历出 lis的元素

<class 'generator'>
<li class="item-0">first item</li>
             
<li class="item-1"><a href="link2.html">second item</a></li>
             
<li class="item-2 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             
<li class="item-3 active"><a href="link4.html">fourth item</a></li>
             
<li class="item-4"><a href="link5.html">fifth item</a></li>
遍历元素的结果

 

7、获取信息

a.sttr('helf)  或者 a.attr.helf

from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-2.active a')
print(a)
print(a.attr('href'))
print(a.attr.href)

结果如下

<a href="link3.html"><span class="bold">third item</span></a>
link3.html
link3.html

取到 a标签  然后在 a标签中查询 href属性,如果没用那么显示 None ,不会去别的地方找

 

8、获取内容

from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-2.active a')
print(a)
print(a.text())

 

 结果如上 会显示 a标签包括子标签内的所有内容,例如 

<li class="item-2 active"><a href="link3.html">hello world<span class="bold">third item</span></a></li>

则会显示  hello world 和 thied item  2个

 

9、DOM操作

removeClass  addClass

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-2.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)

 

删除 对象li中 li的标签 中 class的 active属性

再添加 class active属性

 

attr   css

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-2.active')
print(li)
li.attr('name', 'link')
print(li)
li.css('font-size', '14px')
print(li)

可以添加 name属性和 font-size属性

 

remove

html = '''
<div class="wrap">
    Hello, World
    <p>This is a paragraph.</p>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())
wrap.find('p').remove()
print(wrap.text())


#Hello, World This is a paragraph.
#Hello, World

只取到div标签下的内容,除掉子标签的内容

 

10、伪类选择器

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child')
print(li)                       # <li class="item-0">first item</li>
li = doc('li:last-child')
print(li)                       # <li class="item-4"><a href="link5.html">fifth item</a></li>
li = doc('li:nth-child(2)')
print(li)                       # <li class="item-1"><a href="link2.html">second item</a></li>
li = doc('li:gt(2)')
print(li)                       # <li class="item-3 active"><a href="link4.html">fourth item</a></li>
                                # <li class="item-4"><a href="link5.html">fifth item</a></li>
li = doc('li:nth-child(2n)')
print(li)                       # <li class="item-1"><a href="link2.html">second item</a></li>
                                # <li class="item-3 active"><a href="link4.html">fourth item</a></li>
li = doc('li:contains(second)')
print(li)                       # <li class="item-1"><a href="link2.html">second item</a></li>

其中 child是从 0开始

contains(xxx)  为包含 xxx内容的标签

 

posted @ 2020-08-31 15:38  yxlll  阅读(201)  评论(0编辑  收藏  举报