BeautifulSoup

BeautifulSoup简单使用:

#!/urs/bin/evn python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''


# 然后创建BeautifulSoup对象,创建BeautifulSoup对象有两种方式:
# 第一种:通过字符串创建
soup = BeautifulSoup(html, 'lxml')
# 另一种通过文件来创建。假如html_str字符串保存为index.html文件。
# soup = BeautifulSoup(open('index.html'))

# 文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码。
print(soup.prettify())
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

通过下面的一个例子,对bs4有一个简单的了解,以及看一下它的强大之处:

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.p["class"])
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id='link3'))

结果:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
<title>The Dormouse's story</title>
title
The Dormouse's story
head
<p class="title"><b>The Dormouse's story</b></p>
['title']
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

标签选择器

在快速使用中我们添加如下代码:
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

通过这种soup.标签名 我们就可以获得这个标签的内容
这里有个问题需要注意,通过这种方式获取标签,如果文档中有多个这样的标签,返回的结果是第一个标签的内容,如上面我们通过soup.p获取p标签,而文档中有多个p标签,但是只返回了第一个p标签内容



获取内容 soup.title.string:

#!/urs/bin/evn python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
"""
.string, .strings, stripped_strings 三个属性。
.string这个属性很有特点:如果一个标记里面没有标记里面没有标记了,那么,string就会返回标记里面的内容。如果标记里面里面只有唯一的一个标记了,那么,.steing也会返回最里面的内容。如果tag包含了多个子节点,tag就无法确定,string方法应该调用哪个子节点的内容,.srting的输出结果是None
"""
html = '''                             
<html><head><title>The Dormouse's story
<body>                                 
<p class="title"><b>The Dormouse's stor

<p class="story">Once upon a time there
<a href="http://example.com/elsie" clas
<a href="http://example.com/lacie" clas
<a href="http://example.com/tillie" cla
and they lived at the bottom of a well.
<p class="story">...</p>               
'''
soup = BeautifulSoup(html, 'lxml',)
# 想要获取标记内部的文字,需要用到.string
print(soup.head.string)
print(soup.title.string)
print(soup.html.stting)
print('-' * 50)
# strings属性主要应用于tag中包含多个字符串的情况,可以进行循环遍历。
for string in soup.strings:
    print(string)
print('+' * 50)
# .stripped_strings属性可以去掉输出字符串中包含的空格或空行。
for q in soup.stripped_strings:
    print(q)

结果:

The Dormouse's story

The Dormouse's story

None
--------------------------------------------------
The Dormouse's story



The Dormouse's stor


Once upon a time there

...


++++++++++++++++++++++++++++++++++++++++++++++++++
The Dormouse's story
The Dormouse's stor
Once upon a time there
...

嵌套选择

我们直接可以通过下面嵌套的方式获取

print(soup.head.title.string)

 


获取名称 soup.title.name
 
#!/urs/bin/evn python
# -*- coding:utf-8 -*-
"""
Tag: Tag对象与XmL或HTML原生文档中Tag相同,通俗点说就是标记。比如<title>The Dormouse's story</title>或者<a href="http://example.com/elsie" class="sister" id="linkl">Elsie</a>
抽取title: print soup.title
抽取a: print soup.a
抽取p: print soup.a

Tag 中有两个最重要的属性:name和attributes。 每个Tag都有自己的名字,通过.name来获取。
"""

from bs4 import BeautifulSoup

html = '''                             
<html><head><title>The Dormouse's story
<body>                                 
<p class="title"><b>The Dormouse's stor

<p class="story">Once upon a time there
<a href="http://example.com/elsie" clas
<a href="http://example.com/lacie" clas
<a href="http://example.com/tillie" cla
and they lived at the bottom of a well.
<p class="story">...</p>               
'''

# 然后创建BeautifulSoup对象,创建BeautifulSoup对象
# 第一种:通过字符串创建
soup = BeautifulSoup(html, 'lxml', )
print(soup.name)   # soup对象本身比较特殊,他的name为[documernt], 对于其他内部标记,输出的值标记本身的名称。
print(soup.title.name)
print(soup.p.sting)

"""
Tag:可以获取name。还可以修改name,改变之后将影响所有通过当前BeautifulSoup对象生成的HTMl文档。
"""
soup.title.name = "cc"
print(soup.title)
print(soup.cc)  # 这里已经修改title标记成功修改为cc
# 再说一下Tag中的属性,<p class="title"><b>The Dormouue's story</b></p> 有一个"class"值性,值为”title“。 Tag的属性的操作方法与字典相同。
print(soup.p['class'])
print(soup.p.get('class'))

# 也可以点取,比如:.attrs, 用于获取Tag中所有属性


# name一样,我们可以对标记中的这些属性和内容等进行修改。
soup.p['class'] = 'cc'
print(soup.p)

结果:

[document]
title
None
None
<cc>The Dormouse's story
</cc>
['title']
['title']
<p class="cc"><b>The Dormouse's stor

</b></p>

 

获取属性

print(soup.p.attrs['name'])
print(soup.p['name'])
上面两种方式都可以获取p标签的name属性值

 

父节点和祖先节点

#!/urs/bin/evn python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
html = '''                             
<html><head><title>The Dormouse's story
<body>                                 
<p class="title"><b>The Dormouse's stor

<p class="story">Once upon a time there
<a href="http://example.com/elsie" clas
<a href="http://example.com/lacie" clas
<a href="http://example.com/tillie" cla
and they lived at the bottom of a well.
<p class="story">...</p>               
'''
soup = BeautifulSoup(html, 'lxml模块')
print(soup.title)
print(soup.title.parent)  # 父节点
# 通过元素的.parents属性可以递归得到元素的所有的所有父辈节点,使用了.parents方法遍历了<a>标记到根节点的所有节点。
print(soup.a)
for p in soup.parents:
    if p is None:
        print(p)
    else:
        print(p.name)

结果:

<title>The Dormouse's story
</title>
<head><title>The Dormouse's story
</title></head>
<a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a>

兄弟节点

soup.a.next_siblings 获取后面的兄弟节点
soup.a.previous_siblings 获取前面的兄弟节点
soup.a.next_sibling 获取下一个兄弟标签
souo.a.previous_sinbling 获取上一个兄弟标签

#!/urs/bin/evn python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
html = '''                             
<html><head><title>The Dormouse's story
<body>                                 
<p class="title"><b>The Dormouse's stor

<p class="story">Once upon a time there
<a href="http://example.com/elsie" clas
<a href="http://example.com/lacie" clas
<a href="http://example.com/tillie" cla
and they lived at the bottom of a well.
<p class="story">...</p>               
'''
soup = BeautifulSoup(html, 'lxml')
# 兄弟节点(从soup.prettify()的输出结果中,我们可以看到<a>有很多兄弟节点。兄弟节点可以理解为和本节点处在同一级的节点,.next_sibling属性可以获取该节点的下一个兄弟节点,.prebious_sibling则与之相反,如果节点不存在,则返回None。
#

print(soup.p.next_sibling)
print('-' * 50)
print(soup.p.prev_sibling)
print('#' * 50)
print(soup.p.next_sibling.next_sibling)
for i in soup.p.next_siblings:
    print(repr(i))

结果:

<p class="story">Once upon a time there
<a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p>
--------------------------------------------------
None
##################################################


<p class="story">Once upon a time there
<a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p>
'\n'
#!/urs/bin/evn python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
# 前后节点需要使用.next_element,.previous_element这两个属性,与.next_sibling.previous_slbling不同,它并不是针对于兄弟节点,而是针对所有节点,不分层次,例如<head><title>The Dormiuse's</title></head>中的下一个节点就是title
html = '''                             
<html><head><title>The Dormouse's story
<body>                                 
<p class="title"><b>The Dormouse's stor

<p class="story">Once upon a time there
<a href="http://example.com/elsie" clas
<a href="http://example.com/lacie" clas
<a href="http://example.com/tillie" cla
and they lived at the bottom of a well.
<p class="story">...</p>               
'''
soup = BeautifulSoup(html, 'lxml模块')
print(soup.head)
print(soup.head.next_element)
# 如果想遍历所有的前节点或者后节点,通过.next_elements 和.previous_elements的迭代器就可以向前或向后访问文档的解析内容。
print('-' * 50)
for element in soup.a.next_element:
    print(repr(element))
结果
<head><title>The Dormouse's story
</title></head>
<title>The Dormouse's story
</title>
--------------------------------------------------
'.'
'.'
'.'

子节点和子孙节点:

#!/urs/bin/evn python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
html = '''                             
<html><head><title>The Dormouse's story
<body>                                 
<p class="title"><b>The Dormouse's stor

<p class="story">Once upon a time there
<a href="http://example.com/elsie" clas
<a href="http://example.com/lacie" clas
<a href="http://example.com/tillie" cla
and they lived at the bottom of a well.
<p class="story">...</p>               
'''
# 子节点:(Tag)中的.contents和.children是非常重要的
soup = BeautifulSoup(html, 'lxml')
print(soup.head.contents)
print(len(soup.head.contents))
print(soup.head.contents[0].string)
# 字符串没有.contents属性,就是没有子节点。
# .children属性返回一个生成器,可以对子节点进行循环。
for chid in soup.head.contents:
    print(chid)
print('-' * 50)
# .contents和.children属性包含Tag的直接子节点。
# .descendants属性可以对所有Tag的子孙节点进行递归循环
for c in soup.head.descendants:
    print(c)

结果:

[<title>The Dormouse's story
</title>]
1
The Dormouse's story

<title>The Dormouse's story
</title>
--------------------------------------------------
<title>The Dormouse's story
</title>
The Dormouse's story

 

标准选择器

find_all(name,attrs,recursive,text,**kwargs)

find_all(name,attrs,recursive,text,**kwargs)
可以根据标签名,属性,内容查找文档

name的用法:

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print('-' * 50)
print(type(soup.find_all('ul')[0]))

结果:

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
--------------------------------------------------
<class 'bs4.element.Tag'>

同时我们是可以针对结果再次find_all,从而获取所有的li标签信息:

for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

attrs可以传入字典的方式来查找标签,但是这里有个特殊的就是class,因为class在python中是特殊的字段,所以如果想要查找class相关的可以更改attrs={'class_':'element'}或者soup.find_all('',{"class":"element}),特殊的标签属性可以不写attrs,例如id。

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))

结果:

['Foo', 'Foo']

其他用法:

#!/urs/bin/evn python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
import re
"""
find_all方法,用于搜索当前Tag的所有Tag子节点,并判断是否符合过滤器的条件,
find_all(name, attrs, recursive, text, **kwargs)

name参数:可以查找所有名字为name的标记,字符串对象会被自动忽略掉。name参数取值可以是字符串,正则表达式,列表,True 和方法。最简单的过滤是字符串。在搜索方法中传入一个字符串参数,BeautifulSoup会查找与字符串完整匹配的内容。
"""

html = '''                             
<html><head><title>The Dormouse's story
<body>                                 
<p class="title"><b>The Dormouse's stor

<p class="story">Once upon a time there
<a href="http://example.com/elsie" clas
<a href="http://example.com/lacie" clas
<a href="http://example.com/tillie" cla
and they lived at the bottom of a well.
<p class="story">...</p>               
'''
soup = BeautifulSoup(html, 'lxml模块')
print(soup.find_all('b'))
# 如果传入正则表达式作为参数,BeautifulSoup会通过正则表达式的match()来匹配内容。
for tag in soup.find_all(re.compile('^b')):
    print(tag.name)
print('*' * 50)
# 如果传入列表参数,BeautifulSoup会将与列表中任一元素匹配的内容返回。
print(soup.find_all(['a', 'b']))
print('@' * 50)
# 如果传入的参数是True,True可以匹配任何值。
for ti in soup.find_all(True):
    print(ti)

print('#' * 50)
# 如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数Tag节点,如果这个方法返回True表示当前匹配并且被找到,如果不是则返回FALSE。

"""
def hasClass_id(tag):
    return tag.has_attr('class') and tag.has_attr('id')
print(soup.find_all(hasClass_id))
"""

# kwargs参数: 如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字Tag的属性来搜索。搜索指定名字的属性时可以使用的参数值包括字符串,正则表达式,列表,True。如果包含id参数,BeautifulSoup会搜索每个tag的"id"属性。
print(soup.find_all(id='link2'))


# 如果传入href参数,BeautifulSoup会搜索每个Tag的'href'属性。
print(soup.find_all(href=re.compile('elsie')))
print(soup.find_all(id=True))

# 如果想用class过滤。但是class是关键字,需要在class后面加个下划线。
print(soup.find_all('a', class_='sister'))
print('c' * 50)
# 使用多个指定名字的参数可以同时过滤Tag的多个属性:
print(soup.find_all(href=re.compile('elsie'), id='linkl'))
"""
# 有些tag属性再搜索不能使用,比如:HTML5中的 data-*属性
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

data_soup.find_all(attrs={"data-foo": "value"})
"""

结果:

[<b>The Dormouse's stor

</b>]
body
b
**************************************************
[<b>The Dormouse's stor

</b>, <a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a>]
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
<html><head><title>The Dormouse's story
</title></head><body>
<p class="title"><b>The Dormouse's stor

</b></p><p class="story">Once upon a time there
<a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p>
</body></html>
<head><title>The Dormouse's story
</title></head>
<title>The Dormouse's story
</title>
<body>
<p class="title"><b>The Dormouse's stor

</b></p><p class="story">Once upon a time there
<a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p>
</body>
<p class="title"><b>The Dormouse's stor

</b></p>
<b>The Dormouse's stor

</b>
<p class="story">Once upon a time there
<a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a></p>
<a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a>
##################################################
[]
[<a a="" and="" at="" bottom="" cla="" clas="" class="story" href="http://example.com/elsie" lived="" of="" the="" they="" well.="">...</a>]
[]
[]
cccccccccccccccccccccccccccccccccccccccccccccccccc
[]
#!/urs/bin/evn python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
# find_all()方法返回全部的搜索结果,如果文档树很大那么搜索会很慢。如果我们不需要全部结果,可以使用limit参数限制返回结果的数量。当搜索到的结果数量到达limit的限制时,就停止搜索返回结果。
print(soup.find_all('a', limit=2))

结果:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

 

 

#!/urs/bin/evn python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
# 调用Tag的find_all()方法时,BeautifulSoup会搜索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数recursive=False
print(soup.find_all('title'))
print(soup.find_all('title', recursive=False))

结果

[<title>The Dormouse's story</title>]
[]

 

find

find(name,attrs,recursive,text,**kwargs)
find返回的匹配结果的第一个元素

其他一些类似的用法:
find_parents()返回所有祖先节点,find_parent()返回直接父节点。
find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

推荐使用lxml解析库,必要时使用html.parser
标签选择筛选功能弱但是速度快
建议使用find()、find_all() 查询匹配单个结果或者多个结果
如果对CSS选择器熟悉建议使用select()
记住常用的获取属性和文本值的方法

CSS选择器

通过select()直接传入CSS选择器就可以完成选择
熟悉前端的人对CSS可能更加了解,其实用法也是一样的
.表示class #表示id
标签1,标签2 找到所有的标签1和标签2
标签1 标签2 找到标签1内部的所有的标签2
[attr] 可以通过这种方法找到具有某个属性的所有标签
[atrr=value] 例子[target=_blank]表示查找所有target=_blank的标签

 

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

结果

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>

获取内容

通过get_text()就可以获取文本内容

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())

结果:

Foo
Bar
Jay
Foo
Bar

获取属性
或者属性的时候可以通过[属性名]或者attrs[属性名]

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])
结果
list-1
list-1
list-2
list-2

 

 

#!/urs/bin/evn python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
# 通过CSS也可以定位元素的位置。在写CSS时,标记名不加任何修饰,类名前加点'.', id名前加'#',在这里我们也可以利用类似的方法来筛选元素,用到的方法是soup.select(),返回类型是list.
# 1通过标名称进行查找(通过标记名称可以直接查找,可以找到某个标记下的直接标记和兄弟节点标记)
    # 直接查找
print(soup.select('title'))
    #多层查找
print(soup.select('html head title'))
# 查找直接子节点,查找head下的title标记
print(soup.select('head > title'))
# 查找p下的id='linkl'的标记
print(soup.select('p > # linkl'))
# 查找兄弟节点
# 查找id=‘linkl’之后class=sisiter的所有兄弟标记
print(soup.select('# linkl ~ .sister'))

# 查找紧跟着id="linkl"之后 class=sisiter的子标记
print(soup.select('# link1 + .sester'))

结果:

[<title>The Dormouse's story</title>]
[<title>The Dormouse's story</title>]
[<title>The Dormouse's story</title>]
[]
[]
[]

 

#!/urs/bin/evn python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml模块')
print(soup.select('.sister'))
print(soup.select('[class~=sister]'))

# 通过tag的id查找
print(soup.select('# link1'))
print(soup.select('a# link2'))

# 通过是否存在某个属性来查找
print(soup.select('a[href]'))

# 通过属性值来寻找
print(soup.select('a[href="http://example.com/elseie"]'))
print(soup.select('a[href^="http://example.com/"]'))
print(soup.select('a[href*=".com/el"]'))

结果:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[]
[]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

 

 
posted @ 2018-11-29 13:52  zqxqx  阅读(254)  评论(0编辑  收藏  举报