BeautifulSoup库

（一）简介

BeautifulSoup是一个灵活方便的网页解析库，处理高效，支持多种解析器，利用它可以不用编写正则表达式即可方便的实现网页信息的提取。

这里我们介绍一下一些常用的解析库：

（二）用法详解

　　　　1.基本使用

 1 from bs4 import BeautifulSoup
 2 
 3 html = """
 4 <html><head><title>The Dormouse's story</title></head>
 5 <body>
 6 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
 7 <p class="story">Once upon a time there were three little sisters; and their names were
 8 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
 9 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
10 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
11 and they lived at the bottom of a well.</p>
12 <p class="story">...</p>
13 """
14 soup = BeautifulSoup(html, 'lxml')
15 print(soup.prettify())  #这个方法是格式化的意思，即自动补全html标签的信息。
16 print(soup.title.string)#可以定位title标签的信息

从输出中可以看到，自动为我们补全了不完整的标签信息。

 1 <html>
 2  <head>
 3   <title>
 4    The Dormouse's story
 5   </title>
 6  </head>
 7  <body>
 8   <p class="title" name="dromouse">
 9    <b>
10     The Dormouse's story
11    </b>
12   </p>
13   <p class="story">
14    Once upon a time there were three little sisters; and their names were
15    <a class="sister" href="http://example.com/elsie" id="link1">
16     <!-- Elsie -->
17    </a>
18    ,
19    <a class="sister" href="http://example.com/lacie" id="link2">
20     Lacie
21    </a>
22    and
23    <a class="sister" href="http://example.com/tillie" id="link3">
24     Tillie
25    </a>
26    ;
27 and they lived at the bottom of a well.
28   </p>
29   <p class="story">
30    ...
31   </p>
32  </body>
33 </html>
34 The Dormouse's story

输出

　　　　2.标签选择器

我们上一段代码简单使用了定位title标签，接下来，我们具体看一下标签选择器方面的操作。

 1 html = """
 2 <html><head><title>The Dormouse's story</title></head>
 3 <body>
 4 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
 5 <p class="story">Once upon a time there were three little sisters; and their names were
 6 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
 7 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
 8 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 9 and they lived at the bottom of a well.</p>
10 <p class="story">...</p>
11 """

默认模拟的html

选择元素

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
print(soup.title, type(soup.title)) #返回title标签
print(soup.head)  #返回head标签
print(soup.p)   #返回p标签

》》》输出：
<title>The Dormouse's story</title> <class 'bs4.element.Tag'>
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

可以看到输出结果很自然，很舒服，注意从p标签的匹配结果来看，bs只会匹配首次满足的标签！

获取标签的名称

1 from bs4 import BeautifulSoup
2 
3 soup = BeautifulSoup(html, 'lxml')
4 print(soup.title.name)
5 
6 》》》输出：
7 title

获取属性

 1 from bs4 import BeautifulSoup
 2 
 3 soup = BeautifulSoup(html, 'lxml')
 4 print(soup.p.attrs['name'])  #可以有两种获取方法
 5 print(soup.p['name'])
 6 
 7 
 8 》》》输出：
 9 dromouse
10 dromouse

获取内容

1 from bs4 import BeautifulSoup
2 soup = BeautifulSoup(html, 'lxml')
3 print(soup.p.string)
4 
5 
6 》》》输出：
7 The Dormouse's story

嵌套选择

即可以一层一层往里的嵌套选择

1 from bs4 import BeautifulSoup
2 
3 soup = BeautifulSoup(html, 'lxml')
4 print(soup.head.title.string)
5 
6 
7 》》》输出：
8 The Dormouse's story

子节点和子孙节点

 1 html = """
 2 <html>
 3     <head>
 4         <title>The Dormouse's story</title>
 5     </head>
 6     <body>
 7         <p class="story">
 8             Once upon a time there were three little sisters; and their names were
 9             <a href="http://example.com/elsie" class="sister" id="link1">
10                 <span>Elsie</span>
11             </a>
12             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
13             and
14             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
15             and they lived at the bottom of a well.
16         </p>
17         <p class="story">...</p>
18 """

默认模拟html

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
#获取该标签的子节点即各个子标签并以列表形式输出
print(soup.p.contents)
 

》》》输出：
['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']

或者可以用如下的方法使用：

 1 from bs4 import BeautifulSoup
 2 
 3 soup = BeautifulSoup(html, 'lxml')
 4 #获取该标签的子节点，即嵌套标签也要依次输出
 5 print(soup.p.children)
 6 #其返回结果是一个迭代器，所以采用迭代的方法进行访问
 7 for i, child in enumerate(soup.p.children):
 8     print(i, child)
 9 
10 》》》输出：
11 <list_iterator object at 0x1064f7dd8>
12 0 
13             Once upon a time there were three little sisters; and their names were
14             
15 1 <a class="sister" href="http://example.com/elsie" id="link1">
16 <span>Elsie</span>
17 </a>
18 2 
19 
20 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
21 4  
22             and
23             
24 5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
25 6 
26             and they lived at the bottom of a well.

我们还可以访问其子孙节点，即子节点访问后，子孙节点也要访问

 1 from bs4 import BeautifulSoup
 2 
 3 soup = BeautifulSoup(html, 'lxml')
 4 print(soup.p.descendants)
 5 for i, child in enumerate(soup.p.descendants):
 6     print(i, child)
 7 
 8 
 9 》》》输出：
10 <generator object descendants at 0x10650e678>
11 0 
12             Once upon a time there were three little sisters; and their names were
13             
14 1 <a class="sister" href="http://example.com/elsie" id="link1">
15 <span>Elsie</span>
16 </a>
17 2 
18 
19 3 <span>Elsie</span>
20 4 Elsie
21 5 
22 
23 6 
24 
25 7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
26 8 Lacie
27 9  
28             and
29             
30 10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
31 11 Tillie
32 12 
33             and they lived at the bottom of a well.
34

父节点和祖先节点

访问父节点：

 1 from bs4 import BeautifulSoup
 2 
 3 soup = BeautifulSoup(html, 'lxml')
 4 print(soup.a.parent)
 5 
 6 》》》输出：
 7 p class="story">
 8             Once upon a time there were three little sisters; and their names were
 9             <a class="sister" href="http://example.com/elsie" id="link1">
10 <span>Elsie</span>
11 </a>
12 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
13             and
14             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
15             and they lived at the bottom of a well.
16         </p>

访问祖先节点：

1 from bs4 import BeautifulSoup
2 
3 soup = BeautifulSoup(html, 'lxml')
4 print(list(enumerate(soup.a.parents)))

 1 [(0, <p class="story">
 2             Once upon a time there were three little sisters; and their names were
 3             <a class="sister" href="http://example.com/elsie" id="link1">
 4 <span>Elsie</span>
 5 </a>
 6 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
 7             and
 8             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 9             and they lived at the bottom of a well.
10         </p>), (1, <body>
11 <p class="story">
12             Once upon a time there were three little sisters; and their names were
13             <a class="sister" href="http://example.com/elsie" id="link1">
14 <span>Elsie</span>
15 </a>
16 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
17             and
18             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
19             and they lived at the bottom of a well.
20         </p>
21 <p class="story">...</p>
22 </body>), (2, <html>
23 <head>
24 <title>The Dormouse's story</title>
25 </head>
26 <body>
27 <p class="story">
28             Once upon a time there were three little sisters; and their names were
29             <a class="sister" href="http://example.com/elsie" id="link1">
30 <span>Elsie</span>
31 </a>
32 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
33             and
34             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
35             and they lived at the bottom of a well.
36         </p>
37 <p class="story">...</p>
38 </body></html>), (3, <html>
39 <head>
40 <title>The Dormouse's story</title>
41 </head>
42 <body>
43 <p class="story">
44             Once upon a time there were three little sisters; and their names were
45             <a class="sister" href="http://example.com/elsie" id="link1">
46 <span>Elsie</span>
47 </a>
48 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
49             and
50             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
51             and they lived at the bottom of a well.
52         </p>
53 <p class="story">...</p>
54 </body></html>)]

输出

兄弟节点

 1 from bs4 import BeautifulSoup
 2 
 3 soup = BeautifulSoup(html, 'lxml')
 4 
 5 print(list(enumerate(soup.a.next_siblings)))  #后面的兄弟节点
 6 print(list(enumerate(soup.a.previous_siblings)))     #前面的兄弟节点
 7 
 8 
 9 》》》输出：
10 [(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
11 [(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]

但是仅依靠上述的标签选择，是不能完全解决我们的提取问题，因此需要更方便的方法，因此，bs提供了如下的标准选择器。

　　　　3.标准选择器

find_all( name , attrs , recursive , text , **kwargs )

可根据标签名、属性、内容查找文档

name:

 1 html='''
 2 <div class="panel">
 3     <div class="panel-heading">
 4         <h4>Hello</h4>
 5     </div>
 6     <div class="panel-body">
 7         <ul class="list" id="list-1">
 8             <li class="element">Foo</li>
 9             <li class="element">Bar</li>
10             <li class="element">Jay</li>
11         </ul>
12         <ul class="list list-small" id="list-2">
13             <li class="element">Foo</li>
14             <li class="element">Bar</li>
15         </ul>
16     </div>
17 </div>
18 '''

默认模拟html

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
#找到所有的’ul‘标签并返回
print(soup.find_all('ul'))
#其返回结果是一个tag对象，所以在后续也可以层层访问
print(type(soup.find_all('ul')[0]))


》》》输出：
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>

可以看到返回结果包含两个元素并且是一个tag对象，因此可以按如下层层访问:

 1 from bs4 import BeautifulSoup
 2 
 3 soup = BeautifulSoup(html, 'lxml')
 4 #找到所有的’ul‘标签
 5 for i in soup.find_all('ul'):
 6     #在每一个ul标签内再进行查找：
 7     print(i.find_all('li'))
 8 
 9 
10 》》》输出：
11 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
12 [<li class="element">Foo</li>, <li class="element">Bar</li>]

attrs:

他会找到以键值对为定位目标的标签，如下：

 1 html='''
 2 <div class="panel">
 3     <div class="panel-heading">
 4         <h4>Hello</h4>
 5     </div>
 6     <div class="panel-body">
 7         <ul class="list" id="list-1" name="elements">
 8             <li class="element">Foo</li>
 9             <li class="element">Bar</li>
10             <li class="element">Jay</li>
11         </ul>
12         <ul class="list list-small" id="list-2">
13             <li class="element">Foo</li>
14             <li class="element">Bar</li>
15         </ul>
16     </div>
17 </div>
18 '''

默认模拟html

 1 from bs4 import BeautifulSoup
 2 
 3 soup = BeautifulSoup(html, 'lxml')
 4 #找到包含‘id=list-1’的标签
 5 print(soup.find_all(attrs={'id': 'list-1'}))
 6 print(soup.find_all(attrs={'name': 'elements'}))
 7 
 8 》》》输出：
 9 [<ul class="list" id="list-1" name="elements">
10 <li class="element">Foo</li>
11 <li class="element">Bar</li>
12 <li class="element">Jay</li>
13 </ul>]
14 [<ul class="list" id="list-1" name="elements">
15 <li class="element">Foo</li>
16 <li class="element">Bar</li>
17 <li class="element">Jay</li>
18 </ul>]

另外，我们也可以直接用find_all进行查找，其效果是同上面的一样：

 1 from bs4 import BeautifulSoup
 2 
 3 soup = BeautifulSoup(html, 'lxml')
 4 #找到所有包含id='list-1'的标签
 5 print(soup.find_all(id='list-1'))
 6 #找到包含class_='element'的标签
 7 print(soup.find_all(class_='element'))
 8 
 9 
10 》》》输出：
11 [<ul class="list" id="list-1">
12 <li class="element">Foo</li>
13 <li class="element">Bar</li>
14 <li class="element">Jay</li>
15 </ul>]
16 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

text:

在做一些内容匹配时，会用到text他不是匹配标签而是直接匹配内容：

 1 html='''
 2 <div class="panel">
 3     <div class="panel-heading">
 4         <h4>Hello</h4>
 5     </div>
 6     <div class="panel-body">
 7         <ul class="list" id="list-1">
 8             <li class="element">Foo</li>
 9             <li class="element">Bar</li>
10             <li class="element">Jay</li>
11         </ul>
12         <ul class="list list-small" id="list-2">
13             <li class="element">Foo</li>
14             <li class="element">Bar</li>
15         </ul>
16     </div>
17 </div>
18 '''

默认模拟html

1 from bs4 import BeautifulSoup
2 
3 soup = BeautifulSoup(html, 'lxml')
4 #匹配Foo内容：
5 print(soup.find_all(text='Foo'))
6 
7 
8 》》》输出：
9 ['Foo', 'Foo']

find( name , attrs , recursive , text , **kwargs )

find返回单个元素，find_all返回所有元素，用法方面跟find_all一样

 1 html='''
 2 <div class="panel">
 3     <div class="panel-heading">
 4         <h4>Hello</h4>
 5     </div>
 6     <div class="panel-body">
 7         <ul class="list" id="list-1">
 8             <li class="element">Foo</li>
 9             <li class="element">Bar</li>
10             <li class="element">Jay</li>
11         </ul>
12         <ul class="list list-small" id="list-2">
13             <li class="element">Foo</li>
14             <li class="element">Bar</li>
15         </ul>
16     </div>
17 </div>
18 '''

默认模拟html

 1 from bs4 import BeautifulSoup
 2 
 3 soup = BeautifulSoup(html, 'lxml')
 4 print(soup.find('ul'))
 5 print(type(soup.find('ul')))
 6 print(soup.find('page'))
 7 
 8 
 9 》》》输出：
10 <ul class="list" id="list-1">
11 <li class="element">Foo</li>
12 <li class="element">Bar</li>
13 <li class="element">Jay</li>
14 </ul>
15 <class 'bs4.element.Tag'>
16 None

find_parents() find_parent()

find_parents()返回所有祖先节点，find_parent()返回直接父节点。

find_next_siblings() find_next_sibling()

find_next_siblings()返回后面所有兄弟节点，find_next_sibling()返回后面第一个兄弟节点。

find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点。

find_all_next() find_next()

find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点

find_all_previous() 和 find_previous()

find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

以上这些个用法都和find_all一样，有需要时可以自行参考。

可以举一个简单例子：

1 from bs4 import BeautifulSoup
2 
3 soup = BeautifulSoup(html, 'lxml')
4 ul = soup.find('ul')
5 #返回ul的下一个兄弟节点
6 print(ul.find_next_sibling())
7 #返回ul的所有父节点
8 print(ul.find_parents())

　　　　4.css选择器

通过select()直接传入CSS选择器即可完成选择，在select中，.表示class；#表示id。

例如：

 1 html='''
 2 <div class="panel">
 3     <div class="panel-heading">
 4         <h4>Hello</h4>
 5     </div>
 6     <div class="panel-body">
 7         <ul class="list" id="list-1">
 8             <li class="element">Foo</li>
 9             <li class="element">Bar</li>
10             <li class="element">Jay</li>
11         </ul>
12         <ul class="list list-small" id="list-2">
13             <li class="element">Foo</li>
14             <li class="element">Bar</li>
15         </ul>
16     </div>
17 </div>
18 '''
19 
20 
21 from bs4 import BeautifulSoup
22 
23 soup = BeautifulSoup(html, 'lxml')
24 #相当于定位class=panel下的class=panel-heading标签
25 print(soup.select('.panel .panel-heading'))
26 #相当于定位ul下的li标签
27 print(soup.select('ul li'))
28 #相当于定位id=list-2下的class=element标签
29 print(soup.select('#list-2 .element'))
30 #其返回结果仍然是一个tag对象
31 print(type(soup.select('ul')[0]))

1 [<div class="panel-heading">
2 <h4>Hello</h4>
3 </div>]
4 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
5 [<li class="element">Foo</li>, <li class="element">Bar</li>]
6 <class 'bs4.element.Tag'>
7 In [20]:

输出

当我们获取到标签后，如何获得标签内的属性：

 1 from bs4 import BeautifulSoup
 2 
 3 #直接采用下标即可：
 4 soup = BeautifulSoup(html, 'lxml')
 5 for ul in soup.select('ul'):
 6     print(ul['id'])
 7     print(ul.attrs['id'])
 8 
 9 》》》输出：
10 list-1
11 list-1
12 list-2
13 list-2

当我们获得标签后，如何获得标签包含的内容:

 1 from bs4 import BeautifulSoup
 2 
 3 soup = BeautifulSoup(html, 'lxml')
 4 for ur in soup.select('li'):
 5     print(ur.get_text())
 6 
 7 
 8 》》》输出：
 9 Foo
10 Bar
11 Jay
12 Foo
13 Bar

总结

推荐使用lxml解析库，必要时使用html.parser
标签选择筛选功能弱但是速度快
建议使用find()、find_all() 查询匹配单个结果或者多个结果
如果对CSS选择器熟悉建议使用select()
记住常用的获取属性和文本值的方法

posted @ 2018-10-05 09:52 A-handsome-cxy 阅读(111) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

A-handsome-cxy

BeautifulSoup库

find_all( name , attrs , recursive , text , **kwargs )

find( name , attrs , recursive , text , **kwargs )

find_parents() find_parent()

find_next_siblings() find_next_sibling()

find_previous_siblings() find_previous_sibling()

find_all_next() find_next()

find_all_previous() 和 find_previous()

总结

公告