Beautiful Soup 解析库

Beautiful Soup简介

  Beautiful Souppython一个HTMLXML解析库,是一款强大的解析工具,它借助于网页结构和属性等特征来解析网页。它的出现使得我们不用再去写协议复杂的正则表达式,而只需几个语句就可以对网页中的某个元素进行提取,提高了解析效率。但是在使用中Beautiful Soup依赖于解析器,一般我们使用lxml解析器,它不仅可以解析HTMLXML的功能,而且速度快,容错能力强。

Beautiful Soup用法

简单示例

html = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
#BeautifulSoup对象初始化,并且完善html字符串
soup = BeautifulSoup(html,'lxml')
#将解析的字符串以标准的格式输出
print(soup.prettify())
#选出html中的title节点然后获取文本
print(soup.title.string)

结果
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story
View Code

节点选择器:节点选择器通过直接调用节点名称选择节点元素,在调用string属性获取节点内的文本,这种选择方式速度快,常用于节点结构层次清晰的网页解析中,分为以下几类

  • 元素选择:通过节点元素名选择节点
  • 嵌套选择:由于节点选择器每一个返回结果都是bs4.element.Tag 类型,则它同样可以继续调用节点进行下一步的选择,示例如下,获取head节点元素则继续调用head来选取其内部的head节点元素。
  • 关联选择:在有些情况下能做到一步就可以选到想要的节点元素,则我们可以先选中一个节点元素,然后在以它为基准选择它的子节点、父节点、兄弟节点等。

元素选择

html = """
<html><head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup

#BeautifulSoup 对象初始化
soup = BeautifulSoup(html,'lxml')
#获取title节点信息
print(soup.title)
#查看title节点以及加里面文本内容的数据类型:bs4.element.Tag
print(type(soup.title))
#获取title文本内容
print(soup.title.string)
#通过name属性获取节点名称
print(soup.title.name)
print(soup.head)
#获取第一个p节点信息
print(soup.p)
#获取首个p节点的所有属性,返回一个字典
print(soup.p.attrs)
#获取class值
print(soup.p.attrs['name'])
print(soup.p['class'])
#获取p节点文本内容(这里的P节点是第一个p节点,获取文本也是首个p节点文本)
print(soup.p.string)



结果
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
title
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
{'class': ['title'], 'name': 'dromouse'}
dromouse
['title']
The Dormouse's story
View Code

嵌套选择

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
#嵌套选择title节点元素
print(soup.head.title)
#查看title节点元素类型:bs4.element.Tag
print(type(soup.head.title))
#获取title节点文本内容
print(soup.head.title.string)

结果
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
View Code

关联选择:

子节点和子孙节点

html = """
<html><head><title>The Dormouse's story</title>
</head>
<body>
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""

获取p节点的直接子节点
from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
#获取p节点的所有直接子节点:调用children属性,返回结果为生成器类型,只需用for循环遍历输出即可
print(soup.p.children)
for i,child in enumerate(soup.p.children):
    print(i,child)

结果
<list_iterator object at 0x0000000002E8C4E0>
0 
    Once upon a time there were three little sisters; and their names were
    
1 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 

5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
    and they lived at the bottom of a well.


获取p节点的所有的子孙节点:调用descendants属性,返回结果为生成器类型,利用for循环遍历
from bs4 import BeautifulSoup

soup =BeautifulSoup(html,'lxml')
#获取所有的子孙节点
print(soup.p.descendants)
#结果为生成器类型,for遍历即可
for i,child in enumerate(soup.p.descendants):
    print(i,child)

结果
<generator object descendants at 0x0000000001EF1468>
0 
    Once upon a time there were three little sisters; and their names were
    
1 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
2 Elsie
3 

4 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
5 Lacie
6 

7 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
8 Tillie
9 
    and they lived at the bottom of a well.
View Code

获取父节点和祖先节点

html = """
<html><head><title>The Dormouse's story</title>
</head>
<body>
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
#获取父节点元素:调用parent属性
from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)

结果
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    and they lived at the bottom of a well.
</p>


获取祖先节点元素:调用parents属性
from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))

结果
<class 'generator'>
[(0, <p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    and they lived at the bottom of a well.
</p>), (1, <body>
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>), (2, <html><head><title>The Dormouse's story</title>
</head>
<body>
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body></html>), (3, <html><head><title>The Dormouse's story</title>
</head>
<body>
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body></html>)]
View Code

获取同级节点(兄弟节点)

html = """
<html>
<body>
<p class="story">
    Once upon a time there were three little sisters; and their names were
   <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
Hello
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
#获取同级下一个节点元素
print('Next Sibing:',soup.a.next_sibling)
#获取同级上一个节点元素
print('Pre Sibing:',soup.a.previous_sibling)
#获取后面所有的同级节点
print('Next Sibing:',list(enumerate(soup.a.next_siblings)))
#返回前面所有的同级节点
print('Pre Sibing:',list(enumerate(soup.a.previous_siblings)))

结果

Next Sibing: 
Hello
    
Pre Sibing: 
    Once upon a time there were three little sisters; and their names were
   
Next Sibing: [(0, '\nHello\n    '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \nand\n    '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n    and they lived at the bottom of a well.\n')]
Pre Sibing: [(0, '\n    Once upon a time there were three little sisters; and their names were\n   ')]
View Code

信息提取(比如文本属性等信息)

html = """
<html>
<body>
<p class="story">
    Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print('Next Siblings:')
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
#获取a节点下一个元素的文本内容
print(soup.a.next_sibling.string)
print('Parents:')
print(type(soup.a.parents))
print(list(soup.a.parents)[0])
#获取a节点的祖先节点class属性名
print(list(soup.a.parents)[0].attrs['class'])

结果
Next Siblings:
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
Parents:
<class 'generator'>
<p class="story">
    Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
['story']
View Code

方法选择器:方法选择器一般用于比较复杂、繁琐不够灵活的场景中,而find_all()和find()方法等,传入相应的参数即可查询所需要的元素信息。Find_all()方法:查询所有符合条件的元素,API如下:find_all(name,attrs,recursive,text,**kwargs)

  • Name:根据节点名称查询元素
  • Attrs:根据属性名称查询节点元素
  • Text:匹配节点的文本,传入参数可以是字符串也可以是正则表达式

返回所有符合条件的元素find_all()方法:

根据节点名(name)查询元素

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
<div class="panel_body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
#查询所有ul节点,返回结果为一个列表
print(soup.find_all(name='ul'))
#判断元素类型:bs4.element.Tag
print(type(soup.find_all(name='ul')[0]))
#Tag类型进行嵌套查询
for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    #获取内部ul节点元素
    for li in ul.find_all(name='li'):
        print(li.string)

结果
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar
View Code

通过属性(attrs)查询节点元素

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
<div class="panel_body">
<ul class="list" id="list-1" name='elements'>
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="elements">Foo</li>
<li class="elements">Bar</li>
</ul>
</div>
</div>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))
print(soup.find_all(attrs={'class':'elements'}))

结果
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="elements">Foo</li>, <li class="elements">Bar</li>]
View Code

匹配节点的文本(text)

html ='''
<div class="panel">
<div class="panel-body">
<a>Hello,this ia a link</a>
<a>Hello,this is a link, too</a>
</div>
</div>'''

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text=re.compile('link')))

结果
['Hello,this ia a link', 'Hello,this is a link, too']
View Code

返回符合条件的单个元素方法find():

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
<div class="panel_body">
<ul class="list" id="list-1" name='elements'>
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="elements">Foo</li>
<li class="elements">Bar</li>
</ul>
</div>
</div>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.find(name='ul'))
print(type(soup.find(name='ul')))
print(soup.find(class_='list'))

结果
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<class 'bs4.element.Tag'>
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
View Code

CSS选择器:只需调用select()方法传入相应的css选择器即可

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" >
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading'))
#获取所有ul节点下面的所有li节点
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

结果
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>

选择器嵌套选
from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

结果
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

获取属性
from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
    print(ul['id'])

结果
list-1
list-2

获取文本
from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
    #两种方法获取文本
    print('Get Text:',li.get_text())
    print('String:',li.string)

结果
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar
Get Text: Jay
String: Jay
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar
View Code
posted @ 2018-12-25 18:47  Coolc  阅读(159)  评论(0编辑  收藏  举报