BeautifulSoup 的简单使用

Beautiful Soup初了解

解析工具Beautiful Soup，借助网页的结构和属性等特性来解析网页(简单的说就是python的一个HTML或XML的解析库)
Beautiful Soup支持的解析器有很多：Python标准库、lxml HTML解析器、lxmlXML解析器、html5lib

实例引入：

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>', 'lxml')
print(soup.p.string)

# 输出：
Hello

BeautifulSoup 的基本用法

实例引入：

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.prettify(), soup.title.string, sep='\n\n')
# 初始化BeautifulSoup时，自动更正了不标准的HTML
# prettify()方法可以把要解析的字符串以标准的缩进格式输出
# soup.title 可以选出HTML中的title节点，再调用string属性就可以得到里面的文本了

# 输出：
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

The Dormouse's story

结点选择器

选择元素

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')

print(soup.title)               # 打印输出title节点的选择结果
print(type(soup.title))         # 输出soup.title类型
print(soup.title.string)        # 输出title节点的内容
print(soup.head)                # 打印输出head节点的选择结果
print(soup.p)                   # 打印输出p节点的选择结果


# 输出：
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

提取信息

说明：
调用string属性获取文本的值
利用那么属性获取节点的名称
调用attrs获取所有HTML节点属性

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')

print(soup.title.name)          # 选取title节点，然后调用name属性获得节点名称
# 输出：title
print(soup.title.string)        # 调用string属性，获取title节点的文本值
# 输出：The Dormouse's story

print(soup.p.attrs)             # 调用attrs，获取p节点的所有属性
# 输出：{'class': ['title'], 'name': 'dromouse'}

print(soup.p.attrs['name'])         # 获取name属性
# 输出：dromouse
print(soup.p['name'])               # 获取name属性
# 输出：dromouse

关联选择

子节点和子孙节点

contents属性获取直接子结点（生的的是列表）

from bs4 import BeautifulSoup

html = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
# 选取节点元素之后，可以调用contents属性获取它的直接子节点
print(soup.p.contents)

# 输出：
['\n   Once upon a time there were three little sisters; and their names were\n   ', <a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>, '\n   ,\n   ', <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>, '\n   and\n   ', <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>, '\n   ;\nand they lived at the bottom of a well.\n  ']
# 返回结果是一个列表，列表中的元素是所选节点的直接子节点（不包括孙节点）

children属性，返回结果是生成器类型。与contents属性一样，只是返回结果类型不同。

from bs4 import BeautifulSoup

html = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
  </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)                          # 输出：<list_iterator object at 0x1159b7668>
for i, child in enumerate(soup.p.children):
    print(i, child)


# for 循环的输出结果:
0 
   Once upon a time there were three little sisters; and their names were
   
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 
   ,
   
3 <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
4 
   and
   
5 <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
6 
   ;
and they lived at the bottom of a well.

descendants属性会递归查询所有子节点，得到所有子孙节点。

from bs4 import BeautifulSoup

html = """
 <html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)                          # 输出：<generator object Tag.descendants at 0x1131d0048>
for i, child in enumerate(soup.p.descendants):
    print(i, child)



# for 循环输出结果：
0 
   Once upon a time there were three little sisters; and their names were
   
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 
   ,
   
7 <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
8 
    Lacie
   
9 
   and
   
10 <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
11 
    Tillie
   
12 
   ;
and they lived at the bottom of a well.

父节点和祖先节点

parent获取某个节点的一个父结点

from bs4 import BeautifulSoup

html = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
   </a>
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)


# 输出：
<p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>

parent获取所有祖先结点

from bs4 import BeautifulSoup

 3 html = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
   </a>
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.a.parents, type(soup.a.parents), list(enumerate(soup.a.parents)), sep='\n\n')


# 输出：
<generator object PageElement.parents at 0x11c76e048>

<class 'generator'>

[(0, <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">
   ...
  </p>
</body>), (2, <html>
<head>
<title>
   The Dormouse's story
  </title>
</head>
<body>
<p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">
   ...
  </p>
</body>
</html>), (3, <html>
<head>
<title>
   The Dormouse's story
  </title>
</head>
<body>
<p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">
   ...
  </p>
</body>
</html>
)]

兄弟节点

from bs4 import BeautifulSoup

html = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <span>Elsie</span>
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
print(
    # 获取下一个兄弟元素
    {'Next Sibling': soup.a.next_sibling},
    # 获取上一个兄弟元素
    {'Previous Sibling': soup.a.previous_sibling},
    # 返回后面的兄弟元素
    {'Next Siblings': list(enumerate(soup.a.next_siblings))},
    # 返回前面的兄弟元素
    {'Previous Siblings': list(enumerate(soup.a.previous_siblings))},

    sep='\n\n'
)


# 输出：
{'Next Sibling': '\n   ,\n   '}

{'Previous Sibling': '\n   Once upon a time there were three little sisters; and their names were\n   '}

{'Next Siblings': [(0, '\n   ,\n   '), (1, <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>), (2, '\n   and\n   '), (3, <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>), (4, '\n   ;\nand they lived at the bottom of a well.\n  ')]}

{'Previous Siblings': [(0, '\n   Once upon a time there were three little sisters; and their names were\n   ')]}

提取信息

from bs4 import BeautifulSoup

html = """
<html>
 <body>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">Bob</a>
   <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  </p>
 </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
print(
    'Next Sibling:',

    [soup.a.next_sibling],        # 获取上一个兄弟节点
    # \n
    type(soup.a.next_sibling),      # 上一个兄弟节点的类型
    # <class 'bs4.element.NavigableString'>
    [soup.a.next_sibling.string],     # 获取上一个兄弟节点的内容
    # \n
    sep='\n'
)

print(
    'Parent:',

    [type(soup.a.parents)],      # 获取所有的祖先节点
    # <class 'generator'>
    [list(soup.a.parents)[0]],           # 获取第一个祖先节点
    # <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">Bob</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
    [list(soup.a.parents)[0].attrs['class']],        # 获取第一个祖先节点的"class属性"的值
    # ['story']
    sep='\n'
)

# 为了输出返回的结果，均以列表形式


# 输出：
Next Sibling:
['\n']
<class 'bs4.element.NavigableString'>
['\n']
Parent:
[<class 'generator'>]
[<p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">Bob</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>]
[['story']]

嵌套选择

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

# 输出：
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story

方法选择器

find_all(name=None, attrs={}, recursive=True, text=None, limit=None)

查询所有符合条件的元素

from bs4 import BeautifulSoup

html = """
<div>
<ul>
<li class="item-O"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='li'),
      type(soup.find_all(name='li')[0]),
      sep='\n\n')


# 输出：
[<li class="item-O"><a href="linkl.html">first item</a></li>, <li class="item-1"><a href="link2.html">second item</a></li>, <li class="item-inactive"><a href="link3.html">third item</a></li>, <li class="item-1"><a href="link4.html">fourth item</a></li>, <li class="item-0"><a href="link5.html">fifth item</a>
</li>]

<class 'bs4.element.Tag'>


# 返回值是一个列表，列表的元素是名为"li"的节点，每个元素都是bs4.element.Tag类型


# 遍历每个a节点
from bs4 import BeautifulSoup

html = """
<div>
<ul>
<li class="item-O"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
"""

soup = BeautifulSoup(html, 'lxml')
li = soup.find_all(name='li')

for a in li:
    print(a.find_all(name='a'))

# 输出：
[<a href="linkl.html">first item</a>]
[<a href="link2.html">second item</a>]
[<a href="link3.html">third item</a>]
[<a href="link4.html">fourth item</a>]
[<a href="link5.html">fifth item</a>]

attires 参数

from bs4 import BeautifulSoup

html = """
<div>
<ul>
<li class="item-O"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
"""

soup = BeautifulSoup(html, 'lxml')

print(soup.find_all(attrs={'class': 'item-0'}))
print(soup.find_all(attrs={'href': 'link5.html'}))


# 输出：
[<li class="item-0"><a href="link5.html">fifth item</a>
</li>]
[<a href="link5.html">fifth item</a>]

# 可以通过attrs参数传入一些属性来进行查询，即通过特定的属性来查询
# find_all(attrs={'属性名': '属性值', ......})

text 参数

from bs4 import BeautifulSoup
import re

html = """
<div class="panel">
<div class="panel-body">
<a>Hello, this is a link</a>
<a>Hello, this is a link, too</a>
<div/>
<div/>
"""

soup = BeautifulSoup(html, 'lxml')

# 正则表达式规则对象
regular = re.compile('link')

# text参数课用来匹配节点的文本，传入的形式可以是字符串，也可以是正则表达式对象
print(soup.find_all(text=regular))

# 正则匹配输出
print(re.findall(regular, html))


# 输出：
['Hello, this is a link', 'Hello, this is a link, too']
['link', 'link']

说明：

find(name=None, attrs={}, recursive=True, text=None)
# 仅返回与给定条件匹配标记的第一个元素

CSS选择器

Beautiful Soup 提供了CSS选择器，调用select()方法即可
css选择器用法：http://www.w3school.com.cn/cssref/css_selectors.asp

方法

select(selector, namespaces=None, limit=None)

简单实例

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

ul_all = soup.select('ul')
print(ul_all)

for ul in ul_all:
    print()
    print(
        ul['id'],

        ul.select('li'),

        sep='\n'
    )


# 输出：
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]

list-1
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]

list-2
[<li class="element">Foo</li>, <li class="element">Bar</li>]

获取属性

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

ul_all = soup.select('ul')
print(ul_all)

for ul in ul_all:
    print()
    print(
        ul['id'],

        ul.attrs['id'],

        sep='\n'
    )

# 直接传入中括号和属性名  或者  通过attrs属性获取属性值 都可以成功获得属性值

# 输出：
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]

list-1
list-1

list-2
list-2

获取文本

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

ul_all = soup.select('li')
print(ul_all)

for li in ul_all:
    print()
    print(
        'get_text()方法获取文本：'+li.get_text(),

        'string属性获取文本：'+li.string,

        sep='\n'
    )


# 输出：
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

get_text()方法获取文本：Foo
string属性获取文本：Foo

get_text()方法获取文本：Bar
string属性获取文本：Bar

get_text()方法获取文本：Jay
string属性获取文本：Jay

get_text()方法获取文本：Foo
string属性获取文本：Foo

get_text()方法获取文本：Bar
string属性获取文本：Bar

posted @ 2019-07-10 12:35 LeeHua 阅读(284) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Lee Hua's Blog

热爱编程 -- 写Bug