Beautiful Soup的使用

使用Beautiful Soup

1.简介

  简单来说Beautiful Soup是Python的一个HTML或XML解析库,可以用来方便的从网页中提取数据。Beautiful Soup提供了一些简单的Python式的函数来打处理导航,搜索,修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据。

  Beautiful Soup自动将文本文档转换为Unicode编码,输出文档转换为UTF-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时你仅仅需要说明一下原始编码方式就可以了。

2.准备工作

安装Beautiful Soup

 

a.相关链接

  官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/

  中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh

  PyPi  :  https://pypi.python.org/pypi/beautifulsoup4

 

b.pip3安装

  pip3 install beautifulsoup4

c.whell安装

  从PiPy下载whell文件

  然后使用pip安装whell文件

3.使用Beautiful Soup

1.基本用法

from bs4 import BeautifulSoup

html = """
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>The Beautiful Suop</title>
</head>
<body>
<p class="title" name="dromouse"><b>The story</b></p>
<p class="story" >once upon a time there were three title sisters;and their name were
<a href="http://example.com/elsie" class="sister" id="link1">Elise</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

运行结果如下:

<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   The Beautiful Suop
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The story
   </b>
  </p>
  <p class="story">
   once upon a time there were three title sisters;and their name were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elise
   </a>
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
    and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Beautiful Suop

  这里首先声明一个变量html,它是一个HTML字符串。但是需要注意,它并不是一个完成的HTML字符串,body和html节点没有闭合。接着我们将它作为第一个参数传递给Beautiful Soup对象,第二个参数为解析器的类型(这里使用的是lxml),此时就完成了Beautiful Soup对象的初始化。然后将这个对象复制给soup变量。接下来就可以调用soup的各个方法和属性来解析这串HTML代码了。

  首先,调用prettify()方法。这个方法可以把要解析的字符串以标准的缩进格式输出。这里需要注意的是,输出结果包含了body和html节点,也就是说对于不标准的HTML代码Beautiful Soup可以自动更正格式。这一步并不是prettify()做的,而是在初始化时就已经完成了。

  然后调用soup.title.string。这实际上是输出HTML中title节点的文本内容。So,soup.title可以选出HTML中的节点,再调用string属性就可以得到里面的文本了。

2.节点选择器

直接调用节点的名称就可以选择节点元素,在调用string就可以得到节点的文本了。选择方式非常快速,如果单个节点层次非常清晰,可以选用这种方法。

  ♦选择元素  

from bs4 import BeautifulSoup

html = """
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>The Beautiful Suop</title>
</head>
<body>
<p class="title" name="dromouse"><b>The story</b></p>
<p class="story" >once upon a time there were three title sisters;and their name were
<a href="http://example.com/elsie" class="sister" id="link1">Elise</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

运行结果如下:

<title>The Beautiful Suop</title>
<class 'bs4.element.Tag'>
The Beautiful Suop
<head>
<meta charset="utf-8"/>
<title>The Beautiful Suop</title>
</head>
<p class="title" name="dromouse"><b>The story</b></p>

  这里依旧选用刚才的示例代码,首先打印title节点的选择结果,输出title节点的文本内容。接下来是它的类型,<class 'bs4.element.Tag'>这是Beautiful Soup中一个重要的数据结构。

  接下来,我们又尝试了head节点,p节点,选择p节点时只是输出了第一个p节点的内容。当有多个节点时,这种方式只会匹配到第一个节点,后面的节点都会忽略。

  ♦提取信息

    如何获取节点的属性值?获取节点的名称?

  (1)名称获取

  利用name属性获取节点的名称  

print(soup.title.name)

输出结果:

title

  (2)获取属性

  每个节点可以有多个属性,例如id和class等,选择这个节点后可以调用attrs获取所有属性:

print(soup.p.attrs)
运行结果:
{'class': ['title'], 'name': 'dromouse'}

  可以看到,attrs返回的结果是字典型式,把所有属性的和属性值组成了一个字典。如果想获取name属性,只需要加上键值,可以使用attrs['name']来获取。有一种更简便的写法,直接在节点元素后面加上属性名称:

print(soup.p['name'])
print(soup.p['class'])

输出结果:
dromouse
['title']

  这里需要注意的是,有的结果返回的是字符串,有的结果返回的是列表。比如name属性的值是唯一的,返回的结果就是单个字符串,class的属性可以有多个,所有返回的是一个列表。需要在实际使用中判断。

(3)获取内容

  可以使用string获取内容

print(soup.p.string)

输出结果:
The story

这里的p节点是第一个p节点

  ♦嵌套选择

  在上面的例子中,每一步的返回结果都是bs4.element.Tag,我们可以继续调用节点进行下一步:

print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

  输出结果:

<title>The Beautiful Suop</title>
<class 'bs4.element.Tag'>
The Beautiful Suop
  ♦关联选择

  先选取某一个节点元素,在以它为基准去选择其父节点,子节点,兄弟节点等。

(1)子节点及子孙节点

  使用contents属性获取子节点

from bs4 import BeautifulSoup

html = """
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>The Beautiful Suop</title>
</head>
<body>
<p class="story" >once upon a time there were three title sisters;and their name were
<a href="http://example.com/elsie" class="sister" id="link1">
    <span>Elise</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

输出结果:

['once upon a time there were three title sisters;and their name were\n', 
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a>, '\n',
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n',
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';\n
and they lived at the bottom of a well.\n
']

  p节点里包含文本,节点,所以返回一个列表形式。

  使用children可以得到相同的结果,此时返回的是一个生成器类型。

print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

输出结果:

<list_iterator object at 0x0000016B477884A8>
0 once upon a time there were three title sisters;and their name were

1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 

5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 ;
    and they lived at the bottom of a well.

  使用descendants属性获取子孙节点,返回一个生成器,输出的结果包含了span节点。descendants会查询所有子节点,得到所有的子孙节点

<generator object descendants at 0x0000029DA472D9E8>
0 once upon a time there were three title sisters;and their name were

1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>
2 

3 <span>Elise</span>
4 Elise
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9 

10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 ;
    and they lived at the bottom of a well.

(2)父节点和爷爷节点

  调用parent获取某个节点的父节点;

print(soup.a.parent)

  输出结果:

<p class="story">once upon a time there were three title sisters;and their name were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>

  很明显,a的直接父节点是p节点,这里直接输出p节点的内容。

  调用parents选取到爷爷节点,返回的结果是生成器类型,用列表输出了它的索引和内容,列表中的元素就是a节点的祖先节点。

print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))

  输出结果:

<class 'generator'>
[(0, <p class="story">once upon a time there were three title sisters;and their name were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>), 

(1, <body> <p class="story">once upon a time there were three title sisters;and their name were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> </body>),

(2, <html lang="en"> <head> <meta charset="utf-8"/> <title>The Beautiful Suop</title> </head> <body> <p class="story">once upon a time there were three title sisters;and their name were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> </body></html>),

(3, <html lang="en"> <head> <meta charset="utf-8"/> <title>The Beautiful Suop</title> </head> <body> <p class="story">once upon a time there were three title sisters;and their name were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elise</span> </a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> </body></html>)]

(3)兄弟节点

  同级节点获取,next_sibling和previous_sibling分别获取的是节点的下一个兄弟元素和节点的上一个兄弟元素。next_siblings和previous_siblings分别返回后面和前面的所有兄弟元素。

from bs4 import BeautifulSoup

html = """
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>The Beautiful Suop</title>
</head>
<body>
<p class="story" >once upon a time there were three title sisters;and their name were

<a href="http://example.com/elsie" class="sister" id="link1">
    <span>Elise</span>
</a>
hello
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
"""
soup = BeautifulSoup(html, 'lxml')
print("Next Sibling:", soup.a.next_sibling)
print("Prev Sibling:", soup.a.previous_sibling)
print("Next Siblings:", list(soup.a.next_siblings))
print("Prev Siblings:", list(soup.a.previous_siblings))

  输出结果;

Next Sibling: 
hello

Prev Sibling: once upon a time there were three title sisters;and their name were


Next Siblings: ['\nhello\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';\n    and they lived at the bottom of a well.\n']
Prev Siblings: ['once upon a time there were three title sisters;and their name were\n\n']

(4)信息提取

  单个节点可以直接调用string,attrs等属性获取其文本内容和属性,多个节点的生成器转化为列表后,取到某个节点后再调用string,attrs等属性获取相对应的节点的文本和属性。

from bs4 import BeautifulSoup

html = """
<html lang="en">
<body>
<p class="story" >once upon a time there were three title sisters;and their name were
    <a href="http://example.com/elsie" class="sister" id="link1">
        <span>Elise</span>
    </a>
</p>
"""
soup = BeautifulSoup(html, 'lxml')

print(soup.a.next_sibling.string)
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

  输出结果:

<p class="story">once upon a time there were three title sisters;and their name were
    <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>
</p>
['story']

 

 3.方法选择器

  ♦find_all()

  查询所有符合条件的元素,给它传入一些属性和文本就可以得到符合条件的元素,功能十分强大

  find_all(name,attrs,recursive,text,**kwargs)

  (1)name

   根据节点名称查询元素:

from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
        
    </div>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))

输出结果:

[
<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li> <li class="element">Bar</li> </ul>
]
<class 'bs4.element.Tag'>

  调用find_all()方法,name参数的值为ul,查询到所有ul节点,返回列表类型,每个元素都是bs4.element.Tag类型。key继续进行嵌套查询,查询其内部的li节点:

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))

  输出结果:

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

  遍历每个li,获取其文本内容:

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)

  输出结果:

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar

 

 (2)attrs

  根据传入的属性查询:

print(soup.find_all(attrs={'id': 'list-1'}))

  输出结果:

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

  对于一些常见的属性id和class,可以直接使用,不需要attrs。其中class为Python关键字,需要加上下划线:class_='element'

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

  输出结果:

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, 
<li class="element">Bar</li>,
<li class="element">Jay</li>,
<li class="element">Foo</li>,
<li class="element">Bar</li>]

  (3)text

  text参数可以匹配节点的文本,传入的形式可以是字符串,可以是正则表达式对象,:

 import re

 print(soup.find_all(text=re.compile('F')))

  输出结果:

['Foo', 'Foo']

  ♦find()方法

  find()方法返回的是单个元素,也就是第一个匹配的元素。  

print(soup.find(name='ul'))
print(soup.find(class_='list'))

  输出结果:

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

  这里还有很多类似的方法:

  find_parent():返回父节点

  find_parents():返回祖先节点

  find_next_sibling():返回后面的第一个兄弟节点

  find_next_siblings():返回后面所有的兄弟节点

  find_previous_sibling():返回前面的第一个兄弟节点

  find_previous_siblings():返回前面所有的兄弟节点

  find_next():返回节点后面第一个符合条件的节点

  find_all_next():返回节点后面所有符合条件的节点

  find_previous():返回节点前面第一个符合条件的节点

  find_all_previous():返回节点前面所有符合条件的节点

 

 4.CSS选择器

  使用CSS选择器只需要调用select()方法,传入响应的CSS选择器:  

print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))

  输出结果:

[<div class="panel-heading">
<h4>hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
  ♦嵌套选择

  遍历每个ul节点,选择其中的li节点: 

for ul in soup.select('ul'):
    print(ul.select('li'))

  输出结果:

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
  ♦获取属性
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

  输出结果:

list-1
list-1
list-2
list-2
  ♦获取文本

  想要获取文本,除了string以后还可以使用get_text():

# 获取文本
for li in soup.select('li'):
    print(li.get_text())
    print(li.string)

  输出结果:

Foo
Foo
Bar
Bar
Jay
Jay
Foo
Foo

  推荐使用lxml解析库

  节点筛选虽然功能弱但是快

  建议使用find() 和find_all()匹配单个或多个

  熟悉CSS的可以使用select()进行匹配

 

posted @ 2019-05-10 17:58  ZivLi  阅读(356)  评论(0编辑  收藏  举报