python之BeautifulSoup库

1. BeautifulSoup库简介

和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM（Document Object Model）的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4。

安装和文档：

1. 安装

#安装 Beautiful Soup
pip install beautifulsoup4

#安装解析器
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:

$ apt-get install Python-lxml

$ easy_install lxml

$ pip install lxml

另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:

$ apt-get install Python-html5lib

$ easy_install html5lib

$ pip install html5lib

下表列出了主要的解析器,以及它们的优缺点,官网推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup, ["lxml", "xml"])` `BeautifulSoup(markup, "xml")`	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

2. Beautiful Soup中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

3. 几大解析工具对比：

解析工具	解析速度	使用难度
BeautifulSoup	最慢	最简单
lxml	快	简单
正则	最快	最难

2. BeautifulSoup详解

2.1 BeautifulSoup简单使用

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建 Beautiful Soup 对象
# 使用lxml来进行解析
soup = BeautifulSoup(html,"lxml")

print(soup.prettify())

2.2 BeautifulSoup四个常用的对象

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag
NavigatableString
BeautifulSoup
Comment

　2.2.1. Tag类

Tag 通俗点讲就是 HTML 中的一个个标签。示例代码如下：

#-*-coding = utf-8 -*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup  = BeautifulSoup(html,'lxml')
print(soup.title)# <title>The Dormouse's story</title>
print(soup.head)#<head><title>The Dormouse's story</title></head>
print(soup.a)#<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
print(type(soup.p))#<class 'bs4.element.Tag'>
#我们可以利用 soup 加标签名轻松地获取这些标签的内容，这些对象的类型是bs4.element.Tag。但是注意，它查找的是在所有内容中的第一个符合要求的标签。如果要查询所有的标签，后面会进行介绍。
#对于Tag，它有两个重要的属性，分别是name和attrs。示例代码如下：
print(soup.name)# [document] #soup 对象本身比较特殊，它的 name 即为 [document]
print(soup.head.name)#head ,对于其他内部标签，输出的值便为标签本身的名称
print(soup.p.attrs)#{'class': ['title'], 'name': 'dromouse'}返回的是P标签的属性字典
print(soup.p['class'])#['title'],返回属性名对应的属性值，还可以利用get方法，传入属性的名称，二者是等价的
print(soup.p.get('class'))#['title']
soup.p['class'] = 'newclass'#可以对这些属性和内容等等进行修改
print(soup.p)#<p class="newclass" name="dromouse"><b>The Dormouse's story</b></p>

Tag

　2.2.2 NavigableString类

如果拿到标签后，还想获取标签中的内容。那么可以通过tag.string获取标签中的文字。示例代码如下：

print(soup.p.string)#The Dormouse's story
print(type(soup.p.string))# <class 'bs4.element.NavigableString'>

　2.2.3 BeautifulSoup类

BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候，可以把它当作 Tag 对象，它支持遍历文档树和搜索文档中描述的大部分的方法。因为 BeautifulSoup 对象并不是真正的HTML或XML的tag，所以它没有name和attribute属性。但有时查看它的 .name 属性是很方便的，所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name。

soup.name
# '[document]'

　2.2.4 Comment类

Tag , NavigableString , BeautifulSoup 几乎覆盖了html和xml中的所有内容，但是还有一些特殊对象：文档的注释部分

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

　2.2.5 练习

#遍历文档树：即直接通过标签名字选择，特点是选择速度快，但如果存在多个相同的标签则只返回第一个
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

#1、用法
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')
# soup=BeautifulSoup(open('a.html'),'lxml')

print(soup.p) #存在多个相同的标签则只返回第一个
print(soup.a) #存在多个相同的标签则只返回第一个

#2、获取标签的名称
print(soup.p.name)

#3、获取标签的属性
print(soup.p.attrs)

#4、获取标签的内容
print(soup.p.string) # p下的文本只有一个时，取到，否则为None
print(soup.p.strings) #拿到一个生成器对象, 取到p下所有的文本内容
print(soup.p.text) #取到p下所有的文本内容
for line in soup.stripped_strings: #去掉空白
    print(line)


'''
如果tag包含了多个子节点,tag就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None，如果只有一个子节点那么就输出该子节点的文本，比如下面的这种结构，soup.p.string 返回为None,但soup.p.strings就可以找到所有文本
<p id='list-1'>
    哈哈哈哈
    <a class='sss'>
        <span>
            <h1>aaaa</h1>
        </span>
    </a>
    <b>bbbbb</b>
</p>
'''

#5、嵌套选择
print(soup.head.title.string)
print(soup.body.a.string)


#6、子节点、子孙节点
print(soup.p.contents) #p下所有子节点
print(soup.p.children) #得到一个迭代器,包含p下所有子节点

for i,child in enumerate(soup.p.children):
    print(i,child)

print(soup.p.descendants) #获取子孙节点,p下所有的标签都会选择出来
for i,child in enumerate(soup.p.descendants):
    print(i,child)

#7、父节点、祖先节点
print(soup.a.parent) #获取a标签的父节点
print(soup.a.parents) #找到a标签所有的祖先节点，父亲的父亲，父亲的父亲的父亲...


#8、兄弟节点
print('=====>')
print(soup.a.next_sibling) #下一个兄弟
print(soup.a.previous_sibling) #上一个兄弟

print(list(soup.a.next_siblings)) #下面的兄弟们=>生成器对象
print(soup.a.previous_siblings) #上面的兄弟们=>生成器对象

View Code

#-*-coding = utf-8 -*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
#3.搜索文档树
#1）搜索文档树，一般用得比较多的就是两个方法，一个是find，一个是find_all。
# find方法是找到第一个满足条件的标签后就立即返回，只返回一个元素。
# find_all方法是把所有满足条件的标签都选到，然后返回。
# 使用这两个方法，最常用的用法是输入标签名name以及attr参数找出符合要求的标签。
soup = BeautifulSoup(html,'lxml')
aList = soup.find_all('a',attrs={'id':'link2'})
print(aList)#[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
#或者是直接传入属性的的名字作为关键字参数：
soup.find_all("a",id='link2')
#2）select方法
#使用以上方法可以方便的找出元素。但有时候使用css选择器的方式可以更加的方便。
# 使用css选择器的语法，应该使用select方法。以下列出几种常用的css选择器方法：
#a)通过标签名查找
print(soup.select('a'))#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#b)通过类名查找
#通过类名，则应该在类的前面加一个.。比如要查找class=sister的标签。示例代码如下：
print(soup.select('.sister'))#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#c)通过id查找
#通过id查找，应该在id的名字前面加一个＃号。示例代码如下：
print(soup.select('#link1'))#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
#d)组合查找
#组合查找和写 css 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开：
print(soup.select("p #link1"))#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
#直接子标签查找，则使用 > 分隔：
print(soup.select('head > title'))#[<title>The Dormouse's story</title>]
#e)通过属性查找
#查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。示例代码如下：
print(soup.select('a[href="http://example.com/elsie"]'))#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
#f)在根据类名或者id进行查找的时候，如果还要根据标签名进行过滤，那么可以在类的前面或者id的前面加上标签名字
print(soup.select('p.title'))#[<p class="title"><b>The Dormouse's story</b></p>]
print(soup.select('a#link1'))#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
#g)获取内容
#以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。
print(soup.select('title'))
print (soup.select('title')[0].get_text())

for title in soup.select('title'):
    print (title.get_text())

css选择器

#搜索文档树：BeautifulSoup定义了很多搜索方法,这里着重介绍2个: find() 和 find_all() .其它方法的参数和用法类似
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b>
</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

#1、五种过滤器: 字符串、正则表达式、列表、True、方法
#1.1、字符串：即标签名
print(soup.find_all('b'))

#1.2、正则表达式
import re
print(soup.find_all(re.compile('^b'))) #找出b开头的标签，结果有body和b标签

#1.3、列表：如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签:
print(soup.find_all(['a','b']))

#1.4、True：可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点
print(soup.find_all(True))
for tag in soup.find_all(True):
    print(tag.name)

#1.5、方法:如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 ,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print(soup.find_all(has_class_but_no_id))


#2、find_all( name , attrs , recursive , text , **kwargs )
#2.1、name: 搜索name参数的值可以使任一类型的 过滤器 ,字符窜,正则表达式,列表,方法或是 True .
print(soup.find_all(name=re.compile('^t')))

#2.2、keyword: key=value的形式，value可以是过滤器：字符串 , 正则表达式 , 列表, True .
print(soup.find_all(id=re.compile('my')))
print(soup.find_all(href=re.compile('lacie'),id=re.compile('\d'))) #注意类要用class_
print(soup.find_all(id=True)) #查找有id属性的标签

# 有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>','lxml')
# data_soup.find_all(data-foo="value") #报错：SyntaxError: keyword can't be an expression
# 但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:
print(data_soup.find_all(attrs={"data-foo": "value"}))
# [<div data-foo="value">foo!</div>]

#2.3、按照类名查找，注意关键字是class_，class_=value,value可以是五种选择器之一
print(soup.find_all('a',class_='sister')) #查找类为sister的a标签
print(soup.find_all('a',class_='sister ssss')) #查找类为sister和sss的a标签，顺序错误也匹配不成功
print(soup.find_all(class_=re.compile('^sis'))) #查找类为sister的所有标签

#2.4、attrs
print(soup.find_all('p',attrs={'class':'story'}))

#2.5、text: 值可以是：字符，列表，True，正则
print(soup.find_all(text='Elsie'))
print(soup.find_all('a',text='Elsie'))

#2.6、limit参数:如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果
print(soup.find_all('a',limit=2))

#2.7、recursive:调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .
print(soup.html.find_all('a'))
print(soup.html.find_all('a',recursive=False))

'''
像调用 find_all() 一样调用tag
find_all() 几乎是Beautiful Soup中最常用的搜索方法,所以我们定义了它的简写方法. BeautifulSoup 对象和 tag 对象可以被当作一个方法来使用,这个方法的执行结果与调用这个对象的 find_all() 方法相同,下面两行代码是等价的:
soup.find_all("a")
soup("a")
这两行代码也是等价的:
soup.title.find_all(text=True)
soup.title(text=True)
'''
#3、find( name , attrs , recursive , text , **kwargs )
find_all() 方法将返回文档中符合条件的所有tag,尽管有时候我们只想得到一个结果.比如文档中只有一个<body>标签,那么使用 find_all() 方法来查找<body>标签就不太合适, 使用 find_all 方法并设置 limit=1 参数不如直接使用 find() 方法.下面两行代码是等价的:

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]
soup.find('title')
# <title>The Dormouse's story</title>

唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果.
find_all() 方法没有找到目标是返回空列表, find() 方法找不到目标时,返回 None .
print(soup.find("nosuchtag"))
# None

soup.head.title 是 tag的名字 方法的简写.这个简写的原理就是多次调用当前tag的 find() 方法:

soup.head.title
# <title>The Dormouse's story</title>
soup.find("head").find("title")
# <title>The Dormouse's story</title>

find()/find_all()

#-*-coding = utf-8 -*-
from bs4 import BeautifulSoup
import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}
url ='http://www.weather.com.cn/textFC/hb.shtml'
def parse_page(url):
    data = []
    response = requests.request(method='get',url =url,headers=headers)
    text = response.content.decode('utf-8')
    soup = BeautifulSoup(text,'html5lib')
    comMidtab = soup.find(name='div',class_ = 'conMidtab')
    tables = comMidtab.find_all('table')
    for table in tables:
        trs = table.find_all('tr')[2:]
        for index,tr in enumerate(trs):
            tds = tr.find_all('td')
            if index == 0:
                city = list(tds[1].stripped_strings)[0]
            else:
                city = list(tds[0].stripped_strings)[0]
            min_temp = list(tds[-2].stripped_strings)[0]
            data.append({'city':city,'min_temp':min_temp})
    return data

def main():
    AllData =[]
    urls = [
        'http://www.weather.com.cn/textFC/hb.shtml',
        'http://www.weather.com.cn/textFC/db.shtml',
        'http://www.weather.com.cn/textFC/hd.shtml',
        'http://www.weather.com.cn/textFC/hz.shtml',
        'http://www.weather.com.cn/textFC/hn.shtml',
        'http://www.weather.com.cn/textFC/xb.shtml',
        'http://www.weather.com.cn/textFC/xn.shtml',
        'http://www.weather.com.cn/textFC/gat.shtml'
    ]
    for url in urls:
        datas = parse_page(url)
        for data in datas:
            AllData.append(data)
    #根据最低气温进行排序
    AllData.sort(key=lambda data:data['min_temp'])
    print(AllData)


if __name__=='__main__':
    main()

爬取中国天气网信息

>>>>待续

posted @ 2019-03-22 16:40 enjoyzier 阅读(393) 评论(0) 编辑收藏举报

刷新页面返回顶部

enjoyzier

python之BeautifulSoup库

1. BeautifulSoup库简介

2. BeautifulSoup详解

2.1 BeautifulSoup简单使用

2.2 BeautifulSoup四个常用的对象

2.2.1. Tag类

2.2.2 NavigableString类

2.2.3 BeautifulSoup类

2.2.4 Comment类

2.2.5 练习

公告

　2.2.1. Tag类

　2.2.2 NavigableString类

　2.2.3 BeautifulSoup类

　2.2.4 Comment类

　2.2.5 练习