Beautiful Soup 解析库
Beautiful Soup简介
Beautiful Soup是python一个HTML或XML解析库,是一款强大的解析工具,它借助于网页结构和属性等特征来解析网页。它的出现使得我们不用再去写协议复杂的正则表达式,而只需几个语句就可以对网页中的某个元素进行提取,提高了解析效率。但是在使用中Beautiful Soup依赖于解析器,一般我们使用lxml解析器,它不仅可以解析HTML和XML的功能,而且速度快,容错能力强。
Beautiful Soup用法
简单示例
html = """ <html><head><title>The Dormouse's story</title></head> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup #BeautifulSoup对象初始化,并且完善html字符串 soup = BeautifulSoup(html,'lxml') #将解析的字符串以标准的格式输出 print(soup.prettify()) #选出html中的title节点然后获取文本 print(soup.title.string) 结果 <html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html> The Dormouse's story
节点选择器:节点选择器通过直接调用节点名称选择节点元素,在调用string属性获取节点内的文本,这种选择方式速度快,常用于节点结构层次清晰的网页解析中,分为以下几类
- 元素选择:通过节点元素名选择节点
- 嵌套选择:由于节点选择器每一个返回结果都是bs4.element.Tag 类型,则它同样可以继续调用节点进行下一步的选择,示例如下,获取head节点元素则继续调用head来选取其内部的head节点元素。
- 关联选择:在有些情况下能做到一步就可以选到想要的节点元素,则我们可以先选中一个节点元素,然后在以它为基准选择它的子节点、父节点、兄弟节点等。
元素选择
html = """ <html><head><title>The Dormouse's story</title></head> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup #BeautifulSoup 对象初始化 soup = BeautifulSoup(html,'lxml') #获取title节点信息 print(soup.title) #查看title节点以及加里面文本内容的数据类型:bs4.element.Tag print(type(soup.title)) #获取title文本内容 print(soup.title.string) #通过name属性获取节点名称 print(soup.title.name) print(soup.head) #获取第一个p节点信息 print(soup.p) #获取首个p节点的所有属性,返回一个字典 print(soup.p.attrs) #获取class值 print(soup.p.attrs['name']) print(soup.p['class']) #获取p节点文本内容(这里的P节点是第一个p节点,获取文本也是首个p节点文本) print(soup.p.string) 结果 <title>The Dormouse's story</title> <class 'bs4.element.Tag'> The Dormouse's story title <head><title>The Dormouse's story</title></head> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> {'class': ['title'], 'name': 'dromouse'} dromouse ['title'] The Dormouse's story
嵌套选择
html = ''' <html><head><title>The Dormouse's story</title></head> <body>''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') #嵌套选择title节点元素 print(soup.head.title) #查看title节点元素类型:bs4.element.Tag print(type(soup.head.title)) #获取title节点文本内容 print(soup.head.title.string) 结果 <title>The Dormouse's story</title> <class 'bs4.element.Tag'> The Dormouse's story
关联选择:
子节点和子孙节点
html = """ <html><head><title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """ 获取p节点的直接子节点 from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') #获取p节点的所有直接子节点:调用children属性,返回结果为生成器类型,只需用for循环遍历输出即可 print(soup.p.children) for i,child in enumerate(soup.p.children): print(i,child) 结果 <list_iterator object at 0x0000000002E8C4E0> 0 Once upon a time there were three little sisters; and their names were 1 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 2 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 4 5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 6 and they lived at the bottom of a well. 获取p节点的所有的子孙节点:调用descendants属性,返回结果为生成器类型,利用for循环遍历 from bs4 import BeautifulSoup soup =BeautifulSoup(html,'lxml') #获取所有的子孙节点 print(soup.p.descendants) #结果为生成器类型,for遍历即可 for i,child in enumerate(soup.p.descendants): print(i,child) 结果 <generator object descendants at 0x0000000001EF1468> 0 Once upon a time there were three little sisters; and their names were 1 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 2 Elsie 3 4 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 5 Lacie 6 7 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 8 Tillie 9 and they lived at the bottom of a well.
获取父节点和祖先节点
html = """ <html><head><title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """ #获取父节点元素:调用parent属性 from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.a.parent) 结果 <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p> 获取祖先节点元素:调用parents属性 from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(type(soup.a.parents)) print(list(enumerate(soup.a.parents))) 结果 <class 'generator'> [(0, <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p>), (1, <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> </body>), (2, <html><head><title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> </body></html>), (3, <html><head><title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> </body></html>)]
获取同级节点(兄弟节点)
html = """ <html> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> Hello <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') #获取同级下一个节点元素 print('Next Sibing:',soup.a.next_sibling) #获取同级上一个节点元素 print('Pre Sibing:',soup.a.previous_sibling) #获取后面所有的同级节点 print('Next Sibing:',list(enumerate(soup.a.next_siblings))) #返回前面所有的同级节点 print('Pre Sibing:',list(enumerate(soup.a.previous_siblings))) 结果 Next Sibing: Hello Pre Sibing: Once upon a time there were three little sisters; and their names were Next Sibing: [(0, '\nHello\n '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \nand\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n and they lived at the bottom of a well.\n')] Pre Sibing: [(0, '\n Once upon a time there were three little sisters; and their names were\n ')]
信息提取(比如文本属性等信息)
html = """ <html> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> </p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print('Next Siblings:') print(type(soup.a.next_sibling)) print(soup.a.next_sibling) #获取a节点下一个元素的文本内容 print(soup.a.next_sibling.string) print('Parents:') print(type(soup.a.parents)) print(list(soup.a.parents)[0]) #获取a节点的祖先节点class属性名 print(list(soup.a.parents)[0].attrs['class']) 结果 Next Siblings: <class 'bs4.element.Tag'> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> Lacie Parents: <class 'generator'> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> </p> ['story']
方法选择器:方法选择器一般用于比较复杂、繁琐不够灵活的场景中,而find_all()和find()方法等,传入相应的参数即可查询所需要的元素信息。Find_all()方法:查询所有符合条件的元素,API如下:find_all(name,attrs,recursive,text,**kwargs)
- Name:根据节点名称查询元素
- Attrs:根据属性名称查询节点元素
- Text:匹配节点的文本,传入参数可以是字符串也可以是正则表达式
返回所有符合条件的元素find_all()方法:
根据节点名(name)查询元素
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> <div class="panel_body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') #查询所有ul节点,返回结果为一个列表 print(soup.find_all(name='ul')) #判断元素类型:bs4.element.Tag print(type(soup.find_all(name='ul')[0])) #Tag类型进行嵌套查询 for ul in soup.find_all(name='ul'): print(ul.find_all(name='li')) #获取内部ul节点元素 for li in ul.find_all(name='li'): print(li.string) 结果 [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] <class 'bs4.element.Tag'> [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] Foo Bar Jay [<li class="element">Foo</li>, <li class="element">Bar</li>] Foo Bar
通过属性(attrs)查询节点元素
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> <div class="panel_body"> <ul class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="elements">Foo</li> <li class="elements">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find_all(attrs={'id':'list-1'})) print(soup.find_all(attrs={'name':'elements'})) print(soup.find_all(attrs={'class':'elements'})) 结果 [<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>] [<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>] [<li class="elements">Foo</li>, <li class="elements">Bar</li>]
匹配节点的文本(text)
html =''' <div class="panel"> <div class="panel-body"> <a>Hello,this ia a link</a> <a>Hello,this is a link, too</a> </div> </div>''' import re from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find_all(text=re.compile('link'))) 结果 ['Hello,this ia a link', 'Hello,this is a link, too']
返回符合条件的单个元素方法find():
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> <div class="panel_body"> <ul class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="elements">Foo</li> <li class="elements">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find(name='ul')) print(type(soup.find(name='ul'))) print(soup.find(class_='list')) 结果 <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <class 'bs4.element.Tag'> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>
CSS选择器:只需调用select()方法传入相应的css选择器即可
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" > <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.select('.panel .panel-heading')) #获取所有ul节点下面的所有li节点 print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0])) 结果 [<div class="panel-heading"> <h4>Hello</h4> </div>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>] <class 'bs4.element.Tag'> 选择器嵌套选 from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') for ul in soup.select('ul'): print(ul.select('li')) 结果 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>] 获取属性 from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') for ul in soup.select('ul'): print(ul['id']) 结果 list-1 list-2 获取文本 from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') for li in soup.select('li'): #两种方法获取文本 print('Get Text:',li.get_text()) print('String:',li.string) 结果 Get Text: Foo String: Foo Get Text: Bar String: Bar Get Text: Jay String: Jay Get Text: Foo String: Foo Get Text: Bar String: Bar