Python爬虫之Beautiful Soup解析库的使用(五)
Python爬虫之Beautiful Soup解析库的使用
Beautiful Soup-介绍
Python第三方库,用于从HTML或XML中提取数据
官方:http://www.crummv.com/software/BeautifulSoup/
安装:pip install beautifulsoup4
Beautiful Soup-语法
soup = BeautifulSoup(html_doc,'html.parser‘,from_encoding='utf-8' )
第一个参数:html文档字符串
第二个参数:html解析器
第三个参数:html文档的编码
Beautiful Soup-使用
标签选择器操作
注意:只会返回一个指定的标签,这也是标签选择器的特性
选择元素
1 2 3 4 5 6 7 8 9 | from bs4 import BeautifulSoup html_doc = ''' <div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推荐<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新时代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娱乐<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home? data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">财经<span></span></a></li> ''' soup = BeautifulSoup(html_doc, 'lxml' )<br> #将html代码自动补全,并按html代码格式返回 print (soup.prettify())<br> #输出第一个a标签 print (soup.a)<br> #输出第一个span标签 print (soup.span) |
运行结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | <html> <body> <div class = "container" > <a class = "logo" href = "/pc/home?sign=360_79aabe15" > < / a> <nav data - mod = "nnav" id = "nnav" > <div class = "nnav-wrap" > <ul class = "nnav-items" id = "nnav_main" > <li data - index = "0" > <a class = "nnav-item" data - ch = "youlike" href = "/pc/home?ch=youlike&sign=360_79aabe15" target = "_blank" > 推荐 <span> < / span> < / a> < / li> <li data - index = "1" > <a class = "nnav-item" data - ch = "good_safe2toera" href = "/pc/home?ch=good_safe2toera&sign=360_79aabe15" target = "_blank" > 新时代 <span> < / span> < / a> < / li> <li data - index = "2" > <a class = "nnav-item" data - ch = "fun" href = "/pc/home?ch=fun&sign=360_79aabe15" target = "_blank" > 娱乐 <span> < / span> < / a> < / li> <li data - index = "3" > <a class = "nnav-item" href = " / pc / home? data - index = "> < / a> <a class = "nnav-item" data - ch = "economy" href = "/pc/home?ch=economy&sign=360_79aabe15" target = "_blank" > 财经 <span> < / span> < / a> < / li> < / ul> < / div> < / nav> < / div> < / body> < / html> <a class = "logo" href = "/pc/home?sign=360_79aabe15" >< / a> <span>< / span> |
获取名称
获取属性
获取内容
1 2 3 4 5 6 7 8 9 10 11 12 13 | from bs4 import BeautifulSoup html_doc = ''' <div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推荐<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新时代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娱乐<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home? data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">财经<span></span></a></li> ''' soup = BeautifulSoup(html_doc, 'lxml' ) #输出第一个a标签的name print (soup.a.name) #输出第一个a标签的的class属性值,下面两种方法都可以 print (soup.a.attrs[ 'class' ]) print (soup.a[ 'class' ]) #输出第一个a标签的内容 print (soup.a.string) |
运行结果如下:
1 2 3 4 | a [ 'logo' ] [ 'logo' ] None |
嵌套选择
1 2 3 4 5 6 | from bs4 import BeautifulSoup html_doc = ''' <a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike"><span>推荐</span></a> ''' soup = BeautifulSoup(html_doc, 'lxml' ) print (soup.a.span.string) |
运行结果如下:
1 |
子节点和子孙节点操作
获取所有的子节点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from bs4 import BeautifulSoup html = ''' <div class="bc"> <span class="fl" style="padding-top: 1px;"><a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新东方在线网络课堂"></a></span> <span class="fl" style="padding-top: 6px;"> <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四级</a> <a title="新东方在线网络课堂" href="http://www.koolearn.com/" target="_self">新东方在线</a> > <a title="四级网络课堂" href="http://cet4.koolearn.com/" target="_self">四级</a> > <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文 </span> <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a> </div> ''' soup = BeautifulSoup(html, 'lxml' ) #第一种方法 print (soup.div.contents) #第二种方法 print (soup.div.children) for i,child in enumerate (soup.div.children): print (i,child) |
运行结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | [ '\n' , <span class = "fl" style = "padding-top: 1px;" ><a href = "http://www.koolearn.com/" target = "_blank" title = "新东方在线网络课堂" ><img alt = "新东方在线网络课堂" height = "48" src = "http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width = "105" / >< / a>< / span>, '\n' , <span class = "fl" style = "padding-top: 6px;" > <a class = "ky" href = "http://cet4.koolearn.com/" rel = "nofollow" target = "_blank" >四级< / a> <a href = "http://www.koolearn.com/" target = "_self" title = "新东方在线网络课堂" >新东方在线< / a> > <a href = "http://cet4.koolearn.com/" target = "_self" title = "四级网络课堂" >四级< / a> > <a href = "http://cet4.koolearn.com/cihui/" title = "英语四级词汇" >英语四级词汇< / a> > 正文 < / span>, '\n' , <a class = "fr logo_p2" href = "http://www.xdf.cn/" rel = "nofollow" target = "_blank" ><img height = "24" src = "http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width = "208" / >< / a>, '\n' ] <list_iterator object at 0x0000000002E498D0 > 0 1 <span class = "fl" style = "padding-top: 1px;" ><a href = "http://www.koolearn.com/" target = "_blank" title = "新东方在线网络课堂" ><img alt = "新东方在线网络课堂" height = "48" src = "http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width = "105" / >< / a>< / span> 2 3 <span class = "fl" style = "padding-top: 6px;" > <a class = "ky" href = "http://cet4.koolearn.com/" rel = "nofollow" target = "_blank" >四级< / a> <a href = "http://www.koolearn.com/" target = "_self" title = "新东方在线网络课堂" >新东方在线< / a> > <a href = "http://cet4.koolearn.com/" target = "_self" title = "四级网络课堂" >四级< / a> > <a href = "http://cet4.koolearn.com/cihui/" title = "英语四级词汇" >英语四级词汇< / a> > 正文 < / span> 4 5 <a class = "fr logo_p2" href = "http://www.xdf.cn/" rel = "nofollow" target = "_blank" ><img height = "24" src = "http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width = "208" / >< / a> 6 |
获取所有的子孙节点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | from bs4 import BeautifulSoup html = ''' <div class="bc"> <span class="fl" style="padding-top: 1px;"> <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新东方在线网络课堂"></a></span> <span class="fl" style="padding-top: 6px;"> <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四级</a> <a title="新东方在线网络课堂" href="http://www.koolearn.com/" target="_self">新东方在线</a> > <a title="四级网络课堂" href="http://cet4.koolearn.com/" target="_self">四级</a> > <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文</span> <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a> </div> ''' soup = BeautifulSoup(html, 'lxml' ) print (soup.div.descendants) for i,child in enumerate (soup.div.descendants): print (i,child) |
运行结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | <generator object descendants at 0x00000000028F5AF0 > 0 1 <span class = "fl" style = "padding-top: 1px;" > <a href = "http://www.koolearn.com/" target = "_blank" title = "新东方在线网络课堂" ><img alt = "新东方在线网络课堂" height = "48" src = "http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width = "105" / >< / a>< / span> 2 3 <a href = "http://www.koolearn.com/" target = "_blank" title = "新东方在线网络课堂" ><img alt = "新东方在线网络课堂" height = "48" src = "http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width = "105" / >< / a> 4 <img alt = "新东方在线网络课堂" height = "48" src = "http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width = "105" / > 5 6 <span class = "fl" style = "padding-top: 6px;" > <a class = "ky" href = "http://cet4.koolearn.com/" rel = "nofollow" target = "_blank" >四级< / a> <a href = "http://www.koolearn.com/" target = "_self" title = "新东方在线网络课堂" >新东方在线< / a> > <a href = "http://cet4.koolearn.com/" target = "_self" title = "四级网络课堂" >四级< / a> > <a href = "http://cet4.koolearn.com/cihui/" title = "英语四级词汇" >英语四级词汇< / a> > 正文< / span> 7 8 <a class = "ky" href = "http://cet4.koolearn.com/" rel = "nofollow" target = "_blank" >四级< / a> 9 四级 10 11 <a href = "http://www.koolearn.com/" target = "_self" title = "新东方在线网络课堂" >新东方在线< / a> 12 新东方在线 13 > 14 <a href = "http://cet4.koolearn.com/" target = "_self" title = "四级网络课堂" >四级< / a> 15 四级 16 > 17 <a href = "http://cet4.koolearn.com/cihui/" title = "英语四级词汇" >英语四级词汇< / a> 18 英语四级词汇 19 > 正文 20 21 <a class = "fr logo_p2" href = "http://www.xdf.cn/" rel = "nofollow" target = "_blank" ><img height = "24" src = "http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width = "208" / >< / a> 22 <img height = "24" src = "http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width = "208" / > 23 |
父节点和祖先节点操作
获取父节点和祖先节点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | from bs4 import BeautifulSoup html = ''' <div class="bc"> <span class="fl" style="padding-top: 1px;"> <a href="http://www.koolearn.com/" target="_blank" title="新东方在线网络课堂"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width="105" height="48" alt="新东方在线网络课堂"></a></span> <span class="fl" style="padding-top: 6px;"> <a href="http://cet4.koolearn.com/" target="_blank" rel="nofollow" class="ky">四级</a> <a title="新东方在线网络课堂" href="http://www.koolearn.com/" target="_self">新东方在线</a> > <a title="四级网络课堂" href="http://cet4.koolearn.com/" target="_self">四级</a> > <a href="http://cet4.koolearn.com/cihui/" title="英语四级词汇">英语四级词汇</a> > 正文</span> <a href="http://www.xdf.cn/" target="_blank" rel="nofollow" class="fr logo_p2"><img src="http://images.koolearn.com/fe_upload/2015_9_2_1441179317774.jpg" width="208" height="24"></a> </div> ''' soup = BeautifulSoup(html, 'lxml' ) print (soup.a.parent) #获取父节点 print (soup.a.parents) #获取祖先节点 |
运行结果如下:
1 2 3 | <span class = "fl" style = "padding-top: 1px;" > <a href = "http://www.koolearn.com/" target = "_blank" title = "新东方在线网络课堂" ><img alt = "新东方在线网络课堂" height = "48" src = "http://images.koolearn.com/fe_upload/2015_9_2_1441179226504.jpg" width = "105" / >< / a>< / span> <generator object parents at 0x00000000028C5B48 > |
兄弟节点操作
获取兄弟节点
1 2 3 4 5 6 7 8 9 10 11 | from bs4 import BeautifulSoup html = ''' <div class="more_box" id="moreBox"> <h3>360识图</h3> <a href="javascript:;" id="btnLoadMore" class="btn_loadmore">加载更多</a> <p id="imgTotal" class="img_total">找到相关图片约 2637 张</p> </div> ''' soup = BeautifulSoup(html, 'lxml' ) print (soup.a.next_siblings) #获取前面的兄弟节点 print (soup.a.previous_siblings) #获取后面的兄弟节点 |
运行结果如下:
1 2 | <generator object next_siblings at 0x0000000002885B48 > <generator object previous_siblings at 0x0000000002885B48 > |
python生成器generator
1 2 3 4 | l = [x * x for x in range ( 10 )] g = (x * x for x in range ( 10 )) print (l) print (g) |
运行结果如下:
1 2 | [ 0 , 1 , 4 , 9 , 16 , 25 , 36 , 49 , 64 , 81 ] <generator object <genexpr> at 0x000000000251C468 > |
L 是一个list, 而 G 是一个generator:它们在创建时候最基本的不同就list是 [ ] ,而generator是 ( )
如果要一个个打印出来,可以通过next()函数来获得generator的下一个返回值
1 2 3 | g = (x * x for x in range ( 10 )) for i in range ( 10 ): print ( next (g)) |
运行结果如下
1 2 3 4 5 6 7 8 9 10 | 0 1 4 9 16 25 36 49 64 81 |
标准选择器操作
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | <br><br> #可根据标签名、属性、内容查找文档,返回所有匹配结果find_all(name,attrs,recusive,text,**kwargs) #查找所有标签为a的节点 soup.find_all( 'a' ) #查找所有标签为a,链接符合/view/123/htm形式的节点 soup.find_all( 'a' ,href = '/view/123.htm' ) soup.find_all( 'a' ,href = re. compile (r '/view/\d+\.htm' )) #查找所有标签为div,class为abc,文字为python的节点 soup.find_all( 'div' , class_ = 'abc' ,string = 'python' ) 属性: #获取查到的节点的标签名称 node.name #获取查找到的a节点的href属性 node[ 'href' ] #获取查找到的a节点的链接文字 node.get_text() find(name,attrs,recusive,text, * * kwargs) 可根据标签名、属性、内容查找文档,和find_all使用方法差不多,只不过返回第一个符合匹配的结果 find_parents() find_parent() find_parents()返回所有祖先节点 ,find_parent()返回直接父节点 find_next_siblings() find_next_sibling() find_next_siblings()返回前面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点 find_previous_siblings() find_previous_sibling() find_previous_siblings()返回前面所有兄弟节点 , find_previous_sibling()返回前面第一个兄弟节点 find_all_next() find_next() find_all_next()返回节点后所有符合条件的节点 , find_next()返回第一个符合条件的节点 find_all_previous() find_previous() find_all_previous()返回节点后所有符合条件的节点 ,find_previous()返回第一个符合条件的节点 |
测试实例:
import bs4
html_doc='''
<div class="container"> <a href="/pc/home?sign=360_79aabe15" class="logo"></a> <nav id="nnav" data-mod="nnav"> <div class="nnav-wrap"> <ul class="nnav-items" id="nnav_main"> <li data-index="0"><a class="nnav-item" href="/pc/home?ch=youlike&sign=360_79aabe15" target="_blank" data-ch="youlike">推荐<span></span></a></li><li data-index="1"><a class="nnav-item" href="/pc/home?ch=good_safe2toera&sign=360_79aabe15" target="_blank" data-ch="good_safe2toera">新时代<span></span></a></li><li data-index="2"><a class="nnav-item" href="/pc/home?ch=fun&sign=360_79aabe15" target="_blank" data-ch="fun">娱乐<span></span></a></li><li data-index="3"><a class="nnav-item" href="/pc/home?
data-index="4"><a class="nnav-item" href="/pc/home?ch=economy&sign=360_79aabe15" target="_blank" data-ch="economy">财经<span></span></a></li><li data-index="5"><a class="nnav-item" href="/pc/home?ch=estate&sign=360_79aabe15" target="_blank" data-ch="estate">房产<span></span></a></li><li data-index="6"><a class="nnav-item" href="/pc/home?ch=car&sign=360_79aabe15" target="_blank" data-ch="car">汽车<span></span></a></li><li data-index="7"><a class="nnav-item" href="/pc/home?ch=sport&sign=360_79aabe15" target="_blank" data-ch="sport">体育<span></span></a></li><li data-index="8"><a class="nnav-item" href="/pc/home?ch=domestic&sign=360_79aabe15" target="_blank" data-ch="domestic">国内
'''
#创建BeautifulSoup对象
soup = bs4.BeautifulSoup(html_doc,'html.parser')
#获取所有的链接
links = soup.find_all('a')
for link in links:
print(link.name,link['href'],link.get_text())
#获取/pc/home?sign=360_79aabe15的链接
link_node = soup.find('a',href='/pc/home?sign=360_79aabe15')
print(link_node.name,link_node['href'],link_node.get_text())
运行结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 | a / pc / home?sign = 360_79aabe15 a / pc / home?ch = youlike&sign = 360_79aabe15 推荐 a / pc / home?ch = good_safe2toera&sign = 360_79aabe15 新时代 a / pc / home?ch = fun&sign = 360_79aabe15 娱乐 a / pc / home? data - index = 财经 a / pc / home?ch = economy&sign = 360_79aabe15 财经 a / pc / home?ch = estate&sign = 360_79aabe15 房产 a / pc / home?ch = car&sign = 360_79aabe15 汽车 a / pc / home?ch = sport&sign = 360_79aabe15 体育 a / pc / home?ch = domestic&sign = 360_79aabe15 国内 a / pc / home?sign = 360_79aabe15 |
分类:
Python Crawle
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· 三行代码完成国际化适配,妙~啊~
· .NET Core 中如何实现缓存的预热?