Python——Besutiful soup（网页）

什么是beautifulsoup:

是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航，查找，修改文档的方式。（官方）

beautifulsoup是一个解析器，可以特定的解析出内容，省去了我们编写正则表达式的麻烦。

快速开始

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

几个简单的浏览结构化数据的方法:

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
从文档中找到所有<a>标签的链接:

for link in soup.find_all('a'):
    print(link.get('href'))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie
从文档中获取所有文字内容:

print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

这里我们用的是bs4：

1、导入模块：

from bs4 import beautifulsoup

2、选择解析器解析指定内容：

soup=beautifulsoup(解析内容,解析器) #返回一个解析对象

常用解析器：html.parser,lxml,xml,html5lib

有时候需要安装安装解析器：比如pip3 install lxml

BeautifulSoup默认支持Python的标准HTML解析库，但是它也支持一些第三方的解析库：

解析器之间的区别（此处摘自官方文档）

Beautiful Soup为不同的解析器提供了相同的接口，但解析器本身时有区别的。同一篇文档被不同的解析器解析后可能会生成不同结构的树型文档。

区别最大的是HTML解析器和XML解析器，看下面片段被解析成HTML结构：如果想要获得更详细的介绍，可以参考官方文档，令人高兴的是，有了比较简易的中文版：

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

posted @ 2023-06-16 11:20 新兵蛋Z 阅读(29) 评论(0) 收藏举报

刷新页面返回顶部

新兵蛋子

Python——Besutiful soup（网页）

公告