Beautiful Soup 基础用法

快速入门

Beautiful Soup的安装

$ apt-get install python3-bs4

准备工作

下面的一段HTML代码将作为例子被多次用到.这是爱丽丝梦游仙境的的一段内容(以后内容中简称为爱丽丝的文档):

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

通过标签和属性获取html片段(find_all函数)

soup.find_all("title")
# [<title>The Dormouse's story</title>]
# 获取所有的title标签

soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]
# 获取所有的class为title的p标签

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
# 获取id为link2的标签

find_all( name , attrs , recursive , string , **kwargs )
参数说明

注意！！！

在取值时我们要注意一点就是在获取标签的时候获取的是单个标签还是标签列表。
也就是find()和find_all(),select()和select_one()的区别。
当使用

find()
select_one()

时，获得的是一个标签
类型为

<class 'bs4.element.Tag'>

所以可以使用tag['class']取值
当使用

find_all()
select()

时，获得的是组标签(就算只有一个标签也是一组)
类型为

#find_all()的返回值类型
<class 'bs4.element.ResultSet'>
#select()的返回值类型
<class 'list'>

这时，我们要取值就需要先定位是list(ResultSet)中的那个标签在取值
例如tag[0]['class']

例子

import requests
from bs4 import BeautifulSoup

url = "https://www.cnblogs.com/java-six/"

header={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}

response = requests.get(url,headers=header)

soup = BeautifulSoup(response.text,'html.parser')
#  to obtain the length
resSize = len(soup.find_all('a','postTitle2'))
arr = soup.find_all('a','postTitle2')

# print(arr[1].span.string)

for  i in range(resSize):
    print("This is "+str(i)+" element that "+str(arr[i].span.string))
# print(soup.find_all('a','postTitle2'))

posted @ 2022-07-30 15:32 又一岁荣枯阅读(73) 评论(0) 收藏举报

刷新页面返回顶部

又一岁荣枯

1111111111