BeautifulSoup库(0)
BeautifulSoup库
概述
BeautifulSoup库(beautifulsoup4)是解析、遍历、维护、“标签树”的功能库。
官方:https://www.crummy.com/software/BeautifulSoup/
BeautifulSoup库的安装
pip install beautifulsoup4

import requests from bs4 import BeautifulSoup r = requests.get("http://python123.io/ws/demo.html") demo = r.text print(demo) soup = BeautifulSoup(demo, 'html.parser') print(soup.prettify())
BeautifulSoup的理解
1 #引入方式
2 from bs4 import BeautifulSoup
3
4 import bs4
1 #用法
2 from bs4 import BeautifulSoup
3 #代码式
4 soup = BeautifulSoup("<html>data</html>", "html.parser")
5 #文件式
6 soup2 = BeautifulSoup(open("D://demo.html"), "html.parser")
解析器 | 使用方法 | 条件 |
bs4的HTML解析器 | BeautifulSoup(mk, 'html.parser') | 安装bs4库 |
lxml的HTML解析器 | BeautifulSoup(mk, 'lxml') | pip install lxml |
lxml的XML的解析器 | BeautifulSoup(mk, 'xml') | pip install lxml |
html5lib的解析器 | BeautifulSoup(mk, 'html5lib') | pip install html5lib |
基本元素 | 说明 |
Tag | 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾 |
Name | 标签的名字,<p>...</p>的名字是'p',格式:<tag>.name |
Attributes | 标签属性,字典形式组织,格式:<tag>.attrs |
NavigabnleString | 标签内非属性字符串,<>...</>中字符串,格式:<tag>.string |
Comment | 标签内字符串的注释部分,一种特殊的Comment类型 |
- Tag标签
1 import requests
2 from bs4 import BeautifulSoup
3 r = requests.get("http://python123.io/ws/demo.html")
4 demo = r.text
5 soup = BeautifulSoup(demo, 'html.parser')
6 print(soup.title)
7 tag = soup.a
8 print(tag)
- Tag的name(名字)
1 import requests
2 from bs4 import BeautifulSoup
3 r = requests.get("http://python123.io/ws/demo.html")
4 demo = r.text
5 soup = BeautifulSoup(demo, 'html.parser')
6 print(soup.a.name)
7 print(soup.a.parent.name)
8 print(soup.a.parent.parent.name)
- Tag的attrs(属性)
1 import requests
2 from bs4 import BeautifulSoup
3 r = requests.get("http://python123.io/ws/demo.html")
4 demo = r.text
5 soup = BeautifulSoup(demo, 'html.parser')
6 tag = soup.a
7 print(tag.attrs)
8 print(tag.attrs['class'])
9 print(tag.attrs['href'])
10 print(type(tag.attrs))
11 print(type(tag))
- Tag的NavingableString
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')
print(soup.a)
print(soup.a.string)
print(soup.p)
print(soup.p.string)
print(type(soup.p.string))
- Tag的Comment
1 import requests
2 from bs4 import BeautifulSoup
3 newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
4 print(newsoup.b.string)
5 print(type(newsoup.b.string))
6 print(newsoup.p.string)
7 print(type(newsoup.p.string))
基于bs4库的HTML的内容遍历方法
属性 | 说明 |
.contents | 子节点的列表,将<tag>所有儿子节点存入列表 |
.children | 子节点的迭代类型,与.conternts类似,用于循环遍历儿子节点 |
.descendants | 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历 |
1 import requests
2 from bs4 import BeautifulSoup
3 r = requests.get("http://python123.io/ws/demo.html")
4 demo = r.text
5 soup = BeautifulSoup(demo, "html.parser")
6 print(soup.head)
7 print(soup.head.contents)
8 print(soup.body.contents)
9 print(len(soup.body.contents))
10 print(soup.body.contents[1])
11 #遍历儿子节点
12 for child in soup.body.children:
13 print(child)
14 #遍历子孙节点
15 for child in soup.body.descendants:
16 print(child)
属性 | 说明 |
.parent | 节点的父亲标签 |
.parents | 节点先辈标签的迭代类型,用于循环遍历先辈节点 |
1 import requests
2 from bs4 import BeautifulSoup
3 r = requests.get("http://python123.io/ws/demo.html")
4 demo = r.text
5 soup = BeautifulSoup(demo, "html.parser")
6 print(soup.title.parent)
7 print(soup.html.parent)
8 print(soup.parent)
9 for parent in soup.a.parents:
10 if parent is None:
11 print(parent)
12 else:
13 print(parent.name)
属性 | 说明 |
.next_sibling | 返回按照HTML文本顺序的下一个平行节点标签 |
.previous_sibling | 返回按照HTML文本顺序的上一个平行节点标签 |
.next_siblings | 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签 |
.previous_siblings | 迭代类型,返回按照HTML文本顺序的前续所有平行节点类型 |
1 import requests
2 from bs4 import BeautifulSoup
3 r = requests.get("http://python123.io/ws/demo.html")
4 demo = r.text
5 soup = BeautifulSoup(demo, "html.parser")
6 print(soup.a.next_sibling)
7 print(soup.a.next_sibling.next_sibling)
8 print(soup.a.previous_sibling)
9 print(soup.a.previous_sibling.previous_sibling)
10 print(soup.a.parent)
11 #遍历后续节点
12 for sibling in soup.a.next_sibling:
13 print(sibling)
14 #遍历前续节点
15 for sibling in soup.a.previous_sibling:
16 print(sibling)
基于bs4库的HTML的格式化和编码
格式化输出print(soup.a.prettify())
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」