BeautifulSoup库(0)

BeautifulSoup库

概述

BeautifulSoup库(beautifulsoup4)是解析、遍历、维护、“标签树”的功能库。

官方:https://www.crummy.com/software/BeautifulSoup/

BeautifulSoup库的安装

pip install beautifulsoup4
复制代码
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
print(demo)


soup = BeautifulSoup(demo, 'html.parser')
print(soup.prettify())
安装测试
复制代码

BeautifulSoup的理解

1 #引入方式
2 from bs4 import BeautifulSoup
3 
4 import bs4
1 #用法
2 from bs4 import BeautifulSoup
3 #代码式
4 soup = BeautifulSoup("<html>data</html>", "html.parser")
5 #文件式
6 soup2 = BeautifulSoup(open("D://demo.html"), "html.parser")

 

Beautiful Soup库解析器
解析器 使用方法 条件
bs4的HTML解析器 BeautifulSoup(mk, 'html.parser') 安装bs4库
lxml的HTML解析器 BeautifulSoup(mk, 'lxml') pip install lxml
lxml的XML的解析器 BeautifulSoup(mk, 'xml') pip install lxml
html5lib的解析器 BeautifulSoup(mk, 'html5lib') pip install html5lib

 

Beautiful Soup类的基本元素
基本元素 说明
Tag 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾
Name 标签的名字,<p>...</p>的名字是'p',格式:<tag>.name
Attributes 标签属性,字典形式组织,格式:<tag>.attrs
NavigabnleString 标签内非属性字符串,<>...</>中字符串,格式:<tag>.string
Comment 标签内字符串的注释部分,一种特殊的Comment类型
  • Tag标签
复制代码
1 import requests
2 from bs4 import BeautifulSoup
3 r = requests.get("http://python123.io/ws/demo.html")
4 demo = r.text
5 soup = BeautifulSoup(demo, 'html.parser')
6 print(soup.title)
7 tag = soup.a
8 print(tag)
复制代码
  • Tag的name(名字)
复制代码
1 import requests
2 from bs4 import BeautifulSoup
3 r = requests.get("http://python123.io/ws/demo.html")
4 demo = r.text
5 soup = BeautifulSoup(demo, 'html.parser')
6 print(soup.a.name)
7 print(soup.a.parent.name)
8 print(soup.a.parent.parent.name)
复制代码
  • Tag的attrs(属性)
复制代码
 1 import requests
 2 from bs4 import BeautifulSoup
 3 r = requests.get("http://python123.io/ws/demo.html")
 4 demo = r.text
 5 soup = BeautifulSoup(demo, 'html.parser')
 6 tag = soup.a
 7 print(tag.attrs)
 8 print(tag.attrs['class'])
 9 print(tag.attrs['href'])
10 print(type(tag.attrs))
11 print(type(tag))
复制代码
  • Tag的NavingableString
复制代码
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')
print(soup.a)
print(soup.a.string)
print(soup.p)
print(soup.p.string)
print(type(soup.p.string))
复制代码
  • Tag的Comment
复制代码
1 import requests
2 from bs4 import BeautifulSoup
3 newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
4 print(newsoup.b.string)
5 print(type(newsoup.b.string))
6 print(newsoup.p.string)
7 print(type(newsoup.p.string))
复制代码

基于bs4库的HTML的内容遍历方法

 

 

标签树的下行遍历
属性 说明
.contents 子节点的列表,将<tag>所有儿子节点存入列表
.children 子节点的迭代类型,与.conternts类似,用于循环遍历儿子节点
.descendants 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历

 

复制代码
 1 import requests
 2 from bs4 import BeautifulSoup
 3 r = requests.get("http://python123.io/ws/demo.html")
 4 demo = r.text
 5 soup = BeautifulSoup(demo, "html.parser")
 6 print(soup.head)
 7 print(soup.head.contents)
 8 print(soup.body.contents)
 9 print(len(soup.body.contents))
10 print(soup.body.contents[1])
11 #遍历儿子节点
12 for child in soup.body.children:
13     print(child)
14 #遍历子孙节点
15 for child in soup.body.descendants:
16     print(child)
复制代码

 

标签树的上行遍历
属性 说明
.parent 节点的父亲标签
.parents 节点先辈标签的迭代类型,用于循环遍历先辈节点

 

复制代码
 1 import requests
 2 from bs4 import BeautifulSoup
 3 r = requests.get("http://python123.io/ws/demo.html")
 4 demo = r.text
 5 soup = BeautifulSoup(demo, "html.parser")
 6 print(soup.title.parent)
 7 print(soup.html.parent)
 8 print(soup.parent)
 9 for parent in soup.a.parents:
10     if parent is None:
11         print(parent)
12     else:
13         print(parent.name)
复制代码

 

 

标签树的平行遍历(在同一个父节点下的个节点间)
属性 说明
.next_sibling 返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling 返回按照HTML文本顺序的上一个平行节点标签
.next_siblings 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings 迭代类型,返回按照HTML文本顺序的前续所有平行节点类型

 

复制代码
 1 import requests
 2 from bs4 import BeautifulSoup
 3 r = requests.get("http://python123.io/ws/demo.html")
 4 demo = r.text
 5 soup = BeautifulSoup(demo, "html.parser")
 6 print(soup.a.next_sibling)
 7 print(soup.a.next_sibling.next_sibling)
 8 print(soup.a.previous_sibling)
 9 print(soup.a.previous_sibling.previous_sibling)
10 print(soup.a.parent)
11 #遍历后续节点
12 for sibling in soup.a.next_sibling:
13     print(sibling)
14 #遍历前续节点
15 for sibling in soup.a.previous_sibling:
16     print(sibling)
复制代码

基于bs4库的HTML的格式化和编码

格式化输出print(soup.a.prettify())

 

posted @   魔九念  阅读(83)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」
点击右上角即可分享
微信分享提示