bs4一点知识
BeautifulSoup库解析器
解析器 | 使用方法 | 条件 |
---|---|---|
bs4的HTML解析器 | BeautifulSoup(mk,'html.parser') | 安装bs4库 |
lxml的HTML解析器 | BeautifulSoup(mk,'lxml') | pip install lxml |
lxml的XML解析器 | BeautifulSoup(mk,'xml') | pip install lxml |
html5lib的解析器 | BeautifulSoup(mk,'htmlslib') | pip install html5lib |
实例展示BeautifulSoup的基本用法:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.title #获取标题
<title>This is a python demo page</title>
>>> soup.a #获取a标签
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.title.string
'This is a python demo page'
>>> soup.prettify() #输出html标准格式内容
'<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>\n <body>\n <p class="title">\n <b>\n The demo python introduces several python courses.\n </b>\n </p>\n <p class="course">\n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n Basic Python\n </a>\n and\n <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n Advanced Python\n </a>\n .\n </p>\n </body>\n</html>'
>>> soup.a.name #每个<tag>都有自己的名字,通过<tag>.name获取
'a'
>>> soup.p.name
'p'
>>> tag = soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>
>>>
6.5 标签树的遍历
标签树的下行遍历
标签树的上行遍历:遍历所有先辈节点,包括soup本身
标签树的平行遍历:同一个父节点的各节点间
实例演示:
from bs4 import BeautifulSoup
import requests
demo = requests.get("http://python123.io/ws/demo.html").text
soup = BeautifulSoup(demo,"html.parser")
#标签树的上行遍历
print("遍历儿子节点:\n")
for child in soup.body.children:
print(child)
print("遍历子孙节点:\n")
for child1 in soup.body.descendants:
print(child1)
print(soup.title.parent)
print(soup.html.parent)
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
#标签树的平行遍历
print(soup.a.next_sibling)
print(soup.a.next_sibling.next_sibling)
print(soup.a.previous_sibling)
从网页上获得小说章节目录并找到正文网页和标题
data=[]
soup=BeautifulSoup(res.text,'html.parser')
for dd in soup.find_all("dd"):
link=dd.find("a")
if not link:
continue
data.append(("{}{}".format(url,link['href']),link.get_text()))
return data
精确匹配实例
对于有多个相同class的标签,对于有其他有别名都可以用findall同样检索.
<li class='item angs lock' id='1'></li>
<li class='item angs'></li>
<li class='item' id='2'></li>
可以用 pattern=findAll(name="li",class_="item")
检索,但是 pattern[1]['id']
就会出错,如果条件设计为 item angs lock
,第三个就检索不到
正确用法是 name='li',attrs={'class':'item',"data-id": True}
,这样只检索第一第三个标签
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY