BeautifulSoup 中 获取标签下的文本
常用方法:
使用get_text()方法可以获取当前标签下的所有文字,包括其子标签的,该方法可自动剔除其余的修饰标签
若当前标签的子节点是文字,可使用.string获得其下的文本内容
高阶方法:
若文本属于此标签的一个子节点、兄弟节点、父节点等,可灵活使用以下遍历方法进行获取:
1.下行遍历
标签树的下行遍历
.content 子节点列表,将tag所有儿子节点存入列表
.children子节点的迭代类型,与.contents类似用于循环遍历儿子节点 .descendants 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历
测试代码:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://python123.io/ws/demo.html")
demo=r.text
soup=BeautifulSoup(demo,"html.parser")
print(soup.head) #head标签内容
print(soup.head.contents) #head标签子节点的内容
print(soup.body.contents) #body标签子节点的内容
print(len(soup.body.contents)) #body标签的子节点层数
print(soup.body.contents[1]) #
2.上行遍历
.parent 节点的父亲标签
.parents 循环遍历先辈节点
测试代码:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://python123.io/ws/demo.html")
demo=r.text
soup=BeautifulSoup(demo,"html.parser")
#print(soup.title.parent)
#print(soup.html.parent)
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
3.平行遍历
标签树的平行遍历
.next_sibling 返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling返回按照HTML文本顺序的上一个平行节点标签 .nex_siblings 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签
import requests
from bs4 import BeautifulSoup
r=requests.get("http://python123.io/ws/demo.html")
demo=r.text
soup=BeautifulSoup(demo,"html.parser")
print(soup.a.next_sibling) #a的平行标签
print(soup.a.next_sibling.next_sibling) #a标签的下一个标签的平行标签
print(soup.a.previous_sibling) #a标签的上一个标签
print(soup.a.previous_sibling.previous_sibling) #a标签的上一个标签的平行标签
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!