BeautifulSoup
BeautifulSoup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:Tag
,NavigableString
,BeautifulSoup
,Comment
.
1、Tag:
soup = BeautifulSoup('#<title>The Dormouses story</title>',"lxml") tag = soup.title print(tag) >> <title>The Dormouses story</title>
Tag有两个重要的属性:name和attrs;
soup = BeautifulSoup('#<title>The Dormouses story</title>',"lxml") tag = soup.title print(tag.name) >> title
#利用name属性修改html文档
soup = BeautifulSoup('#<title>The Dormouses story</title>',"lxml")
tag = soup.title
tag.name = 'aaa'
print(tag)
>> <aaa>The Dormouses story</aaa>
#一个tag可能有很多个属性,如 tag <b class="boldest"> 有一个 “class” 的属性,值为 “boldest” . soup = BeautifulSoup('<b class="boldest">Extremely bold</b>',"lxml") tag = soup.b print(tag.attrs) >>{'class': ['boldest']} # tag的属性的操作方法与字典相同: print(tag['class']) >>['boldest'] #tag的属性可以被修改 #tag['class'] = 'verybold' #print(tag) >> <b class="verybold">Extremely bold</b> #tag的属性可以被添加 tag['grade'] = 'first' print(tag) >> <b class="boldest" grade="first">Extremely bold</b> #tag的属性可以被删除 del tag['class'] print(tag) >> <b>Extremely bold</b>
2、NavigableString:BeautifulSoup用NavigableString
类来包装tag中的字符串
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>',"lxml") tag = soup.b print(tag.string) >> Extremely bold
#tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法:
tag.string.replace_with('change string')
print(tag)
>> <b class="boldest">change string</b>
3、BeautifulSoup:表示的是一个文档的全部内容。大部分时候,可以把它当作
Tag
对象。
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>',"lxml") print(soup.attrs) >> {} print(soup.name) >> [document]
4、Comment:是一个特殊类型的
NavigableString
对象,为文档的注释部分。
soup = BeautifulSoup('<b><!--Hey, buddy. Want to buy a used parser?--></b>','lxml') tag = soup.b comment = tag.string print(comment) >> Hey, buddy. Want to buy a used parser?
遍历文档树:
(1) Tag的名字:
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc, 'html.parser') print(soup.head) #指定获取的tag的name print(soup.title) print(soup.a) # 当有多个该名称的tag时,只能获取到第一个 print(soup.find_all('a')) # 查找全部名称为a的Tag
(2).contents 和 .children
tag的 .contents
属性可以将tag的子节点以列表的方式输出
(3).descendants:遍历子孙节点