1、快速操作:
soup.title == soup.find('title') # <title>The Dormouse's story</title> soup.title.name # u'title' soup.title.string == soup.title.text == soup.title.get_text() # u'The Dormouse's story' soup.title.parent.name # u'head' soup.p == soup.find('p') # . 点属性,只能获取当前标签下的第一个标签 # <p class="title"><b>The Dormouse's story</b></p> soup.p['class'] # u'title' soup.a == soup.find('a') # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all(['a','b']) # 查找所有的a标签和b标签
soup.find_all(id=["link1","link2"]) # 查找所有id=link1 和id=link2的标签
soup.find(id="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
2、Beautiful Soup对象有四种类型:
1、BeautifulSoup
2、tag:标签
3、NavigableString : 标签中的文本,可包含注释内容
4、Comment :标签中的注释,纯注释,没有正文内容
标签属性的操做跟字典是一样一样的
html多值属性(xml不适合):
意思为一个属性名称,它是多值的,即包含多个属性值,即使属性中只有一个值也返回值为list,
如:class,rel , rev , accept-charset , headers , accesskey
其它属性为单值属性,即使属性值中有多个空格隔开的值,也是反回一个字符串
soup.a['class'] #['sister'] id_soup = BeautifulSoup('<p id="my id"></p>') id_soup.p['id'] #'my id'
3、html中tag内容输出:
string:输出单一子标签文本内容或注释内容(选其一,标签中包含两种内容则输出为None)
strings: 返回所有子孙标签的文本内容的生成器(不包含注释)
stripped_strings:返回所有子孙标签的文本内容的生成器(不包含注释,并且在去掉了strings中的空行和空格)
text:只输出文本内容,可同时输出多个子标签内容
get_text():只输出文本内容,可同时输出多个子标签内容
string:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" soup = BeautifulSoup(markup, 'html.parser') comm = soup.b.string print(comm) # Hey, buddy. Want to buy a used parser? print(type(comm)) #<class 'bs4.element.Comment'>
strings:
head_tag = soup.body for s in head_tag.strings: print(repr(s)) 结果: '\n' "The Dormouse's story" '\n' 'Once upon a time there were three little sisters; and their names were\n ' 'Elsie' ',\n ' 'Lacie' ' and\n ' 'Tillie' ';\n and they lived at the bottom of a well.\n ' '\n' '...' '\n'
stripped_strings:
head_tag = soup.body for s in head_tag.stripped_strings: print(repr(s)) 结果: "The Dormouse's story" 'Once upon a time there were three little sisters; and their names were' 'Elsie' ',' 'Lacie' 'and' 'Tillie' ';\n and they lived at the bottom of a well.' '...'
text:
soup = BeautifulSoup(html_doc, 'html.parser') head_tag = soup.body print(head_tag.text) 结果: The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ...
soup = BeautifulSoup(html_doc, 'html.parser') head_tag = soup.body print(repr(head_tag.text)) 结果: "\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\n Elsie,\n Lacie and\n Tillie;\n and they lived at the bottom of a well.\n \n...\n"