每日日报

1、快速操作：

soup.title  == soup.find('title')
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string  == soup.title.text  == soup.title.get_text()
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p   == soup.find('p')  # . 点属性，只能获取当前标签下的第一个标签
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a  == soup.find('a')
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(['a','b'])  # 查找所有的a标签和b标签
soup.find_all(id=["link1","link2"])  # 查找所有id=link1 和id=link2的标签

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

　　2、Beautiful Soup对象有四种类型：

　　　　1、BeautifulSoup

　　　　2、tag：标签

　　　　3、NavigableString : 标签中的文本，可包含注释内容

　　　　4、Comment ：标签中的注释，纯注释，没有正文内容

　　标签属性的操做跟字典是一样一样的

　　html多值属性(xml不适合)：

　　　　意思为一个属性名称，它是多值的，即包含多个属性值，即使属性中只有一个值也返回值为list，

　　　　如：class,rel , rev , accept-charset , headers , accesskey

　　　　其它属性为单值属性，即使属性值中有多个空格隔开的值，也是反回一个字符串

soup.a['class']  #['sister']


id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']  #'my id'

　　3、html中tag内容输出：　

　　　　string:输出单一子标签文本内容或注释内容（选其一，标签中包含两种内容则输出为None）

　　　　strings: 返回所有子孙标签的文本内容的生成器（不包含注释）

　　　　stripped_strings:返回所有子孙标签的文本内容的生成器（不包含注释,并且在去掉了strings中的空行和空格）

　　　　text:只输出文本内容，可同时输出多个子标签内容

　　　　get_text():只输出文本内容，可同时输出多个子标签内容

　　string:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comm = soup.b.string
print(comm)  # Hey, buddy. Want to buy a used parser?
print(type(comm))  #<class 'bs4.element.Comment'>

　　strings:

head_tag = soup.body
for s in head_tag.strings:
    print(repr(s))

结果：
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n        '
'Elsie'
',\n        '
'Lacie'
' and\n        '
'Tillie'
';\n        and they lived at the bottom of a well.\n    '
'\n'
'...'
'\n'

　　stripped_strings:

head_tag = soup.body
for s in head_tag.stripped_strings:
    print(repr(s))

结果：
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\n        and they lived at the bottom of a well.'
'...'

　　text:

soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.body
print(head_tag.text)

结果：
The Dormouse's story
Once upon a time there were three little sisters; and their names were
        Elsie,
        Lacie and
        Tillie;
        and they lived at the bottom of a well.
    
...

soup = BeautifulSoup(html_doc, 'html.parser')
head_tag = soup.body
print(repr(head_tag.text))

结果：
"\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\n        Elsie,\n        Lacie and\n        Tillie;\n        and they lived at the bottom of a well.\n    \n...\n"

发表于 2021-05-17 19:16 樱岛麻衣daisuki 阅读(53) 评论(0) 编辑收藏举报

刷新页面返回顶部

每日日报

公告

导航