Beautiful Soup基础

1、安装Beautiful Soup库:

pip install beautifulsoup4

2、导入bs4库:

from bs4 import BeautifulSoup

3、创建BeautifulSoup对象:

①、根据html文本创建对象:

soup = BeautifulSoup(html)

②、根据html文件创建对象:

soup = BeautifulSoup(open('index.html'))

4、格式化输出html文本:

soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())

 5、获取tag对象:

from bs4 import BeautifulSoup

html = '''<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''

soup = BeautifulSoup(html, 'lxml')
print(soup.a)
print(soup.p)

6、获取标签的属性:

soup = BeautifulSoup(html, 'lxml')
print(soup.a.attrs) # 以字典的形式返回


{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

7、获取标签的文本:

# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)

# The Dormouse's story

8、遍历节点:

(1)直接子节点:

要点:
.contents 返回直接子节点的列表
.children 返回直接子节点的迭代器对象

 (2)所有子孙节点:

知识点:
.descendants  返回所有子孙节点的可迭代的对象

 

posted @ 2017-07-11 23:14  还是原来那个我  阅读(141)  评论(0编辑  收藏  举报