爬虫库之BeautifulSoup学习（三）

遍历文档树：

　　1、查找子节点

　　.contents　　

　　tag的.content属性可以将tag的子节点以列表的方式输出。

　　print soup.body.contents

　　print type(soup.body.contents)

　　运行结果：

[u'\n', The Dormouse's story, u'\n', Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1"></a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well., u'\n', ..., u'\n']

<type 'list'>
[Finished in 0.2s]

.children

它返回的不是一个list，不过我们可以通过它来遍历获取所有子节点。

我们可以打印输出，可以发现它返回的是一个list生成器对象

print soup.body.children

我们怎样获得里面的内容呢？遍历一下就ok了：

for child in soup.boyd.children:

　　print child

运行返回内容：

The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...

[Finished in 0.2s]

2、所有子孙节点

.descendants

.contents 和 .children 属性仅包含tag的直接子节点，.descendants 属性可以对所有tag的子孙节点进行递归循环，和 children类似，我们也需要遍历获取其中的内容。

for child in soup.descendants:
　　print child

运行结果如下，可以发现，所有的节点都被打印出来了，先生最外层的 HTML标签，其次从 head 标签一个个剥离，以此类推。

3、节点内容

.string

如果一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容。

果tag包含了多个子节点,tag就无法确定，string 方法应该调用哪个子节点的内容, .string 的输出结果是 None

print soup.head.string
print soup.title.string
print soup.body.string

#The Dormouse's story
#The Dormouse's story
#None
[Finished in 0.2s]

4、多个内容

.strings

获取多个内容，不过需要遍历获取

for string in soup.strings:

　　print repr(string)

　　.stripped_strings

　　输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容

for string in soup.stripped_strings:
　　print repr(string)

运行结果：

u"The Dormouse's story"
u"The Dormouse's story"
u'Once upon a time there were three little sisters; and their names were'
u','
u'Lacie'
u'and'
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'...'
[Finished in 0.2s]

5、父节点

.parent

print soup.p.parent.name

print soup.head.title.string.parent.name

#body

#title

6、兄弟节点、前后节点等略

posted @ 2017-05-12 19:56 沉默的云阅读(225) 评论(0) 收藏举报

刷新页面返回顶部

爬虫库之BeautifulSoup学习（三）

公告