[转]python下很帅气的爬虫包 - Beautiful Soup 示例

原文地址http://blog.csdn.net/watsy/article/details/14161201

先发一下官方文档地址。http://www.crummy.com/software/BeautifulSoup/bs4/doc/

建议有时间可以看一下python包的文档。

Beautiful Soup 相比其他的html解析有个非常重要的优势。html会被拆解为对象处理。全篇转化为字典和数组。

相比正则解析的爬虫，省略了学习正则的高成本。

相比xpath爬虫的解析，同样节约学习时间成本。虽然xpath已经简单点了。（爬虫框架Scrapy就是使用xpath）

安装
linux下可以执行

[plain] view plaincopy
apt-get install python-bs4

也可以用python的安装包工具来安装
[html] view plaincopy
easy_install beautifulsoup4

pip install beautifulsoup4

使用简介
下面说一下BeautifulSoup 的使用。

解析html需要提取数据。其实主要有几点

1：获取指定tag的内容。

[plain] view plaincopy

hello, watsy

hello, beautiful soup.

2：获取指定tag下的属性。

[html] view plaincopy
watsy's blog
3：如何获取，就需要用到查找方法。

使用示例采用官方

[html] view plaincopy
html_doc = """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" 格式化输出。 [html] view plaincopy from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc)

print(soup.prettify())

</h1> <h1 id="the-dormouses-story">The Dormouse's story</h1> <h1 id="_-3">

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie

,

Lacie

and

Tillie

; and they lived at the bottom of a well.

...

获取指定tag的内容
[html] view plaincopy
soup.title

The Dormouse's story

soup.title.name

u'title'

soup.title.string

u'The Dormouse's story'

soup.title.parent.name

u'head'

soup.p

The Dormouse's story

soup.a

Elsie

上面示例给出了4个方面
1：获取tag

soup.title

2：获取tag名称

soup.title.name

3：获取title tag的内容

soup.title.string

4：获取title的父节点tag的名称

soup.title.parent.name

怎么样，非常对象化的使用吧。

提取tag属性
下面要说一下如何提取href等属性。

[html] view plaincopy
soup.p['class']

u'title'

获取属性。方法是
soup.tag['属性名称']

[html] view plaincopy
watsy's blog
常见的应该是如上的提取联接。
代码是

[html] view plaincopy
soup.a['href']
相当easy吧。

查找与判断
接下来进入重要部分。全文搜索查找提取.

soup提供find与find_all用来查找。其中find在内部是调用了find_all来实现的。因此只说下find_all

[html] view plaincopy
def find_all(self, name=None, attrs={}, recursive=True, text=None,
limit=None, **kwargs):

看参数。
第一个是tag的名称，第二个是属性。第3个选择递归，text是判断内容。limit是提取数量限制。**kwargs 就是字典传递了。。

举例使用。

[html] view plaincopy
tag名称
soup.find_all('b')

[The Dormouse's story]

正则参数
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)

body

b

for tag in soup.find_all(re.compile("t")):
print(tag.name)

html

title

列表
soup.find_all(["a", "b"])

[The Dormouse's story,

Elsie,

Lacie,

Tillie]

函数调用
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)

[
The Dormouse's story
,

Once upon a time there were...
,

...
]

tag的名称和属性查找
soup.find_all("p", "title")

[
The Dormouse's story
]

tag过滤
soup.find_all("a")

[Elsie,

Lacie,

Tillie]

tag属性过滤
soup.find_all(id="link2")

[Lacie]

text正则过滤
import re
soup.find(text=re.compile("sisters"))

u'Once upon a time there were three little sisters; and their names were\n'

获取内容和字符串
获取tag的字符串
[html] view plaincopy
title_tag.string

u'The Dormouse's story'

注意在实际使用中应该使用 unicode(title_tag.string)来转换为纯粹的string对象

使用strings属性会返回soup的构造1个迭代器，迭代tag对象下面的所有文本内容
[html] view plaincopy
for string in soup.strings:
print(repr(string))

u"The Dormouse's story"

u'\n\n'

u"The Dormouse's story"

u'\n\n'

u'Once upon a time there were three little sisters; and their names were\n'

u'Elsie'

u',\n'

u'Lacie'

u' and\n'

u'Tillie'

u';\nand they lived at the bottom of a well.'

u'\n\n'

u'...'

u'\n'

获取内容
.contents会以列表形式返回tag下的节点。
[html] view plaincopy
head_tag = soup.head
head_tag

The Dormouse's story

head_tag.contents
[The Dormouse's story]

title_tag = head_tag.contents[0]
title_tag

The Dormouse's story

title_tag.contents

[u'The Dormouse's story']

想想，应该没有什么其他的了。。其他的也可以看文档学习使用。

总结
其实使用起主要是
[html] view plaincopy
soup = BeatifulSoup(data)
soup.title
soup.p.['title']
divs = soup.find_all('div', content='tpc_content')
divs[0].contents[0].string

posted @ 2014-12-06 15:13 catmelo 阅读(324) 评论(0) 收藏举报

刷新页面返回顶部

Less is More

[转]python下很帅气的爬虫包 - Beautiful Soup 示例

</h1> <h1 id="the-dormouses-story">The Dormouse's story</h1> <h1 id="_-3">

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie

,

Lacie

and

Tillie

; and they lived at the bottom of a well.

...

The Dormouse's story

u'title'

u'The Dormouse's story'

u'head'

The Dormouse's story

Elsie

u'title'

[The Dormouse's story]

body

b

html

title

[The Dormouse's story,

Elsie,

Lacie,

Tillie]

[The Dormouse's story,

Once upon a time there were...,

...]

[The Dormouse's story]

[Elsie,

Lacie,

Tillie]

[Lacie]

u'Once upon a time there were three little sisters; and their names were\n'

u'The Dormouse's story'

u"The Dormouse's story"

u'\n\n'

u"The Dormouse's story"

u'\n\n'

u'Once upon a time there were three little sisters; and their names were\n'

u'Elsie'

u',\n'

u'Lacie'

u' and\n'

u'Tillie'

u';\nand they lived at the bottom of a well.'

u'\n\n'

u'...'

u'\n'

The Dormouse's story

The Dormouse's story

[u'The Dormouse's story']

公告

[
The Dormouse's story
,

Once upon a time there were...
,

...
]

[
The Dormouse's story
]