python BeautifulSoup

之前解析LXML,用的是XPath，现在临时被抓取写爬虫，接人家的代码，看到用的是BeautifulSoup，稍微学了下，也挺好用的，简单记录下用法，有机会做下和Xpath的对比测试

初始化

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"lxml")

得到soup之后，就开始提取一些比较有用的信息，比如标题：可以直接使用

soup.title

得到的结果，是带标签的，类似这种形式：<title>title</title>，但显然我们只要里面的有效信息，当然简单粗暴的话，直接用正则表达式，拿出来也是OK的

前面API不熟，项目又催的紧，我就这么干的，现在普及下他的API

print soup.title.string
print soup.title.next
print soup.title.next_element

这些都是可以得到里面那个title的，但是注意下，string的话，对于里面有多个标签的，不太好使。类似这种：<p class="hello" id="1">hello1<strong> world</strong></p>

对于这种情况，就需要使用下strings,如下所示：

pc= soup.body.p
print pc
print pc.string
for s in pc.strings:
    print s

另外要注意的一点是：直接用soup.tag的方式，是得到第一个元素的，当有多个元素同样的元素，需要提取的时候，不太好使，这时候需要使用下他的find_all函数，例如：

<html>
    <title>title</title>
    <body>
        <p id='1' class='hello'>hello1<strong> world</strong></p>
        <p id='2'>hello2
        </p>
        <p id='3'>hello3</p>
        <p id='4'>hello4</p>
        <img src="abc.jpeg"/>
        <a href="http://www.baidu.com"></a>
    </body>
</html>

我要提取所有的p中的元素，可以使用：

print soup.body.find_all("p")

当然，如果我只想要那个有class的p，怎么搞呢？

print soup.body.find_all("p",attrs={"class":"hello"})

依次类推，我们可以只提取id=3的p

那么问题来了，我现在想要找那个有class属性的p的id，怎么搞

很简单，找到对应的p之后，我们使用p['id']即可得到那个id对应的value了，但是要注意的是我们使用的是find_all方法，找到的p肯定是多个（虽然在我们这个例子里面只有一个），所以想说的是，给的肯定是一个集合，所以我们需要注意下这点：

p= soup.body.find_all("p",attrs={"class":"hello"})
print type(p)
print p[0]['id']

有了find_all之后，有时候，我们不需要那么多，我只要满足条件的第一个就可以，所以，很自然的就有find函数，方法差不多，直接忽略了

还有要注意的是找兄弟，和找父节点（后者用的比较少）

pc= soup.body.p

# 找到他的兄弟节点，用这个 属于迭代方式
for item in  pc.next_siblings:
    print item.__str__().replace("\n","")

#找到他的下一个兄弟
print pc.find_next_sibling()

# 找父节点
print pc.parent

下面来一个终极大招，现在要找一个既有class属性又有id属性的怎么搞？

def has_class_with_id(html):
    return html.has_attr('class') and  html.has_attr('id')


result = soup.find_all(is_right)
for item in result:
    print result

再来个难点的，我需要找到class=hello并且id=1的怎么搞？

def is_right(html):
    print html
    print html.has_attr('class')
    print html.has_attr('id')
    if html.has_attr('class'):
        print html['class'][0]
    if html.has_attr('id'):
        print html['id']
    print ""
    return html.has_attr('class') and  html.has_attr('id') and html['class'][0]=="hello" and html['id']=="1"


注意下，class可能含多个，所以它也是一个集合

posted @ 2016-05-19 11:10 LiuWei_Find 阅读(194) 评论(0) 编辑收藏举报

刷新页面返回顶部

LiuWei_Find

微博主页:http://weibo.com/nashiyue

python BeautifulSoup

公告