BeautifulSoup4----利用find_all和get方法来获取信息

中文文档
官方教学网页源码：

<html>
    <head>
        <title>Page title</title>
    </head>
    <body>
        <p id="firstpara" align="center">
        This is paragraph<b>one</b>.
        </p>
        <p id="secondpara" align="blah">
        This is paragraph<b>two</b>.
        </p>
     </body>
</html>

find方法的参数及意义

find(name=None, attrs={}, recursive=True, text=None, **kwargs)

1,按照tag(标签)搜索：

1 find(tagname)        # 直接搜索名为tagname的tag 如：find('head')
2 find(list)           # 搜索在list中的tag，如: find(['head', 'body'])
3 find(dict)           # 搜索在dict中的tag，如:find({'head':True, 'body':True})
4 find(re.compile('')) # 搜索符合正则的tag, 如:find(re.compile('^p')) 搜索以p开头的tag
5 find(lambda)         # 搜索函数返回结果为true的tag, 如:find(lambda name: if len(name) == 1) 搜索长度为1的tag
6 find(True)           # 搜索所有tag

　　2,按照attrs(属性)搜索:

1 find('id'='xxx')                                  # 寻找id属性为xxx的
2 find(attrs={'id':re.compile('xxx'), 'algin':'xxx'}) # 寻找id属性符合正则且algin属性为xxx的
3 find(attrs={'id':True, 'algin':None})               # 寻找有id属性但是没有algin属性的

利用BeautifulSoup4爬取豆瓣数据的ID

代码如下：

import requests
from bs4 import BeautifulSoup as bs

#以豆瓣‘编程’分类的一个连接URL为例子开始爬数据ID
url = 'https://book.douban.com/tag/编程?start=20&type=T'
res = requests.get(url)  #发送请求
#print(res.encoding)    #这个是用来查看网页编码的
#res.encoding = 'utf-8'   #跟上一个结合来用，如果编码有乱码，则可以通过这个定义编码来改变
html = res.text     
#print(html)

IDs = []
soup  = bs(html,"html.parser")     #定义一个BeautifulSoup变量
items = soup.find_all('a',attrs={'class':'nbg'})
#print(items)

for i in items:
    idl = i.get('href')
    #print(idl)
    id = idl.split('/')[4]
    print(id)
    IDs.append(id)
print('这一页收集到书籍ID数：%d' % len(IDs))

第一部分是获取网页源代码的过程，使用requests模块
第二部分为使用BeautifulSoup来解析网页，得到需要的信息
- ```
soup  = bs(html,"html.parser")
```
  这句的意思是声明一个变量，用BeautifulSoup处理之后的原网页代码
- ```
items = soup.find_all('a',attrs={'class':'nbg'})
```
  这句的作用是查找a标签，当然，a标签会有很多，但是我们不需要所有，因此我们还需要判断一下这种a标签还有个属性是class='nbg'，我们只需要这种a标签。items得到的是一个list
- 属性都放着attrs这个字典中，当某个属性的值不是定值的时候，可以使用 '属性名':True 这种方式。
- ```
for i in items:
    idl = i.get('href')
```
  这句的意思是获取满足条件的每个a标签中属性‘href’的值
- ```
id = idl.split('/')[4]
```
  由于‘href’的属性是一个连接，但是我们只需要得到ID，所有可以将连接按照‘/’分解，提取ID
具体的爬虫例子可以参照：智联招聘爬虫
Beautifulsoup的select选择器方法可以参考爬虫例子：前程无忧爬虫

posted @ 2017-02-12 01:23 晴空行阅读(39235) 评论(0) 编辑收藏举报

刷新页面返回顶部

Go_Pythoner

BeautifulSoup4----利用find_all和get方法来获取信息

中文文档

官方教学网页源码：

find方法的参数及意义

利用BeautifulSoup4爬取豆瓣数据的ID

具体的爬虫例子可以参照：智联招聘爬虫

Beautifulsoup的select选择器方法可以参考爬虫例子：前程无忧爬虫

公告