利用python进行数据分析-05-pandas基础

1、Series

series是类似于一维数组的对象,它是由一组数据以及与之相关的数据标签(即索引)组成,仅由一组数据即可产生最简单的series:

p=pd.Series([1,2,4,3])

p
Out[5]: 
0    1
1    2
2    4
3    3
dtype: int64p.valuesOut[6]: array([1, 2, 4, 3], dtype=int64)p.index
Out[7]: Int64Index([0, 1, 2, 3], dtype='int64')

索引(index)在左,值(value)在右

p.values
Out[6]: array([1, 2, 4, 3], dtype=int64)

p.index
Out[7]: Int64Index([0, 1, 2, 3], dtype='int64')

可通过索引的方式来选取Series中的单个或一组值。

p[‘0’]

numpy数组之间的计算:(obj2)

obj2
Out[22]: 
a    4
b    7
c   -5
d    3
dtype: int64

obj2[obj2>0]
Out[26]: 
a    4
b    7
d    3
dtype: int64

obj2 * 2
Out[27]: 
a     8
b    14
c   -10
d     6
dtype: int64

import numpy as np

np.exp(obj2)
Out[29]: 
a      54.598150
b    1096.633158
c       0.006738
d      20.085537
dtype: float64

如果数据放在字典中,可以直接进行调用创建series

sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}

obj3 = pd.Series(sdata)

obj3
Out[34]: 
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

如果只传入一个字典,那么Series中的索引就是字典中的键

states = ['California','Ohio','Oregon','Texas']

sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}

obj4 = pd.Series(sdata,index = states)

obj4
Out[12]: 
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64
可用pandas的isnull和notnull函数来检查缺失数据
pd.isnull(obj4)
Out[13]: 
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

pd.notnull(obj4)
Out[14]: 
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

同理,obj4.isnull()也可以

series的name属性

obj4.name = 'population'

obj4.index.name = 'state'

obj4
Out[18]: 
state
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
Name: population, dtype: float64

2、dataframe----表格型数据结构

含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。dataframe既有索引也有列索引,它可以被看作由Series组成的字典(共用同一个索引)

DataFrame以有序的组成,每一列的个数要相等

构建DataFrame:

data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]}

frame = pd.DataFrame(data)

frame
Out[41]: 
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

DataFrame会自动加上索引(跟Series一样),且全部列会被有序排列。

可自己指定序列,按照特点顺序进行排列

pd.DataFrame(data,columns=['year','state','pop'])
Out[46]: 
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9

跟Series一样,如果传入的列在数据中找不到,会产生NA值。

frame2 = pd.DataFrame(data,columns = ['year','state','pop','debt'],
index = ['one','two','three','four','five'])

frame2
Out[6]: 
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaNframe2.columns Out[7]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

通过字典标记的方式或属性的方式,可以将DataFrame的列获取为一个Series:

frame2['state']
Out[8]: 
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

frame2.state
Out[9]: 
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

返回的Series拥有和原DataFrame相同的索引,且其name属性也已经被相应地被设置好了。行也可以通过位置或名称的方式进行获取,比如用索引字段ix

data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]}

frame2 = pd.DataFrame(data,columns = ['year','state','pop','debt'],
index = ['one','two','three','four','five'])

frame2.ix['three']     #获取行的数据
Out[7]: 
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

frame2['debt'] = 16.5       #更改数据中debt的值

frame2
Out[9]: 
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5

将列表或数组赋值给某个列时,其长度必须跟DataFrame的长度相匹配,如果赋值的是一个Series,就会精确匹配DataFrame的索引,所有的空位都会被填上缺失值:

frame2 = pd.DataFrame(data,columns = ['year','state','pop','debt'],
index = ['one','two','three','four','five'])

frame2['debt']=val

frame2
Out[16]: 
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7

frame2['eastern'] = frame2.state == 'Ohio'
#为不存在的值创建一个新列。
frame2
Out[18]: 
       year   state  pop  debt eastern
one    2000    Ohio  1.5   NaN    True
two    2001    Ohio  1.7  -1.2    True
three  2002    Ohio  3.6   NaN    True
four   2001  Nevada  2.4  -1.5   False
five   2002  Nevada  2.9  -1.7   False

del frame2['eastern']
#关键字del 用于删除列
frame2.columns
Out[20]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

3、嵌套字典

pop={'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}}

若将上述字典传给DataFrame,它会被解释为:外层字典的键作为列,内层键则作为行索引:

frame3 = pd.DataFrame(pop)

frame3
Out[23]: 
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

可对结果进行转置:

frame3.T
Out[24]: 
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6

内层字典的键会被合并,排序以形成最终的索引。如果显示指定了索引,则不会这样:

DataFrame 的index和columns的属性

frame3.index.name = 'year';frame3.columns.name = 'state'

frame3
Out[27]: 
state  Nevada  Ohio
year               
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6

跟series一样,values属性也会以二维ndarray的形式返回DataFrame中的数据:

frame3.values
Out[28]: 
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

4、索引对象

pandas的索引对象负责管理轴标签和其他元数据(比如轴名称等)。构建Series或DataFrame时,所用到的任何数组或其他序列的标签都会转换成一个Index:

obj=pd.Series(range(3),index = ['a','b','c'])

index = obj.index

index
Out[33]: Index(['a', 'b', 'c'], dtype='object')

index[1:]
Out[34]: Index(['b', 'c'], dtype='object')#index是不能修改的(immutable)

posted @ 2015-11-03 13:58  Groupe  阅读(197)  评论(0编辑  收藏  举报