pandas模块
pandas
引入约定
>>> from pandas import Series,DataFrame
>>> import pandas as pd
Series
类似于一维数组的对象,由一组数据和相关的数据标签(索引)组成
>>> obj=Series([4,7,-5,3])
>>> obj
0 4
1 7
2 -5
3 3
dtype: int64
通过values和index属性获取其数组表示形式和索引对象
>>> obj.values
array([ 4, 7, -5, 3])
>>> obj.index
RangeIndex(start=0, stop=4, step=1)
对各个数据点进行标记的索引
>>> obj2=Series([4,7,-5,3],index=['d','b','a','c'])
>>> obj2
d 4
b 7
a -5
c 3
dtype: int64
>>> obj2.index
Index([u'd', u'b', u'a', u'c'], dtype='object')
与普通Numpy数组相比,可以通过索引的方式选取Series中的单个或一组值
>>> obj2['a']
-5
>>> obj2[['a','b','c']]
a -5
b 7
c 3
dtype: int64
将Series看成一个定长的有序字典
>>> 'b' in obj2
True
>>> 'e' in obj2
False
如果数据被存放在一个python字典中,可以直接通过这个字典创建Series
>>> sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
>>> obj3=Series(sdata)
>>> obj3
Ohio 35000
Oregon 16000
Texas 71000
Utah 500
如果只传入一个字典,则结果Series中的索引就是原字典的键
sdate中跟states索引相匹配,按照传入的states顺序进行排列
>>> states=['California','Ohio','Oregon','Texas']
>>> obj4=Series(sdata,index=states)
>>> obj4
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
pandas的isnull和notnull函数用于检查缺失数据
>>> pd.isnull(obj4)
California True
Ohio False
Oregon False
Texas False
dtype: bool
Series也有类似的实例方法
>>> obj4.isnull()
California True
Ohio False
Oregon False
Texas False
dtype: bool
Series最重要的功能是---算术运算中会自动对齐不同索引的数据;数据对齐功能
>>> obj3+obj4
California NaN
Ohio 70000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
Series对象本身及其索引都有一个name属性
>>> obj4.name='population'
>>> obj4.index.name='state'
>>> obj4
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
Series的索引可以通过赋值的方式就地修改
>>> obj.index=['Bob','Steve','Jeff','Ryan']
>>> obj
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
DataFrame
构建DataFrame,直接传入一个由等长列表或Numpy数组组成的字典
自动加上索引,且全部列会被有序排列
>>> data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],'year':[2000,2001,2002,2001,2002],'pop':[1.5,1.7,3.6,2.4,2.9]}
>>> frame=DataFrame(data)
>>> frame
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
如果指定了列序列,则DataFrame的列就会按照指定顺序进行排列
>>> DataFrame(data,columns=['year','state','pop'])
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
跟Series一样,如果传入的列在数据中找不到,就会产生NA值
>>> frame2=DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five'])
>>> frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
>>> frame2.columns
Index([u'year', u'state', u'pop', u'debt'], dtype='object')
通过类似字典标记的方式或属性,可以将DataFrame的列获取为一个Series,拥有原DataFrame相同的索引,其name属性已经被相应地设置好
>>> frame2['state']
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
>>> frame2.year
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
行也可以通过位置或名称的方式进行获取,比如用索引字段ix
>>> frame2.ix['three']
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
列可以通过赋值的方式进行修改
>>> frame2['debt']=16.5
>>> frame2['debt']=np.arange(5.)
>>> frame2
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
将列表或数组赋值给某个列时,长度必须跟DataFrame的长度相匹配
如果赋值的是一个Series,就会精确匹配DataFrame的索引,所有的空位都将被填上缺省值
>>> val=Series([-1.2,-1.5,-1.7],index=['two','four','five'])
>>> frame2['debt']=val
>>> frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
为不存在的列赋值会创建出一个新列
>>> frame2['eastern']=frame2.state == 'Ohio'
>>> frame2
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
关键字del用于删除列
>>> del frame2['eastern']
>>> frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
通过索引方式返回的列只是相应数据的视图而已,并不是副本。对返回的Series所做的任何就地修改全部会反映到源DataFrame上
另一个常见的数据形式是嵌套字典(字典的字典)
外层字典的键作为列,内层键则作为行索引
>>> pop={'Nevada':{2001:2.4,2002:2.9},
... 'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
>>> frame3=DataFrame(pop)
>>> frame3
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
对结果进行转置
>>> frame3.T
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
内层字典的键会被合并,排序以形成最终的索引
>>> DataFrame(pop,index=[2001,2002,2003])
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
设置DataFrame的index和columns的name属性
>>> frame3.index.name='year';frame3.columns.name='state'
>>> frame3
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
跟Series一样,values属性也会以二维ndarray的形式返回DataFrame中的数据
>>> frame3.values
array([[ nan, 1.5],
[ 2.4, 1.7],
[ 2.9, 3.6]])
如果DataFrame各列的数据类型不同,则值数组的数据类型就会选用兼容所有列的数据的数据类型
>>> frame2.values
array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, -1.2],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, -1.5],
[2002, 'Nevada', 2.9, -1.7]], dtype=object)
索引对象
pandas数据模型的重要组成部分
负责管理轴标签和其它元数据。构建Series或DataFrame时,所用到的任何数组或其它序列的标签都会被转换成一个Index
>>> obj=Series(range(3),index=['a','b','c'])
>>> index=obj.index
>>> index
Index([u'a', u'b', u'c'], dtype='object')
Index对象是不可修改的(immutable),使index对象在多个数据结构之间安全共享
重新索引
>>> obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
>>> obj
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
调用该Series的reindex将会根据新索引进行重排,索引值不存在引入缺失值
>>> obj2=obj.reindex(['a','b','c','d','e'])
>>> obj2
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
设定默认的缺失值
>>> obj2=obj.reindex(['a','b','c','d','e'],fill_value=0)
>>> obj2
a -5.3
b 7.2
c 3.6
d 4.5
e 0.0
dtype: float64
对于时间序列这样的有序数据,重新索引时可以需要做一些插值处理
method选项
>>> obj3=Series(['blue','purple','yellow'],index=[0,2,4])
reindex的插值method选项
ffill或pad;前向填充或搬运值
>>> obj3.reindex(range(6),method='ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
对于DataFrame,reindex可以修改行索引,列,或两个都修改。如果仅传入一个序列,则会重新索引行
>>> frame=DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],columns=['Ohio','Texas','California'])
>>> frame
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
>>> frame2=frame.reindex(['a','b','c','d'])
>>> frame2
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
使用columns关键字即可重新索引列
>>> states=['Texas','Utah','California']
>>> frame.reindex(columns=states)
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
可以同时对行和列进行重新索引,而插值则只能按行应用
>>> frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)
Texas Utah California
a 1 NaN 2
b 1 NaN 2
c 4 NaN 5
d 7 NaN 8
利用ix的标签索引功能,重新索引任务可以变得更简洁
>>> frame.ix[['a','b','c','d'],states]
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0
reindex函数的参数
index 索引的新序列
method 插值填充方式
fill_value 在重新索引的过程中,需要引入缺失值时使用的替代值
limit 前向或后向填充时的最大填充量
level 在MultiIndex的指定级别上匹配简单索引,否则选取其子集
copy 默认为true,无论如何都复制
丢弃指定轴上的项
>>> obj=Series(np.arange(5.),index=['a','b','c','d','e'])
>>> obj
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
>>> new_obj=obj.drop('c')
>>> new_obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
对于DataFrame,可以删除任意轴上的索引值
>>> data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
>>> data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
>>> data.drop(['Colorado','Ohio'])
one two three four
Utah 8 9 10 11
New York 12 13 14 15
>>> data.drop('two',axis=1)
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
>>> data.drop(['two','four'],axis=1)
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
索引,选取和过滤
类似于Numpy数组的索引,只不过Series的索引值不只是整数
Series利用标签的切片运算与普通的python切片运算不同,其末端是包含的
>>> data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
>>> data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
DataFrame的切片
>>> data[:2]
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
>>> data<5
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
>>> data[data<5]=0
>>> data
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
>>> data.ix['Colorado',['two','three']]
two 5
three 6
Name: Colorado, dtype: int64
>>> data.ix[['Colorado','Utah'],[3,0,1]]
four one two
Colorado 7 0 5
Utah 11 8 9
>>> data.ix[2]
one 8
two 9
three 10
four 11
Name: Utah, dtype: int64
>>> data.ix[:'Utah','two']
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int64
算术对齐和数据对齐
自动的数据对齐操作在不重叠的索引处引入NA值
对于DataFrame,对齐操作会同时发生在行和列上
使用add方法,传入加数以及一个fill_value参数:obj.add(obj2,fill_value=0)
DataFrame和Series之间的运算
>>> arr=np.arange(12.).reshape((3,4))
>>> arr
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
>>> arr[0]
array([ 0., 1., 2., 3.])
>>> arr-arr[0]
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
这叫做广播(broadcasting)
>>> frame=DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
>>> series=frame.ix[0]
>>> frame
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
>>> series
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
默认情况下,DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的行,然后沿着行一直向下广播
>>> frame-series
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象被重新索引以形成并集
>>> series2=Series(range(3),index=['b','e','f'])
>>> series2
b 0
e 1
f 2
dtype: int64
>>> frame+series2
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
匹配行且在列上广播,则必须使用算术运算方法
>>> frame.sub(series3,axis=0)
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
函数应用和映射
许多最为常见的数组统计功能都被实现成DataFrame的方法
>>> frame=DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
>>> frame
b d e
Utah -1.120701 -0.772813 -1.183221
Ohio -0.690566 0.610834 0.382371
Texas 0.287303 -0.001705 -1.055101
Oregon 1.149945 1.056177 -0.178909
>>> deff(x):
... return Series([x.min(),x.max()],index=['min','max'])
...
>>> frame.apply(f)
b d e
min -1.120701 -0.772813 -1.183221
max 1.149945 1.056177 0.382371
frame中各个浮点值的格式化字符串
>>> format=lambda x:'%.2f' % x
>>> frame.applymap(format)
b d e
Utah -1.12 -0.77 -1.18
Ohio -0.69 0.61 0.38
Texas 0.29 -0.00 -1.06
Oregon 1.15 1.06 -0.18
Series有一个用于元素级函数的map方法
>>> frame['e'].map(format)
Utah -1.18
Ohio 0.38
Texas -1.06
Oregon -0.18
Name: e, dtype: object
排序和排名
>>> obj=Series(range(4),index=['d','a','b','c'])
>>> obj.sort_index()
a 1
b 2
c 3
d 0
dtype: int64
对于DataFrame,则可以根据任意一个轴上的索引进行排序
数据默认是按升序排列的,但也可以降序排列
>>> frame=DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c'])
>>> frame.sort_index(axis=1)
a b c d
three 1 2 3 0
one 5 6 7 4
>>> frame.sort_index(axis=1,ascending=False)
d c b a
three 0 3 2 1
one 4 7 6 5
按值对Series进行排序,可使用其order方法
>>> obj=Series([4,7,-3,2])
>>> obj.order()
>>> obj.sort_values()
2 -3
3 2
0 4
1 7
dtype: int64
在排序时,任何缺失值默认都会被放到Series的末尾
>>> frame=DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
>>> frame
a b
0 0 4
1 1 7
2 0 -3
3 1 2
>>> frame.sort_values(by='b')
a b
2 0 -3
3 1 2
0 0 4
1 1 7
排名
根据某种规则破坏平级关系
>>> obj=Series([7,-5,7,4,2,0,4])
>>> obj
0 7
1 -5
2 7
3 4
4 2
5 0
6 4
dtype: int64
>>> obj.rank()
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
排名时用于破坏平级关系的method选项
average 默认,在相等分组中,为各个值分配平均排名
min 使用整个分组的最小排名
max 使用整个分组的最大排名
first 按值在原始数据中的出现顺序分配排名
带有重复值的轴索引
许多pandas函数(eg:reindex)都要求标签唯一,但并不是强制性
索引的is_unique属性
>>> obj.index.is_unique
False
某个索引对应多个值,则返回一个Series
>>> obj['a']
a 0
a 1
dtype: int64
对应单值,返回一个标量值
>>> obj['c']
4
汇总和计算描述统计
sum求和,传入axis=1将会按行进行求和运算
NA值会自动被排除,除非整个切片(行或列)都是NA
通过skipna选项可以禁用该功能,df.mean(axis=1,skipna=False)
describe,用于一次性产生多个汇总统计
相关系数与协方差
唯一值,值计数以及成员资格
可以从一维Series的值中抽取信息
>>> obj=Series(['c','a','d','a','a','b','b','c','c'])
>>> uniques=obj.unique()
>>> uniques
array(['c', 'a', 'd', 'b'], dtype=object)
计算一个Series中各值出现的频率
>>> obj.value_counts()
c 3
a 3
b 2
d 1
dtype: int64
矢量化集合的成员资格,可用于选取Series中或DataFrame列中数据的子集
>>> mask=obj.isin(['b','c'])
>>> mask
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
>>> obj[mask]
0 c
5 b
6 b
7 c
8 c
dtype: object
处理缺失数据
missing data在大部分数据分析应用中都很常见,pandas的设计目标是让缺失数据的处理任务尽量轻松
python内置的None值也会被当做NA处理
滤除缺失数据
>>> from numpy import nan as NA
>>> data=Series([1,NA,3.5,NA,7])
>>> data
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
>>> data.dropna()
0 1.0
2 3.5
4 7.0
dtype: float64
通过布尔型索引
>>> data[data.notnull()]
0 1.0
2 3.5
4 7.0
dtype: float64
对于DataFrame,dropna默认丢弃任何含有缺失值的行
传入how='all'将会丢弃全为NA的那些行
>>> data=DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3.]])
>>> data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
>>> data.dropna(how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
>>> data[4]=NA
>>> data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
>>> data.dropna(axis=1,how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
时间序列数据,只想留下一部分观测数据
>>> df=DataFrame(np.random.randn(7,3))
>>> df
0 1 2
0 1.367974 -0.556556 0.679336
1 -0.480919 -1.535185 -0.299710
2 0.230583 0.140626 0.604209
3 0.437830 -0.467286 -0.859989
4 -0.254706 -0.227431 -0.956299
5 0.966204 -2.010860 -0.010693
6 -0.673721 1.497827 -0.257273
>>> df.ix[:4,1]=NA
>>> df.ix[:2,2]
>>> df
0 1 2
0 1.367974 NaN NaN
1 -0.480919 NaN NaN
2 0.230583 NaN NaN
3 0.437830 NaN -0.859989
4 -0.254706 NaN -0.956299
5 0.966204 -2.010860 -0.010693
6 -0.673721 1.497827 -0.257273
一行中至少有3个非NA值将其保留
>>> df.dropna(thresh=3)
0 1 2
5 0.966204 -2.010860 -0.010693
6 -0.673721 1.497827 -0.257273
填充缺失数据
fillna方法是最主要的函数
>>> df.fillna(0)
0 1 2
0 1.367974 0.000000 0.000000
1 -0.480919 0.000000 0.000000
2 0.230583 0.000000 0.000000
3 0.437830 0.000000 -0.859989
4 -0.254706 0.000000 -0.956299
5 0.966204 -2.010860 -0.010693
6 -0.673721 1.497827 -0.257273
一个字典调用fillna,就可以实现对不同列填充不同的值
>>> df.fillna({1:0.5,3:-1})
0 1 2
0 1.367974 0.500000 NaN
1 -0.480919 0.500000 NaN
2 0.230583 0.500000 NaN
3 0.437830 0.500000 -0.859989
4 -0.254706 0.500000 -0.956299
5 0.966204 -2.010860 -0.010693
6 -0.673721 1.497827 -0.257273
fillna默认会返回新对象,但也可以对现有对象进行就地修改
返回被填充对象的引用
>>> _=df.fillna(0,inplace=True)
>>> df
0 1 2
0 1.367974 0.000000 0.000000
1 -0.480919 0.000000 0.000000
2 0.230583 0.000000 0.000000
3 0.437830 0.000000 -0.859989
4 -0.254706 0.000000 -0.956299
5 0.966204 -2.010860 -0.010693
6 -0.673721 1.497827 -0.257273
对reindex有效的那些插值方法也可以用fillna
>>> df=DataFrame(np.random.randn(6,3))
>>> df.ix[2:,1]=NA;df.ix[4:,2]=NA
>>> df
0 1 2
0 0.647866 0.891312 -0.211922
1 -1.455856 -0.629213 -1.043685
2 2.078467 NaN -0.067846
3 -0.223047 NaN 0.513800
4 0.306559 NaN NaN
5 0.404265 NaN NaN
填充最靠近行的数值填充,列行为
>>> df.fillna(method='ffill')
0 1 2
0 0.647866 0.891312 -0.211922
1 -1.455856 -0.629213 -1.043685
2 2.078467 -0.629213 -0.067846
3 -0.223047 -0.629213 0.513800
4 0.306559 -0.629213 0.513800
5 0.404265 -0.629213 0.513800
>>> df.fillna(method='ffill',limit=2)
0 1 2
0 0.647866 0.891312 -0.211922
1 -1.455856 -0.629213 -1.043685
2 2.078467 -0.629213 -0.067846
3 -0.223047 -0.629213 0.513800
4 0.306559 NaN 0.513800
5 0.404265 NaN 0.513800
层次化索引
hierachical indexing
一个轴上拥有多个(两个以上)索引级别
>>> data=Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,2,3]])
>>> data
a 1 -0.521370
2 0.658209
3 0.841101
b 1 0.354237
2 -0.426983
3 0.835357
c 1 -0.246308
2 0.709859
d 2 -1.215098
3 0.400793
dtype: float64
这就是带有MultiIndex索引的Series的格式化输出形式
>>> data.index
MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
>>> data['b']
1 0.354237
2 -0.426983
3 0.835357
dtype: float64
内层进行选取
>>> data[:,2]
a 0.658209
b -0.426983
c 0.709859
d -1.215098
dtype: float64
层次化索引在数据重塑和基于分组的操作(如透视表生成)扮演重要的角色
>>> data.unstack()
1 2 3
a -0.521370 0.658209 0.841101
b 0.354237 -0.426983 0.835357
c -0.246308 0.709859 NaN
d NaN -1.215098 0.400793
unstack的逆运算是stack
对于DataFrame,每条轴都可以有分层索引
>>> frame=DataFrame(np.arange(12).reshape((4,3)),index=[['a','a','b','b'],[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
>>> frame
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
各层都可以有名字(可以是字符串,也可以是别的python对象)
>>> frame.index.names=['key1','key2']
>>> frame
Ohio Colorado
Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
>>> frame.columns.names=['state','color']
>>> frame
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
有了分部的列索引,可以轻松选取列分组
可以单独创建MultiIndex然后复用
>>> MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],['Green','Red','Green']],names=['state','color'])
重排分级顺序
swaplevel接受两个级别编号或名称,并返回一个互换了级别的新对象
>>> frame.swaplevel('key1','key2')
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
sortlevel则根据单个级别中的值对数据进行排序
>>> frame
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
>>> frame.sortlevel(1)
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
>>> frame.swaplevel(0,1).sortlevel(0)
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11
在层次化索引的对象上,如果索引是按字典方式从外向内排序,即调用sortlevel(0)或sort_index()的结果,数据选取操作的性能要好的多
根据级别汇总统计
许多对DataFrame和Series的描述和汇总统计都有一个level选项,用于指定在某条轴上求和的级别
>>> frame.sum(level='key2')
state Ohio Colorado
color Green Red Green
key2
1 6 8 10
2 12 14 16
>>> frame.sum(level='color',axis=1)
color Green Red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10
使用DataFrame的列
想要将DataFrame的一个或多个列当做行索引来用,或者将行索引当成DataFrame的列
>>> frame=DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one','one','one','two','two','two','two'],'d':[0,1,2,0,1,2,3]})
>>> frame
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
set_index()函数会将其一个或多个列转换为行索引,并创建一个新的DataFrame
>>> frame2=frame.set_index(['c','d'])
>>> frame2
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
默认情况下,那些列会从DataFrame中移除,但也可以将其保留下来
>>> frame.set_index(['c','d'],drop=False)
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
reset_index的功能相反,层次化索引的级别会被转移到列里面
>>> frame2.reset_index()
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1