pandas index索引对象与重建索引
一、Index
Pandas中的索引对象Index用于存储轴标签和其它元数据。索引对象是不可变的,用户无法修改它。
In [73]: obj = pd.Series(range(3),index = ['a','b','c'])
In [74]: index = obj.index
In [75]: index
Out[75]: Index(['a', 'b', 'c'], dtype='object')
In [76]: index[1:]
Out[76]: Index(['b', 'c'], dtype='object')
In [77]: index[1] = 'f' # TypeError
In [8]: index.size
Out[8]: 3
In [9]: index.shape
Out[9]: (3,)
In [10]: index.ndim
Out[10]: 1
In [11]: index.dtype
Out[11]: dtype('O')
索引对象的不可变特性,使得在多种数据结构中分享索引对象更安全:
In [78]: labels = pd.Index(np.arange(3))
In [79]: labels
Out[79]: Int64Index([0, 1, 2], dtype='int64')
In [80]: obj2 = pd.Series([2,3.5,0], index=labels)
In [81]: obj2
Out[81]:
0 2.0
1 3.5
2 0.0
dtype: float64
In [82]: obj2.index is labels
Out[82]: True
索引对象,本质上也是一个容器对象,所以可以使用Python的in操作:
In [84]: f2
Out[84]:
key year state pop debt
order
a 2000 beijing 1.5 NaN
b 2001 beijing 1.7 NaN
c 2002 beijing 3.6 1.0
d 2001 shanghai 2.4 2.0
e 2002 shanghai 2.9 NaN
f 2003 shanghai 3.2 3.0
In [86]: 'c' in f2.index
Out[86]: True
In [88]: 'pop' in f2.columns
Out[88]: True
而且最关键的是,pandas的索引对象可以包含重复的标签:
In [89]: dup_lables = pd.Index(['foo','foo','bar','bar'])
In [90]: dup_lables
Out[90]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
那么思考一下,DataFrame对象可不可以有重复的columns或者index呢?
可以的!但是请尽量不要这么做!:
In [91]: f2.index = ['a']*6
In [92]: f2
Out[92]:
key year state pop debt
a 2000 beijing 1.5 NaN
a 2001 beijing 1.7 NaN
a 2002 beijing 3.6 1.0
a 2001 shanghai 2.4 2.0
a 2002 shanghai 2.9 NaN
a 2003 shanghai 3.2 3.0
In [93]: f2.loc['a']
Out[93]:
key year state pop debt
a 2000 beijing 1.5 NaN
a 2001 beijing 1.7 NaN
a 2002 beijing 3.6 1.0
a 2001 shanghai 2.4 2.0
a 2002 shanghai 2.9 NaN
a 2003 shanghai 3.2 3.0
In [94]: f2.columns = ['year']*4
In [95]: f2
Out[95]:
year year year year
a 2000 beijing 1.5 NaN
a 2001 beijing 1.7 NaN
a 2002 beijing 3.6 1.0
a 2001 shanghai 2.4 2.0
a 2002 shanghai 2.9 NaN
a 2003 shanghai 3.2 3.0
In [96]: f2.index.is_unique # 可以使用这个属性来判断是否是唯一的索引
Out[96]: False
index对象也可以进行集合的交、并、差和异或运算,类似Python的标准set数据结构。
二、重建索引
reindex方法用于重新为Pandas对象设置新索引。这不是就地修改,而是会参照原有数据,调整顺序。
In [96]: obj=pd.Series([4.5,7.2,-5.3,3.6],index = ['d','b','a','c'])
In [97]: obj
Out[97]:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
reindex会按照新的索引进行排列,不存在的索引将引入缺失值:
In [99]: obj2 = obj.reindex(list('abcde'))
In [100]: obj2
Out[100]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
也可以为缺失值指定填充方式method参数,比如ffill表示向前填充,bfill表示向后填充:
In [101]: obj3 = pd.Series(['blue','purple','yellow'],index = [0,2,4])
In [102]: obj3
Out[102]:
0 blue
2 purple
4 yellow
dtype: object
In [103]: obj3.reindex(range(6),method='ffill')
Out[103]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
对于DataFrame这种二维对象,如果执行reindex方法时只提供一个列表参数,则默认是修改行索引。可以用关键字参数columns指定修改的是列索引:
In [104]: f = pd.DataFrame(np.arange(9).reshape((3,3)),index=list('acd'),columns=['beijing','shanghai','guangzhou'])
In [105]: f
Out[105]:
beijing shanghai guangzhou
a 0 1 2
c 3 4 5
d 6 7 8
In [106]: f2 = f.reindex(list('abcd'))
In [107]: f2
Out[107]:
beijing shanghai guangzhou
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
In [112]: f3 = f.reindex(columns=['beijing','shanghai','xian','guangzhou'])
In [113]: f3
Out[113]:
beijing shanghai xian guangzhou
a 0 1 NaN 2
c 3 4 NaN 5
d 6 7 NaN 8