Pandas入门学习笔记2
2 基本功能
只是一些基本功能,更深奥的内容用到再摸索。
2.1 重新索引
reindex是pandas的重要方法,举个例子:
In [101]: obj = Series([4,7,-5,3.4],index=['c','a','b','d'])
In [102]: obj
Out[102]:
c 4.0
a 7.0
b -5.0
d 3.4
dtype: float64
In [103]: obj2 = obj.reindex(['a','b','c','d','e'])
In [104]: obj2
Out[104]:
a 7.0
b -5.0
c 4.0
d 3.4
e NaN
dtype: float64
# 缺失值可以自定义
In [105]: obj.reindex(['a','b','c','d','e'],fill_value=0)
Out[105]:
a 7.0
b -5.0
c 4.0
d 3.4
e 0.0 #缺失值填充
dtype: float64
reindex的插值method选项:
参数 | 说明 |
---|---|
ffill或pad | 前向填充值 |
bfill或backfill | 后向填充值 |
In [106]: obj3 = Series(['blue','purple','yellow'],index=[0,2,4])
# 前向填充
In [107]: obj3.reindex(range(6),method='ffill')
Out[107]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
# 后向填充
In [109]: obj3.reindex(range(6),method='bfill')
Out[109]:
0 blue
1 purple
2 purple
3 yellow
4 yellow
5 NaN
dtype: object
针对DataFrame,可以修改行、列或两个都进行重新索引。
In [111]: frame = DataFrame(np.arange(9).reshape(3,3), index=['a','b','c'],colmns=['Ohio','Texas','California'])
In [112]: frame
Out[112]:
Ohio Texas California
a 0 1 2
b 3 4 5
c 6 7 8
In [113]: frame2 = frame.reindex(['a','b','c','d']) # 默认行索引
In [115]: frame2
Out[115]:
Ohio Texas California
a 0.0 1.0 2.0
b 3.0 4.0 5.0
c 6.0 7.0 8.0
d NaN NaN NaN
In [116]: states = ['Texas','Utah','California']
In [117]: frame.reindex(columns=states) #指定列索引
Out[117]:
Texas Utah California
a 1 NaN 2
b 4 NaN 5
c 7 NaN 8
# 对行、列都进行重新索引,
# 并且进行插值,但是只能在0轴进行,即按行应用。
In [118]: frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)
Out[118]:
Texas Utah California
a 1 NaN 2
b 4 NaN 5
c 7 NaN 8
d 7 NaN 8
# 用ix更简洁。
In [119]: frame.ix[['a','b','c','d'],states]
Out[119]:
Texas Utah California
a 1.0 NaN 2.0
b 4.0 NaN 5.0
c 7.0 NaN 8.0
d NaN NaN NaN
reindex函数的参数
2.2 丢弃指定轴上的项
丢弃项,只要一个索引或列表即可。drop方法会返回一个删除了指定值的新对象。
In [120]: obj = Series(np.arange(5.),index=['a','b','c','d','e'])
In [121]: new_obj = obj.drop('c')
In [122]: new_obj
Out[122]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
In [124]: obj.drop(['d','c'])
Out[124]:
a 0.0
b 1.0
e 4.0
dtype: float64
针对DataFrame,可以删除任意轴上的索引值。
In [125]: data = DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado',
...: 'Utah','New York'],columns=['one','two','three','four'])
In [126]: data
Out[126]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [127]: data.drop(['Colorado','Ohio'])
Out[127]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
In [128]: data.drop(['two',],axis=1)
Out[128]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
2.3 索引、选取和过滤
In [129]: obj = Series(np.arange(4.),index=['a','b','c','d'])
In [130]: obj['a'] #使用index索引
Out[130]: 0.0
In [131]: obj[0] #使用序号来索引
Out[131]: 0.0
In [132]: obj[1]
Out[132]: 1.0
In [133]: obj
Out[133]:
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
In [134]: obj[1:2] # 使用序号切片
Out[134]:
b 1.0
dtype: float64
In [135]: obj[1:3]
Out[135]:
b 1.0
c 2.0
dtype: float64
In [136]: obj[obj<2] # 使用值判断
Out[136]:
a 0.0
b 1.0
dtype: float64
In [137]: obj['b':'c'] # 使用索引切片,注意是两端包含的。
Out[137]:
b 1.0
c 2.0
dtype: float64
In [138]: obj['b':'c'] = 100 # 赋值
In [139]: obj
Out[139]:
a 0.0
b 100.0
c 100.0
d 3.0
dtype: float64
针对DataFrame,索引就是获取一个或多个列。
使用列名:获取列
使用序号或bool值:获取行
In [140]: data
Out[140]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [141]:
In [141]: data['two'] # 获取第2列
Out[141]:
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
In [142]: data[['two','one']] # 按要求获取列
Out[142]:
two one
Ohio 1 0
Colorado 5 4
Utah 9 8
New York 13 12
In [143]: data[:2] # 获取前面两行,使用数字序号获取的是行
Out[143]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
In [144]: data[data['three']>5] # 获取第三列大于5的行
Out[144]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
DataFrame在语法上与ndarray是比较相似的。
In [146]: data < 5
Out[146]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
In [147]: data[data<5] = 0
In [148]: data
Out[148]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
索引字段ix:
可以通过Numpy的标记法以及轴标签从DataFrame中选取行和列的子集。
此外,ix得表述方式很简单
In [150]: data.ix['Colorado',['two','three']]
Out[150]:
two 5
three 6
Name: Colorado, dtype: int32
In [151]: data.ix[['Colorado','Utah'],[3,0,1]]
Out[151]:
four one two
Colorado 7 0 5
Utah 11 8 9
In [152]: data.ix[2]
Out[152]:
one 8
two 9
three 10
four 11
Name: Utah, dtype: int32
In [153]: data.ix[:'Utah','two']
Out[153]:
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int32
DataFrame的索引选项
2.4 算术运算和数据对齐
算术运算结果就是不同索引之间的并集,不存在的值之间运算结果用NaN表示。
In [4]: s1 = Series([-2,-3,5,-1],index=list('abcd'))
In [5]: s2 = Series([9,2,5,1,5],index=list('badef'))
In [6]: s1 + s2
Out[6]:
a 0.0
b 6.0
c NaN
d 4.0
e NaN
f NaN
dtype: float64
DataFrame也是一样,会同时发生在行和列上。
在算术方法中填充值
In [7]: df1 = DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))
In [8]: df2 = DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde'))
In [9]: df1
Out[9]:
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
In [10]: df2
Out[10]:
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
In [11]: df1 + df2 # 不填充值
Out[11]:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 11.0 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
In [12]: df1.add(df2, fill_value=0) # 填充0
Out[12]:
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 11.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
In [13]: df1.reindex(columns=df2.columns, method='ffill')
Out[13]:
a b c d e
0 0.0 1.0 2.0 3.0 NaN
1 4.0 5.0 6.0 7.0 NaN
2 8.0 9.0 10.0 11.0 NaN
In [14]: df1.reindex(columns=df2.columns, fill_value=0) # 重新索引的时候也可以填充。
Out[14]:
a b c d e
0 0.0 1.0 2.0 3.0 0
1 4.0 5.0 6.0 7.0 0
2 8.0 9.0 10.0 11.0 0
可用的算术算法有:
- add:加法,
- sub:减法,
- div:除法
- mul:乘法
DataFrame和Series之间的运算
采用广播的方式,就是会按照一定的规律作用到整个DataFrame之中。
In [15]: frame = DataFrame(np.arange(12.).reshape(4,3),columns=list('bde'),index
...: =['Utah','Ohio','Texas','Oregon'])
In [16]: series = frame.ix[0] # 获取第一行
In [17]: frame
Out[17]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [18]: series
Out[18]:
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
In [19]: frame - series # 自动广播到其他行
Out[19]:
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
In [20]: series2 = Series(np.arange(3),index=list('bef'))
In [21]: series2
Out[21]:
b 0
e 1
f 2
dtype: int64
In [22]: frame + series2 # 没有的列使用NaN
Out[22]:
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
In [23]: series3 = frame['d'] # 获取列
In [24]: frame.sub(series3, axis=0) #列相减,指定axis
Out[24]:
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
2.5 函数应用和映射
Numpy中的通用函数(ufunc)也可以作用于pandas的Series和DataFrame对象。
In [31]: np.abs(frame)
Out[31]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [32]: np.max(frame)
Out[32]:
b 9.0
d 10.0
e 11.0
dtype: float64
DataFrame有一个apply方法,可以接受自定义函数。
In [33]: f = lambda x: np.max(x) - np.min(x)
In [34]: frame.apply(f)
Out[34]:
b 9.0
d 9.0
e 9.0
dtype: float64
In [35]: frame
Out[35]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [36]: f = lambda x : np.exp2(x)
In [37]: frame.apply(f)
Out[37]:
b d e
Utah 1.0 2.0 4.0
Ohio 8.0 16.0 32.0
Texas 64.0 128.0 256.0
Oregon 512.0 1024.0 2048.0
许多常用的方法,DataFrame已经实现,不需要使用apply方法自定义。
In [38]: f = lambda x: Series([np.max(x),np.min(x)],index=['max','min'])
In [39]: frame.apply(f)
Out[39]:
b d e
max 9.0 10.0 11.0
min 0.0 1.0 2.0
# 如果f函数是一个元素级别的函数,就使用applymap
In [40]: f = lambda x : '%.2f' % x
In [41]: frame.applymap(f)
Out[41]:
b d e
Utah 0.00 1.00 2.00
Ohio 3.00 4.00 5.00
Texas 6.00 7.00 8.00
Oregon 9.00 10.00 11.00
# 同样对于Series就使用map,与DataFrame的applymap是对应的。
In [43]: series
Out[43]:
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
In [44]: series.map(f)
Out[44]:
b 0.00
d 1.00
e 2.00
Name: Utah, dtype: object
2.6 排序与排名
排序
排序可以使用:
- sort_index方法:按索引排序,
- sort_value方法(order方法):按值排序,使用by参数
In [45]: obj = Series(range(4),index=list('dbca'))
In [46]: obj
Out[46]:
d 0
b 1
c 2
a 3
dtype: int64
In [47]: obj.sort_index()
Out[47]:
a 3
b 1
c 2
d 0
dtype: int64
In [50]: frame
Out[50]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [51]: frame.sort_index()
Out[51]:
b d e
Ohio 3.0 4.0 5.0
Oregon 9.0 10.0 11.0
Texas 6.0 7.0 8.0
Utah 0.0 1.0 2.0
In [52]: frame.sort_index(axis=0)
Out[52]:
b d e
Ohio 3.0 4.0 5.0
Oregon 9.0 10.0 11.0
Texas 6.0 7.0 8.0
Utah 0.0 1.0 2.0
In [53]: frame.sort_index(axis=1)
Out[53]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [54]: frame.sort_index(axis=1, ascending=False) # 倒序
Out[54]:
e d b
Utah 2.0 1.0 0.0
Ohio 5.0 4.0 3.0
Texas 8.0 7.0 6.0
Oregon 11.0 10.0 9.0
按值排序:
In [55]: s1 = Series([3,-2,-7,4])
In [56]: s1.order()
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: order is deprecated, use sort_values(...)
#!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[56]:
2 -7
1 -2
0 3
3 4
dtype: int64
In [58]: frame.sort_index(by='b')
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
#!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[58]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [59]: frame.sort_values(by='b')
Out[59]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
排名
rank方法,默认情况下为“相同的值分配一个平均排名”:
In [60]: s1 = Series([7,-5,7,4,2,0,4])
In [61]: s1.rank() # 可见0和2索引对应的值都是7,排名分别为6,7;因此取平均值6.5
Out[61]:
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
当然,有很多方法可以“打破”这种平级关系。
In [62]: s1.rank(method='first') # 按原始数据出现顺序排序
Out[62]:
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
In [63]: s1.rank(ascending=False, method='max') # 倒序,平级处理使用最大排名
Out[63]:
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64
DataFrame排名可以使用axis按行或按列进行排名。
2.7 带有重复值的轴索引
目前所有的例子中索引都是唯一的,而且如pandas中的许多函数(reindex)就要求索引唯一。
但是也不是强制的。
In [64]: obj = Series(range(5),index=list('aabbc'))
In [65]: obj
Out[65]:
a 0
a 1
b 2
b 3
c 4
dtype: int64
In [67]: obj.index.is_unique
Out[67]: False
In [68]: obj['a']
Out[68]:
a 0
a 1
dtype: int64
In [69]: obj['c']
Out[69]: 4
对于DataFrame,也是如此。
In [70]: df =DataFrame(np.random.randn(4,3),index=list('aabb'))
In [79]: df.ix['a']
Out[79]:
0 1 2
a 1.099692 -0.491098 0.625690
a -0.816857 1.025018 0.558494
In [80]: df.reindex(['b','a']) # 不能重新索引有重复索引的DataFrame
...
ValueError: cannot reindex from a duplicate axis
待续。。。