pandas 算术和函数
一、算术和广播
当对两个Series或者DataFrame对象进行算术运算的时候,返回的结果是两个对象的并集。如果存在某个索引不匹配时,将以缺失值NaN的方式体现,并对以后的操作产生影响。这类似数据库的外连接操作。
In [58]: s1 = pd.Series([4.2,2.6, 5.4, -1.9], index=list('acde'))
In [60]: s2 = pd.Series([-2.3, 1.2, 5.6, 7.2, 3.4], index= list('acefg'))
In [61]: s1
Out[61]:
a 4.2
c 2.6
d 5.4
e -1.9
dtype: float64
In [62]: s2
Out[62]:
a -2.3
c 1.2
e 5.6
f 7.2
g 3.4
dtype: float64
In [63]: s1+s2
Out[63]:
a 1.9
c 3.8
d NaN
e 3.7
f NaN
g NaN
dtype: float64
In [64]: s1-s2
Out[64]:
a 6.5
c 1.4
d NaN
e -7.5
f NaN
g NaN
dtype: float64
In [65]: s1* s2
Out[65]:
a -9.66
c 3.12
d NaN
e -10.64
f NaN
g NaN
dtype: float64
In [66]: df1 = pd.DataFrame(np.arange(9).reshape(3,3),columns=list('bcd'),index=['one','two','three'])
In [67]: df2 = pd.DataFrame(np.arange(12).reshape(4,3),columns=list('bde'),index=['two','three','five','six'])
In [68]: df1
Out[68]:
b c d
one 0 1 2
two 3 4 5
three 6 7 8
In [69]: df2
Out[69]:
b d e
two 0 1 2
three 3 4 5
five 6 7 8
six 9 10 11
In [70]: df1 + df2
Out[70]:
b c d e
five NaN NaN NaN NaN
one NaN NaN NaN NaN
six NaN NaN NaN NaN
three 9.0 NaN 12.0 NaN
two 3.0 NaN 6.0 NaN
其实,在上述过程中,为了防止NaN对后续的影响,很多时候我们要使用一些填充值:
In [71]: df1.add(df2, fill_value=0)
Out[71]:
b c d e
five 6.0 NaN 7.0 8.0
one 0.0 1.0 2.0 NaN
six 9.0 NaN 10.0 11.0
three 9.0 7.0 12.0 5.0
two 3.0 4.0 6.0 2.0
In [74]: df1.reindex(columns=df2.columns, fill_value=0) # 也可以这么干
Out[74]:
b d e
one 0 2 0
two 3 5 0
three 6 8 0
注意,这里填充的意思是,如果某一方有值,另一方没有的话,将没有的那方的值填充为指定的参数值。而不是在最终结果中,将所有的NaN替换为填充值。
类似add的方法还有:
- add:加法
- sub:减法
- div:除法
- floordiv:整除
- mul:乘法
- pow:幂次方
DataFrame也可以和Series进行操作,这类似于numpy中不同维度数组间的操作,其中将使用广播机制。
DataFrame和Series之间的操作与numpy中的操作是类似的:
In [80]: df = pd.DataFrame(np.arange(12).reshape(4,3),columns=list('bde'),index=['one','two','three','four'])
In [81]: s = df.iloc[0] # 取df的第一行生成一个Series
In [82]: df
Out[82]:
b d e
one 0 1 2
two 3 4 5
three 6 7 8
four 9 10 11
In [83]: s
Out[83]:
b 0
d 1
e 2
Name: one, dtype: int32
In [84]: df - s # 减法会广播
Out[84]:
b d e
one 0 0 0
two 3 3 3
three 6 6 6
four 9 9 9
In [85]: s2 = pd.Series(range(3), index=list('bef'))
In [86]: df + s2 # 如果存在不匹配的列索引,则引入缺失值
Out[86]:
b d e f
one 0.0 NaN 3.0 NaN
two 3.0 NaN 6.0 NaN
three 6.0 NaN 9.0 NaN
four 9.0 NaN 12.0 NaN
In [87]: s3 = df['d'] # 取df的一列
In [88]: s3
Out[88]:
one 1
two 4
three 7
four 10
Name: d, dtype: int32
In [89]: df.sub(s3, axis='index') # 指定按列进行广播
Out[89]:
b d e
one -1 0 1
two -1 0 1
three -1 0 1
four -1 0 1
在上面最后的例子中,我们通过axis='index'或者axis=0,在另外一个方向广播。
二、函数和映射
一些Numpy的通用函数对Pandas对象也有效:
In [91]: df = pd.DataFrame(np.random.randn(4,3), columns=list('bde'),index = ['one','two','three','four'])
In [92]: df
Out[92]:
b d e
one -0.522310 0.636599 0.992393
two 0.572624 -0.451550 -1.935332
three 0.021926 0.056706 -0.267661
four -2.718122 -0.740140 -1.565448
In [93]: np.abs(df)
Out[93]:
b d e
one 0.522310 0.636599 0.992393
two 0.572624 0.451550 1.935332
three 0.021926 0.056706 0.267661
four 2.718122 0.740140 1.565448
当然,也可以自定义处理函数,然后使用pandas提供的apply方法,将它应用在每一列:
In [94]: f = lambda x: x.max() - x.min()
In [95]: df.apply(f)
Out[95]:
b 3.290745
d 1.376740
e 2.927725
dtype: float64
当然,可以指定按行应用f,只需要设置axis='columns'。也可以将引用函数的返回值设计为一个Series,这样最终结果会是个DataFrame:
In [96]: df.apply(f, axis='columns')
Out[96]:
one 1.514703
two 2.507956
three 0.324367
four 1.977981
dtype: float64
In [97]: def f2(x):
...: return pd.Series([x.min(),x.max()], index=['min','max'])
In [98]: df.apply(f2)
Out[98]:
b d e
min -2.718122 -0.740140 -1.935332
max 0.572624 0.636599 0.992393
还有更细粒度的apply方法,也就是DataFrame的applymap以及Series的map。它们逐一对每个元素进行操作,而不是整行整列的操作。请体会下面的例子:
In [99]: f3 = lambda x: '%.2f' % x
In [100]: df.applymap(f3)
Out[100]:
b d e
one -0.52 0.64 0.99
two 0.57 -0.45 -1.94
three 0.02 0.06 -0.27
four -2.72 -0.74 -1.57
In [101]: df['d'].map(f3) # 获取d列,这是一个Series
Out[101]:
one 0.64
two -0.45
three 0.06
four -0.74
Name: d, dtype: object