Pandas的对齐运算和函数
Pandas的对齐运算
是数据清洗的重要过程,可以按索引对齐进行运算,如果没对齐的位置则补NaN,最后也可以填充NaN
Series的对齐运算
1. Series 按行、索引对齐
s1 = pd.Series(range(10, 20), index=range(10)) s2 = pd.Series(range(20, 25), index=range(5)) print('s1: ') print(s1) print('') print('s2: ') print(s2)
效果:
s1: 0 10 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 dtype: int64 s2: 0 20 1 21 2 22 3 23 4 24 dtype: int64
2. Series的对齐运算
s1 = pd.Series(range(10, 20), index=range(10)) s2 = pd.Series(range(20, 25), index=range(5)) print(s1) print(s2) print(s1+s2)
效果
0 10 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 dtype: int64 0 20 1 21 2 22 3 23 4 24 dtype: int64 0 30.0 1 32.0 2 34.0 3 36.0 4 38.0 5 NaN 6 NaN 7 NaN 8 NaN 9 NaN dtype: float64
DataFrame的对齐运算
1. DataFrame按行、列索引对齐
df1 = pd.DataFrame(np.ones((2, 2)), columns=['a', 'b']) df2 = pd.DataFrame(np.ones((3, 3)), columns=['a', 'b', 'c']) print('df1: ') print(df1) print('') print('df2: ') print(df2)
效果:
df1: a b 0 1.0 1.0 1 1.0 1.0 df2: a b c 0 1.0 1.0 1.0 1 1.0 1.0 1.0 2 1.0 1.0 1.0
2. DataFrame的对齐运算
df1 = pd.DataFrame(np.ones((2, 2)), columns=['a', 'b']) df2 = pd.DataFrame(np.ones((3, 3)), columns=['a', 'b', 'c']) print('df1: ') print(df1) print('') print('df2: ') print(df2) print('df1+df2: ') print(df1 + df2)
效果:
df1: a b 0 1.0 1.0 1 1.0 1.0 df2: a b c 0 1.0 1.0 1.0 1 1.0 1.0 1.0 2 1.0 1.0 1.0 df1+df2: a b c 0 2.0 2.0 NaN 1 2.0 2.0 NaN 2 NaN NaN NaN
填充未对齐的数据进行运算
1. fill_value
使用
add
,sub
,div
,mul
的同时,通过
fill_value
指定填充值,未对齐的数据将和填充值做运算
import pandas as pd import numpy as np # df_obj = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd']) # # 通过list构建Series # ser_data = {"a": 17.8, "b": 20.1, "c": 16.5,"d":12} # ser_obj = pd.Series(ser_data) s1 = pd.Series(range(10, 20), index = range(10)) s2 = pd.Series(range(20, 25), index = range(5)) print(s1) print(s2) print(s1.add(s2, fill_value = -1)) df1 = pd.DataFrame(np.ones((2,2)), columns = ['a', 'b']) df2 = pd.DataFrame(np.ones((3,3)), columns = ['a', 'b', 'c']) print(df1) print(df2) print(df1.sub(df2, fill_value = 2.))
效果
0 10 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 dtype: int64 0 20 1 21 2 22 3 23 4 24 dtype: int64 0 30.0 1 32.0 2 34.0 3 36.0 4 38.0 5 14.0 6 15.0 7 16.0 8 17.0 9 18.0 dtype: float64 a b 0 1.0 1.0 1 1.0 1.0 a b c 0 1.0 1.0 1.0 1 1.0 1.0 1.0 2 1.0 1.0 1.0 a b c 0 0.0 0.0 1.0 1 0.0 0.0 1.0 2 1.0 1.0 1.0
Pandas的函数应用
apply 和 applymap
1. 可直接使用NumPy的函数
df = pd.DataFrame(np.random.randn(5,4) - 1) print(df) print(np.abs(df))
效果:
0 1 2 3 0 -0.638228 -0.615340 -2.416771 -0.521187 1 -0.978901 -0.765940 -0.821583 -0.109666 2 -0.182581 -0.820414 -0.497785 1.638130 3 -1.398201 0.893015 -1.109652 -1.740068 4 -0.079365 -0.750413 0.847062 -1.175580 0 1 2 3 0 0.638228 0.615340 2.416771 0.521187 1 0.978901 0.765940 0.821583 0.109666 2 0.182581 0.820414 0.497785 1.638130 3 1.398201 0.893015 1.109652 1.740068 4 0.079365 0.750413 0.847062 1.175580
2. 通过apply将函数应用到列或行上
df = pd.DataFrame(np.random.randn(5, 4) - 1) print(df) print(df.apply(lambda x: x.max()))
效果:
0 1 2 3 0 -0.672592 -0.917094 -1.698291 -2.683744 1 -1.593442 0.308978 -0.668113 -0.867197 2 -1.023184 -0.406812 -1.993301 -0.516704 3 -0.666674 -0.524327 -2.032358 0.192416 4 -0.466286 -1.319539 -1.643544 -1.137968 0 -0.466286 1 0.308978 2 -0.668113 3 0.192416 dtype: float64
注意指定轴的方向,默认axis=0,方向是列
df = pd.DataFrame(np.random.randn(5, 4) - 1) print(df) print(df.apply(lambda x: x.max())) # 指定轴方向,axis=1,方向是行 print(df.apply(lambda x : x.max(), axis=1))
效果
0 1 2 3 0 -1.053992 -0.627906 -2.195281 -0.433810 1 -1.838847 0.821711 0.005306 -0.485479 2 -0.194641 -0.608357 0.476059 -0.989364 3 -0.935286 0.370543 -0.316234 -0.482919 4 -0.142188 -2.685907 -0.757193 -0.150942 0 -0.142188 1 0.821711 2 0.476059 3 -0.150942 dtype: float64 0 -0.433810 1 0.821711 2 0.476059 3 0.370543 4 -0.142188 dtype: float64
3. 通过applymap将函数应用到每个数据上
df = pd.DataFrame(np.random.randn(5, 4) - 1) print(df) # 使用applymap应用到每个数据 f2 = lambda x : '%.2f' % x print(df.applymap(f2))
效果
0 1 2 3 0 -1.477573 -2.256976 -1.665249 0.381750 1 -1.748229 -0.457566 -1.138169 -1.741856 2 -1.456192 -0.596993 -1.293459 1.057294 3 -0.845528 -0.725874 -2.720255 0.472505 4 -0.927104 -1.748213 -0.382931 0.046957 0 1 2 3 0 -1.48 -2.26 -1.67 0.38 1 -1.75 -0.46 -1.14 -1.74 2 -1.46 -0.60 -1.29 1.06 3 -0.85 -0.73 -2.72 0.47 4 -0.93 -1.75 -0.38 0.05
排序
1. 索引排序
sort_index()
排序默认使用升序排序,ascending=False 为降序排序
s4 = pd.Series(range(10, 15), index = np.random.randint(5, size=5)) print(s4) # 索引排序 s4.sort_index() # 0 0 1 3 3 print(s4.sort_index() )
效果
0 10
2 11
3 12
4 13
3 14
dtype: int64
0 10
2 11
3 12
3 14
4 13
对DataFrame操作时注意轴方向
df4 = pd.DataFrame(np.random.randn(3, 5), index=np.random.randint(3, size=3), columns=np.random.randint(5, size=5)) print(df4) df4_isort = df4.sort_index(axis=1, ascending=False) print(df4_isort) # 4 2 1 1 0
效果
1 1 4 2 0 0 0.661257 -1.022631 0.337867 -0.680210 0.018720 2 0.486521 -0.617665 -1.566189 1.484633 0.284891 2 -0.902534 2.621820 -0.278090 -0.807439 1.121617 4 2 1 1 0 0 0.337867 -0.680210 0.661257 -1.022631 0.018720 2 -1.566189 1.484633 0.486521 -0.617665 0.284891 2 -0.278090 -0.807439 -0.902534 2.621820 1.121617
2. 按值排序
sort_values(by='column name')
根据某个唯一的列名进行排序,如果有其他相同列名则报错。
df4 = pd.DataFrame(np.random.randn(3, 5)) print(df4) # 按值排序 df4_vsort = df4.sort_values(by=0, ascending=False) print(df4_vsort)
0 1 2 3 4 0 -0.579405 1.055458 -2.274356 -1.215769 1.582240 1 2.081478 -0.687347 0.854755 -0.011375 -2.779123 2 1.824004 -1.294691 0.940245 1.626087 -0.539030 0 1 2 3 4 1 2.081478 -0.687347 0.854755 -0.011375 -2.779123 2 1.824004 -1.294691 0.940245 1.626087 -0.539030 0 -0.579405 1.055458 -2.274356 -1.215769 1.582240
处理缺失数据
df_data = pd.DataFrame([np.random.randn(3), [1., 2., np.nan], [np.nan, 4., np.nan], [1., 2., 3.]]) print(df_data.head())
效果
0 1 2 0 -3.094288 -0.914912 2.419605 1 1.000000 2.000000 NaN 2 NaN 4.000000 NaN 3 1.000000 2.000000 3.000000
1. 判断是否存在缺失值:isnull()
2. 丢弃缺失数据:dropna()
根据axis轴方向,丢弃包含NaN的行或列
3. 填充缺失数据:fillna()
df_data = pd.DataFrame([np.random.randn(3), [1., 2., np.nan], [np.nan, 4., np.nan], [1., 2., 3.]]) print(df_data.head()) # isnull print(df_data.isnull()) # dropna print(df_data.dropna()) print(df_data.dropna(axis=1)) # fillna print(df_data.fillna(-100.))
效果
0 1 2 0 -0.390745 1.712754 -0.156704 1 1.000000 2.000000 NaN 2 NaN 4.000000 NaN 3 1.000000 2.000000 3.000000 0 1 2 0 False False False 1 False False True 2 True False True 3 False False False 0 1 2 0 -0.390745 1.712754 -0.156704 3 1.000000 2.000000 3.000000 1 0 1.712754 1 2.000000 2 4.000000 3 2.000000 0 1 2 0 -0.390745 1.712754 -0.156704 1 1.000000 2.000000 -100.000000 2 -100.000000 4.000000 -100.000000 3 1.000000 2.000000 3.000000
最后,关注【码上加油站】微信公众号后,有疑惑有问题想加油的小伙伴可以码上加入社群,让我们一起码上加油吧!!!