处理缺失数据
pandas使用浮点值NaN表示浮点和非浮点数组中的缺失数据。它只是一个便于被检测出来的标记而已:
1 string_data=Series(['aardvark','artichoke',np.nan,'avocado']) 2 3 string_data 4 Out[71]: 5 0 aardvark 6 1 artichoke 7 2 NaN 8 3 avocado 9 dtype: object 10 11 string_data.isnull() 12 Out[73]: 13 0 False 14 1 False 15 2 True 16 3 False 17 dtype: bool
Python内置的None值也会被当做NA处理:
1 string_data[0]=None 2 string_data.isnull() 3 Out[76]: 4 0 True 5 1 False 6 2 True 7 3 False 8 dtype: bool
NA处理方法
dropna | 根据各标签的值中是否存在缺失数据对轴标签进行过滤,可通过阈值调节对缺失值的容忍度 |
fillna | 用指定值或插值方法(如ffill或bfill)填充缺失数据 |
isnull | 返回一个含有布尔值的对象 |
notnull | isnull的否定式 |
滤除缺失数据
Series
对于Series,dropna返回一个仅含非空数据和索引值的Series
1 from numpy import nan as NA 2 3 data=Series([1,NA,3.5,NA,7]) 4 5 data.dropna() 6 Out[79]: 7 0 1.0 8 2 3.5 9 4 7.0 10 dtype: float64
也可以通过布尔型索引达到这个目的
1 data[data.notnull()] 2 Out[80]: 3 0 1.0 4 2 3.5 5 4 7.0 6 dtype: float64
DataFrame
dropna默认丢弃任何含有缺失值的行:
1 data=DataFrame([[1.,6.5,3.],[1.,NA,NA], 2 [NA,NA,NA],[NA,6.5,3.]]) 3 4 cleaned=data.dropna() 5 6 data 7 Out[84]: 8 0 1 2 9 0 1.0 6.5 3.0 10 1 1.0 NaN NaN 11 2 NaN NaN NaN 12 3 NaN 6.5 3.0 13 14 cleaned 15 Out[85]: 16 0 1 2 17 0 1.0 6.5 3.0
传入how='all'参数将只丢弃全为NA的那些行:
1 data.dropna(how='all') 2 Out[86]: 3 0 1 2 4 0 1.0 6.5 3.0 5 1 1.0 NaN NaN 6 3 NaN 6.5 3.0
要丢弃列,传入axis=1即可:
1 data[4]=NA 2 3 data 4 Out[90]: 5 0 1 2 4 6 0 1.0 6.5 3.0 NaN 7 1 1.0 NaN NaN NaN 8 2 NaN NaN NaN NaN 9 3 NaN 6.5 3.0 NaN 10 11 data.dropna(axis=1,how='all') 12 Out[91]: 13 0 1 2 14 0 1.0 6.5 3.0 15 1 1.0 NaN NaN 16 2 NaN NaN NaN 17 3 NaN 6.5 3.0
只想留下一部分观测数据,使用thresh参数:
1 df=DataFrame(np.random.randn(7,3)) 2 df.ix[:4,1]=NA 3 df.ix[:2,2]=NA 4 5 df 6 Out[97]: 7 0 1 2 8 0 0.374594 NaN NaN 9 1 -1.839283 NaN NaN 10 2 -0.278500 NaN NaN 11 3 -0.153041 NaN -0.508259 12 4 0.788720 NaN 0.522755 13 5 -0.850456 -0.742876 -0.508570 14 6 -0.811658 -1.395474 1.452715 15 16 df.dropna(thresh=3) 17 Out[98]: 18 0 1 2 19 5 -0.850456 -0.742876 -0.508570 20 6 -0.811658 -1.395474 1.452715
填充缺失数据
对于大多数情况而言,fillna方法是最主要的函数。通过一个常数调用fillna就会将缺失值替换为那个常数值:
1 df.fillna(0) 2 Out[99]: 3 0 1 2 4 0 0.374594 0.000000 0.000000 5 1 -1.839283 0.000000 0.000000 6 2 -0.278500 0.000000 0.000000 7 3 -0.153041 0.000000 -0.508259 8 4 0.788720 0.000000 0.522755 9 5 -0.850456 -0.742876 -0.508570 10 6 -0.811658 -1.395474 1.452715
通过字典调用fillna,实现对不同的列填充不同的值:
1 df.fillna({1:0.5,3:-1}) 2 Out[100]: 3 0 1 2 4 0 0.374594 0.500000 NaN 5 1 -1.839283 0.500000 NaN 6 2 -0.278500 0.500000 NaN 7 3 -0.153041 0.500000 -0.508259 8 4 0.788720 0.500000 0.522755 9 5 -0.850456 -0.742876 -0.508570 10 6 -0.811658 -1.395474 1.452715
fillna 默认返回新对象,但也可以对现有对象进行就地修改:
1 _=df.fillna(0,inplace=True) 2 3 df 4 Out[102]: 5 0 1 2 6 0 0.374594 0.000000 0.000000 7 1 -1.839283 0.000000 0.000000 8 2 -0.278500 0.000000 0.000000 9 3 -0.153041 0.000000 -0.508259 10 4 0.788720 0.000000 0.522755 11 5 -0.850456 -0.742876 -0.508570 12 6 -0.811658 -1.395474 1.452715
对reindex有效的插值方法也可用于fillna:
1 df=DataFrame(np.random.randn(6,3)) 2 3 df.ix[2:,1]=NA 4 __main__:1: DeprecationWarning: 5 .ix is deprecated. Please use 6 .loc for label based indexing or 7 .iloc for positional indexing 8 9 See the documentation here: 10 http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated 11 12 df.ix[4:,2]=NA 13 14 df 15 Out[107]: 16 0 1 2 17 0 1.765621 -1.034028 0.303409 18 1 0.668661 -2.071361 0.716810 19 2 0.728906 NaN -1.767853 20 3 1.587540 NaN 0.028572 21 4 -0.247599 NaN NaN 22 5 -1.155067 NaN NaN 23 24 df.fillna(method='ffill') 25 Out[108]: 26 0 1 2 27 0 1.765621 -1.034028 0.303409 28 1 0.668661 -2.071361 0.716810 29 2 0.728906 -2.071361 -1.767853 30 3 1.587540 -2.071361 0.028572 31 4 -0.247599 -2.071361 0.028572 32 5 -1.155067 -2.071361 0.028572 33 34 df.fillna(method='ffill',limit=2) 35 Out[109]: 36 0 1 2 37 0 1.765621 -1.034028 0.303409 38 1 0.668661 -2.071361 0.716810 39 2 0.728906 -2.071361 -1.767853 40 3 1.587540 -2.071361 0.028572 41 4 -0.247599 NaN 0.028572 42 5 -1.155067 NaN 0.028572
可以利用fillna实现许多别的功能。比如,传入Series的平均值或中位数:
1 data.fillna(data.mean()) 2 Out[111]: 3 0 1.000000 4 1 3.833333 5 2 3.500000 6 3 3.833333 7 4 7.000000 8 dtype: float64
fillna函数的参数
value | 用于填充缺失值的标量值或字典对象 |
method | 插值方式。默认为‘ffill’ |
axis | 待填充的轴,默认axis=0 |
limit | (对于前向和后向)可以连续填充的最大数量 |