处理缺失数据

pandas使用浮点值NaN表示浮点和非浮点数组中的缺失数据。它只是一个便于被检测出来的标记而已:

 1 string_data=Series(['aardvark','artichoke',np.nan,'avocado'])
 2 
 3 string_data
 4 Out[71]: 
 5 0     aardvark
 6 1    artichoke
 7 2          NaN
 8 3      avocado
 9 dtype: object
10     
11 string_data.isnull()
12 Out[73]: 
13 0    False
14 1    False
15 2     True
16 3    False
17 dtype: bool

Python内置的None值也会被当做NA处理:

1 string_data[0]=None
2 string_data.isnull()
3 Out[76]: 
4 0     True
5 1    False
6 2     True
7 3    False
8 dtype: bool
NA处理方法
dropna 根据各标签的值中是否存在缺失数据对轴标签进行过滤,可通过阈值调节对缺失值的容忍度
fillna 用指定值或插值方法(如ffill或bfill)填充缺失数据
isnull 返回一个含有布尔值的对象
notnull isnull的否定式

 
滤除缺失数据
 
Series
对于Series,dropna返回一个仅含非空数据和索引值的Series
 1 from numpy import nan as NA
 2 
 3 data=Series([1,NA,3.5,NA,7])
 4 
 5 data.dropna()
 6 Out[79]: 
 7 0    1.0
 8 2    3.5
 9 4    7.0
10 dtype: float64

也可以通过布尔型索引达到这个目的

1 data[data.notnull()]
2 Out[80]: 
3 0    1.0
4 2    3.5
5 4    7.0
6 dtype: float64

 

DataFrame
dropna默认丢弃任何含有缺失值的行:
 1 data=DataFrame([[1.,6.5,3.],[1.,NA,NA],
 2 [NA,NA,NA],[NA,6.5,3.]])
 3 
 4 cleaned=data.dropna()
 5 
 6 data
 7 Out[84]: 
 8      0    1    2
 9 0  1.0  6.5  3.0
10 1  1.0  NaN  NaN
11 2  NaN  NaN  NaN
12 3  NaN  6.5  3.0
13 
14 cleaned
15 Out[85]: 
16      0    1    2
17 0  1.0  6.5  3.0

传入how='all'参数将只丢弃全为NA的那些行:

1 data.dropna(how='all')
2 Out[86]: 
3      0    1    2
4 0  1.0  6.5  3.0
5 1  1.0  NaN  NaN
6 3  NaN  6.5  3.0

要丢弃列,传入axis=1即可:

 1 data[4]=NA
 2 
 3 data
 4 Out[90]: 
 5      0    1    2   4
 6 0  1.0  6.5  3.0 NaN
 7 1  1.0  NaN  NaN NaN
 8 2  NaN  NaN  NaN NaN
 9 3  NaN  6.5  3.0 NaN
10 
11 data.dropna(axis=1,how='all')
12 Out[91]: 
13      0    1    2
14 0  1.0  6.5  3.0
15 1  1.0  NaN  NaN
16 2  NaN  NaN  NaN
17 3  NaN  6.5  3.0

只想留下一部分观测数据,使用thresh参数:

 1 df=DataFrame(np.random.randn(7,3))
 2 df.ix[:4,1]=NA
 3 df.ix[:2,2]=NA
 4 
 5 df
 6 Out[97]: 
 7           0         1         2
 8 0  0.374594       NaN       NaN
 9 1 -1.839283       NaN       NaN
10 2 -0.278500       NaN       NaN
11 3 -0.153041       NaN -0.508259
12 4  0.788720       NaN  0.522755
13 5 -0.850456 -0.742876 -0.508570
14 6 -0.811658 -1.395474  1.452715
15 
16 df.dropna(thresh=3)
17 Out[98]: 
18           0         1         2
19 5 -0.850456 -0.742876 -0.508570
20 6 -0.811658 -1.395474  1.452715

 

填充缺失数据
 
对于大多数情况而言,fillna方法是最主要的函数。通过一个常数调用fillna就会将缺失值替换为那个常数值:
 1 df.fillna(0)
 2 Out[99]: 
 3           0         1         2
 4 0  0.374594  0.000000  0.000000
 5 1 -1.839283  0.000000  0.000000
 6 2 -0.278500  0.000000  0.000000
 7 3 -0.153041  0.000000 -0.508259
 8 4  0.788720  0.000000  0.522755
 9 5 -0.850456 -0.742876 -0.508570
10 6 -0.811658 -1.395474  1.452715

通过字典调用fillna,实现对不同的列填充不同的值:

 1 df.fillna({1:0.5,3:-1})
 2 Out[100]: 
 3           0         1         2
 4 0  0.374594  0.500000       NaN
 5 1 -1.839283  0.500000       NaN
 6 2 -0.278500  0.500000       NaN
 7 3 -0.153041  0.500000 -0.508259
 8 4  0.788720  0.500000  0.522755
 9 5 -0.850456 -0.742876 -0.508570
10 6 -0.811658 -1.395474  1.452715

fillna 默认返回新对象,但也可以对现有对象进行就地修改:

 1 _=df.fillna(0,inplace=True)
 2 
 3 df
 4 Out[102]: 
 5           0         1         2
 6 0  0.374594  0.000000  0.000000
 7 1 -1.839283  0.000000  0.000000
 8 2 -0.278500  0.000000  0.000000
 9 3 -0.153041  0.000000 -0.508259
10 4  0.788720  0.000000  0.522755
11 5 -0.850456 -0.742876 -0.508570
12 6 -0.811658 -1.395474  1.452715

对reindex有效的插值方法也可用于fillna:

 1 df=DataFrame(np.random.randn(6,3))
 2 
 3 df.ix[2:,1]=NA
 4 __main__:1: DeprecationWarning: 
 5 .ix is deprecated. Please use
 6 .loc for label based indexing or
 7 .iloc for positional indexing
 8 
 9 See the documentation here:
10 http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
11 
12 df.ix[4:,2]=NA
13 
14 df
15 Out[107]: 
16           0         1         2
17 0  1.765621 -1.034028  0.303409
18 1  0.668661 -2.071361  0.716810
19 2  0.728906       NaN -1.767853
20 3  1.587540       NaN  0.028572
21 4 -0.247599       NaN       NaN
22 5 -1.155067       NaN       NaN
23 
24 df.fillna(method='ffill')
25 Out[108]: 
26           0         1         2
27 0  1.765621 -1.034028  0.303409
28 1  0.668661 -2.071361  0.716810
29 2  0.728906 -2.071361 -1.767853
30 3  1.587540 -2.071361  0.028572
31 4 -0.247599 -2.071361  0.028572
32 5 -1.155067 -2.071361  0.028572
33 
34 df.fillna(method='ffill',limit=2)
35 Out[109]: 
36           0         1         2
37 0  1.765621 -1.034028  0.303409
38 1  0.668661 -2.071361  0.716810
39 2  0.728906 -2.071361 -1.767853
40 3  1.587540 -2.071361  0.028572
41 4 -0.247599       NaN  0.028572
42 5 -1.155067       NaN  0.028572

可以利用fillna实现许多别的功能。比如,传入Series的平均值或中位数:

1 data.fillna(data.mean())
2 Out[111]: 
3 0    1.000000
4 1    3.833333
5 2    3.500000
6 3    3.833333
7 4    7.000000
8 dtype: float64

fillna函数的参数

value 用于填充缺失值的标量值或字典对象
method 插值方式。默认为‘ffill’
axis 待填充的轴,默认axis=0
limit (对于前向和后向)可以连续填充的最大数量
 
posted @ 2018-08-06 18:09  平淡才是真~~  阅读(571)  评论(0编辑  收藏  举报