pandas(五)处理缺失数据和层次化索引
pandas用浮点值Nan表示浮点和非浮点数组中的缺失数据。它只是一个便于被检测的标记而已。
>>> string_data = Series(['aardvark','artichoke',np.nan,'avocado']) >>> string_data 0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object >>> string_data.isnull() 0 False 1 False 2 True 3 False dtype: bool >>> string_data.notnull() 0 True 1 True 2 False 3 True dtype: bool >>> string_data.fillna("miss") 0 aardvark 1 artichoke 2 miss 3 avocado dtype: object >>> string_data 0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object
NA处理方法
方法 | 说明 |
dropna | 根据个标签中的是否存在缺失数据进行过滤,可以通过阈值进行调整 |
fillna | 用指定值或插值来填充缺失数据 |
isnull | 返回一个含有布尔值的对象,这些布尔值表示哪些是缺失值,给对象的类型与原类型一样 |
notnull | isnull的否定式 |
特别说明dropna方法:
常用参数:
axis 指定轴
how :“any/all” any代表只有有缺失值,all代表一列全部缺失
thresh; 代表最少notnull值的个数,是一个整型。
滤除缺失数据
对于Series有两种方法实现:
>>> from numpy import nan as NA >>> >>> >>> data = Series([1,NA,3.2,NA,5]) >>> data 0 1.0 1 NaN 2 3.2 3 NaN 4 5.0 dtype: float64 #方法一 >>> data.dropna() 0 1.0 2 3.2 4 5.0 dtype: float64 #方法二 >>> data[data.notnull()] 0 1.0 2 3.2 4 5.0 dtype: float64
而对于DataFrame对象,事情就有点复杂了。dropna默认丢弃任何含有缺失值的行。
>>> frame = DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]]) >>> >>> >>> >>> frame 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0 >>> clean_data = frame.dropna()#默认丢弃所有含有缺失值的行 >>> clean_data 0 1 2 0 1.0 6.5 3.0 >>> frame.dropna(how ='all')#只丢弃全部是缺失值的行 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0 >>> frame.dropna(axis = 1 ,how='all')#丢弃全部是缺失值的列 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0 >>> frame.dropna(thresh =2)#丢弃剩余少于2个真实值的行 0 1 2 0 1.0 6.5 3.0 3 NaN 6.5 3.0 >>>
填充缺失数据
对于DataFrame对象
>>> df = DataFrame(np.random.randn(7,3)) >>> df.ix[:4 ,1] = NA >>> df.ix[:2,2] =NA >>> df 0 1 2 0 -1.362151 NaN NaN 1 -0.465262 NaN NaN 2 0.037518 NaN NaN 3 -2.895224 NaN -2.514141 4 -0.635875 NaN 1.722823 5 -0.479897 0.999354 -0.547433 6 -0.744960 0.363400 0.706812 >>> df.fillna(0) #元素级填充 0 1 2 0 -1.362151 0.000000 0.000000 1 -0.465262 0.000000 0.000000 2 0.037518 0.000000 0.000000 3 -2.895224 0.000000 -2.514141 4 -0.635875 0.000000 1.722823 5 -0.479897 0.999354 -0.547433 6 -0.744960 0.363400 0.706812 #根据不同的列填充不同的数值 >>> df.fillna({1:0.5,2:-1 }) 0 1 2 0 -1.362151 0.500000 -1.000000 1 -0.465262 0.500000 -1.000000 2 0.037518 0.500000 -1.000000 3 -2.895224 0.500000 -2.514141 4 -0.635875 0.500000 1.722823 5 -0.479897 0.999354 -0.547433 6 -0.744960 0.363400 0.706812 >>> df.fillna(method ='bfill')#method方法选择前向或后向填充 0 1 2 0 -1.362151 0.999354 -2.514141 1 -0.465262 0.999354 -2.514141 2 0.037518 0.999354 -2.514141 3 -2.895224 0.999354 -2.514141 4 -0.635875 0.999354 1.722823 5 -0.479897 0.999354 -0.547433 6 -0.744960 0.363400 0.706812 >>> df.fillna(method ='bfill',limit =2)#限制后向填充为两行 0 1 2 0 -1.362151 NaN NaN 1 -0.465262 NaN -2.514141 2 0.037518 NaN -2.514141 3 -2.895224 0.999354 -2.514141 4 -0.635875 0.999354 1.722823 5 -0.479897 0.999354 -0.547433 6 -0.744960 0.363400 0.706812 >>>
fillna默认会返回新对象,如果需要就地修改元数据,可以加上inplace = True
>>> df.fillna(0,inplace = True) >>> df 0 1 2 0 -1.362151 0.000000 0.000000 1 -0.465262 0.000000 0.000000 2 0.037518 0.000000 0.000000 3 -2.895224 0.000000 -2.514141 4 -0.635875 0.000000 1.722823 5 -0.479897 0.999354 -0.547433 6 -0.744960 0.363400 0.706812
fillna函数的参数
参数 | 说明 |
method | 前向或后向填充 |
value | 待填充的值或字典对象 |
axis | 待填充的轴 |
inplace | 修改调用者对象而不产生副本 |
limit | 前向或后向填充的最大数量 |
层次化索引
能使你在一个轴上拥有多个索引级别。
创建层次化索引
>>> data = Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,1,2]]) >>> data a 1 -0.450814 2 -0.776317 3 -0.140582 b 1 -0.717184 2 0.943802 3 0.972454 c 1 -0.390725 2 -1.340875 d 1 -0.648987 2 -0.960173 dtype: float64 >>> data.index MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 0, 1]]) >>>
利用层次化索引来选取子集
>>> data['a'] 1 -0.450814 2 -0.776317 3 -0.140582 dtype: float64 >>> data['c':'d'] c 1 -0.390725 2 -1.340875 d 1 -0.648987 2 -0.960173 dtype: float64 >>> data.ix[['a','c']] a 1 -0.450814 2 -0.776317 3 -0.140582 c 1 -0.390725 2 -1.340875 dtype: float64 选择内层子集 >>> data['a',2] -0.7763173836675796 >>> data[:,2] a -0.776317 b 0.943802 c -1.340875 d -0.960173 dtype: float64
利用stack和unstack可以实现层次化索引的Series和DataFrame的转换
>>> frame 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0 >>> frame.stack() 0 0 1.0 1 6.5 2 3.0 1 0 1.0 3 1 6.5 2 3.0 dtype: float64 >>> data.unstack() 1 2 3 a -0.450814 -0.776317 -0.140582 b -0.717184 0.943802 0.972454 c -0.390725 -1.340875 NaN d -0.648987 -0.960173 NaN
重排分级顺序
swaplevel根据给定的编号或name属性进行交换层次化索引
sortlevel 根据给定的级别的值进行排序
>>> frame = DataFrame(np.random.randn(5,4),index = [['a','a','a','b','b'],[1,2,3,1,2]],columns = pd.MultiIndex.from_arrays([['o','o','w','w'],[1,2,1,2]],names=['color','num'])) >>> frame color o w num 1 2 1 2 a 1 1.558178 1.614265 0.674642 -0.269209 2 -0.324755 -0.486829 -1.086918 -0.496748 3 0.283367 -0.518154 0.551998 0.747767 b 1 0.904257 1.315240 0.328065 -0.006465 2 0.249438 0.946020 1.572290 -0.198329 >>> frame.index.names = ['name','age'] >>> frame color o w num 1 2 1 2 name age a 1 1.558178 1.614265 0.674642 -0.269209 2 -0.324755 -0.486829 -1.086918 -0.496748 3 0.283367 -0.518154 0.551998 0.747767 b 1 0.904257 1.315240 0.328065 -0.006465 2 0.249438 0.946020 1.572290 -0.198329 >>> frame.swaplevel('name','age') color o w num 1 2 1 2 age name 1 a 1.558178 1.614265 0.674642 -0.269209 2 a -0.324755 -0.486829 -1.086918 -0.496748 3 a 0.283367 -0.518154 0.551998 0.747767 1 b 0.904257 1.315240 0.328065 -0.006465 2 b 0.249438 0.946020 1.572290 -0.198329 >>> frame.sortlevel(1) __main__:1: FutureWarning: sortlevel is deprecated, use sort_index(level= ...) color o w num 1 2 1 2 name age a 1 1.558178 1.614265 0.674642 -0.269209 b 1 0.904257 1.315240 0.328065 -0.006465 a 2 -0.324755 -0.486829 -1.086918 -0.496748 b 2 0.249438 0.946020 1.572290 -0.198329 a 3 0.283367 -0.518154 0.551998 0.747767 >>> frame.sort_index(level = 1)#以后sortlevel会废弃,这里可以用sort_index的level选项排序 color o w num 1 2 1 2 name age a 1 1.558178 1.614265 0.674642 -0.269209 b 1 0.904257 1.315240 0.328065 -0.006465 a 2 -0.324755 -0.486829 -1.086918 -0.496748 b 2 0.249438 0.946020 1.572290 -0.198329 a 3 0.283367 -0.518154 0.551998 0.747767
可以根据级别汇总统计
许多对DataFrame和Series的描述和汇总统计都有一个level选项,用于指定在某条轴上算术运算的级别
>>> frame.sum(level = 'age') color o w num 1 2 1 2 age 1 2.462435 2.929505 1.002707 -0.275673 2 -0.075318 0.459191 0.485372 -0.695077 3 0.283367 -0.518154 0.551998 0.747767 >>> frame.sum(level = 'color',axis =1) color o w name age a 1 3.172443 0.405433 2 -0.811584 -1.583666 3 -0.234786 1.299765 b 1 2.219497 0.321600 2 1.195458 1.373961 >>>
使用DataFrame的列完成层次化行索引的转化
>>> frame = DataFrame({'a':range(7),'b':range(7,0,-1),'c':['o','o','o','t','t','f','f'],'d':[1,2,3,4,1,2,3]}) >>> frame a b c d 0 0 7 o 1 1 1 6 o 2 2 2 5 o 3 3 3 4 t 4 4 4 3 t 1 5 5 2 f 2 6 6 1 f 3 >>> frame2 = frame.set_index(['c','d'])#将一个或多个列转换为行索引 >>> frame2 a b c d o 1 0 7 2 1 6 3 2 5 t 4 3 4 1 4 3 f 2 5 2 3 6 1 >>> frame2.reset_index(['c','d'])#将层次化索引转换为列 c d a b 0 o 1 0 7 1 o 2 1 6 2 o 3 2 5 3 t 4 3 4 4 t 1 4 3 5 f 2 5 2 6 f 3 6 1
在将列转换为层次化行索引的时候,默认会删除原来的列,如果要保留的话,需要drop选项
>>> frame3 = frame.set_index(['c','d'],drop=False) >>> frame3 a b c d c d o 1 0 7 o 1 2 1 6 o 2 3 2 5 o 3 t 4 3 4 t 4 1 4 3 t 1 f 2 5 2 f 2 3 6 1 f 3