Pandas 缺失数据
第6章 缺失数据¶
In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/table_missing.csv')
df.head()
Out[1]:
一、缺失观测及其类型¶
In [2]:
df['Physics'].isna().head()
Out[2]:
In [3]:
df['Physics'].notna().head()
Out[3]:
对DataFrame使用会返回布尔表¶
In [4]:
df.isna().head()
Out[4]:
但对于DataFrame我们更关心到底每列有多少缺失值¶
In [5]:
df.isna().sum()
Out[5]:
此外,可以通过第1章中介绍的info函数查看缺失信息¶
In [6]:
df.info()
(b)查看缺失值的所以在行¶
以最后一列为例,挑出该列缺失值的行¶
In [7]:
df[df['Physics'].isna()]
Out[7]:
In [8]:
df[df.notna().all(1)]
Out[8]:
In [9]:
np.nan == np.nan
Out[9]:
In [10]:
np.nan == 0
Out[10]:
In [11]:
np.nan == None
Out[11]:
在用equals函数比较时,自动略过两侧全是np.nan的单元格,因此结果不会影响¶
In [12]:
df.equals(df)
Out[12]:
其次,它在numpy中的类型为浮点,由此导致数据集读入时,即使原来是整数的列,只要有缺失值就会变为浮点型¶
In [13]:
type(np.nan)
Out[13]:
In [14]:
pd.Series([1,2,3]).dtype
Out[14]:
In [15]:
pd.Series([1,np.nan,3]).dtype
Out[15]:
此外,对于布尔类型的列表,如果是np.nan填充,那么它的值会自动变为True而不是False¶
In [16]:
pd.Series([1,np.nan,3],dtype='bool')
Out[16]:
但当修改一个布尔列表时,会改变列表类型,而不是赋值为True¶
In [17]:
s = pd.Series([True,False],dtype='bool')
s[1]=np.nan
s
Out[17]:
In [18]:
df['ID'].dtype
Out[18]:
In [19]:
df['Math'].dtype
Out[19]:
In [20]:
df['Class'].dtype
Out[20]:
In [21]:
None == None
Out[21]:
它的布尔值为False¶
In [22]:
pd.Series([None],dtype='bool')
Out[22]:
修改布尔列表不会改变数据类型¶
In [23]:
s = pd.Series([True,False],dtype='bool')
s[0]=None
s
Out[23]:
In [24]:
s = pd.Series([1,0],dtype='bool')
s[0]=None
s
Out[24]:
在传入数值类型后,会自动变为np.nan¶
In [25]:
type(pd.Series([1,None])[1])
Out[25]:
只有当传入object类型是保持不动,几乎可以认为,除非人工命名None,它基本不会自动出现在Pandas中¶
In [26]:
type(pd.Series([1,None],dtype='O')[1])
Out[26]:
在使用equals函数时不会被略过,因此下面的情况下返回False¶
In [27]:
pd.Series([None]).equals(pd.Series([np.nan]))
Out[27]:
In [28]:
s_time = pd.Series([pd.Timestamp('20120101')]*5)
s_time
Out[28]:
In [29]:
s_time[2] = None
s_time
Out[29]:
In [30]:
s_time[2] = np.nan
s_time
Out[30]:
In [31]:
s_time[2] = pd.NaT
s_time
Out[31]:
In [32]:
type(s_time[2])
Out[32]:
In [33]:
s_time[2] == s_time[2]
Out[33]:
In [34]:
s_time.equals(s_time)
Out[34]:
In [35]:
s = pd.Series([True,False],dtype='bool')
s[1]=pd.NaT
s
Out[35]:
3. Nullable类型与NA符号¶
这是Pandas在1.0新版本中引入的重大改变,其目的就是为了(在若干版本后)解决之前出现的混乱局面,统一缺失值处理方法¶
"The goal of pd.NA is provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type)."——User Guide for Pandas v-1.0¶
官方鼓励用户使用新的数据类型和缺失类型pd.NA¶
In [36]:
s_original = pd.Series([1, 2], dtype="int64")
s_original
Out[36]:
In [37]:
s_new = pd.Series([1, 2], dtype="Int64")
s_new
Out[37]:
它的好处就在于,其中前面提到的三种缺失值都会被替换为统一的NA符号,且不改变数据类型¶
In [38]:
s_original[1] = np.nan
s_original
Out[38]:
In [39]:
s_new[1] = np.nan
s_new
Out[39]:
In [40]:
s_new[1] = None
s_new
Out[40]:
In [41]:
s_new[1] = pd.NaT
s_new
Out[41]:
In [42]:
s_original = pd.Series([1, 0], dtype="bool")
s_original
Out[42]:
In [43]:
s_new = pd.Series([0, 1], dtype="boolean")
s_new
Out[43]:
In [44]:
s_original[0] = np.nan
s_original
Out[44]:
In [45]:
s_original = pd.Series([1, 0], dtype="bool") #此处重新加一句是因为前面赋值改变了bool类型
s_original[0] = None
s_original
Out[45]:
In [46]:
s_new[0] = np.nan
s_new
Out[46]:
In [47]:
s_new[0] = None
s_new
Out[47]:
In [48]:
s_new[0] = pd.NaT
s_new
Out[48]:
In [49]:
s = pd.Series(['dog','cat'])
s[s_new]
Out[49]:
In [50]:
s = pd.Series(['dog','cat'],dtype='string')
s
Out[50]:
In [51]:
s[0] = np.nan
s
Out[51]:
In [52]:
s[0] = None
s
Out[52]:
In [53]:
s[0] = pd.NaT
s
Out[53]:
此外,和object类型的一点重要区别就在于,在调用字符方法后,string类型返回的是Nullable类型,object则会根据缺失类型和数据类型而改变¶
In [54]:
s = pd.Series(["a", None, "b"], dtype="string")
s.str.count('a')
Out[54]:
In [55]:
s2 = pd.Series(["a", None, "b"], dtype="object")
s2.str.count("a")
Out[55]:
In [56]:
s.str.isdigit()
Out[56]:
In [57]:
s2.str.isdigit()
Out[57]:
4. NA的特性¶
In [58]:
True | pd.NA
Out[58]:
In [59]:
pd.NA | True
Out[59]:
In [60]:
False | pd.NA
Out[60]:
In [61]:
False & pd.NA
Out[61]:
In [62]:
True & pd.NA
Out[62]:
取值不明直接报错¶
In [63]:
#bool(pd.NA)
In [64]:
pd.NA ** 0
Out[64]:
In [65]:
1 ** pd.NA
Out[65]:
其他情况:¶
In [66]:
pd.NA + 1
Out[66]:
In [67]:
"a" * pd.NA
Out[67]:
In [68]:
pd.NA == pd.NA
Out[68]:
In [69]:
pd.NA < 2.5
Out[69]:
In [70]:
np.log(pd.NA)
Out[70]:
In [71]:
np.add(pd.NA, 1)
Out[71]:
In [72]:
pd.read_csv('data/table_missing.csv').dtypes
Out[72]:
In [73]:
pd.read_csv('data/table_missing.csv').convert_dtypes().dtypes
Out[73]:
二、缺失数据的运算与分组¶
1. 加号与乘号规则¶
使用加法时,缺失值为0¶
In [74]:
s = pd.Series([2,3,np.nan,4])
s.sum()
Out[74]:
使用乘法时,缺失值为1¶
In [75]:
s.prod()
Out[75]:
使用累计函数时,缺失值自动略过¶
In [76]:
s.cumsum()
Out[76]:
In [77]:
s.cumprod()
Out[77]:
In [78]:
s.pct_change()
Out[78]:
In [79]:
df_g = pd.DataFrame({'one':['A','B','C','D',np.nan],'two':np.random.randn(5)})
df_g
Out[79]:
In [80]:
df_g.groupby('one').groups
Out[80]:
三、填充与剔除¶
1. fillna方法¶
(a)值填充与前后向填充(分别与ffill方法和bfill方法等价)¶
In [81]:
df['Physics'].fillna('missing').head()
Out[81]:
In [82]:
df['Physics'].fillna(method='ffill').head()
Out[82]:
In [83]:
df['Physics'].fillna(method='backfill').head()
Out[83]:
(b)填充中的对齐特性¶
In [84]:
df_f = pd.DataFrame({'A':[1,3,np.nan],'B':[2,4,np.nan],'C':[3,5,np.nan]})
df_f.fillna(df_f.mean())
Out[84]:
返回的结果中没有C,根据对齐特点不会被填充¶
In [85]:
df_f.fillna(df_f.mean()[['A','B']])
Out[85]:
2. dropna方法¶
(a)axis参数¶
In [86]:
df_d = pd.DataFrame({'A':[np.nan,np.nan,np.nan],'B':[np.nan,3,2],'C':[3,2,1]})
df_d
Out[86]:
In [87]:
df_d.dropna(axis=0)
Out[87]:
In [88]:
df_d.dropna(axis=1)
Out[88]:
(b)how参数(可以选all或者any,表示全为缺失去除和存在缺失去除)¶
In [89]:
df_d.dropna(axis=1,how='all')
Out[89]:
(c)subset参数(即在某一组列范围中搜索缺失值)¶
In [90]:
df_d.dropna(axis=0,subset=['B','C'])
Out[90]:
四、插值(interpolation)¶
1. 线性插值¶
In [91]:
s = pd.Series([1,10,15,-5,-2,np.nan,np.nan,28])
s
Out[91]:
In [92]:
s.interpolate()
Out[92]:
In [93]:
s.interpolate().plot()
Out[93]:
此时的插值与索引无关¶
In [94]:
s.index = np.sort(np.random.randint(50,300,8))
s.interpolate()
#值不变
Out[94]:
In [95]:
s.interpolate().plot()
#后面三个点不是线性的(如果几乎为线性函数,请重新运行上面的一个代码块,这是随机性导致的)
Out[95]:
In [96]:
s.interpolate(method='index').plot()
#可以看到与上面的区别
Out[96]:
如果索引是时间,那么可以按照时间长短插值,对于时间序列将在第9章详细介绍¶
In [97]:
s_t = pd.Series([0,np.nan,10]
,index=[pd.Timestamp('2012-05-01'),pd.Timestamp('2012-05-07'),pd.Timestamp('2012-06-03')])
s_t
Out[97]:
In [98]:
s_t.interpolate().plot()
Out[98]:
In [99]:
s_t.interpolate(method='time').plot()
Out[99]:
In [100]:
ser = pd.Series(np.arange(1, 10.1, .25) ** 2 + np.random.randn(37))
missing = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29])
ser[missing] = np.nan
methods = ['linear', 'quadratic', 'cubic']
df = pd.DataFrame({m: ser.interpolate(method=m) for m in methods})
df.plot()
Out[100]:
In [101]:
s = pd.Series([1,np.nan,np.nan,np.nan,5])
s.interpolate(limit=2)
Out[101]:
(b)limit_direction表示插值方向,可选forward,backward,both,默认前向¶
In [102]:
s = pd.Series([np.nan,np.nan,1,np.nan,np.nan,np.nan,5,np.nan,np.nan,])
s.interpolate(limit_direction='backward')
Out[102]:
(c)limit_area表示插值区域,可选inside,outside,默认None¶
In [103]:
s = pd.Series([np.nan,np.nan,1,np.nan,np.nan,np.nan,5,np.nan,np.nan,])
s.interpolate(limit_area='inside')
Out[103]:
In [104]:
s = pd.Series([np.nan,np.nan,1,np.nan,np.nan,np.nan,5,np.nan,np.nan,])
s.interpolate(limit_area='outside')
Out[104]:
五、问题与练习¶
1. 问题¶
2. 练习¶
In [105]:
pd.read_csv('data/Missing_data_one.csv').head()
Out[105]:
In [106]:
pd.read_csv('data/Missing_data_two.csv').head()
Out[106]: