python数据分析之:数据清理,转换,合并,重塑(二)
一:移除重复数据
DataFrame经常出现重复行,就像下面的这样
In [7]: data=DataFrame({'k1':['one']*3+['two']*4,'k2':[1,1,2,3,3,4,4]})
In [8]: data
Out[8]:
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4
duplicated方法可以判断出每行是否重复了
In [9]: data.duplicated()
Out[9]:
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
既然可以判断重复,那么我们也可以丢弃这些重复项
In [11]: data.drop_duplicates()
Out[11]:
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4
这种丢弃是针对的全部列,也可以指定某一列
In [12]: data.drop_duplicates('k1')
Out[12]:
k1 k2
0 one 1
3 two 3
二利用函数或映射进行数据转换
在对数据集进行转换的时候,我们希望对列中的值进行一个转换。转换可以用到map方法
In [13]: data=DataFrame({'food':['bacon','pulled pork','bacon','Pastrami','corne
...: d beef','Bacon','pastrami','honey ham','nova lox'],'ounces':[4,3,12,6,7
...: .5,8,3,5,6]})
In [14]: data
Out[14]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
对于上述的各种肉的类型,如果希望指明从那些动物上获取的,那么就要写一个肉到动物的映射。首先完成一个映射表
In [15]: meat_to_animal={'bacon':'pig','pulled pork':'pig','pastrami':'ciw','cor
...: ned beef':'cow','honey ham':'pig','nova lox':'salmon'}
再进行映射,首先是将肉类全部转换成小写字母,然后再和meat_to_animal进行映射
In [16]: data['animal']=data['food'].map(str.lower).map(meat_to_animal)
In [17]: data
Out[17]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 ciw
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 ciw
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
三 替换值
前面在将填充数据的时候用到了fillna的方法。但是其实使用replace方法更简单。
比如下面的这个数据,-000是无效的数据。通过replace的方法直接替换掉
In [18]: data=Series([1,-000,2,3,4])
In [19]: data
Out[19]:
0 1
1 0
2 2
3 3
4 4
dtype: int64
In [20]: data.replace(-000,np.nan)
Out[20]:
0 1.0
1 NaN
2 2.0
3 3.0
4 4.0
dtype: float64
四 离散化和面元划分
为了分析,经常需要将数据拆分成不同的数据,也就是离散化
In [21]: ages=[20,22,25,27,21,23,37,31,61,45,41,32]
In [22]: bins=[18,25,35,60,100]
In [23]: cats=pd.cut(ages,bins)
In [24]: cats
Out[24]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
得到的数据是各自的区间。但是数据呈现这样的方式观测起来很不方便,可以给每个区间加上标签这样更直观些
In [25]: group_name=['youth','youngAdult','MiddleAged','Senior']
In [27]: cats=pd.cut(ages,bins,labels=group_name)
In [28]: cats
Out[28]:
[youth, youth, youth, youngAdult, youth, ..., youngAdult, Senior, MiddleAged, MiddleAged, youngAdult]
Length: 12
Categories (4, object): [youth < youngAdult < MiddleAged < Senior]
如果cut传入的是面元的数量而不是确切的面元边界,则会根据最小值和最大值计算等长面元。比如下面的这个数据,将一些均匀的数据分成了4组
In [29]: data=np.random.rand(20)
In [30]: pd.cut(data,4,precision=2)
Out[30]:
[(0.77, 0.97], (0.15, 0.36], (0.15, 0.36], (0.15, 0.36], (0.77, 0.97], ..., (0.56, 0.77], (0.15, 0.36], (0.15, 0.36], (0.77, 0.97], (0.77, 0.97]]
Length: 20
Categories (4, interval[float64]): [(0.15, 0.36] < (0.36, 0.56] < (0.56, 0.77] < (0.77, 0.97]]
In [32]: group=['first','second','third','fourth']
In [33]: pd.cut(data,4,precision=2,labels=group)
Out[33]:
[fourth, first, first, first, fourth, ..., third, first, first, fourth, fourth]
Length: 20
Categories (4, object): [first < second < third < fourth]
五 检测和过滤异常值
比如下面的4列具有正态分布的数据。通过describe可以得到数据的各项值,比如期望,方差,4分位数等
In [35]: data=DataFrame(np.random.randn(1000,4))
In [36]: data.describe()
Out[36]:
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.067684 0.067924 0.025598 -0.002298
std 0.998035 0.992106 1.006835 0.996794
min -3.428254 -3.548824 -3.184377 -3.745356
25% -0.774890 -0.591841 -0.641675 -0.644144
50% -0.116401 0.101143 0.002073 -0.013611
75% 0.616366 0.780282 0.680391 0.654328
max 3.366626 2.653656 3.260383 3.927528
六 排列和随机采样
利用numpy.random.permutation函数可以轻松实现对Series和DataFrame的列的重排工作
In [37]: df=DataFrame(np.arange(5*4).reshape(5,4))
In [38]: df
Out[38]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
In [41]: sampler
Out[41]: array([1, 0, 2, 3, 4])
In [39]: sampler=np.random.permutation(5)
可以看到根据sampler提供的行索引顺序对数据进行了重排
In [40]: df.take(sampler)
Out[40]:
0 1 2 3
1 4 5 6 7
0 0 1 2 3
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19