pandas处理数据

pandas数据处理

1、删除重复元素

使用duplicated()函数检测重复的行,返回元素为布尔类型的Series对象,每个元素对应一行,如果该行不是第一次出现,则元素为True

- keep参数:指定保留哪一重复的行数据
  • 创建具有重复元素行的DataFrame
import numpy as np
import pandas
from pandas import Series,DataFrame
#创建一个df
df = DataFrame(data=np.random.randint(0,100,size=(8,4)))
df.iloc[1] = [666,666,666,666]
df.iloc[3] = [666,666,666,666]
df.iloc[6] = [666,666,666,666]
df
0 1 2 3
0 52 0 19 15
1 666 666 666 666
2 55 52 66 28
3 666 666 666 666
4 81 11 49 12
5 57 80 15 4
6 666 666 666 666
7 13 13 34 87
#手动将df的某几行设置成相同的内容

  • 使用duplicated查看所有重复元素行
~df.duplicated(keep='first')
0     True
1     True
2     True
3    False
4     True
5     True
6    False
7     True
dtype: bool
df.loc[~df.duplicated(keep=False)]
0 1 2 3
0 52 0 19 15
2 55 52 66 28
4 81 11 49 12
5 57 80 15 4
7 13 13 34 87
#drop
indexs = df.loc[df.duplicated(keep='last')].index
df.drop(labels=indexs,axis=0)
0 1 2 3
0 52 0 19 15
2 55 52 66 28
4 81 11 49 12
5 57 80 15 4
6 666 666 666 666
7 13 13 34 87
  • 删除重复元素的行

  • 使用drop_duplicates()函数删除重复的行

    • drop_duplicates(keep='first/last'/False)
df.drop_duplicates(keep='first',inplace=False)
0 1 2 3
0 52 0 19 15
1 666 666 666 666
2 55 52 66 28
4 81 11 49 12
5 57 80 15 4
7 13 13 34 87

2. 映射

1) replace()函数:替换元素

使用replace()函数,对values进行映射操作

Series替换操作

  • 单值替换
    • 普通替换
    • 字典替换(推荐)
  • 多值替换
    • 列表替换
    • 字典替换(推荐)
  • 参数
    • to_replace:被替换的元素

单值普通替换

s = Series(data=[3,4,5,6,8,10,9])
s.replace(to_replace=[5,6],value=['five','six'])
0       3
1       4
2    five
3     six
4       8
5      10
6       9
dtype: object

单值字典替换

多值列表替换

多值字典替换

replace参数说明:

  • method:对指定的值使用相邻的值填充替换
  • limit:设定填充次数

DataFrame替换操作

  • 单值替换

    • 普通替换: 替换所有符合要求的元素:to_replace=15,value='e'
    • 按列指定单值替换: to_replace={列标签:替换值} value='value'
  • 多值替换

    • 列表替换: to_replace=[] value=[]
    • 字典替换(推荐) to_replace=
df
0 1 2 3
0 52 0 19 15
1 666 666 666 666
2 55 52 66 28
3 666 666 666 666
4 81 11 49 12
5 57 80 15 4
6 666 666 666 666
7 13 13 34 87
df.replace(to_replace=666,value='sixsixsix')
0 1 2 3
0 52 0 19 15
1 sixsixsix sixsixsix sixsixsix sixsixsix
2 55 52 66 28
3 sixsixsix sixsixsix sixsixsix sixsixsix
4 81 11 49 12
5 57 80 15 4
6 sixsixsix sixsixsix sixsixsix sixsixsix
7 13 13 34 87
df.replace(to_replace={0:'zero'})
0 1 2 3
0 52 zero 19 15
1 666 666 666 666
2 55 52 66 28
3 666 666 666 666
4 81 11 49 12
5 57 80 15 4
6 666 666 666 666
7 13 13 34 87
df.replace(to_replace={2:666},value='six')
0 1 2 3
0 52 0 19 15
1 666 666 six 666
2 55 52 66 28
3 666 666 six 666
4 81 11 49 12
5 57 80 15 4
6 666 666 six 666
7 13 13 34 87

注意:DataFrame中,无法使用method和limit参数

============================================

练习19:

假设张三李四的课表里有满分的情况,老师认为是作弊,把所有满分的情况(包括150,300分)都记0分,如何实现?

============================================

2) map()函数:新建一列 , map函数并不是df的方法,而是series的方法

  • map()可以映射新一列数据

  • map()中可以使用lambd表达式

  • map()中可以使用方法,可以是自定义的方法

    eg:map({to_replace:value})

  • 注意 map()中不能使用sum之类的函数,for循环

  • 新增一列:给df中,添加一列,该列的值为英文名对应的中文名

dic = {
    'name':['jay','tom','jay'],
    'salary':[12000,7000,12000]
}
df = DataFrame(data=dic)
df
name salary
0 jay 12000
1 tom 7000
2 jay 12000
#映射关系表
dic = {
    'jay':'周杰伦',
    'tom':'张三',
}
df['c_name'] = df['name'].map(dic)
df
name salary c_name
0 jay 12000 周杰伦
1 tom 7000 张三
2 jay 12000 周杰伦

map当做一种运算工具,至于执行何种运算,是由map函数的参数决定的(参数:lambda,函数)

  • 使用自定义函数
#超过3000部分的钱缴纳50%的税
df
name salary c_name
0 jay 12000 周杰伦
1 tom 7000 张三
2 jay 12000 周杰伦
def after_salary(s):
    if s <= 3000:
        return s
    else:
        return s - (s-3000)*0.5
df['after_sal'] = df['salary'].map(after_salary)
df
name salary c_name after_sal
0 jay 12000 周杰伦 7500.0
1 tom 7000 张三 5000.0
2 jay 12000 周杰伦 7500.0
  • 使用lambda表达式

注意:并不是任何形式的函数都可以作为map的参数。只有当一个函数具有一个参数且有返回值,那么该函数才可以作为map的参数。

============================================

练习20:

新增两列,分别为张三、李四的成绩状态,如果分数低于90,则为"failed",如果分数高于120,则为"excellent",其他则为"pass"

【提示】使用函数作为map的参数

============================================

3. 使用聚合操作对数据异常值检测和过滤

使用df.std()函数可以求得DataFrame对象每一列的标准差

  • 创建一个1000行3列的df 范围(0-1),求其每一列的标准差
df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
df
A B C
0 0.860007 0.240142 0.972207
1 0.517836 0.448164 0.013285
2 0.814698 0.549706 0.400180
3 0.976913 0.759793 0.700096
4 0.257355 0.031429 0.530168
5 0.356690 0.145748 0.869480
6 0.714695 0.433587 0.404421
7 0.243076 0.832746 0.607988
8 0.513298 0.084707 0.231510
9 0.608363 0.567508 0.243649
10 0.834919 0.595526 0.424388
11 0.430329 0.185759 0.316484
12 0.847074 0.248879 0.176468
13 0.372983 0.072461 0.012102
14 0.282422 0.431422 0.355343
15 0.007912 0.124476 0.801391
16 0.829466 0.677355 0.019161
17 0.857219 0.960232 0.513511
18 0.046890 0.183148 0.048427
19 0.019581 0.488118 0.248252
20 0.994921 0.541453 0.447042
21 0.670733 0.106632 0.870625
22 0.061870 0.699618 0.326010
23 0.472609 0.807719 0.483731
24 0.376022 0.932000 0.311720
25 0.866591 0.862012 0.195621
26 0.778366 0.090125 0.937042
27 0.405581 0.000708 0.878814
28 0.432118 0.061092 0.649828
29 0.805522 0.592622 0.359337
... ... ... ...
970 0.566573 0.128418 0.568342
971 0.327980 0.067132 0.656326
972 0.277012 0.705513 0.343562
973 0.036581 0.280018 0.215526
974 0.697460 0.739308 0.509766
975 0.667708 0.187524 0.832612
976 0.304881 0.320975 0.987438
977 0.314502 0.932980 0.905878
978 0.285760 0.366236 0.308724
979 0.103418 0.205835 0.371233
980 0.495616 0.954549 0.539369
981 0.874344 0.930889 0.825738
982 0.713143 0.923549 0.946772
983 0.377911 0.651799 0.700629
984 0.708385 0.114808 0.130637
985 0.589167 0.994180 0.756443
986 0.432346 0.629973 0.364151
987 0.159302 0.204101 0.569264
988 0.342391 0.229654 0.269073
989 0.938419 0.693347 0.113945
990 0.728082 0.486067 0.843194
991 0.112923 0.245280 0.490515
992 0.118851 0.725196 0.717325
993 0.481957 0.406391 0.632429
994 0.063079 0.219776 0.974884
995 0.829652 0.712765 0.551807
996 0.272144 0.823870 0.930134
997 0.631840 0.629386 0.573002
998 0.329112 0.152424 0.746059
999 0.263857 0.592957 0.804029

1000 rows × 3 columns

std_twice = df['C'].std() * 2
df['C']  > std_twice
df.loc[~(df['C']  > std_twice)]
A B C
1 0.517836 0.448164 0.013285
2 0.814698 0.549706 0.400180
4 0.257355 0.031429 0.530168
6 0.714695 0.433587 0.404421
8 0.513298 0.084707 0.231510
9 0.608363 0.567508 0.243649
10 0.834919 0.595526 0.424388
11 0.430329 0.185759 0.316484
12 0.847074 0.248879 0.176468
13 0.372983 0.072461 0.012102
14 0.282422 0.431422 0.355343
16 0.829466 0.677355 0.019161
17 0.857219 0.960232 0.513511
18 0.046890 0.183148 0.048427
19 0.019581 0.488118 0.248252
20 0.994921 0.541453 0.447042
22 0.061870 0.699618 0.326010
23 0.472609 0.807719 0.483731
24 0.376022 0.932000 0.311720
25 0.866591 0.862012 0.195621
29 0.805522 0.592622 0.359337
30 0.456585 0.680231 0.494607
31 0.226859 0.694393 0.234240
34 0.548207 0.976733 0.039299
36 0.169733 0.003497 0.554480
37 0.369663 0.936746 0.415894
38 0.555920 0.172885 0.172166
40 0.919874 0.396758 0.002643
42 0.366576 0.835003 0.427422
44 0.488103 0.017431 0.281818
... ... ... ...
952 0.590463 0.606074 0.066888
953 0.043948 0.870355 0.143739
954 0.425745 0.468863 0.245972
955 0.626427 0.681483 0.067346
956 0.481236 0.223866 0.557637
957 0.506315 0.448762 0.408199
959 0.718023 0.280907 0.485149
960 0.531066 0.725144 0.074422
961 0.928132 0.104976 0.059524
962 0.634376 0.133937 0.154810
964 0.255689 0.127695 0.176475
965 0.048704 0.460688 0.085121
966 0.149927 0.479655 0.320294
967 0.115064 0.831730 0.279443
968 0.766362 0.544640 0.528170
970 0.566573 0.128418 0.568342
972 0.277012 0.705513 0.343562
973 0.036581 0.280018 0.215526
974 0.697460 0.739308 0.509766
978 0.285760 0.366236 0.308724
979 0.103418 0.205835 0.371233
980 0.495616 0.954549 0.539369
984 0.708385 0.114808 0.130637
986 0.432346 0.629973 0.364151
987 0.159302 0.204101 0.569264
988 0.342391 0.229654 0.269073
989 0.938419 0.693347 0.113945
991 0.112923 0.245280 0.490515
995 0.829652 0.712765 0.551807
997 0.631840 0.629386 0.573002

591 rows × 3 columns

df['C']  > std_twice
0       True
1      False
2      False
3       True
4      False
5       True
6      False
7       True
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15      True
16     False
17     False
18     False
19     False
20     False
21      True
22     False
23     False
24     False
25     False
26      True
27      True
28      True
29     False
       ...  
970    False
971     True
972    False
973    False
974    False
975     True
976     True
977     True
978    False
979    False
980    False
981     True
982     True
983     True
984    False
985     True
986    False
987    False
988    False
989    False
990     True
991    False
992     True
993     True
994     True
995    False
996     True
997    False
998     True
999     True
Name: C, Length: 1000, dtype: bool
df.loc[df['C']  > std_twice]
indexs = df.loc[df['C']  > std_twice].index
indexs
Int64Index([  0,   3,   5,   7,  15,  21,  26,  27,  28,  32,
            ...
            982, 983, 985, 990, 992, 993, 994, 996, 998, 999],
           dtype='int64', length=409)
df.loc[indexs,'C'] = np.nan
df.fillna(axis=0,method='ffill',inplace=True)
df.fillna(axis=0,method='bfill',inplace=True)
df
A B C
0 0.860007 0.240142 0.013285
1 0.517836 0.448164 0.013285
2 0.814698 0.549706 0.400180
3 0.976913 0.759793 0.400180
4 0.257355 0.031429 0.530168
5 0.356690 0.145748 0.530168
6 0.714695 0.433587 0.404421
7 0.243076 0.832746 0.404421
8 0.513298 0.084707 0.231510
9 0.608363 0.567508 0.243649
10 0.834919 0.595526 0.424388
11 0.430329 0.185759 0.316484
12 0.847074 0.248879 0.176468
13 0.372983 0.072461 0.012102
14 0.282422 0.431422 0.355343
15 0.007912 0.124476 0.355343
16 0.829466 0.677355 0.019161
17 0.857219 0.960232 0.513511
18 0.046890 0.183148 0.048427
19 0.019581 0.488118 0.248252
20 0.994921 0.541453 0.447042
21 0.670733 0.106632 0.447042
22 0.061870 0.699618 0.326010
23 0.472609 0.807719 0.483731
24 0.376022 0.932000 0.311720
25 0.866591 0.862012 0.195621
26 0.778366 0.090125 0.195621
27 0.405581 0.000708 0.195621
28 0.432118 0.061092 0.195621
29 0.805522 0.592622 0.359337
... ... ... ...
970 0.566573 0.128418 0.568342
971 0.327980 0.067132 0.568342
972 0.277012 0.705513 0.343562
973 0.036581 0.280018 0.215526
974 0.697460 0.739308 0.509766
975 0.667708 0.187524 0.509766
976 0.304881 0.320975 0.509766
977 0.314502 0.932980 0.509766
978 0.285760 0.366236 0.308724
979 0.103418 0.205835 0.371233
980 0.495616 0.954549 0.539369
981 0.874344 0.930889 0.539369
982 0.713143 0.923549 0.539369
983 0.377911 0.651799 0.539369
984 0.708385 0.114808 0.130637
985 0.589167 0.994180 0.130637
986 0.432346 0.629973 0.364151
987 0.159302 0.204101 0.569264
988 0.342391 0.229654 0.269073
989 0.938419 0.693347 0.113945
990 0.728082 0.486067 0.113945
991 0.112923 0.245280 0.490515
992 0.118851 0.725196 0.490515
993 0.481957 0.406391 0.490515
994 0.063079 0.219776 0.490515
995 0.829652 0.712765 0.551807
996 0.272144 0.823870 0.551807
997 0.631840 0.629386 0.573002
998 0.329112 0.152424 0.573002
999 0.263857 0.592957 0.573002

1000 rows × 3 columns

对df应用筛选条件,去除标准差太大的数据:假设过滤条件为 C列数据大于两倍的C列标准差

============================================

练习21:

新建一个形状为10000*3的标准正态分布的DataFrame(np.random.randn),去除掉所有满足以下情况的行:其中任一元素绝对值(df.abs())大于3倍标准差

============================================

4. 排序

使用.take()函数排序

- take()函数接受一个索引列表,用数字表示,使得df根据列表中索引的顺序进行排序
- eg:df.take([1,3,4,2,5])

可以借助np.random.permutation()函数随机排序

df.take([2,1,0],axis=1)
C B A
0 0.013285 0.240142 0.860007
1 0.013285 0.448164 0.517836
2 0.400180 0.549706 0.814698
3 0.400180 0.759793 0.976913
4 0.530168 0.031429 0.257355
5 0.530168 0.145748 0.356690
6 0.404421 0.433587 0.714695
7 0.404421 0.832746 0.243076
8 0.231510 0.084707 0.513298
9 0.243649 0.567508 0.608363
10 0.424388 0.595526 0.834919
11 0.316484 0.185759 0.430329
12 0.176468 0.248879 0.847074
13 0.012102 0.072461 0.372983
14 0.355343 0.431422 0.282422
15 0.355343 0.124476 0.007912
16 0.019161 0.677355 0.829466
17 0.513511 0.960232 0.857219
18 0.048427 0.183148 0.046890
19 0.248252 0.488118 0.019581
20 0.447042 0.541453 0.994921
21 0.447042 0.106632 0.670733
22 0.326010 0.699618 0.061870
23 0.483731 0.807719 0.472609
24 0.311720 0.932000 0.376022
25 0.195621 0.862012 0.866591
26 0.195621 0.090125 0.778366
27 0.195621 0.000708 0.405581
28 0.195621 0.061092 0.432118
29 0.359337 0.592622 0.805522
... ... ... ...
970 0.568342 0.128418 0.566573
971 0.568342 0.067132 0.327980
972 0.343562 0.705513 0.277012
973 0.215526 0.280018 0.036581
974 0.509766 0.739308 0.697460
975 0.509766 0.187524 0.667708
976 0.509766 0.320975 0.304881
977 0.509766 0.932980 0.314502
978 0.308724 0.366236 0.285760
979 0.371233 0.205835 0.103418
980 0.539369 0.954549 0.495616
981 0.539369 0.930889 0.874344
982 0.539369 0.923549 0.713143
983 0.539369 0.651799 0.377911
984 0.130637 0.114808 0.708385
985 0.130637 0.994180 0.589167
986 0.364151 0.629973 0.432346
987 0.569264 0.204101 0.159302
988 0.269073 0.229654 0.342391
989 0.113945 0.693347 0.938419
990 0.113945 0.486067 0.728082
991 0.490515 0.245280 0.112923
992 0.490515 0.725196 0.118851
993 0.490515 0.406391 0.481957
994 0.490515 0.219776 0.063079
995 0.551807 0.712765 0.829652
996 0.551807 0.823870 0.272144
997 0.573002 0.629386 0.631840
998 0.573002 0.152424 0.329112
999 0.573002 0.592957 0.263857

1000 rows × 3 columns

df.take(np.random.permutation(3),axis=1)
B A C
0 0.240142 0.860007 0.013285
1 0.448164 0.517836 0.013285
2 0.549706 0.814698 0.400180
3 0.759793 0.976913 0.400180
4 0.031429 0.257355 0.530168
5 0.145748 0.356690 0.530168
6 0.433587 0.714695 0.404421
7 0.832746 0.243076 0.404421
8 0.084707 0.513298 0.231510
9 0.567508 0.608363 0.243649
10 0.595526 0.834919 0.424388
11 0.185759 0.430329 0.316484
12 0.248879 0.847074 0.176468
13 0.072461 0.372983 0.012102
14 0.431422 0.282422 0.355343
15 0.124476 0.007912 0.355343
16 0.677355 0.829466 0.019161
17 0.960232 0.857219 0.513511
18 0.183148 0.046890 0.048427
19 0.488118 0.019581 0.248252
20 0.541453 0.994921 0.447042
21 0.106632 0.670733 0.447042
22 0.699618 0.061870 0.326010
23 0.807719 0.472609 0.483731
24 0.932000 0.376022 0.311720
25 0.862012 0.866591 0.195621
26 0.090125 0.778366 0.195621
27 0.000708 0.405581 0.195621
28 0.061092 0.432118 0.195621
29 0.592622 0.805522 0.359337
... ... ... ...
970 0.128418 0.566573 0.568342
971 0.067132 0.327980 0.568342
972 0.705513 0.277012 0.343562
973 0.280018 0.036581 0.215526
974 0.739308 0.697460 0.509766
975 0.187524 0.667708 0.509766
976 0.320975 0.304881 0.509766
977 0.932980 0.314502 0.509766
978 0.366236 0.285760 0.308724
979 0.205835 0.103418 0.371233
980 0.954549 0.495616 0.539369
981 0.930889 0.874344 0.539369
982 0.923549 0.713143 0.539369
983 0.651799 0.377911 0.539369
984 0.114808 0.708385 0.130637
985 0.994180 0.589167 0.130637
986 0.629973 0.432346 0.364151
987 0.204101 0.159302 0.569264
988 0.229654 0.342391 0.269073
989 0.693347 0.938419 0.113945
990 0.486067 0.728082 0.113945
991 0.245280 0.112923 0.490515
992 0.725196 0.118851 0.490515
993 0.406391 0.481957 0.490515
994 0.219776 0.063079 0.490515
995 0.712765 0.829652 0.551807
996 0.823870 0.272144 0.551807
997 0.629386 0.631840 0.573002
998 0.152424 0.329112 0.573002
999 0.592957 0.263857 0.573002

1000 rows × 3 columns

random_df = df.take(np.random.permutation(3),axis=1).take(np.random.permutation(1000),axis=0)
random_df
B C A
901 0.842875 0.484079 0.396218
681 0.941300 0.160590 0.085862
97 0.401448 0.289703 0.619326
233 0.460187 0.374978 0.459398
766 0.231968 0.338906 0.836340
728 0.500936 0.035901 0.790860
819 0.049642 0.201026 0.248039
26 0.090125 0.195621 0.778366
570 0.063299 0.377617 0.503758
833 0.681264 0.085146 0.907368
114 0.788951 0.343372 0.225716
822 0.026554 0.338009 0.545263
642 0.113898 0.455778 0.690428
352 0.162700 0.075357 0.392956
649 0.116311 0.029454 0.569790
460 0.567423 0.090965 0.959869
758 0.021598 0.424248 0.372490
166 0.498037 0.060493 0.475296
806 0.990978 0.549066 0.233480
676 0.896488 0.177937 0.750045
47 0.102752 0.568177 0.181398
648 0.287340 0.299368 0.540929
266 0.599017 0.133734 0.257810
639 0.164303 0.473354 0.345383
73 0.194962 0.456506 0.026895
49 0.400460 0.333073 0.212971
299 0.647202 0.227759 0.135387
633 0.693565 0.400427 0.277807
996 0.823870 0.551807 0.272144
625 0.719373 0.115486 0.233699
... ... ... ...
596 0.729380 0.135190 0.960753
121 0.519113 0.444025 0.718233
351 0.997335 0.075357 0.249110
661 0.086639 0.567981 0.462961
108 0.185337 0.244531 0.985059
678 0.006171 0.296101 0.840489
164 0.186113 0.555717 0.325252
701 0.524764 0.504779 0.907671
253 0.452349 0.267856 0.900799
147 0.998336 0.532935 0.633324
825 0.316121 0.102929 0.283131
749 0.742133 0.544720 0.474593
782 0.084357 0.442086 0.854215
355 0.866595 0.065320 0.474957
432 0.748315 0.095471 0.490352
386 0.734383 0.000074 0.779548
548 0.416296 0.031425 0.493160
923 0.133347 0.527376 0.701312
755 0.806736 0.395561 0.595543
481 0.349173 0.301911 0.940143
150 0.123704 0.304874 0.692476
426 0.654830 0.564152 0.627674
406 0.039941 0.356761 0.511435
324 0.214678 0.580897 0.376711
374 0.724043 0.039151 0.284666
214 0.570377 0.467031 0.343456
405 0.706814 0.464470 0.831942
700 0.596208 0.208127 0.104092
647 0.848684 0.136256 0.753361
692 0.575595 0.184419 0.465930

1000 rows × 3 columns

random_df[0:100]
B C A
901 0.842875 0.484079 0.396218
681 0.941300 0.160590 0.085862
97 0.401448 0.289703 0.619326
233 0.460187 0.374978 0.459398
766 0.231968 0.338906 0.836340
728 0.500936 0.035901 0.790860
819 0.049642 0.201026 0.248039
26 0.090125 0.195621 0.778366
570 0.063299 0.377617 0.503758
833 0.681264 0.085146 0.907368
114 0.788951 0.343372 0.225716
822 0.026554 0.338009 0.545263
642 0.113898 0.455778 0.690428
352 0.162700 0.075357 0.392956
649 0.116311 0.029454 0.569790
460 0.567423 0.090965 0.959869
758 0.021598 0.424248 0.372490
166 0.498037 0.060493 0.475296
806 0.990978 0.549066 0.233480
676 0.896488 0.177937 0.750045
47 0.102752 0.568177 0.181398
648 0.287340 0.299368 0.540929
266 0.599017 0.133734 0.257810
639 0.164303 0.473354 0.345383
73 0.194962 0.456506 0.026895
49 0.400460 0.333073 0.212971
299 0.647202 0.227759 0.135387
633 0.693565 0.400427 0.277807
996 0.823870 0.551807 0.272144
625 0.719373 0.115486 0.233699
... ... ... ...
320 0.793754 0.270634 0.576142
991 0.245280 0.490515 0.112923
945 0.628873 0.315306 0.106584
464 0.279384 0.181075 0.813286
896 0.297097 0.105391 0.439719
450 0.541616 0.324493 0.078195
779 0.172180 0.212489 0.869420
492 0.068938 0.420931 0.288381
618 0.801756 0.422841 0.611366
815 0.866783 0.493224 0.364423
400 0.092009 0.511670 0.883439
139 0.475415 0.030180 0.009531
288 0.521498 0.047107 0.889200
600 0.162564 0.081140 0.450440
247 0.215706 0.420259 0.429032
741 0.546596 0.373861 0.668371
172 0.258265 0.503051 0.388976
520 0.409268 0.080479 0.597250
35 0.366329 0.039299 0.216926
995 0.712765 0.551807 0.829652
736 0.763012 0.411588 0.796950
383 0.329860 0.481572 0.282086
454 0.570412 0.025394 0.974983
457 0.692182 0.449229 0.092467
240 0.762304 0.188582 0.163220
613 0.534953 0.459497 0.542918
516 0.238180 0.019506 0.467064
439 0.386206 0.352531 0.892062
983 0.651799 0.539369 0.377911
512 0.556697 0.045425 0.233789

100 rows × 3 columns

  • np.random.permutation(x)可以生成x个从0-(x-1)的随机数列
np.random.permutation(5)
array([1, 2, 3, 0, 4])

随机抽样

当DataFrame规模足够大时,直接使用np.random.permutation(x)函数,就配合take()函数实现随机抽样

5. 数据分类处理【重点】

数据聚合是数据处理的最后一步,通常是要使每一个数组生成一个单一的数值。

数据分类处理:

  • 分组:先把数据分为几组
  • 用函数处理:为不同组的数据应用不同的函数以转换数据
  • 合并:把不同组得到的结果合并起来

数据分类处理的核心:
- groupby()函数
- groups属性查看分组情况
- eg: df.groupby(by='item').groups

分组

from pandas import DataFrame,Series
df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
                'price':[4,3,3,2.5,4,2],
               'color':['red','yellow','yellow','green','green','green'],
               'weight':[12,20,50,30,20,44]})
df
color item price weight
0 red Apple 4.0 12
1 yellow Banana 3.0 20
2 yellow Orange 3.0 50
3 green Banana 2.5 30
4 green Orange 4.0 20
5 green Apple 2.0 44
  • 使用groupby实现分组
df.groupby(by='item',axis=0)
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000EDABA20>
  • 使用groups查看分组情况
#使用goups属性查看分组情况
df.groupby(by='item',axis=0).groups
{'Apple': Int64Index([0, 5], dtype='int64'),
 'Banana': Int64Index([1, 3], dtype='int64'),
 'Orange': Int64Index([2, 4], dtype='int64')}
  • 分组后的聚合操作:分组后的成员中可以被进行运算的值会进行运算,不能被运算的值不进行运算
#给df创建一个新列,内容为各个水果的平均价格
df.groupby(by='item',axis=0).mean()['price']
price weight
item
Apple 3.00 28
Banana 2.75 25
Orange 3.50 35
mean_price = df.groupby(by='item',axis=0)['price'].mean()
mean_price
item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64
dic = mean_price.to_dict()
df['mean_price'] = df['item'].map(dic)
df
color item price weight mean_price
0 red Apple 4.0 12 3.00
1 yellow Banana 3.0 20 2.75
2 yellow Orange 3.0 50 3.50
3 green Banana 2.5 30 2.75
4 green Orange 4.0 20 3.50
5 green Apple 2.0 44 3.00

按颜色查看各种颜色的水果的平均价格

color_mean_price = df.groupby(by='color',axis=0)['price'].mean()
color_mean_price
color
green     2.833333
red       4.000000
yellow    3.000000
Name: price, dtype: float64
dic = color_mean_price.to_dict()
df['color_mean_price'] = df['color'].map(dic)
df
color item price weight mean_price color_mean_price
0 red Apple 4.0 12 3.00 4.000000
1 yellow Banana 3.0 20 2.75 3.000000
2 yellow Orange 3.0 50 3.50 3.000000
3 green Banana 2.5 30 2.75 2.833333
4 green Orange 4.0 20 3.50 2.833333
5 green Apple 2.0 44 3.00 2.833333

汇总:将各种颜色水果的平均价格和df进行汇总

6.0 高级数据聚合

使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算

  • df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
  • transform和apply都会进行运算,在transform或者apply中传入函数即可
  • transform和apply也可以传入一个lambda表达式
df.groupby(by='item')['price'].mean()
item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64
#求出各种水果价格的平均值
df.groupby(by='item')['price']
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x1148c30f0>
df
item price color weight
0 Apple 4.0 red 12
1 Banana 3.0 yellow 20
2 Orange 3.0 yellow 50
3 Banana 2.5 green 30
4 Orange 4.0 green 20
5 Apple 2.0 green 44
df.groupby(by='item').groups
{'Apple': Int64Index([0, 5], dtype='int64'),
 'Banana': Int64Index([1, 3], dtype='int64'),
 'Orange': Int64Index([2, 4], dtype='int64')}
#使用apply函数求出水果的平均价格
df.groupby(by='item')['price'].apply(fun)
item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64
def fun(s):
    sum = 0
    for i in s:
        sum+=i
    return sum/s.size
#使用transform函数求出水果的平均价格
df.groupby(by='item')['price'].transform(fun)
0    3.00
1    2.75
2    3.50
3    2.75
4    3.50
5    3.00
Name: price, dtype: float64
#apply还可以代替运算工具形式map
s = Series(data=[1,2,3,4,5,6,7,87,9,9])
# s.map(func)
s.apply(func)
posted @ 2019-07-01 16:34  海予心  阅读(519)  评论(0编辑  收藏  举报