pandas处理数据
pandas数据处理
1、删除重复元素
使用duplicated()函数检测重复的行,返回元素为布尔类型的Series对象,每个元素对应一行,如果该行不是第一次出现,则元素为True
- keep参数:指定保留哪一重复的行数据
- 创建具有重复元素行的DataFrame
import numpy as np
import pandas
from pandas import Series,DataFrame
#创建一个df
df = DataFrame(data=np.random.randint(0,100,size=(8,4)))
df.iloc[1] = [666,666,666,666]
df.iloc[3] = [666,666,666,666]
df.iloc[6] = [666,666,666,666]
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 52 | 0 | 19 | 15 |
1 | 666 | 666 | 666 | 666 |
2 | 55 | 52 | 66 | 28 |
3 | 666 | 666 | 666 | 666 |
4 | 81 | 11 | 49 | 12 |
5 | 57 | 80 | 15 | 4 |
6 | 666 | 666 | 666 | 666 |
7 | 13 | 13 | 34 | 87 |
#手动将df的某几行设置成相同的内容
- 使用duplicated查看所有重复元素行
~df.duplicated(keep='first')
0 True
1 True
2 True
3 False
4 True
5 True
6 False
7 True
dtype: bool
df.loc[~df.duplicated(keep=False)]
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 52 | 0 | 19 | 15 |
2 | 55 | 52 | 66 | 28 |
4 | 81 | 11 | 49 | 12 |
5 | 57 | 80 | 15 | 4 |
7 | 13 | 13 | 34 | 87 |
#drop
indexs = df.loc[df.duplicated(keep='last')].index
df.drop(labels=indexs,axis=0)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 52 | 0 | 19 | 15 |
2 | 55 | 52 | 66 | 28 |
4 | 81 | 11 | 49 | 12 |
5 | 57 | 80 | 15 | 4 |
6 | 666 | 666 | 666 | 666 |
7 | 13 | 13 | 34 | 87 |
-
删除重复元素的行
-
使用drop_duplicates()函数删除重复的行
- drop_duplicates(keep='first/last'/False)
df.drop_duplicates(keep='first',inplace=False)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 52 | 0 | 19 | 15 |
1 | 666 | 666 | 666 | 666 |
2 | 55 | 52 | 66 | 28 |
4 | 81 | 11 | 49 | 12 |
5 | 57 | 80 | 15 | 4 |
7 | 13 | 13 | 34 | 87 |
2. 映射
1) replace()函数:替换元素
使用replace()函数,对values进行映射操作
Series替换操作
- 单值替换
- 普通替换
- 字典替换(推荐)
- 多值替换
- 列表替换
- 字典替换(推荐)
- 参数
- to_replace:被替换的元素
单值普通替换
s = Series(data=[3,4,5,6,8,10,9])
s.replace(to_replace=[5,6],value=['five','six'])
0 3
1 4
2 five
3 six
4 8
5 10
6 9
dtype: object
单值字典替换
多值列表替换
多值字典替换
replace参数说明:
- method:对指定的值使用相邻的值填充替换
- limit:设定填充次数
DataFrame替换操作
-
单值替换
- 普通替换: 替换所有符合要求的元素:to_replace=15,value='e'
- 按列指定单值替换: to_replace={列标签:替换值} value='value'
-
多值替换
- 列表替换: to_replace=[] value=[]
- 字典替换(推荐) to_replace=
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 52 | 0 | 19 | 15 |
1 | 666 | 666 | 666 | 666 |
2 | 55 | 52 | 66 | 28 |
3 | 666 | 666 | 666 | 666 |
4 | 81 | 11 | 49 | 12 |
5 | 57 | 80 | 15 | 4 |
6 | 666 | 666 | 666 | 666 |
7 | 13 | 13 | 34 | 87 |
df.replace(to_replace=666,value='sixsixsix')
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 52 | 0 | 19 | 15 |
1 | sixsixsix | sixsixsix | sixsixsix | sixsixsix |
2 | 55 | 52 | 66 | 28 |
3 | sixsixsix | sixsixsix | sixsixsix | sixsixsix |
4 | 81 | 11 | 49 | 12 |
5 | 57 | 80 | 15 | 4 |
6 | sixsixsix | sixsixsix | sixsixsix | sixsixsix |
7 | 13 | 13 | 34 | 87 |
df.replace(to_replace={0:'zero'})
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 52 | zero | 19 | 15 |
1 | 666 | 666 | 666 | 666 |
2 | 55 | 52 | 66 | 28 |
3 | 666 | 666 | 666 | 666 |
4 | 81 | 11 | 49 | 12 |
5 | 57 | 80 | 15 | 4 |
6 | 666 | 666 | 666 | 666 |
7 | 13 | 13 | 34 | 87 |
df.replace(to_replace={2:666},value='six')
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 52 | 0 | 19 | 15 |
1 | 666 | 666 | six | 666 |
2 | 55 | 52 | 66 | 28 |
3 | 666 | 666 | six | 666 |
4 | 81 | 11 | 49 | 12 |
5 | 57 | 80 | 15 | 4 |
6 | 666 | 666 | six | 666 |
7 | 13 | 13 | 34 | 87 |
注意:DataFrame中,无法使用method和limit参数
============================================
练习19:
假设张三李四的课表里有满分的情况,老师认为是作弊,把所有满分的情况(包括150,300分)都记0分,如何实现?
============================================
2) map()函数:新建一列 , map函数并不是df的方法,而是series的方法
-
map()可以映射新一列数据
-
map()中可以使用lambd表达式
-
map()中可以使用方法,可以是自定义的方法
eg:map({to_replace:value})
-
注意 map()中不能使用sum之类的函数,for循环
-
新增一列:给df中,添加一列,该列的值为英文名对应的中文名
dic = {
'name':['jay','tom','jay'],
'salary':[12000,7000,12000]
}
df = DataFrame(data=dic)
df
name | salary | |
---|---|---|
0 | jay | 12000 |
1 | tom | 7000 |
2 | jay | 12000 |
#映射关系表
dic = {
'jay':'周杰伦',
'tom':'张三',
}
df['c_name'] = df['name'].map(dic)
df
name | salary | c_name | |
---|---|---|---|
0 | jay | 12000 | 周杰伦 |
1 | tom | 7000 | 张三 |
2 | jay | 12000 | 周杰伦 |
map当做一种运算工具,至于执行何种运算,是由map函数的参数决定的(参数:lambda,函数)
- 使用自定义函数
#超过3000部分的钱缴纳50%的税
df
name | salary | c_name | |
---|---|---|---|
0 | jay | 12000 | 周杰伦 |
1 | tom | 7000 | 张三 |
2 | jay | 12000 | 周杰伦 |
def after_salary(s):
if s <= 3000:
return s
else:
return s - (s-3000)*0.5
df['after_sal'] = df['salary'].map(after_salary)
df
name | salary | c_name | after_sal | |
---|---|---|---|---|
0 | jay | 12000 | 周杰伦 | 7500.0 |
1 | tom | 7000 | 张三 | 5000.0 |
2 | jay | 12000 | 周杰伦 | 7500.0 |
- 使用lambda表达式
注意:并不是任何形式的函数都可以作为map的参数。只有当一个函数具有一个参数且有返回值,那么该函数才可以作为map的参数。
============================================
练习20:
新增两列,分别为张三、李四的成绩状态,如果分数低于90,则为"failed",如果分数高于120,则为"excellent",其他则为"pass"
【提示】使用函数作为map的参数
============================================
3. 使用聚合操作对数据异常值检测和过滤
使用df.std()函数可以求得DataFrame对象每一列的标准差
- 创建一个1000行3列的df 范围(0-1),求其每一列的标准差
df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
df
A | B | C | |
---|---|---|---|
0 | 0.860007 | 0.240142 | 0.972207 |
1 | 0.517836 | 0.448164 | 0.013285 |
2 | 0.814698 | 0.549706 | 0.400180 |
3 | 0.976913 | 0.759793 | 0.700096 |
4 | 0.257355 | 0.031429 | 0.530168 |
5 | 0.356690 | 0.145748 | 0.869480 |
6 | 0.714695 | 0.433587 | 0.404421 |
7 | 0.243076 | 0.832746 | 0.607988 |
8 | 0.513298 | 0.084707 | 0.231510 |
9 | 0.608363 | 0.567508 | 0.243649 |
10 | 0.834919 | 0.595526 | 0.424388 |
11 | 0.430329 | 0.185759 | 0.316484 |
12 | 0.847074 | 0.248879 | 0.176468 |
13 | 0.372983 | 0.072461 | 0.012102 |
14 | 0.282422 | 0.431422 | 0.355343 |
15 | 0.007912 | 0.124476 | 0.801391 |
16 | 0.829466 | 0.677355 | 0.019161 |
17 | 0.857219 | 0.960232 | 0.513511 |
18 | 0.046890 | 0.183148 | 0.048427 |
19 | 0.019581 | 0.488118 | 0.248252 |
20 | 0.994921 | 0.541453 | 0.447042 |
21 | 0.670733 | 0.106632 | 0.870625 |
22 | 0.061870 | 0.699618 | 0.326010 |
23 | 0.472609 | 0.807719 | 0.483731 |
24 | 0.376022 | 0.932000 | 0.311720 |
25 | 0.866591 | 0.862012 | 0.195621 |
26 | 0.778366 | 0.090125 | 0.937042 |
27 | 0.405581 | 0.000708 | 0.878814 |
28 | 0.432118 | 0.061092 | 0.649828 |
29 | 0.805522 | 0.592622 | 0.359337 |
... | ... | ... | ... |
970 | 0.566573 | 0.128418 | 0.568342 |
971 | 0.327980 | 0.067132 | 0.656326 |
972 | 0.277012 | 0.705513 | 0.343562 |
973 | 0.036581 | 0.280018 | 0.215526 |
974 | 0.697460 | 0.739308 | 0.509766 |
975 | 0.667708 | 0.187524 | 0.832612 |
976 | 0.304881 | 0.320975 | 0.987438 |
977 | 0.314502 | 0.932980 | 0.905878 |
978 | 0.285760 | 0.366236 | 0.308724 |
979 | 0.103418 | 0.205835 | 0.371233 |
980 | 0.495616 | 0.954549 | 0.539369 |
981 | 0.874344 | 0.930889 | 0.825738 |
982 | 0.713143 | 0.923549 | 0.946772 |
983 | 0.377911 | 0.651799 | 0.700629 |
984 | 0.708385 | 0.114808 | 0.130637 |
985 | 0.589167 | 0.994180 | 0.756443 |
986 | 0.432346 | 0.629973 | 0.364151 |
987 | 0.159302 | 0.204101 | 0.569264 |
988 | 0.342391 | 0.229654 | 0.269073 |
989 | 0.938419 | 0.693347 | 0.113945 |
990 | 0.728082 | 0.486067 | 0.843194 |
991 | 0.112923 | 0.245280 | 0.490515 |
992 | 0.118851 | 0.725196 | 0.717325 |
993 | 0.481957 | 0.406391 | 0.632429 |
994 | 0.063079 | 0.219776 | 0.974884 |
995 | 0.829652 | 0.712765 | 0.551807 |
996 | 0.272144 | 0.823870 | 0.930134 |
997 | 0.631840 | 0.629386 | 0.573002 |
998 | 0.329112 | 0.152424 | 0.746059 |
999 | 0.263857 | 0.592957 | 0.804029 |
1000 rows × 3 columns
std_twice = df['C'].std() * 2
df['C'] > std_twice
df.loc[~(df['C'] > std_twice)]
A | B | C | |
---|---|---|---|
1 | 0.517836 | 0.448164 | 0.013285 |
2 | 0.814698 | 0.549706 | 0.400180 |
4 | 0.257355 | 0.031429 | 0.530168 |
6 | 0.714695 | 0.433587 | 0.404421 |
8 | 0.513298 | 0.084707 | 0.231510 |
9 | 0.608363 | 0.567508 | 0.243649 |
10 | 0.834919 | 0.595526 | 0.424388 |
11 | 0.430329 | 0.185759 | 0.316484 |
12 | 0.847074 | 0.248879 | 0.176468 |
13 | 0.372983 | 0.072461 | 0.012102 |
14 | 0.282422 | 0.431422 | 0.355343 |
16 | 0.829466 | 0.677355 | 0.019161 |
17 | 0.857219 | 0.960232 | 0.513511 |
18 | 0.046890 | 0.183148 | 0.048427 |
19 | 0.019581 | 0.488118 | 0.248252 |
20 | 0.994921 | 0.541453 | 0.447042 |
22 | 0.061870 | 0.699618 | 0.326010 |
23 | 0.472609 | 0.807719 | 0.483731 |
24 | 0.376022 | 0.932000 | 0.311720 |
25 | 0.866591 | 0.862012 | 0.195621 |
29 | 0.805522 | 0.592622 | 0.359337 |
30 | 0.456585 | 0.680231 | 0.494607 |
31 | 0.226859 | 0.694393 | 0.234240 |
34 | 0.548207 | 0.976733 | 0.039299 |
36 | 0.169733 | 0.003497 | 0.554480 |
37 | 0.369663 | 0.936746 | 0.415894 |
38 | 0.555920 | 0.172885 | 0.172166 |
40 | 0.919874 | 0.396758 | 0.002643 |
42 | 0.366576 | 0.835003 | 0.427422 |
44 | 0.488103 | 0.017431 | 0.281818 |
... | ... | ... | ... |
952 | 0.590463 | 0.606074 | 0.066888 |
953 | 0.043948 | 0.870355 | 0.143739 |
954 | 0.425745 | 0.468863 | 0.245972 |
955 | 0.626427 | 0.681483 | 0.067346 |
956 | 0.481236 | 0.223866 | 0.557637 |
957 | 0.506315 | 0.448762 | 0.408199 |
959 | 0.718023 | 0.280907 | 0.485149 |
960 | 0.531066 | 0.725144 | 0.074422 |
961 | 0.928132 | 0.104976 | 0.059524 |
962 | 0.634376 | 0.133937 | 0.154810 |
964 | 0.255689 | 0.127695 | 0.176475 |
965 | 0.048704 | 0.460688 | 0.085121 |
966 | 0.149927 | 0.479655 | 0.320294 |
967 | 0.115064 | 0.831730 | 0.279443 |
968 | 0.766362 | 0.544640 | 0.528170 |
970 | 0.566573 | 0.128418 | 0.568342 |
972 | 0.277012 | 0.705513 | 0.343562 |
973 | 0.036581 | 0.280018 | 0.215526 |
974 | 0.697460 | 0.739308 | 0.509766 |
978 | 0.285760 | 0.366236 | 0.308724 |
979 | 0.103418 | 0.205835 | 0.371233 |
980 | 0.495616 | 0.954549 | 0.539369 |
984 | 0.708385 | 0.114808 | 0.130637 |
986 | 0.432346 | 0.629973 | 0.364151 |
987 | 0.159302 | 0.204101 | 0.569264 |
988 | 0.342391 | 0.229654 | 0.269073 |
989 | 0.938419 | 0.693347 | 0.113945 |
991 | 0.112923 | 0.245280 | 0.490515 |
995 | 0.829652 | 0.712765 | 0.551807 |
997 | 0.631840 | 0.629386 | 0.573002 |
591 rows × 3 columns
df['C'] > std_twice
0 True
1 False
2 False
3 True
4 False
5 True
6 False
7 True
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 True
16 False
17 False
18 False
19 False
20 False
21 True
22 False
23 False
24 False
25 False
26 True
27 True
28 True
29 False
...
970 False
971 True
972 False
973 False
974 False
975 True
976 True
977 True
978 False
979 False
980 False
981 True
982 True
983 True
984 False
985 True
986 False
987 False
988 False
989 False
990 True
991 False
992 True
993 True
994 True
995 False
996 True
997 False
998 True
999 True
Name: C, Length: 1000, dtype: bool
df.loc[df['C'] > std_twice]
indexs = df.loc[df['C'] > std_twice].index
indexs
Int64Index([ 0, 3, 5, 7, 15, 21, 26, 27, 28, 32,
...
982, 983, 985, 990, 992, 993, 994, 996, 998, 999],
dtype='int64', length=409)
df.loc[indexs,'C'] = np.nan
df.fillna(axis=0,method='ffill',inplace=True)
df.fillna(axis=0,method='bfill',inplace=True)
df
A | B | C | |
---|---|---|---|
0 | 0.860007 | 0.240142 | 0.013285 |
1 | 0.517836 | 0.448164 | 0.013285 |
2 | 0.814698 | 0.549706 | 0.400180 |
3 | 0.976913 | 0.759793 | 0.400180 |
4 | 0.257355 | 0.031429 | 0.530168 |
5 | 0.356690 | 0.145748 | 0.530168 |
6 | 0.714695 | 0.433587 | 0.404421 |
7 | 0.243076 | 0.832746 | 0.404421 |
8 | 0.513298 | 0.084707 | 0.231510 |
9 | 0.608363 | 0.567508 | 0.243649 |
10 | 0.834919 | 0.595526 | 0.424388 |
11 | 0.430329 | 0.185759 | 0.316484 |
12 | 0.847074 | 0.248879 | 0.176468 |
13 | 0.372983 | 0.072461 | 0.012102 |
14 | 0.282422 | 0.431422 | 0.355343 |
15 | 0.007912 | 0.124476 | 0.355343 |
16 | 0.829466 | 0.677355 | 0.019161 |
17 | 0.857219 | 0.960232 | 0.513511 |
18 | 0.046890 | 0.183148 | 0.048427 |
19 | 0.019581 | 0.488118 | 0.248252 |
20 | 0.994921 | 0.541453 | 0.447042 |
21 | 0.670733 | 0.106632 | 0.447042 |
22 | 0.061870 | 0.699618 | 0.326010 |
23 | 0.472609 | 0.807719 | 0.483731 |
24 | 0.376022 | 0.932000 | 0.311720 |
25 | 0.866591 | 0.862012 | 0.195621 |
26 | 0.778366 | 0.090125 | 0.195621 |
27 | 0.405581 | 0.000708 | 0.195621 |
28 | 0.432118 | 0.061092 | 0.195621 |
29 | 0.805522 | 0.592622 | 0.359337 |
... | ... | ... | ... |
970 | 0.566573 | 0.128418 | 0.568342 |
971 | 0.327980 | 0.067132 | 0.568342 |
972 | 0.277012 | 0.705513 | 0.343562 |
973 | 0.036581 | 0.280018 | 0.215526 |
974 | 0.697460 | 0.739308 | 0.509766 |
975 | 0.667708 | 0.187524 | 0.509766 |
976 | 0.304881 | 0.320975 | 0.509766 |
977 | 0.314502 | 0.932980 | 0.509766 |
978 | 0.285760 | 0.366236 | 0.308724 |
979 | 0.103418 | 0.205835 | 0.371233 |
980 | 0.495616 | 0.954549 | 0.539369 |
981 | 0.874344 | 0.930889 | 0.539369 |
982 | 0.713143 | 0.923549 | 0.539369 |
983 | 0.377911 | 0.651799 | 0.539369 |
984 | 0.708385 | 0.114808 | 0.130637 |
985 | 0.589167 | 0.994180 | 0.130637 |
986 | 0.432346 | 0.629973 | 0.364151 |
987 | 0.159302 | 0.204101 | 0.569264 |
988 | 0.342391 | 0.229654 | 0.269073 |
989 | 0.938419 | 0.693347 | 0.113945 |
990 | 0.728082 | 0.486067 | 0.113945 |
991 | 0.112923 | 0.245280 | 0.490515 |
992 | 0.118851 | 0.725196 | 0.490515 |
993 | 0.481957 | 0.406391 | 0.490515 |
994 | 0.063079 | 0.219776 | 0.490515 |
995 | 0.829652 | 0.712765 | 0.551807 |
996 | 0.272144 | 0.823870 | 0.551807 |
997 | 0.631840 | 0.629386 | 0.573002 |
998 | 0.329112 | 0.152424 | 0.573002 |
999 | 0.263857 | 0.592957 | 0.573002 |
1000 rows × 3 columns
对df应用筛选条件,去除标准差太大的数据:假设过滤条件为 C列数据大于两倍的C列标准差
============================================
练习21:
新建一个形状为10000*3的标准正态分布的DataFrame(np.random.randn),去除掉所有满足以下情况的行:其中任一元素绝对值(df.abs())大于3倍标准差
============================================
4. 排序
使用.take()函数排序
- take()函数接受一个索引列表,用数字表示,使得df根据列表中索引的顺序进行排序
- eg:df.take([1,3,4,2,5])
可以借助np.random.permutation()函数随机排序
df.take([2,1,0],axis=1)
C | B | A | |
---|---|---|---|
0 | 0.013285 | 0.240142 | 0.860007 |
1 | 0.013285 | 0.448164 | 0.517836 |
2 | 0.400180 | 0.549706 | 0.814698 |
3 | 0.400180 | 0.759793 | 0.976913 |
4 | 0.530168 | 0.031429 | 0.257355 |
5 | 0.530168 | 0.145748 | 0.356690 |
6 | 0.404421 | 0.433587 | 0.714695 |
7 | 0.404421 | 0.832746 | 0.243076 |
8 | 0.231510 | 0.084707 | 0.513298 |
9 | 0.243649 | 0.567508 | 0.608363 |
10 | 0.424388 | 0.595526 | 0.834919 |
11 | 0.316484 | 0.185759 | 0.430329 |
12 | 0.176468 | 0.248879 | 0.847074 |
13 | 0.012102 | 0.072461 | 0.372983 |
14 | 0.355343 | 0.431422 | 0.282422 |
15 | 0.355343 | 0.124476 | 0.007912 |
16 | 0.019161 | 0.677355 | 0.829466 |
17 | 0.513511 | 0.960232 | 0.857219 |
18 | 0.048427 | 0.183148 | 0.046890 |
19 | 0.248252 | 0.488118 | 0.019581 |
20 | 0.447042 | 0.541453 | 0.994921 |
21 | 0.447042 | 0.106632 | 0.670733 |
22 | 0.326010 | 0.699618 | 0.061870 |
23 | 0.483731 | 0.807719 | 0.472609 |
24 | 0.311720 | 0.932000 | 0.376022 |
25 | 0.195621 | 0.862012 | 0.866591 |
26 | 0.195621 | 0.090125 | 0.778366 |
27 | 0.195621 | 0.000708 | 0.405581 |
28 | 0.195621 | 0.061092 | 0.432118 |
29 | 0.359337 | 0.592622 | 0.805522 |
... | ... | ... | ... |
970 | 0.568342 | 0.128418 | 0.566573 |
971 | 0.568342 | 0.067132 | 0.327980 |
972 | 0.343562 | 0.705513 | 0.277012 |
973 | 0.215526 | 0.280018 | 0.036581 |
974 | 0.509766 | 0.739308 | 0.697460 |
975 | 0.509766 | 0.187524 | 0.667708 |
976 | 0.509766 | 0.320975 | 0.304881 |
977 | 0.509766 | 0.932980 | 0.314502 |
978 | 0.308724 | 0.366236 | 0.285760 |
979 | 0.371233 | 0.205835 | 0.103418 |
980 | 0.539369 | 0.954549 | 0.495616 |
981 | 0.539369 | 0.930889 | 0.874344 |
982 | 0.539369 | 0.923549 | 0.713143 |
983 | 0.539369 | 0.651799 | 0.377911 |
984 | 0.130637 | 0.114808 | 0.708385 |
985 | 0.130637 | 0.994180 | 0.589167 |
986 | 0.364151 | 0.629973 | 0.432346 |
987 | 0.569264 | 0.204101 | 0.159302 |
988 | 0.269073 | 0.229654 | 0.342391 |
989 | 0.113945 | 0.693347 | 0.938419 |
990 | 0.113945 | 0.486067 | 0.728082 |
991 | 0.490515 | 0.245280 | 0.112923 |
992 | 0.490515 | 0.725196 | 0.118851 |
993 | 0.490515 | 0.406391 | 0.481957 |
994 | 0.490515 | 0.219776 | 0.063079 |
995 | 0.551807 | 0.712765 | 0.829652 |
996 | 0.551807 | 0.823870 | 0.272144 |
997 | 0.573002 | 0.629386 | 0.631840 |
998 | 0.573002 | 0.152424 | 0.329112 |
999 | 0.573002 | 0.592957 | 0.263857 |
1000 rows × 3 columns
df.take(np.random.permutation(3),axis=1)
B | A | C | |
---|---|---|---|
0 | 0.240142 | 0.860007 | 0.013285 |
1 | 0.448164 | 0.517836 | 0.013285 |
2 | 0.549706 | 0.814698 | 0.400180 |
3 | 0.759793 | 0.976913 | 0.400180 |
4 | 0.031429 | 0.257355 | 0.530168 |
5 | 0.145748 | 0.356690 | 0.530168 |
6 | 0.433587 | 0.714695 | 0.404421 |
7 | 0.832746 | 0.243076 | 0.404421 |
8 | 0.084707 | 0.513298 | 0.231510 |
9 | 0.567508 | 0.608363 | 0.243649 |
10 | 0.595526 | 0.834919 | 0.424388 |
11 | 0.185759 | 0.430329 | 0.316484 |
12 | 0.248879 | 0.847074 | 0.176468 |
13 | 0.072461 | 0.372983 | 0.012102 |
14 | 0.431422 | 0.282422 | 0.355343 |
15 | 0.124476 | 0.007912 | 0.355343 |
16 | 0.677355 | 0.829466 | 0.019161 |
17 | 0.960232 | 0.857219 | 0.513511 |
18 | 0.183148 | 0.046890 | 0.048427 |
19 | 0.488118 | 0.019581 | 0.248252 |
20 | 0.541453 | 0.994921 | 0.447042 |
21 | 0.106632 | 0.670733 | 0.447042 |
22 | 0.699618 | 0.061870 | 0.326010 |
23 | 0.807719 | 0.472609 | 0.483731 |
24 | 0.932000 | 0.376022 | 0.311720 |
25 | 0.862012 | 0.866591 | 0.195621 |
26 | 0.090125 | 0.778366 | 0.195621 |
27 | 0.000708 | 0.405581 | 0.195621 |
28 | 0.061092 | 0.432118 | 0.195621 |
29 | 0.592622 | 0.805522 | 0.359337 |
... | ... | ... | ... |
970 | 0.128418 | 0.566573 | 0.568342 |
971 | 0.067132 | 0.327980 | 0.568342 |
972 | 0.705513 | 0.277012 | 0.343562 |
973 | 0.280018 | 0.036581 | 0.215526 |
974 | 0.739308 | 0.697460 | 0.509766 |
975 | 0.187524 | 0.667708 | 0.509766 |
976 | 0.320975 | 0.304881 | 0.509766 |
977 | 0.932980 | 0.314502 | 0.509766 |
978 | 0.366236 | 0.285760 | 0.308724 |
979 | 0.205835 | 0.103418 | 0.371233 |
980 | 0.954549 | 0.495616 | 0.539369 |
981 | 0.930889 | 0.874344 | 0.539369 |
982 | 0.923549 | 0.713143 | 0.539369 |
983 | 0.651799 | 0.377911 | 0.539369 |
984 | 0.114808 | 0.708385 | 0.130637 |
985 | 0.994180 | 0.589167 | 0.130637 |
986 | 0.629973 | 0.432346 | 0.364151 |
987 | 0.204101 | 0.159302 | 0.569264 |
988 | 0.229654 | 0.342391 | 0.269073 |
989 | 0.693347 | 0.938419 | 0.113945 |
990 | 0.486067 | 0.728082 | 0.113945 |
991 | 0.245280 | 0.112923 | 0.490515 |
992 | 0.725196 | 0.118851 | 0.490515 |
993 | 0.406391 | 0.481957 | 0.490515 |
994 | 0.219776 | 0.063079 | 0.490515 |
995 | 0.712765 | 0.829652 | 0.551807 |
996 | 0.823870 | 0.272144 | 0.551807 |
997 | 0.629386 | 0.631840 | 0.573002 |
998 | 0.152424 | 0.329112 | 0.573002 |
999 | 0.592957 | 0.263857 | 0.573002 |
1000 rows × 3 columns
random_df = df.take(np.random.permutation(3),axis=1).take(np.random.permutation(1000),axis=0)
random_df
B | C | A | |
---|---|---|---|
901 | 0.842875 | 0.484079 | 0.396218 |
681 | 0.941300 | 0.160590 | 0.085862 |
97 | 0.401448 | 0.289703 | 0.619326 |
233 | 0.460187 | 0.374978 | 0.459398 |
766 | 0.231968 | 0.338906 | 0.836340 |
728 | 0.500936 | 0.035901 | 0.790860 |
819 | 0.049642 | 0.201026 | 0.248039 |
26 | 0.090125 | 0.195621 | 0.778366 |
570 | 0.063299 | 0.377617 | 0.503758 |
833 | 0.681264 | 0.085146 | 0.907368 |
114 | 0.788951 | 0.343372 | 0.225716 |
822 | 0.026554 | 0.338009 | 0.545263 |
642 | 0.113898 | 0.455778 | 0.690428 |
352 | 0.162700 | 0.075357 | 0.392956 |
649 | 0.116311 | 0.029454 | 0.569790 |
460 | 0.567423 | 0.090965 | 0.959869 |
758 | 0.021598 | 0.424248 | 0.372490 |
166 | 0.498037 | 0.060493 | 0.475296 |
806 | 0.990978 | 0.549066 | 0.233480 |
676 | 0.896488 | 0.177937 | 0.750045 |
47 | 0.102752 | 0.568177 | 0.181398 |
648 | 0.287340 | 0.299368 | 0.540929 |
266 | 0.599017 | 0.133734 | 0.257810 |
639 | 0.164303 | 0.473354 | 0.345383 |
73 | 0.194962 | 0.456506 | 0.026895 |
49 | 0.400460 | 0.333073 | 0.212971 |
299 | 0.647202 | 0.227759 | 0.135387 |
633 | 0.693565 | 0.400427 | 0.277807 |
996 | 0.823870 | 0.551807 | 0.272144 |
625 | 0.719373 | 0.115486 | 0.233699 |
... | ... | ... | ... |
596 | 0.729380 | 0.135190 | 0.960753 |
121 | 0.519113 | 0.444025 | 0.718233 |
351 | 0.997335 | 0.075357 | 0.249110 |
661 | 0.086639 | 0.567981 | 0.462961 |
108 | 0.185337 | 0.244531 | 0.985059 |
678 | 0.006171 | 0.296101 | 0.840489 |
164 | 0.186113 | 0.555717 | 0.325252 |
701 | 0.524764 | 0.504779 | 0.907671 |
253 | 0.452349 | 0.267856 | 0.900799 |
147 | 0.998336 | 0.532935 | 0.633324 |
825 | 0.316121 | 0.102929 | 0.283131 |
749 | 0.742133 | 0.544720 | 0.474593 |
782 | 0.084357 | 0.442086 | 0.854215 |
355 | 0.866595 | 0.065320 | 0.474957 |
432 | 0.748315 | 0.095471 | 0.490352 |
386 | 0.734383 | 0.000074 | 0.779548 |
548 | 0.416296 | 0.031425 | 0.493160 |
923 | 0.133347 | 0.527376 | 0.701312 |
755 | 0.806736 | 0.395561 | 0.595543 |
481 | 0.349173 | 0.301911 | 0.940143 |
150 | 0.123704 | 0.304874 | 0.692476 |
426 | 0.654830 | 0.564152 | 0.627674 |
406 | 0.039941 | 0.356761 | 0.511435 |
324 | 0.214678 | 0.580897 | 0.376711 |
374 | 0.724043 | 0.039151 | 0.284666 |
214 | 0.570377 | 0.467031 | 0.343456 |
405 | 0.706814 | 0.464470 | 0.831942 |
700 | 0.596208 | 0.208127 | 0.104092 |
647 | 0.848684 | 0.136256 | 0.753361 |
692 | 0.575595 | 0.184419 | 0.465930 |
1000 rows × 3 columns
random_df[0:100]
B | C | A | |
---|---|---|---|
901 | 0.842875 | 0.484079 | 0.396218 |
681 | 0.941300 | 0.160590 | 0.085862 |
97 | 0.401448 | 0.289703 | 0.619326 |
233 | 0.460187 | 0.374978 | 0.459398 |
766 | 0.231968 | 0.338906 | 0.836340 |
728 | 0.500936 | 0.035901 | 0.790860 |
819 | 0.049642 | 0.201026 | 0.248039 |
26 | 0.090125 | 0.195621 | 0.778366 |
570 | 0.063299 | 0.377617 | 0.503758 |
833 | 0.681264 | 0.085146 | 0.907368 |
114 | 0.788951 | 0.343372 | 0.225716 |
822 | 0.026554 | 0.338009 | 0.545263 |
642 | 0.113898 | 0.455778 | 0.690428 |
352 | 0.162700 | 0.075357 | 0.392956 |
649 | 0.116311 | 0.029454 | 0.569790 |
460 | 0.567423 | 0.090965 | 0.959869 |
758 | 0.021598 | 0.424248 | 0.372490 |
166 | 0.498037 | 0.060493 | 0.475296 |
806 | 0.990978 | 0.549066 | 0.233480 |
676 | 0.896488 | 0.177937 | 0.750045 |
47 | 0.102752 | 0.568177 | 0.181398 |
648 | 0.287340 | 0.299368 | 0.540929 |
266 | 0.599017 | 0.133734 | 0.257810 |
639 | 0.164303 | 0.473354 | 0.345383 |
73 | 0.194962 | 0.456506 | 0.026895 |
49 | 0.400460 | 0.333073 | 0.212971 |
299 | 0.647202 | 0.227759 | 0.135387 |
633 | 0.693565 | 0.400427 | 0.277807 |
996 | 0.823870 | 0.551807 | 0.272144 |
625 | 0.719373 | 0.115486 | 0.233699 |
... | ... | ... | ... |
320 | 0.793754 | 0.270634 | 0.576142 |
991 | 0.245280 | 0.490515 | 0.112923 |
945 | 0.628873 | 0.315306 | 0.106584 |
464 | 0.279384 | 0.181075 | 0.813286 |
896 | 0.297097 | 0.105391 | 0.439719 |
450 | 0.541616 | 0.324493 | 0.078195 |
779 | 0.172180 | 0.212489 | 0.869420 |
492 | 0.068938 | 0.420931 | 0.288381 |
618 | 0.801756 | 0.422841 | 0.611366 |
815 | 0.866783 | 0.493224 | 0.364423 |
400 | 0.092009 | 0.511670 | 0.883439 |
139 | 0.475415 | 0.030180 | 0.009531 |
288 | 0.521498 | 0.047107 | 0.889200 |
600 | 0.162564 | 0.081140 | 0.450440 |
247 | 0.215706 | 0.420259 | 0.429032 |
741 | 0.546596 | 0.373861 | 0.668371 |
172 | 0.258265 | 0.503051 | 0.388976 |
520 | 0.409268 | 0.080479 | 0.597250 |
35 | 0.366329 | 0.039299 | 0.216926 |
995 | 0.712765 | 0.551807 | 0.829652 |
736 | 0.763012 | 0.411588 | 0.796950 |
383 | 0.329860 | 0.481572 | 0.282086 |
454 | 0.570412 | 0.025394 | 0.974983 |
457 | 0.692182 | 0.449229 | 0.092467 |
240 | 0.762304 | 0.188582 | 0.163220 |
613 | 0.534953 | 0.459497 | 0.542918 |
516 | 0.238180 | 0.019506 | 0.467064 |
439 | 0.386206 | 0.352531 | 0.892062 |
983 | 0.651799 | 0.539369 | 0.377911 |
512 | 0.556697 | 0.045425 | 0.233789 |
100 rows × 3 columns
- np.random.permutation(x)可以生成x个从0-(x-1)的随机数列
np.random.permutation(5)
array([1, 2, 3, 0, 4])
随机抽样
当DataFrame规模足够大时,直接使用np.random.permutation(x)函数,就配合take()函数实现随机抽样
5. 数据分类处理【重点】
数据聚合是数据处理的最后一步,通常是要使每一个数组生成一个单一的数值。
数据分类处理:
- 分组:先把数据分为几组
- 用函数处理:为不同组的数据应用不同的函数以转换数据
- 合并:把不同组得到的结果合并起来
数据分类处理的核心:
- groupby()函数
- groups属性查看分组情况
- eg: df.groupby(by='item').groups
分组
from pandas import DataFrame,Series
df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
'price':[4,3,3,2.5,4,2],
'color':['red','yellow','yellow','green','green','green'],
'weight':[12,20,50,30,20,44]})
df
color | item | price | weight | |
---|---|---|---|---|
0 | red | Apple | 4.0 | 12 |
1 | yellow | Banana | 3.0 | 20 |
2 | yellow | Orange | 3.0 | 50 |
3 | green | Banana | 2.5 | 30 |
4 | green | Orange | 4.0 | 20 |
5 | green | Apple | 2.0 | 44 |
- 使用groupby实现分组
df.groupby(by='item',axis=0)
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000EDABA20>
- 使用groups查看分组情况
#使用goups属性查看分组情况
df.groupby(by='item',axis=0).groups
{'Apple': Int64Index([0, 5], dtype='int64'),
'Banana': Int64Index([1, 3], dtype='int64'),
'Orange': Int64Index([2, 4], dtype='int64')}
- 分组后的聚合操作:分组后的成员中可以被进行运算的值会进行运算,不能被运算的值不进行运算
#给df创建一个新列,内容为各个水果的平均价格
df.groupby(by='item',axis=0).mean()['price']
price | weight | |
---|---|---|
item | ||
Apple | 3.00 | 28 |
Banana | 2.75 | 25 |
Orange | 3.50 | 35 |
mean_price = df.groupby(by='item',axis=0)['price'].mean()
mean_price
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
dic = mean_price.to_dict()
df['mean_price'] = df['item'].map(dic)
df
color | item | price | weight | mean_price | |
---|---|---|---|---|---|
0 | red | Apple | 4.0 | 12 | 3.00 |
1 | yellow | Banana | 3.0 | 20 | 2.75 |
2 | yellow | Orange | 3.0 | 50 | 3.50 |
3 | green | Banana | 2.5 | 30 | 2.75 |
4 | green | Orange | 4.0 | 20 | 3.50 |
5 | green | Apple | 2.0 | 44 | 3.00 |
按颜色查看各种颜色的水果的平均价格
color_mean_price = df.groupby(by='color',axis=0)['price'].mean()
color_mean_price
color
green 2.833333
red 4.000000
yellow 3.000000
Name: price, dtype: float64
dic = color_mean_price.to_dict()
df['color_mean_price'] = df['color'].map(dic)
df
color | item | price | weight | mean_price | color_mean_price | |
---|---|---|---|---|---|---|
0 | red | Apple | 4.0 | 12 | 3.00 | 4.000000 |
1 | yellow | Banana | 3.0 | 20 | 2.75 | 3.000000 |
2 | yellow | Orange | 3.0 | 50 | 3.50 | 3.000000 |
3 | green | Banana | 2.5 | 30 | 2.75 | 2.833333 |
4 | green | Orange | 4.0 | 20 | 3.50 | 2.833333 |
5 | green | Apple | 2.0 | 44 | 3.00 | 2.833333 |
汇总:将各种颜色水果的平均价格和df进行汇总
6.0 高级数据聚合
使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算
- df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
- transform和apply都会进行运算,在transform或者apply中传入函数即可
- transform和apply也可以传入一个lambda表达式
df.groupby(by='item')['price'].mean()
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
#求出各种水果价格的平均值
df.groupby(by='item')['price']
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x1148c30f0>
df
item | price | color | weight | |
---|---|---|---|---|
0 | Apple | 4.0 | red | 12 |
1 | Banana | 3.0 | yellow | 20 |
2 | Orange | 3.0 | yellow | 50 |
3 | Banana | 2.5 | green | 30 |
4 | Orange | 4.0 | green | 20 |
5 | Apple | 2.0 | green | 44 |
df.groupby(by='item').groups
{'Apple': Int64Index([0, 5], dtype='int64'),
'Banana': Int64Index([1, 3], dtype='int64'),
'Orange': Int64Index([2, 4], dtype='int64')}
#使用apply函数求出水果的平均价格
df.groupby(by='item')['price'].apply(fun)
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
def fun(s):
sum = 0
for i in s:
sum+=i
return sum/s.size
#使用transform函数求出水果的平均价格
df.groupby(by='item')['price'].transform(fun)
0 3.00
1 2.75
2 3.50
3 2.75
4 3.50
5 3.00
Name: price, dtype: float64
#apply还可以代替运算工具形式map
s = Series(data=[1,2,3,4,5,6,7,87,9,9])
# s.map(func)
s.apply(func)