Pandas的高级操作
pandas数据处理
1. 删除重复元素
使用duplicated()函数检测重复的行,返回元素为布尔类型的Series对象,每个元素对应一行,如果该行不是第一次出现,则元素为True
-
keep参数:指定保留哪一重复的行数据
-
创建具有重复元素行的DataFrame
import numpy as np
import pandas as pd
from pandas import DataFrame
# 创建一个df
df = DataFrame(data=np.random.randint(0,100,size=(12,7)))
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 18 | 75 | 98 | 68 | 33 | 35 | 15 |
1 | 48 | 71 | 36 | 34 | 41 | 17 | 46 |
2 | 3 | 28 | 81 | 21 | 86 | 53 | 85 |
3 | 15 | 35 | 35 | 90 | 63 | 83 | 14 |
4 | 12 | 36 | 65 | 79 | 25 | 53 | 95 |
5 | 98 | 63 | 4 | 58 | 35 | 64 | 80 |
6 | 31 | 61 | 23 | 33 | 80 | 53 | 60 |
7 | 52 | 47 | 60 | 58 | 54 | 35 | 17 |
8 | 7 | 92 | 42 | 61 | 31 | 40 | 56 |
9 | 76 | 45 | 30 | 42 | 74 | 83 | 53 |
10 | 69 | 2 | 89 | 99 | 12 | 51 | 62 |
11 | 17 | 86 | 1 | 76 | 40 | 34 | 41 |
# 手动将df的某几行设置成相同的内容
df.iloc[1] = [6,6,6,6,6,6,6]
df.iloc[8] = [6,6,6,6,6,6,6]
df.iloc[5] = [6,6,6,6,6,6,6]
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 18 | 75 | 98 | 68 | 33 | 35 | 15 |
1 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
2 | 3 | 28 | 81 | 21 | 86 | 53 | 85 |
3 | 15 | 35 | 35 | 90 | 63 | 83 | 14 |
4 | 12 | 36 | 65 | 79 | 25 | 53 | 95 |
5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
6 | 31 | 61 | 23 | 33 | 80 | 53 | 60 |
7 | 52 | 47 | 60 | 58 | 54 | 35 | 17 |
8 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
9 | 76 | 45 | 30 | 42 | 74 | 83 | 53 |
10 | 69 | 2 | 89 | 99 | 12 | 51 | 62 |
11 | 17 | 86 | 1 | 76 | 40 | 34 | 41 |
- 使用drop_duplicates()函数删除重复的行
- drop_duplicates(keep='first/last'/False)
df.drop_duplicates(keep='last') # 保留最后一个重复的行
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 26 | 9 | 31 | 11 | 38 | 46 | 22 |
2 | 89 | 24 | 53 | 28 | 64 | 89 | 40 |
3 | 7 | 80 | 43 | 91 | 32 | 95 | 6 |
4 | 96 | 92 | 58 | 55 | 82 | 73 | 21 |
6 | 43 | 1 | 13 | 54 | 24 | 34 | 43 |
7 | 75 | 32 | 88 | 85 | 40 | 29 | 41 |
8 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
9 | 93 | 3 | 62 | 4 | 60 | 51 | 78 |
10 | 58 | 11 | 63 | 42 | 62 | 30 | 10 |
11 | 89 | 93 | 96 | 49 | 23 | 40 | 57 |
2. 映射
1) replace()函数:替换元素
DataFrame替换操作
-
单值替换
- 普通替换: 替换所有符合要求的元素:to_replace=15,value='e'
- 按列指定单值替换: to_replace={列标签:替换值} value='value'
-
多值替换
- 列表替换: to_replace=[] value=[]
- 字典替换(推荐) to_replace=
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 26 | 9 | 31 | 11 | 38 | 46 | 22 |
1 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
2 | 89 | 24 | 53 | 28 | 64 | 89 | 40 |
3 | 7 | 80 | 43 | 91 | 32 | 95 | 6 |
4 | 96 | 92 | 58 | 55 | 82 | 73 | 21 |
5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
6 | 43 | 1 | 13 | 54 | 24 | 34 | 43 |
7 | 75 | 32 | 88 | 85 | 40 | 29 | 41 |
8 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
9 | 93 | 3 | 62 | 4 | 60 | 51 | 78 |
10 | 58 | 11 | 63 | 42 | 62 | 30 | 10 |
11 | 89 | 93 | 96 | 49 | 23 | 40 | 57 |
注意:DataFrame中,无法使用method和limit参数
df.replace(to_replace=6,value='six') # 将数据中的所有6替换成six,默认不改变原表
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 18 | 75 | 98 | 68 | 33 | 35 | 15 |
1 | six | six | six | six | six | six | six |
2 | 3 | 28 | 81 | 21 | 86 | 53 | 85 |
3 | 15 | 35 | 35 | 90 | 63 | 83 | 14 |
4 | 12 | 36 | 65 | 79 | 25 | 53 | 95 |
5 | six | six | six | six | six | six | six |
6 | 31 | 61 | 23 | 33 | 80 | 53 | 60 |
7 | 52 | 47 | 60 | 58 | 54 | 35 | 17 |
8 | six | six | six | six | six | six | six |
9 | 76 | 45 | 30 | 42 | 74 | 83 | 53 |
10 | 69 | 2 | 89 | 99 | 12 | 51 | 62 |
11 | 17 | 86 | 1 | 76 | 40 | 34 | 41 |
df.replace(to_replace={6:'six'}) # 效果同上
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 26 | 9 | 31 | 11 | 38 | 46 | 22 |
1 | six | six | six | six | six | six | six |
2 | 89 | 24 | 53 | 28 | 64 | 89 | 40 |
3 | 7 | 80 | 43 | 91 | 32 | 95 | six |
4 | 96 | 92 | 58 | 55 | 82 | 73 | 21 |
5 | six | six | six | six | six | six | six |
6 | 43 | 1 | 13 | 54 | 24 | 34 | 43 |
7 | 75 | 32 | 88 | 85 | 40 | 29 | 41 |
8 | six | six | six | six | six | six | six |
9 | 93 | 3 | 62 | 4 | 60 | 51 | 78 |
10 | 58 | 11 | 63 | 42 | 62 | 30 | 10 |
11 | 89 | 93 | 96 | 49 | 23 | 40 | 57 |
df.replace(to_replace={5:6},value='six') # 将第5列中的6,替换成six
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 26 | 9 | 31 | 11 | 38 | 46 | 22 |
1 | 6 | 6 | 6 | 6 | 6 | six | 6 |
2 | 89 | 24 | 53 | 28 | 64 | 89 | 40 |
3 | 7 | 80 | 43 | 91 | 32 | 95 | 6 |
4 | 96 | 92 | 58 | 55 | 82 | 73 | 21 |
5 | 6 | 6 | 6 | 6 | 6 | six | 6 |
6 | 43 | 1 | 13 | 54 | 24 | 34 | 43 |
7 | 75 | 32 | 88 | 85 | 40 | 29 | 41 |
8 | 6 | 6 | 6 | 6 | 6 | six | 6 |
9 | 93 | 3 | 62 | 4 | 60 | 51 | 78 |
10 | 58 | 11 | 63 | 42 | 62 | 30 | 10 |
11 | 89 | 93 | 96 | 49 | 23 | 40 | 57 |
2) map()函数
新建一列 , map函数并不是df的方法,而是series的方法
-
map()可以映射新一列数据
-
map()中可以使用lambd表达式
-
map()中可以使用方法,可以是自定义的方法
eg:map({to_replace:value})
-
注意 map()中不能使用sum之类的函数,for循环
-
新增一列:给df中,添加一列,该列的值为中文名对应的英文名
dic = {
'name':['张三','周杰伦','张三'],
'salary':[20000,10000,20000]
}
df = DataFrame(data=dic)
df
name | salary | |
---|---|---|
0 | 张三 | 20000 |
1 | 周杰伦 | 10000 |
2 | 张三 | 20000 |
# 映射关系表
dic = {
'张三':'tom',
'周杰伦':'jay'
}
df['e_name'] = df['name'].map(dic)
df
name | salary | e_name | |
---|---|---|---|
0 | 张三 | 20000 | tom |
1 | 周杰伦 | 10000 | jay |
2 | 张三 | 20000 | tom |
map当做一种运算工具,至于执行何种运算,是由map函数的参数决定的(参数:lambda,函数)
- 使用自定义函数
def after_sal(s):
return s - (s-3000)*0.5
# 超过3000部分的钱缴纳50%的税
df['after_sal'] = df['salary'].map(after_sal)
df
name | salary | e_name | after_sal | |
---|---|---|---|---|
0 | 张三 | 20000 | tom | 11500.0 |
1 | 周杰伦 | 10000 | jay | 6500.0 |
2 | 张三 | 20000 | tom | 11500.0 |
df['after_sal'] = df['salary'].apply(after_sal) # apply效率高于map
df
name | salary | e_name | after_sal | |
---|---|---|---|---|
0 | 张三 | 20000 | tom | 11500.0 |
1 | 周杰伦 | 10000 | jay | 6500.0 |
2 | 张三 | 20000 | tom | 11500.0 |
- 使用lambda表达式
df['after_sal'] = df['salary'].apply(lambda x:x-(x-3000)*0.5) # 或map
df
name | salary | e_name | after_sal | |
---|---|---|---|---|
0 | 张三 | 20000 | tom | 11500.0 |
1 | 周杰伦 | 10000 | jay | 6500.0 |
2 | 张三 | 20000 | tom | 11500.0 |
注意:并不是任何形式的函数都可以作为map的参数。只有当一个函数具有一个参数且有返回值,那么该函数才可以作为map的参数。
3. 使用聚合操作对数据异常值检测和过滤
使用df.std()函数可以求得DataFrame对象每一列的标准差
- 创建一个1000行3列的df 范围(0-1),求其每一列的标准差
df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
df
A | B | C | |
---|---|---|---|
0 | 0.056365 | 0.080972 | 0.378327 |
1 | 0.371930 | 0.007791 | 0.318345 |
2 | 0.140999 | 0.921772 | 0.752930 |
3 | 0.877110 | 0.447756 | 0.760049 |
4 | 0.212178 | 0.143772 | 0.621486 |
5 | 0.255404 | 0.195473 | 0.008677 |
6 | 0.011568 | 0.308934 | 0.882607 |
7 | 0.470868 | 0.080049 | 0.285998 |
8 | 0.659013 | 0.794802 | 0.270541 |
9 | 0.315826 | 0.814653 | 0.906056 |
10 | 0.892474 | 0.301340 | 0.687254 |
11 | 0.015484 | 0.567598 | 0.043682 |
12 | 0.957620 | 0.967676 | 0.063608 |
13 | 0.102506 | 0.490077 | 0.235902 |
14 | 0.099083 | 0.778190 | 0.451824 |
15 | 0.023148 | 0.074169 | 0.589411 |
16 | 0.425894 | 0.772662 | 0.797658 |
17 | 0.939475 | 0.773502 | 0.766101 |
18 | 0.330299 | 0.984615 | 0.346554 |
19 | 0.882735 | 0.237546 | 0.847036 |
20 | 0.578589 | 0.730879 | 0.751632 |
21 | 0.504627 | 0.716272 | 0.386102 |
22 | 0.424879 | 0.231262 | 0.590047 |
23 | 0.580738 | 0.675268 | 0.726104 |
24 | 0.507248 | 0.136465 | 0.463764 |
25 | 0.421517 | 0.814806 | 0.449040 |
26 | 0.275373 | 0.935430 | 0.525679 |
27 | 0.404031 | 0.221492 | 0.730966 |
28 | 0.779142 | 0.063435 | 0.120807 |
29 | 0.618392 | 0.535934 | 0.554632 |
... | ... | ... | ... |
970 | 0.378107 | 0.687434 | 0.567923 |
971 | 0.876770 | 0.443219 | 0.236627 |
972 | 0.486757 | 0.416836 | 0.524889 |
973 | 0.886021 | 0.203959 | 0.789022 |
974 | 0.838247 | 0.279468 | 0.333581 |
975 | 0.762230 | 0.352878 | 0.550439 |
976 | 0.044568 | 0.680916 | 0.350743 |
977 | 0.031232 | 0.029839 | 0.918445 |
978 | 0.323142 | 0.686965 | 0.978349 |
979 | 0.746471 | 0.081773 | 0.729567 |
980 | 0.810169 | 0.793025 | 0.993532 |
981 | 0.480849 | 0.321984 | 0.233431 |
982 | 0.491794 | 0.056681 | 0.429988 |
983 | 0.278019 | 0.105290 | 0.435492 |
984 | 0.480974 | 0.098199 | 0.958667 |
985 | 0.465396 | 0.806955 | 0.668972 |
986 | 0.602675 | 0.966963 | 0.338542 |
987 | 0.051971 | 0.105833 | 0.132917 |
988 | 0.416362 | 0.861777 | 0.832573 |
989 | 0.951651 | 0.002912 | 0.942564 |
990 | 0.274033 | 0.071102 | 0.941272 |
991 | 0.632913 | 0.807060 | 0.540686 |
992 | 0.035006 | 0.526970 | 0.058584 |
993 | 0.368957 | 0.395593 | 0.210440 |
994 | 0.692847 | 0.655492 | 0.877564 |
995 | 0.245593 | 0.003551 | 0.913750 |
996 | 0.374804 | 0.311604 | 0.680521 |
997 | 0.355928 | 0.924330 | 0.224949 |
998 | 0.923060 | 0.834740 | 0.275359 |
999 | 0.905336 | 0.482290 | 0.722851 |
1000 rows × 3 columns
对df应用筛选条件,去除标准差太大的数据:假设过滤条件为 C列数据大于两倍的C列标准差
twice_std = df['C'].std() * 2
twice_std
0.5714973528631762
~(df['C'] > twice_std)
0 True
1 True
2 False
3 False
4 False
5 True
6 False
7 True
8 True
9 False
10 False
11 True
12 True
13 True
14 True
15 False
16 False
17 False
18 True
19 False
20 False
21 True
22 False
23 False
24 True
25 True
26 True
27 False
28 True
29 True
...
970 True
971 True
972 True
973 False
974 True
975 True
976 True
977 False
978 False
979 False
980 False
981 True
982 True
983 True
984 False
985 False
986 True
987 True
988 False
989 False
990 False
991 True
992 True
993 True
994 False
995 False
996 False
997 True
998 True
999 False
Name: C, Length: 1000, dtype: bool
df.loc[~(df['C'] > twice_std)]
A | B | C | |
---|---|---|---|
0 | 0.056365 | 0.080972 | 0.378327 |
1 | 0.371930 | 0.007791 | 0.318345 |
5 | 0.255404 | 0.195473 | 0.008677 |
7 | 0.470868 | 0.080049 | 0.285998 |
8 | 0.659013 | 0.794802 | 0.270541 |
11 | 0.015484 | 0.567598 | 0.043682 |
12 | 0.957620 | 0.967676 | 0.063608 |
13 | 0.102506 | 0.490077 | 0.235902 |
14 | 0.099083 | 0.778190 | 0.451824 |
18 | 0.330299 | 0.984615 | 0.346554 |
21 | 0.504627 | 0.716272 | 0.386102 |
24 | 0.507248 | 0.136465 | 0.463764 |
25 | 0.421517 | 0.814806 | 0.449040 |
26 | 0.275373 | 0.935430 | 0.525679 |
28 | 0.779142 | 0.063435 | 0.120807 |
29 | 0.618392 | 0.535934 | 0.554632 |
31 | 0.616251 | 0.034984 | 0.342615 |
33 | 0.009574 | 0.195987 | 0.221378 |
35 | 0.721609 | 0.518311 | 0.561978 |
36 | 0.316993 | 0.678054 | 0.163737 |
37 | 0.494355 | 0.499986 | 0.560351 |
39 | 0.584863 | 0.881738 | 0.481162 |
43 | 0.160369 | 0.402388 | 0.208208 |
45 | 0.002698 | 0.576528 | 0.070493 |
47 | 0.764883 | 0.778927 | 0.494559 |
48 | 0.868643 | 0.392903 | 0.109240 |
49 | 0.058928 | 0.350504 | 0.497170 |
50 | 0.373490 | 0.783554 | 0.335720 |
55 | 0.638066 | 0.442382 | 0.173654 |
56 | 0.837218 | 0.722685 | 0.454352 |
... | ... | ... | ... |
943 | 0.322268 | 0.957496 | 0.108147 |
944 | 0.384463 | 0.490386 | 0.245737 |
945 | 0.382611 | 0.726888 | 0.345724 |
947 | 0.713337 | 0.828064 | 0.364005 |
948 | 0.818703 | 0.445825 | 0.281585 |
951 | 0.968651 | 0.897188 | 0.368103 |
952 | 0.136136 | 0.431300 | 0.543917 |
954 | 0.846105 | 0.064527 | 0.200963 |
955 | 0.708107 | 0.857570 | 0.475146 |
957 | 0.595819 | 0.060763 | 0.294676 |
958 | 0.268046 | 0.790128 | 0.342255 |
959 | 0.116645 | 0.968789 | 0.493773 |
967 | 0.073665 | 0.204168 | 0.286095 |
968 | 0.205796 | 0.596242 | 0.468190 |
970 | 0.378107 | 0.687434 | 0.567923 |
971 | 0.876770 | 0.443219 | 0.236627 |
972 | 0.486757 | 0.416836 | 0.524889 |
974 | 0.838247 | 0.279468 | 0.333581 |
975 | 0.762230 | 0.352878 | 0.550439 |
976 | 0.044568 | 0.680916 | 0.350743 |
981 | 0.480849 | 0.321984 | 0.233431 |
982 | 0.491794 | 0.056681 | 0.429988 |
983 | 0.278019 | 0.105290 | 0.435492 |
986 | 0.602675 | 0.966963 | 0.338542 |
987 | 0.051971 | 0.105833 | 0.132917 |
991 | 0.632913 | 0.807060 | 0.540686 |
992 | 0.035006 | 0.526970 | 0.058584 |
993 | 0.368957 | 0.395593 | 0.210440 |
997 | 0.355928 | 0.924330 | 0.224949 |
998 | 0.923060 | 0.834740 | 0.275359 |
559 rows × 3 columns
- 检测过滤缺失值
- dropna
- fillna
- 检测过滤重复值
- drop_duplicated(keep)
- 检测过滤异常值
- 得到鉴定异常值的条件
- 将异常值对应的行删除
4. 排序
使用.take()函数排序
- take()函数接受一个索引列表,用数字表示,使得df根据列表中索引的顺序进行排序
- eg:df.take([1,3,4,2,5])
可以借助np.random.permutation()函数随机排序
np.random.permutation(1000) # 将0-999进行随机排列
array([956, 614, 993, 437, 371, 215, 579, 282, 301, 646, 893, 7, 441,
539, 953, 794, 155, 370, 154, 100, 753, 793, 412, 867, 941, 998,
672, 590, 708, 1, 634, 899, 417, 242, 557, 122, 397, 850, 543,
560, 389, 896, 903, 505, 685, 334, 665, 460, 768, 937, 522, 637,
121, 605, 107, 130, 286, 532, 982, 563, 995, 89, 217, 218, 82,
781, 951, 798, 200, 947, 790, 398, 538, 411, 15, 44, 784, 205,
281, 314, 439, 132, 192, 238, 795, 470, 65, 842, 259, 426, 528,
383, 682, 750, 119, 465, 503, 278, 715, 603, 544, 265, 239, 569,
204, 616, 343, 710, 653, 256, 6, 873, 338, 27, 570, 707, 70,
73, 233, 838, 799, 266, 859, 279, 136, 479, 724, 870, 611, 574,
564, 655, 177, 39, 253, 148, 471, 317, 661, 851, 69, 523, 513,
928, 650, 23, 582, 622, 814, 959, 723, 938, 612, 912, 865, 402,
638, 80, 962, 214, 983, 194, 680, 758, 29, 74, 86, 102, 583,
695, 580, 835, 931, 832, 454, 258, 493, 967, 670, 555, 494, 501,
581, 591, 179, 354, 118, 671, 380, 732, 229, 719, 623, 874, 495,
944, 900, 123, 250, 628, 244, 872, 731, 625, 586, 57, 752, 596,
827, 775, 841, 163, 394, 833, 153, 669, 295, 826, 384, 890, 711,
60, 141, 237, 198, 404, 463, 712, 960, 749, 510, 866, 609, 26,
169, 372, 459, 365, 949, 124, 733, 12, 257, 668, 878, 487, 138,
652, 300, 219, 413, 445, 193, 207, 337, 779, 77, 95, 693, 812,
409, 33, 490, 992, 9, 167, 358, 743, 369, 99, 817, 542, 706,
289, 589, 666, 927, 391, 761, 844, 452, 66, 830, 498, 968, 689,
329, 508, 526, 335, 884, 129, 972, 507, 480, 274, 110, 425, 500,
388, 418, 869, 769, 251, 863, 456, 112, 247, 304, 478, 481, 429,
741, 241, 347, 37, 673, 427, 285, 415, 59, 853, 144, 822, 125,
455, 64, 332, 71, 971, 763, 804, 19, 191, 918, 608, 61, 327,
137, 116, 746, 482, 828, 766, 691, 424, 727, 468, 633, 302, 861,
848, 134, 704, 491, 320, 280, 660, 375, 846, 359, 987, 511, 342,
307, 399, 76, 825, 11, 28, 961, 485, 451, 675, 457, 618, 554,
551, 885, 531, 880, 534, 160, 607, 367, 374, 797, 910, 970, 595,
575, 756, 90, 897, 801, 49, 140, 985, 512, 577, 922, 168, 225,
360, 315, 350, 919, 231, 911, 631, 31, 774, 103, 186, 892, 293,
483, 149, 860, 887, 93, 340, 744, 908, 52, 196, 222, 955, 3,
930, 571, 484, 156, 50, 843, 599, 506, 936, 703, 881, 273, 520,
41, 85, 328, 223, 48, 492, 97, 56, 36, 974, 924, 656, 58,
649, 92, 114, 62, 173, 984, 973, 346, 573, 996, 597, 990, 667,
206, 917, 213, 272, 462, 686, 469, 472, 236, 643, 787, 224, 120,
255, 24, 171, 94, 904, 771, 344, 556, 981, 593, 988, 271, 762,
363, 254, 535, 361, 979, 303, 692, 964, 504, 150, 894, 349, 796,
714, 525, 943, 785, 260, 145, 292, 718, 811, 234, 641, 403, 818,
999, 461, 778, 802, 901, 352, 40, 515, 32, 877, 664, 323, 966,
635, 905, 754, 940, 810, 182, 75, 442, 308, 262, 776, 592, 267,
203, 294, 657, 34, 414, 405, 232, 151, 373, 601, 14, 807, 467,
421, 43, 935, 430, 287, 313, 283, 152, 516, 530, 356, 559, 518,
644, 889, 977, 521, 548, 381, 674, 929, 0, 916, 246, 540, 297,
67, 980, 422, 117, 772, 53, 13, 91, 46, 423, 509, 21, 128,
598, 115, 610, 679, 783, 264, 78, 270, 824, 311, 648, 220, 636,
226, 658, 886, 227, 268, 773, 620, 529, 864, 502, 567, 713, 963,
366, 210, 333, 249, 600, 701, 2, 640, 407, 745, 942, 113, 87,
390, 159, 188, 948, 957, 488, 351, 288, 245, 431, 248, 164, 767,
839, 702, 803, 792, 594, 837, 489, 934, 684, 386, 629, 519, 876,
63, 448, 98, 858, 378, 298, 368, 453, 25, 868, 624, 79, 133,
902, 906, 428, 401, 162, 157, 728, 950, 662, 190, 496, 568, 975,
952, 627, 909, 994, 131, 780, 751, 883, 871, 319, 722, 199, 536,
209, 821, 318, 290, 393, 35, 325, 187, 786, 681, 284, 514, 331,
647, 855, 143, 989, 642, 96, 676, 986, 561, 602, 336, 20, 379,
847, 735, 954, 645, 547, 357, 447, 435, 739, 228, 566, 305, 353,
158, 755, 716, 730, 856, 127, 47, 392, 862, 809, 720, 760, 432,
243, 932, 208, 382, 585, 747, 111, 836, 736, 700, 705, 615, 355,
18, 330, 820, 8, 857, 184, 175, 221, 737, 524, 697, 436, 395,
764, 939, 104, 759, 819, 240, 659, 147, 269, 387, 420, 621, 364,
926, 201, 549, 165, 696, 742, 997, 181, 277, 726, 10, 683, 991,
291, 81, 126, 68, 920, 808, 572, 740, 533, 699, 72, 146, 230,
888, 5, 606, 466, 263, 458, 898, 604, 385, 805, 105, 211, 945,
958, 721, 823, 376, 497, 545, 576, 738, 626, 852, 449, 541, 444,
406, 976, 88, 815, 552, 166, 183, 178, 438, 553, 84, 83, 717,
651, 782, 678, 324, 584, 42, 687, 517, 195, 106, 101, 933, 434,
348, 440, 587, 310, 923, 663, 921, 499, 565, 296, 38, 891, 895,
316, 30, 978, 677, 170, 322, 613, 546, 527, 630, 476, 174, 51,
816, 845, 185, 108, 17, 321, 813, 806, 109, 882, 197, 550, 907,
339, 698, 965, 362, 729, 914, 791, 694, 475, 879, 486, 309, 748,
326, 688, 202, 410, 915, 690, 854, 377, 341, 788, 22, 777, 275,
473, 261, 400, 45, 54, 135, 770, 189, 946, 562, 925, 537, 789,
312, 829, 725, 252, 800, 578, 446, 55, 419, 396, 4, 558, 212,
831, 450, 299, 161, 617, 345, 306, 757, 709, 180, 235, 433, 840,
477, 913, 474, 734, 408, 443, 834, 654, 875, 172, 632, 416, 16,
216, 464, 139, 619, 588, 969, 176, 276, 142, 639, 765, 849])
# 行排序与列排序均随机
df.take(indices=np.random.permutation(1000),axis=0).take(indices=np.random.permutation(3),axis=1)
A | C | B | |
---|---|---|---|
810 | 0.056462 | 0.836914 | 0.105296 |
2 | 0.140999 | 0.752930 | 0.921772 |
721 | 0.941986 | 0.206568 | 0.283233 |
803 | 0.302248 | 0.027969 | 0.946815 |
46 | 0.576391 | 0.604795 | 0.199215 |
224 | 0.091563 | 0.448896 | 0.460941 |
682 | 0.081894 | 0.360009 | 0.174743 |
894 | 0.758221 | 0.311932 | 0.054626 |
389 | 0.951142 | 0.174418 | 0.764700 |
441 | 0.283697 | 0.577370 | 0.698306 |
350 | 0.623445 | 0.681211 | 0.547610 |
53 | 0.186217 | 0.617344 | 0.339724 |
467 | 0.231915 | 0.540558 | 0.972880 |
962 | 0.543442 | 0.895628 | 0.444214 |
598 | 0.516110 | 0.047393 | 0.670478 |
337 | 0.022056 | 0.222698 | 0.010719 |
481 | 0.182805 | 0.301250 | 0.652167 |
277 | 0.127561 | 0.749532 | 0.170472 |
162 | 0.767004 | 0.261541 | 0.381312 |
250 | 0.847071 | 0.344957 | 0.539958 |
416 | 0.369274 | 0.495600 | 0.393579 |
425 | 0.228196 | 0.273655 | 0.114908 |
843 | 0.394974 | 0.904397 | 0.875514 |
893 | 0.451844 | 0.336345 | 0.787189 |
492 | 0.516625 | 0.499929 | 0.350670 |
453 | 0.218878 | 0.957251 | 0.308231 |
186 | 0.611224 | 0.981765 | 0.809362 |
243 | 0.092659 | 0.374212 | 0.658671 |
522 | 0.773774 | 0.436375 | 0.037527 |
961 | 0.172133 | 0.762221 | 0.800747 |
... | ... | ... | ... |
624 | 0.587435 | 0.183552 | 0.831386 |
675 | 0.636248 | 0.542904 | 0.918788 |
861 | 0.519202 | 0.322943 | 0.315798 |
989 | 0.951651 | 0.942564 | 0.002912 |
136 | 0.940608 | 0.069835 | 0.504026 |
950 | 0.294872 | 0.712361 | 0.821118 |
529 | 0.648302 | 0.860493 | 0.626701 |
833 | 0.783501 | 0.823326 | 0.357173 |
173 | 0.181090 | 0.697154 | 0.906783 |
615 | 0.177069 | 0.732558 | 0.275658 |
182 | 0.091686 | 0.262477 | 0.340532 |
913 | 0.069850 | 0.903723 | 0.102737 |
417 | 0.353772 | 0.345310 | 0.618327 |
487 | 0.697415 | 0.083422 | 0.921608 |
345 | 0.331507 | 0.295755 | 0.995060 |
978 | 0.323142 | 0.978349 | 0.686965 |
197 | 0.947977 | 0.235533 | 0.295503 |
133 | 0.428408 | 0.963203 | 0.485624 |
214 | 0.861541 | 0.840486 | 0.435903 |
640 | 0.453934 | 0.807253 | 0.940066 |
977 | 0.031232 | 0.918445 | 0.029839 |
698 | 0.780159 | 0.042282 | 0.127449 |
427 | 0.326411 | 0.101616 | 0.915007 |
898 | 0.768911 | 0.231629 | 0.451036 |
77 | 0.718200 | 0.682757 | 0.986735 |
865 | 0.553171 | 0.535761 | 0.088467 |
513 | 0.203601 | 0.908238 | 0.116113 |
711 | 0.655778 | 0.164941 | 0.472295 |
685 | 0.012172 | 0.035356 | 0.501114 |
801 | 0.891855 | 0.355426 | 0.682663 |
1000 rows × 3 columns
- np.random.permutation(x)可以生成x个从0-(x-1)的随机数列
df.take(indices=np.random.permutation(1000),axis=0).take(indices=np.random.permutation(3),axis=1)[0:5]
B | A | C | |
---|---|---|---|
839 | 0.817163 | 0.346661 | 0.113644 |
708 | 0.644456 | 0.327089 | 0.081710 |
244 | 0.852833 | 0.366820 | 0.028656 |
728 | 0.627186 | 0.850947 | 0.375577 |
238 | 0.784179 | 0.764240 | 0.579280 |
随机抽样
当DataFrame规模足够大时,直接使用np.random.permutation(x)函数,就配合take()函数实现随机抽样
5. 数据分类处理【重点】
数据聚合是数据处理的最后一步,通常是要使每一个数组生成一个单一的数值。
数据分类处理:
- 分组:先把数据分为几组
- 用函数处理:为不同组的数据应用不同的函数以转换数据
- 合并:把不同组得到的结果合并起来
数据分类处理的核心:
- groupby()函数
- groups属性查看分组情况
- eg: df.groupby(by='item').groups
分组
df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
'price':[4,3,3,2.5,4,2],
'color':['red','yellow','yellow','green','green','green'],
'weight':[12,20,50,30,20,44]})
df
color | item | price | weight | |
---|---|---|---|---|
0 | red | Apple | 4.0 | 12 |
1 | yellow | Banana | 3.0 | 20 |
2 | yellow | Orange | 3.0 | 50 |
3 | green | Banana | 2.5 | 30 |
4 | green | Orange | 4.0 | 20 |
5 | green | Apple | 2.0 | 44 |
- 使用groupby实现分组
df.groupby(by='item',axis=0)
<pandas.core.groupby.DataFrameGroupBy object at 0x00000235AA6F6C18>
- 使用groups查看分组情况
# 该函数可以进行数据的分组,但是不显示分组情况
df.groupby(by='item',axis=0).groups
{'Apple': Int64Index([0, 5], dtype='int64'),
'Banana': Int64Index([1, 3], dtype='int64'),
'Orange': Int64Index([2, 4], dtype='int64')}
- 分组后的聚合操作:分组后的成员中可以被进行运算的值会进行运算,不能被运算的值不进行运算
# 给df创建一个新列,内容为各个水果的平均价格
df.groupby(by='item',axis=0).mean()['price']
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
mean_price_series = df.groupby(by='item',axis=0)['price'].mean()
mean_price_series
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
# 映射关系表
dic = mean_price_series.to_dict()
df['mean_price'] = df['item'].map(dic)
df
color | item | price | weight | mean_price | |
---|---|---|---|---|---|
0 | red | Apple | 4.0 | 12 | 3.00 |
1 | yellow | Banana | 3.0 | 20 | 2.75 |
2 | yellow | Orange | 3.0 | 50 | 3.50 |
3 | green | Banana | 2.5 | 30 | 2.75 |
4 | green | Orange | 4.0 | 20 | 3.50 |
5 | green | Apple | 2.0 | 44 | 3.00 |
计算出苹果的平均价格
df
color | item | price | weight | |
---|---|---|---|---|
0 | red | Apple | 4.0 | 12 |
1 | yellow | Banana | 3.0 | 20 |
2 | yellow | Orange | 3.0 | 50 |
3 | green | Banana | 2.5 | 30 |
4 | green | Orange | 4.0 | 20 |
5 | green | Apple | 2.0 | 44 |
按颜色查看各种颜色的水果的平均价格
汇总:将各种颜色水果的平均价格和df进行汇总
df['color_mean_price'] = df['color'].map(df.groupby(by='color')['price'].mean().to_dict())
df
color | item | price | weight | mean_price | color_mean_price | |
---|---|---|---|---|---|---|
0 | red | Apple | 4.0 | 12 | 3.00 | 4.000000 |
1 | yellow | Banana | 3.0 | 20 | 2.75 | 3.000000 |
2 | yellow | Orange | 3.0 | 50 | 3.50 | 3.000000 |
3 | green | Banana | 2.5 | 30 | 2.75 | 2.833333 |
4 | green | Orange | 4.0 | 20 | 3.50 | 2.833333 |
5 | green | Apple | 2.0 | 44 | 3.00 | 2.833333 |
6.高级数据聚合
使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算
- df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
- transform和apply都会进行运算,在transform或者apply中传入函数即可
- transform和apply也可以传入一个lambda表达式
df.groupby(by='item')['price'].mean()
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
def my_mean(s):
sum = 0
for i in s:
sum += i
return sum/len(s)
# 使用apply函数求出水果的平均价格
df.groupby(by='item')['price'].apply(my_mean)
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
# 使用transform函数求出水果的平均价格
df.groupby(by='item')['price'].transform(my_mean)
0 3.00
1 2.75
2 3.50
3 2.75
4 3.50
5 3.00
Name: price, dtype: float64