《利用python进行数据分析》学习笔记--数据聚合与分组(groupby)
在数据聚合与分组中,主要包括:
根据一个或多个键(函数、数组、或dataframe的列名)拆分pandas对象
计算分组后数据的统计值,包括:计数,平均值,标准差,自定义函数
对dataframe的列应用各种各样的函数
实现组内转换或其他运算,规整化,线性回归,排名,选取子集
透视表,交叉表
分组分析
groupby
就是将pandas的数据对象进行,拆分---应用---合并 的数据处理过程
如下面的:
df = pd.DataFrame({'key1':['a','a','b','b','a'], 'key2':['one','two','two','one','one'], 'data1':np.random.randn(5), 'data2':np.random.randn(5)}) df 返回 key1 key2 data1 data2 0 a one 0.127349 1.477832 1 a two -0.430755 0.397269 2 b two -0.911272 0.669960 3 b one 0.370609 0.469459 4 a one 0.968631 0.885551 当我们想要根据key1进行分组,并且计算data1列的平均值 则用到groupby进行拆分 grouped = df['data1'].groupby(df['key1']) grouped.mean() key1 a 0.221742 b -0.270332
在这里,数据根据分组键进行了聚合,产生了新的Series,而且key1是唯一的索引值
当我们一次传入多个数组时: mean = df['data1'].groupby([df['key1'],df['key2']]).mean() mean key1 key2 a one 0.547990 two -0.430755 b one 0.370609 two -0.911272 此时得到的Series具有一个层次化索引 mean.unstack() key2 one two key1 a 0.547990 -0.430755 b 0.370609 -0.911272
分组键可以是任何长度适当的数组,也可以直接是列名(字符串,数字,)
另外,groupby的size用法,能够返回一个含有分组大小的series
对分组进行迭代
group by 对象的迭代可以有单个键进行迭代,也可以有多重键迭代:
for year , group in df.groupby('key1'): print (year) print (group) a key1 key2 data1 data2 0 a one -0.047866 0.323667 1 a two 0.400731 0.521359 4 a one 0.192612 -0.233259 b key1 key2 data1 data2 2 b two -0.516440 -0.030872 3 b one 1.204337 -1.718465
for (a1,a2,), group in df.groupby(['key1','key2']): print(a1,a2) print(group) a one key1 key2 data1 data2 0 a one -0.047866 0.323667 4 a one 0.192612 -0.233259 a two key1 key2 data1 data2 1 a two 0.400731 0.521359 b one key1 key2 data1 data2 3 b one 1.204337 -1.718465 b two key1 key2 data1 data2 2 b two -0.51644 -0.030872
此时,我们可以对这些迭代产生的数据片段进行操作,如整合成字典:
pieces = dict(list(df.groupby('key1'))) pieces['a'] key1 key2 data1 data2 0 a one -0.047866 0.323667 1 a two 0.400731 0.521359 4 a one 0.192612 -0.233259
group = df.groupby(df.dtypes,axis = 1) #对列进行操作 dict(list(group)) {dtype('float64'): data1 data2 0 -0.047866 0.323667 1 0.400731 0.521359 2 -0.516440 -0.030872 3 1.204337 -1.718465 4 0.192612 -0.233259, dtype('O'): key1 key2 0 a one 1 a two 2 b two 3 b one 4 a one}
选取列
通过dataframe产生的groupby对象,用单个字符串或字符串数组为列名进行索引,就可以实现选取列的目的
df.groupby(['key1','key2'])[['data2']].mean() #只计算data2列的mean,并以dataframe形式显示 data2 key1 key2 #返回的是已分组的dataframe,或series a one 0.045204 two 0.521359 b one -1.718465 two -0.030872
通过字典,series 进行分组
通过字典:
people = pd.DataFrame(np.random.randn(5,5), columns = ['a','b','c','d','e'], index = ['A','B','C','D','E']) people.ix[2:4,['b','c']] = np.nan #添加NA值 people a b c d e A 1.136587 -1.014446 0.516930 -1.032332 -0.638470 B 0.971402 -2.668267 0.741624 1.000780 0.207016 C -0.816918 NaN NaN 0.693395 0.432147 D 0.056326 NaN NaN 0.143980 0.103061 E -0.986889 0.881119 -0.034649 1.266522 0.240111 mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}#建立字典联系 by_columns = people.groupby(mapping, axis = 1) #将字典传给groupby by_columns.sum() blue red A -0.515401 -0.516329 B 1.742404 -1.489849 C 0.693395 -0.384771 D 0.143980 0.159386 E 1.231873 0.134341
通过Series:
map_series = pd.Series(mapping)
map_series a red b red c blue d blue e red f orange dtype: object people.groupby(map_series,axis = 1).count()
blue red A 2 3 B 2 3 C 1 2 D 1 2 E 2 3
通过函数进行分组
people = pd.DataFrame(np.random.randn(5,5), columns = ['a','b','c','d','e'], index = ['Joe','Tony','Steve','Jim','Travis']) people.ix[2:4,['b','c']] = np.nan #添加NA值 people.groupby(len).sum() #根据人名长度来对数据分组 a b c d e 3 2.551209 -0.095830 -0.388745 -1.382552 -0.678547 4 -0.644126 1.371943 -0.316472 0.121421 -1.208201 5 -0.557207 0.000000 0.000000 -0.102257 0.438465 6 1.228693 0.056033 -0.472999 1.077108 -0.772639
将数组,列表,字典,函数 综合运用
key_list = ['one','one','two','three','two'] people.groupby([len,key_list]).max() a b c d e 3 one 2.189451 -0.095830 -0.388745 -1.208221 -0.826994 three 0.361758 NaN NaN -0.174331 0.148447 4 one -0.644126 1.371943 -0.316472 0.121421 -1.208201 5 two -0.557207 NaN NaN -0.102257 0.438465 6 two 1.228693 0.056033 -0.472999 1.077108 -0.772639
层次化索引分组
columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'], [1,3,5,1,3]], names = ['cty','tenor']) hier_df = pd.DataFrame(np.random.randn(4,5),columns = columns) hier_df cty US JP tenor 1 3 5 1 3 0 -0.248862 -1.474661 -0.483425 0.217972 0.957090 1 0.707428 -1.150710 0.101633 1.821193 2.027130 2 0.752401 0.850991 1.068320 0.327417 -0.331883 3 -0.032896 0.819291 -0.270272 1.750449 0.295412 hier_df.groupby(level = 'cty',axis = 1).count() #层次化索引聚合数据通过level传入级别编号或名称 cty JP US 0 2 3 1 2 3 2 2 3 3 2 3
数据聚合
聚合是指从数组产生标量值的数据转换过程,常见聚合运算包括:
count---计数,非NA值
sum---求和
mean--非NA值平均数
median---非NA值的算术中位数
std.var---分母为n-1的标准差和方差
min,max---非NA值最小,最大值
prod---非NA值的积
first,last---第一个和最后一个非NA值
quantile(0.9)---样本90%分位数,series方法
当然我们 也可以自定义聚合函数,如:
df key1 key2 data1 data2 0 a one -0.047866 0.323667 1 a two 0.400731 0.521359 2 b two -0.516440 -0.030872 3 b one 1.204337 -1.718465 4 a one 0.192612 -0.233259 def peak_to_peak(arr): #自定义聚合函数 return arr.max()-arr.min() grouped.agg(peak_to_peak) #传入groupby的agg即可 data1 data2 key1 a 0.448598 0.754618 b 1.720777 1.687592
面向列的多函数应用
tips = pd.read_csv('C:/Users/Administrator/Desktop/IPYTHON/tips.csv') tips['tip_pct'] = tips['tip']/tips['total_bill'] tips[:5] #返回 total_bill tip sex smoker day time size tip_pct 0 16.99 1.01 Female No Sun Dinner 2 0.059447 1 10.34 1.66 Male No Sun Dinner 3 0.160542 2 21.01 3.50 Male No Sun Dinner 3 0.166587 3 23.68 3.31 Male No Sun Dinner 2 0.139780 4 24.59 3.61 Female No Sun Dinner 4 0.146808 grouped = tips.groupby(['sex','smoker']) #根据sex,smoker 进行分组 grouped_pct = grouped['tip_pct'] grouped_pct.agg('mean') #可以将函数名以字符串形式传入 #返回 sex smoker Female No 0.156921 Yes 0.182150 Male No 0.160669 Yes 0.152771 grouped_pct.agg(['mean','std',peak_to_peak]) #传入一组函数时,直接以列表形式传入 #返回 mean std peak_to_peak sex smoker Female No 0.156921 0.036421 0.195876 Yes 0.182150 0.071595 0.360233 Male No 0.160669 0.041849 0.220186 Yes 0.152771 0.090588 0.674707 grouped_pct.agg([('foo','mean'),('bar',np.std)]) #自定义每一列的name,可以通过二元元祖形式完成 #返回 foo bar sex smoker Female No 0.156921 0.036421 Yes 0.182150 0.071595 Male No 0.160669 0.041849 Yes 0.152771 0.090588 func = ['count','mean','max'] result = grouped['tip_pct','total_bill'].agg(func) #返回 tip_pct total_bill count mean max count mean max sex smoker Female No 54 0.156921 0.252672 54 18.105185 35.83 Yes 33 0.182150 0.416667 33 17.977879 44.30 Male No 97 0.160669 0.291990 97 19.791237 48.33 Yes 60 0.152771 0.710345 60 22.284500 50.81 grouped.agg({'tip':np.max,'size':'sum'}) #对不同列应用不同的函数,想agg传入一个从列名映射到函数的字典 #返回 tip size sex smoker Female No 5.2 140 Yes 6.5 74 Male No 9.0 263 Yes 10.0 150 grouped.agg({'tip_pct':['min','max','mean','std'],'size':'sum'}) #对不同列应用不同的函数,想agg传入一个从列名映射到函数的字典 #返回 tip_pct size min max mean std sum sex smoker Female No 0.056797 0.252672 0.156921 0.036421 140 Yes 0.056433 0.416667 0.182150 0.071595 74 Male No 0.071804 0.291990 0.160669 0.041849 263 Yes 0.035638 0.710345 0.152771 0.090588 150
分组级运算和转换
transform 会将一个函数应用到各个分组,并且将结果放在合适的位置:
以之前的people数据为例: key = ['one','two','one','two','one'] people.groupby(key).mean() a b c d e one -0.142316 -0.065082 2.255068 1.211343 0.817666 two 1.493163 -0.169286 0.671765 1.290332 0.093891 当使用transform以后 people.groupby(key).transform(np.mean) a b c d e Joe -0.142316 -0.065082 2.255068 1.211343 0.817666 Tony 1.493163 -0.169286 0.671765 1.290332 0.093891 Steve -0.142316 -0.065082 2.255068 1.211343 0.817666 Jim 1.493163 -0.169286 0.671765 1.290332 0.093891 Travis -0.142316 -0.065082 2.255068 1.211343 0.817666
当我们想要从各组中减去平均值,可以创建一个距平化函数,然后传给transform:
def demean(arr): return(arr)- arr.mean() demeaned = people.groupby(key).transform(demean) demeaned a b c d e Joe 2.102734 -0.111083 0.242585 -0.334759 -0.008227 Tony 0.339766 0.000000 0.000000 0.630333 -0.157031 Steve -1.583725 NaN NaN 0.503897 -0.087889 Jim -0.339766 NaN NaN -0.630333 0.157031 Travis -0.519009 0.111083 -0.242585 -0.169138 0.096116
apply:会将待处理的对象拆分为多个片段,然后对各片段调用传入的函数,最后尝试将各片段组合到一起
以小费数据为例:
def top(df,n = 5,column = 'tip_pct'): #定义 在指定列找出最大值,然后把这个值所在的行选出来 的函数 return df.sort_index(by = column)[-n:] top(tips,n = 6) #选出最高的6个tip_pct total_bill tip sex smoker day time size tip_pct 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525 183 23.17 6.50 Male Yes Sun Dinner 4 0.280535 232 11.61 3.39 Male No Sat Dinner 2 0.291990 67 3.07 1.00 Female Yes Sat Dinner 1 0.325733 178 9.60 4.00 Female Yes Sun Dinner 2 0.416667 172 7.25 5.15 Male Yes Sun Dinner 2 0.710345 tips.groupby('smoker').apply(top) #对smoker分组并调用apply total_bill tip sex smoker day time size tip_pct smoker No 88 24.71 5.85 Male No Thur Lunch 2 0.236746 185 20.69 5.00 Male No Sun Dinner 5 0.241663 51 10.29 2.60 Female No Sun Dinner 2 0.252672 149 7.51 2.00 Male No Thur Lunch 2 0.266312 232 11.61 3.39 Male No Sat Dinner 2 0.291990 Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525 183 23.17 6.50 Male Yes Sun Dinner 4 0.280535 67 3.07 1.00 Female Yes Sat Dinner 1 0.325733 178 9.60 4.00 Female Yes Sun Dinner 2 0.416667 172 7.25 5.15 Male Yes Sun Dinner 2 0.710345 tips.groupby(['smoker','day']).apply(top,n = 1,column = 'total_bill') total_bill tip sex smoker day time size tip_pct smoker day No Fri 94 22.75 3.25 Female No Fri Dinner 2 0.142857 Sat 212 48.33 9.00 Male No Sat Dinner 4 0.186220 Sun 156 48.17 5.00 Male No Sun Dinner 6 0.103799 Thur 142 41.19 5.00 Male No Thur Lunch 5 0.121389 Yes Fri 95 40.17 4.73 Male Yes Fri Dinner 4 0.117750 Sat 170 50.81 10.00 Male Yes Sat Dinner 3 0.196812 Sun 182 45.35 3.50 Male Yes Sun Dinner 3 0.077178 Thur 197 43.11 5.00 Female Yes Thur Lunch 4 0.115982
在传入groupby时 指定 group_keys = False,可以选择关闭分组键的效果。
分位数和桶分析:将pandas中的分位数和groupby结合起来,可以实现对数据集的桶分析,分位数分析:
frame = pd.DataFrame({'data1':np.random.randn(1000), 'data2':np.random.randn(1000)}) factor = pd.cut(frame.data1,4) #4是指将数据分成4个区间 def get_stats(group): #定义函数对data2进行统计计算 return{'min':group.min(),'max':group.max(),'count':group.count(), 'mean':group.mean()} grouped = frame.data2.groupby(factor) #这里用factor,可以保证data2的相关统计计算,是基于data1的数据划分区间 grouped.apply(get_stats).unstack() 返回 count max mean min data1 (-3.196, -1.594] 55.0 2.567283 -0.322128 -2.235570 (-1.594, 0.00267] 418.0 2.505320 -0.052012 -3.109679 (0.00267, 1.599] 467.0 3.187789 0.069884 -3.000375 (1.599, 3.195] 60.0 1.873327 0.022410 -1.934497 grouping = pd.qcut(frame.data1,10,labels = False) #qcut能够利用样本分位数得到大小相同的区间,labels = False 即可 grouped = frame.data2.groupby(grouping) grouped.apply(get_stats).unstack() 返回 count max mean min data1 0 100.0 2.567283 -0.184858 -2.235570 1 100.0 2.235197 -0.025099 -2.251106 2 100.0 2.011793 -0.096504 -2.184717 3 100.0 1.953811 -0.120498 -2.911569 4 100.0 3.124611 0.156681 -3.109679 5 100.0 3.067191 0.075880 -2.844021 6 100.0 2.328596 0.082201 -2.068833 7 100.0 3.187789 0.105087 -2.310712 8 100.0 2.624968 -0.047236 -3.000375 9 100.0 2.735516 -0.000431 -2.182246
示例:用特定于分组的值填充缺失值
states = ['Ohio','New York','Vermont','Florida','Oregon','Nevada','California','Idaho'] group_key = ['East'] * 4 + ['West'] *4 data = pd.Series(np.random.randn(8),index = states) data[['Vermont','Nevada','Idaho']] = np.nan data data.groupby(group_key).mean() #求出平均值 fill_mean = lambda g: g.fillna(g.mean()) #用分组平均值填充NA值 data.groupby(group_key).apply(fill_mean) fill_values = {'East':0.5,'West':-1} #预定义填充值,利用name属性填充 fill_func = lambda g: g.fillna(fill_values[g.name]) data.groupby(group_key).apply(fill_func)
透视表和交叉表