数据聚合与分组运算学习笔记2
2 数据聚合
从数组产生标量值的数据转化过程。
In [5]: df
Out[5]:
data1 data2 key1 key2
0 -1.809227 2.231607 a one
1 1.663738 -0.113191 a two
2 0.411805 -0.842909 b one
3 1.244803 -1.147212 b two
4 1.014822 0.122223 a one
In [6]: grouped = df.groupby('key1')
In [7]: grouped['data1'].quantile(0.9) # 0.9分位数,
Out[7]:
key1
a 1.533955
b 1.161503
Name: data1, dtype: float64
# 还有之前的sum,mean等。
# 也能传入自定义函数
In [5]: def peak_to_peak(arr):
...: return arr.max()-arr.min()
In [9]: grouped.agg(peak_to_peak)
Out[9]:
data1 data2
key1
a 3.472965 2.344798
b 0.832998 0.304303
In [10]: grouped.describe()
Out[10]:
data1 data2
key1
a count 3.000000 3.000000
mean 0.289777 0.746880
std 1.846521 1.291188
min -1.809227 -0.113191
25% -0.397203 0.004516
50% 1.014822 0.122223
75% 1.339280 1.176915
max 1.663738 2.231607
b count 2.000000 2.000000
mean 0.828304 -0.995061
std 0.589018 0.215175
min 0.411805 -1.147212
25% 0.620055 -1.071137
50% 0.828304 -0.995061
75% 1.036554 -0.918985
max 1.244803 -0.842909
经过groupby优化过的方法
2.1 面向列的多函数应用
针对列的聚合运算其实就是agg或者使用mean、sum等方法。如果需要多个聚合函数同时作用:
# 以列表的形式传入agg,元素为函数名称或函数对象。
In [9]: df.groupby('key1').agg(['mean', peak_to_peak])
Out[9]:
data1 data2
mean peak_to_peak mean peak_to_peak
key1
a 0.143435 2.047937 0.934671 1.279921
b 0.037396 1.884494 -1.087873 1.018864
# 还可以定义列名 (列名,函数名)
In [10]: df.groupby('key1').agg([('Mean','mean'),('bar',np.std)])
Out[10]:
data1 data2
Mean bar Mean bar
key1
a 0.143435 1.125854 0.934671 0.673188
b 0.037396 1.332538 -1.087873 0.720446
# 对不同的列,应用不同的函数,使用字典
In [11]: func_dict = {'data1':'mean','data2':np.std}
In [12]: df.groupby('key1').agg(func_dict)
Out[12]:
data1 data2
key1
a 0.143435 0.673188
b 0.037396 0.720446
2.2 以无索引的形式返回聚合数据
在groupby中默认会以唯一的分组键作为索引,如果参数中传入“as_index=False”即可禁止索引。
In [13]: df.groupby('key1',as_index=False).mean()
Out[13]:
key1 data1 data2
0 a 0.143435 0.934671
1 b 0.037396 -1.087873
# 相同的结果
In [17]: df.groupby('key1').mean().reset_index()
Out[17]:
key1 data1 data2
0 a 0.143435 0.934671
1 b 0.037396 -1.087873
3 分组级运算和转换
聚合只是分组运算和转换的一个特例,他能够接受的是将一维数组简化为标量的函数。
本节的transform和apply函数能够执行更多的分组运算
In [19]: df
Out[19]:
data1 data2 key1 key2
0 0.897195 0.693473 a one
1 0.683851 1.695230 a two
2 -0.904851 -0.578441 b one
3 0.979643 -1.597305 b two
4 -1.150742 0.415309 a one
In [22]: k1_mean = df.groupby('key1').mean().add_prefix('mean_')
In [25]: df.merge(k1_mean,left_on='key1',right_index=True)
Out[25]:
data1 data2 key1 key2 mean_data1 mean_data2
0 0.897195 0.693473 a one 0.143435 0.934671
1 0.683851 1.695230 a two 0.143435 0.934671
4 -1.150742 0.415309 a one 0.143435 0.934671
2 -0.904851 -0.578441 b one 0.037396 -1.087873
3 0.979643 -1.597305 b two 0.037396 -1.087873
# 这样稍显麻烦
我们使用 transform 再来试一试。
In [12]: key = ['one','two','one','two','one']
In [13]: people
Out[13]:
a b c d e
Joe -0.632221 0.074549 -0.964887 -2.144542 1.787703
Steve -0.442427 0.925263 0.646603 0.719694 -0.208777
Wes -0.019415 0.032652 -0.475912 0.490444 -0.305742
Jim 0.214406 -0.263113 -1.867597 1.762480 1.024320
Travis 0.348362 -0.650411 -1.660043 -0.332139 -0.669246
In [14]: people.groupby(key).mean() # 求用key分类后的平均值
Out[14]:
a b c d e
one -0.101091 -0.181070 -1.033614 -0.662079 0.270905
two -0.114010 0.331075 -0.610497 1.241087 0.407772
In [15]: people.groupby(key).transform(np.mean) # 将通过key计算出来的平均值,赋值回去
Out[15]:
a b c d e
Joe -0.101091 -0.181070 -1.033614 -0.662079 0.270905
Steve -0.114010 0.331075 -0.610497 1.241087 0.407772
Wes -0.101091 -0.181070 -1.033614 -0.662079 0.270905
Jim -0.114010 0.331075 -0.610497 1.241087 0.407772
Travis -0.101091 -0.181070 -1.033614 -0.662079 0.270905
In [16]: def demean(arr):
...: return arr-arr.mean()
...:
# transform 会将传入的函数应用到各个分组中,并将结果放到适当的位置上去。
#
In [17]: demeaned = people.groupby(key).transform(demean)
In [18]: demeaned
Out[18]:
a b c d e
Joe -0.531130 0.255619 0.068727 -1.482463 1.516798
Steve -0.328417 0.594188 1.257100 -0.521393 -0.616549
Wes 0.081676 0.213722 0.557702 1.152523 -0.576647
Jim 0.328417 -0.594188 -1.257100 0.521393 0.616549
Travis 0.449453 -0.469341 -0.626429 0.329940 -0.940151
In [19]: demeaned.groupby(key).mean()
Out[19]:
a b c d e
one 1.850372e-17 1.850372e-17 -1.110223e-16 -5.551115e-17 0.0
two 0.000000e+00 0.000000e+00 -1.110223e-16 0.000000e+00 0.0
3.1 apply:一般性的“拆分-应用-合并”
transform与agg是一样的,要么产生一个可以广播的标量,要么产生一个相同大小的数组。
apply会将对象拆分成多个片段,然后对各个片段传入函数,最后尝试组合。
即“拆分-应用-合并”。
In [14]: tips = pd.read_csv('tips.csv')
In [15]: tips['tip_pct']=tips['tip']/tips['total_bill']
In [16]: tips
Out[16]:
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
In [17]: def top(df, n=5, column = 'tip_pct'):
...: return df.sort_index(by=column)[-n:]
...:
In [18]: top(tips, n=6)
C:\Users\yangfl\Anaconda3\Scripts\ipython-script.py:2: FutureWarning: by argumen
t to sort_index is deprecated, pls use .sort_values(by=...)
import sys
Out[18]:
total_bill tip sex smoker day time size tip_pct
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
232 11.61 3.39 Male No Sat Dinner 2 0.291990
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
In [17]: def top(df, n=5, column = 'tip_pct'):
...: return df.sort_index(by=column)[-n:]
...:
In [18]: top(tips, n=6)
C:\Users\yangfl\Anaconda3\Scripts\ipython-script.py:2: FutureWarning: by argumen
t to sort_index is deprecated, pls use .sort_values(by=...)
import sys
Out[18]:
total_bill tip sex smoker day time size tip_pct
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
232 11.61 3.39 Male No Sat Dinner 2 0.291990
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
In [19]: tips.groupby('smoker').apply(top)
C:\Users\yangfl\Anaconda3\Scripts\ipython-script.py:2: FutureWarning: by argumen
t to sort_index is deprecated, pls use .sort_values(by=...)
import sys
Out[19]:
total_bill tip sex smoker day time size tip_pct
smoker
No 88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
In [20]: result = tips.groupby('smoker')['tip_pct'].describe()
In [21]: result
Out[21]:
smoker
No count 151.000000
mean 0.159328
std 0.039910
min 0.056797
25% 0.136906
50% 0.155625
75% 0.185014
max 0.291990
Yes count 93.000000
mean 0.163196
std 0.085119
min 0.035638
25% 0.106771
50% 0.153846
75% 0.195059
max 0.710345
Name: tip_pct, dtype: float64
禁止分组键
In [22]: tips.groupby('smoker', group_keys=False).apply(top)
C:\Users\yangfl\Anaconda3\Scripts\ipython-script.py:2: FutureWarning: by argumen
t to sort_index is deprecated, pls use .sort_values(by=...)
import sys
Out[22]:
total_bill tip sex smoker day time size tip_pct
88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
3.2 分位数和桶分析
分位数
In [23]: frame = DataFrame({'data1':np.random.randn(1000),'data2':np.random.ran
...: dn(1000)})
In [24]: factor = pd.cut(frame.data1, 4) #分成四份
In [25]: factor[:5]
Out[25]:
0 (-0.122, 1.486]
1 (-1.73, -0.122]
2 (1.486, 3.0932]
3 (-1.73, -0.122]
4 (-0.122, 1.486]
Name: data1, dtype: category
Categories (4, object): [(-3.344, -1.73] < (-1.73, -0.122] < (-0.122, 1.486] < (
1.486, 3.0932]]
In [26]: def get_stats(group):
...: return {"min":group.min(),
...: 'max':group.max(),
...: 'count':group.count(),
...: 'mean':group.mean()}
...:
In [27]: grouped = frame.data2.groupby(factor)
In [28]: grouped.apply(get_stats).unstack()
Out[28]:
count max mean min
data1
(-3.344, -1.73] 38.0 2.219138 0.068483 -2.613319
(-1.73, -0.122] 421.0 3.024953 -0.043523 -2.733123
(-0.122, 1.486] 472.0 3.882440 -0.023289 -4.073087
(1.486, 3.0932] 69.0 2.520630 0.067271 -2.593762
桶分析
In [29]: grouping = pd.qcut(frame.data1, 10, labels=False)
In [30]: grouped= frame.data2.groupby(grouping)
In [31]: grouped.apply(get_stats).unstack()
Out[31]:
count max mean min
data1
0 100.0 2.219138 -0.140476 -2.613319
1 100.0 2.877679 0.047305 -2.483696
2 100.0 2.716929 -0.043014 -2.446612
3 100.0 2.387355 -0.080337 -2.733123
4 100.0 3.024953 0.031980 -2.157749
5 100.0 2.617329 0.011424 -2.048438
6 100.0 2.362811 -0.074931 -2.942770
7 100.0 3.882440 0.024096 -3.038111
8 100.0 2.587888 -0.014792 -4.073087
9 100.0 2.520630 0.018031 -2.613203
3.3 示例:用特定于分组的值填充缺失值
In [32]: s = Series(np.random.randn(6))
In [33]: s[::2] = np.nan
In [34]: s
Out[34]:
0 NaN
1 0.818362
2 NaN
3 0.624027
4 NaN
5 -1.260203
dtype: float64
In [35]: s.fillna(s.mean()) # 使用平均值填充空值
Out[35]:
0 0.060729
1 0.818362
2 0.060729
3 0.624027
4 0.060729
5 -1.260203
dtype: float64
3.4 示例:随机采样和排列
一副扑克牌:
In [36]: suits = ['H','S','C','D'] # 四个花色
In [42]: card_val = (list(range(1,11))+[10]*3) * 4 # 21点
# A ~ K
In [44]: base_names = ['A'] + list(range(2,11)) + ['J','Q','K']
In [45]: cards =[]
In [46]: for suit in suits:
...: cards.extend(str(num)+suit for num in base_names)
...:
# 做一个Series
In [47]: deck = Series(card_val, index=cards)
In [48]: deck
Out[48]:
AH 1
2H 2
3H 3
4H 4
5H 5
6H 6
...
随机抽出5张牌
In [49]: def draw(deck,n=5):
...: return deck.take(np.random.permutation(len(deck))[:n])
...:
# np.random.permutation 随机排序
In [50]: draw(deck) # 随机抽出5张牌
Out[50]:
7C 7
2D 2
10D 10
2H 2
AD 1
dtype: int64
随机每种花色抽取两张牌
In [51]: get_suit = lambda card:card[-1]
In [52]: deck.groupby(get_suit).apply(draw, 2)
Out[52]:
C 2C 2
9C 9
D 8D 8
6D 6
H AH 1
10H 10
S 3S 3
6S 6
dtype: int64
3.5 示例:分组加权平均数和相关系数
In [54]: df = DataFrame({'category':list('aaaabbbb'),
...: 'data':np.random.randn(8),
...: 'weights':np.random.rand(8)})
In [55]: df
Out[55]:
category data weights
0 a -0.239087 0.172904
1 a -0.615509 0.511186
2 a 1.015113 0.778079
3 a -0.726262 0.284090
4 b 0.443518 0.189707
5 b 1.025131 0.764808
6 b -1.725867 0.108514
7 b 0.613947 0.696840
In [56]: grouped = df.groupby('category') # 分组
# 加权平均值
In [57]: get_wavg = lambda g: np.average(g['data'],weights=g['weights'])
In [58]: grouped.apply(get_wavg)
Out[58]:
category
a 0.130299
b 0.629996
dtype: float64