《利用python进行数据分析》学习笔记--数据聚合与分组(groupby)

在数据聚合与分组中,主要包括:

根据一个或多个键(函数、数组、或dataframe的列名)拆分pandas对象

计算分组后数据的统计值,包括:计数,平均值,标准差,自定义函数

对dataframe的列应用各种各样的函数

实现组内转换或其他运算,规整化,线性回归,排名,选取子集

透视表,交叉表

分组分析

 

groupby 

就是将pandas的数据对象进行,拆分---应用---合并   的数据处理过程

如下面的:

df = pd.DataFrame({'key1':['a','a','b','b','a'],
                  'key2':['one','two','two','one','one'],
                  'data1':np.random.randn(5),
                  'data2':np.random.randn(5)})
df
返回
    key1    key2    data1    data2
0    a    one    0.127349    1.477832
1    a    two    -0.430755    0.397269
2    b    two    -0.911272    0.669960
3    b    one    0.370609    0.469459
4    a    one    0.968631    0.885551

当我们想要根据key1进行分组,并且计算data1列的平均值
则用到groupby进行拆分
grouped = df['data1'].groupby(df['key1'])
grouped.mean()

key1
a    0.221742
b   -0.270332

在这里,数据根据分组键进行了聚合,产生了新的Series,而且key1是唯一的索引值

当我们一次传入多个数组时:
mean = df['data1'].groupby([df['key1'],df['key2']]).mean()
mean

key1  key2
a     one     0.547990
      two    -0.430755
b     one     0.370609
      two    -0.911272
此时得到的Series具有一个层次化索引
mean.unstack()

key2    one    two
key1        
a    0.547990    -0.430755
b    0.370609    -0.911272

分组键可以是任何长度适当的数组,也可以直接是列名(字符串,数字,)

另外,groupby的size用法,能够返回一个含有分组大小的series

对分组进行迭代

group by 对象的迭代可以有单个键进行迭代,也可以有多重键迭代:

for year , group in df.groupby('key1'):
    print (year) 
    print (group)

a
  key1 key2     data1     data2
0    a  one -0.047866  0.323667
1    a  two  0.400731  0.521359
4    a  one  0.192612 -0.233259
b
  key1 key2     data1     data2
2    b  two -0.516440 -0.030872
3    b  one  1.204337 -1.718465
for (a1,a2,), group in df.groupby(['key1','key2']):
    print(a1,a2)
    print(group)

a one
  key1 key2     data1     data2
0    a  one -0.047866  0.323667
4    a  one  0.192612 -0.233259
a two
  key1 key2     data1     data2
1    a  two  0.400731  0.521359
b one
  key1 key2     data1     data2
3    b  one  1.204337 -1.718465
b two
  key1 key2    data1     data2
2    b  two -0.51644 -0.030872

此时,我们可以对这些迭代产生的数据片段进行操作,如整合成字典:

pieces = dict(list(df.groupby('key1')))
pieces['a']

    key1    key2    data1    data2
0    a    one    -0.047866    0.323667
1    a    two    0.400731    0.521359
4    a    one    0.192612    -0.233259
group = df.groupby(df.dtypes,axis = 1)  #对列进行操作
dict(list(group))

{dtype('float64'):       data1     data2
 0 -0.047866  0.323667
 1  0.400731  0.521359
 2 -0.516440 -0.030872
 3  1.204337 -1.718465
 4  0.192612 -0.233259, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  two
 3    b  one
 4    a  one}

选取列

通过dataframe产生的groupby对象,用单个字符串或字符串数组为列名进行索引,就可以实现选取列的目的

df.groupby(['key1','key2'])[['data2']].mean()   #只计算data2列的mean,并以dataframe形式显示

        data2    
key1    key2                                     #返回的是已分组的dataframe,或series
a       one    0.045204
        two    0.521359
b       one    -1.718465
        two    -0.030872

通过字典,series 进行分组

通过字典

people = pd.DataFrame(np.random.randn(5,5),
                     columns = ['a','b','c','d','e'],
                     index = ['A','B','C','D','E'])
people.ix[2:4,['b','c']] = np.nan  #添加NA值
people

        a            b            c           d           e
A    1.136587    -1.014446    0.516930    -1.032332    -0.638470
B    0.971402    -2.668267    0.741624    1.000780    0.207016
C    -0.816918    NaN           NaN      0.693395    0.432147
D    0.056326    NaN             NaN    0.143980    0.103061
E    -0.986889    0.881119    -0.034649    1.266522    0.240111

mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}#建立字典联系
by_columns = people.groupby(mapping, axis = 1)    #将字典传给groupby
by_columns.sum()

         blue    red
A    -0.515401    -0.516329
B    1.742404    -1.489849
C    0.693395    -0.384771
D    0.143980    0.159386
E    1.231873    0.134341

通过Series:

map_series = pd.Series(mapping)

map_series a red b red c blue d blue e red f orange dtype: object people.groupby(map_series,axis
= 1).count()
blue red A
2 3 B 2 3 C 1 2 D 1 2 E 2 3

通过函数进行分组

people = pd.DataFrame(np.random.randn(5,5),
                     columns = ['a','b','c','d','e'],
                     index = ['Joe','Tony','Steve','Jim','Travis'])
people.ix[2:4,['b','c']] = np.nan  #添加NA值
people.groupby(len).sum()  #根据人名长度来对数据分组

a    b    c    d    e
3    2.551209    -0.095830    -0.388745    -1.382552    -0.678547
4    -0.644126    1.371943    -0.316472    0.121421    -1.208201
5    -0.557207    0.000000    0.000000    -0.102257    0.438465
6    1.228693    0.056033    -0.472999    1.077108    -0.772639

将数组,列表,字典,函数  综合运用

key_list = ['one','one','two','three','two']
people.groupby([len,key_list]).max()
 
               a              b           c              d     e
3    one    2.189451    -0.095830    -0.388745    -1.208221    -0.826994
     three    0.361758    NaN    NaN    -0.174331    0.148447
4    one    -0.644126    1.371943    -0.316472    0.121421    -1.208201
5    two    -0.557207    NaN    NaN    -0.102257    0.438465
6    two    1.228693    0.056033    -0.472999    1.077108    -0.772639

层次化索引分组

 

columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],
                                    [1,3,5,1,3]], names = ['cty','tenor'])
hier_df = pd.DataFrame(np.random.randn(4,5),columns = columns)
hier_df


cty    US                                     JP
tenor     1           3           5            1          3
0    -0.248862    -1.474661    -0.483425    0.217972    0.957090
1    0.707428    -1.150710    0.101633    1.821193    2.027130
2    0.752401    0.850991    1.068320    0.327417    -0.331883
3    -0.032896    0.819291    -0.270272    1.750449    0.295412

hier_df.groupby(level = 'cty',axis = 1).count()      #层次化索引聚合数据通过level传入级别编号或名称

cty  JP    US
0    2    3
1    2    3
2    2    3
3    2    3

数据聚合

聚合是指从数组产生标量值的数据转换过程,常见聚合运算包括:

count---计数,非NA值

sum---求和

mean--非NA值平均数

median---非NA值的算术中位数

std.var---分母为n-1的标准差和方差

min,max---非NA值最小,最大值

prod---非NA值的积

first,last---第一个和最后一个非NA值

quantile(0.9)---样本90%分位数,series方法

      当然我们 也可以自定义聚合函数,如:

df

    key1    key2    data1    data2
0    a    one    -0.047866    0.323667
1    a    two    0.400731    0.521359
2    b    two    -0.516440    -0.030872
3    b    one    1.204337    -1.718465
4    a    one    0.192612    -0.233259

def peak_to_peak(arr):            #自定义聚合函数
    return arr.max()-arr.min()
grouped.agg(peak_to_peak)         #传入groupby的agg即可

        data1    data2
key1        
a    0.448598    0.754618
b    1.720777    1.687592

面向列的多函数应用

tips = pd.read_csv('C:/Users/Administrator/Desktop/IPYTHON/tips.csv')
tips['tip_pct'] = tips['tip']/tips['total_bill']
tips[:5]
#返回
    total_bill    tip    sex    smoker    day    time    size    tip_pct
0    16.99    1.01    Female    No    Sun    Dinner    2    0.059447
1    10.34    1.66    Male    No    Sun    Dinner    3    0.160542
2    21.01    3.50    Male    No    Sun    Dinner    3    0.166587
3    23.68    3.31    Male    No    Sun    Dinner    2    0.139780
4    24.59    3.61    Female    No    Sun    Dinner    4    0.146808
grouped = tips.groupby(['sex','smoker'])  #根据sex,smoker 进行分组
grouped_pct = grouped['tip_pct']        
grouped_pct.agg('mean')                #可以将函数名以字符串形式传入
#返回
sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
grouped_pct.agg(['mean','std',peak_to_peak])   #传入一组函数时,直接以列表形式传入
#返回
                 mean    std    peak_to_peak
sex    smoker            
Female    No    0.156921    0.036421    0.195876
         Yes    0.182150    0.071595    0.360233
Male      No    0.160669    0.041849    0.220186
         Yes    0.152771    0.090588    0.674707
grouped_pct.agg([('foo','mean'),('bar',np.std)])   #自定义每一列的name,可以通过二元元祖形式完成
#返回
                       foo    bar
sex    smoker        
Female    No    0.156921    0.036421
         Yes    0.182150    0.071595
Male     No    0.160669    0.041849
        Yes    0.152771    0.090588
func = ['count','mean','max']
result = grouped['tip_pct','total_bill'].agg(func)
#返回
    tip_pct    total_bill
                count    mean    max    count    mean    max
sex    smoker                        
Female    No    54    0.156921    0.252672    54    18.105185    35.83
         Yes    33    0.182150    0.416667    33    17.977879    44.30
Male      No    97    0.160669    0.291990    97    19.791237    48.33
         Yes    60    0.152771    0.710345    60    22.284500    50.81
grouped.agg({'tip':np.max,'size':'sum'})     #对不同列应用不同的函数,想agg传入一个从列名映射到函数的字典
#返回
    tip    size
sex    smoker        
Female    No    5.2    140
         Yes    6.5    74
Male     No    9.0    263
        Yes    10.0    150
grouped.agg({'tip_pct':['min','max','mean','std'],'size':'sum'}) #对不同列应用不同的函数,想agg传入一个从列名映射到函数的字典
#返回
    tip_pct    size
                  min    max    mean    std    sum
sex    smoker                    
Female    No    0.056797    0.252672    0.156921    0.036421    140
          Yes    0.056433    0.416667    0.182150    0.071595    74
Male      No    0.071804    0.291990    0.160669    0.041849    263
          Yes    0.035638    0.710345    0.152771    0.090588    150

分组级运算和转换

transform 会将一个函数应用到各个分组,并且将结果放在合适的位置:

以之前的people数据为例:
key = ['one','two','one','two','one']
people.groupby(key).mean()


a    b    c    d    e
one    -0.142316    -0.065082    2.255068    1.211343    0.817666
two    1.493163    -0.169286    0.671765    1.290332    0.093891
当使用transform以后
people.groupby(key).transform(np.mean)

a    b    c    d    e
Joe    -0.142316    -0.065082    2.255068    1.211343    0.817666
Tony    1.493163    -0.169286    0.671765    1.290332    0.093891
Steve    -0.142316    -0.065082    2.255068    1.211343    0.817666
Jim    1.493163    -0.169286    0.671765    1.290332    0.093891
Travis    -0.142316    -0.065082    2.255068    1.211343    0.817666

当我们想要从各组中减去平均值,可以创建一个距平化函数,然后传给transform:

def demean(arr):
    return(arr)- arr.mean()
demeaned = people.groupby(key).transform(demean)
demeaned

           a            b           c           d           e
Joe    2.102734    -0.111083    0.242585    -0.334759    -0.008227
Tony    0.339766    0.000000    0.000000    0.630333    -0.157031
Steve    -1.583725    NaN        NaN        0.503897    -0.087889
Jim    -0.339766      NaN        NaN        -0.630333    0.157031
Travis  -0.519009    0.111083    -0.242585    -0.169138    0.096116

apply:会将待处理的对象拆分为多个片段,然后对各片段调用传入的函数,最后尝试将各片段组合到一起

以小费数据为例:

def top(df,n = 5,column = 'tip_pct'):  #定义  在指定列找出最大值,然后把这个值所在的行选出来   的函数
    return df.sort_index(by = column)[-n:]   
top(tips,n = 6)    #选出最高的6个tip_pct

    total_bill    tip    sex    smoker    day    time    size    tip_pct
109    14.31    4.00    Female    Yes    Sat    Dinner    2    0.279525
183    23.17    6.50    Male    Yes    Sun    Dinner    4    0.280535
232    11.61    3.39    Male    No    Sat    Dinner    2    0.291990
67    3.07    1.00    Female    Yes    Sat    Dinner    1    0.325733
178    9.60    4.00    Female    Yes    Sun    Dinner    2    0.416667
172    7.25    5.15    Male    Yes    Sun    Dinner    2    0.710345

tips.groupby('smoker').apply(top) #对smoker分组并调用apply

    total_bill    tip    sex    smoker    day    time    size    tip_pct
smoker                                    
No    88    24.71    5.85    Male    No    Thur    Lunch    2    0.236746
185    20.69    5.00    Male    No    Sun    Dinner    5    0.241663
51    10.29    2.60    Female    No    Sun    Dinner    2    0.252672
149    7.51    2.00    Male    No    Thur    Lunch    2    0.266312
232    11.61    3.39    Male    No    Sat    Dinner    2    0.291990
Yes    109    14.31    4.00    Female    Yes    Sat    Dinner    2    0.279525
183    23.17    6.50    Male    Yes    Sun    Dinner    4    0.280535
67    3.07    1.00    Female    Yes    Sat    Dinner    1    0.325733
178    9.60    4.00    Female    Yes    Sun    Dinner    2    0.416667
172    7.25    5.15    Male    Yes    Sun    Dinner    2    0.710345

tips.groupby(['smoker','day']).apply(top,n = 1,column = 'total_bill')
    total_bill    tip    sex    smoker    day    time    size    tip_pct
smoker    day                                    
No    Fri    94    22.75    3.25    Female    No    Fri    Dinner    2    0.142857
Sat    212    48.33    9.00    Male    No    Sat    Dinner    4    0.186220
Sun    156    48.17    5.00    Male    No    Sun    Dinner    6    0.103799
Thur    142    41.19    5.00    Male    No    Thur    Lunch    5    0.121389
Yes    Fri    95    40.17    4.73    Male    Yes    Fri    Dinner    4    0.117750
Sat    170    50.81    10.00    Male    Yes    Sat    Dinner    3    0.196812
Sun    182    45.35    3.50    Male    Yes    Sun    Dinner    3    0.077178
Thur    197    43.11    5.00    Female    Yes    Thur    Lunch    4    0.115982

在传入groupby时 指定 group_keys = False,可以选择关闭分组键的效果。

分位数和桶分析:将pandas中的分位数和groupby结合起来,可以实现对数据集的桶分析,分位数分析:

frame = pd.DataFrame({'data1':np.random.randn(1000),
                      'data2':np.random.randn(1000)})
factor = pd.cut(frame.data1,4)  #4是指将数据分成4个区间
def get_stats(group):           #定义函数对data2进行统计计算
  return{'min':group.min(),'max':group.max(),'count':group.count(),
          'mean':group.mean()}
grouped = frame.data2.groupby(factor) #这里用factor,可以保证data2的相关统计计算,是基于data1的数据划分区间
grouped.apply(get_stats).unstack()
返回
                    count     max          mean        min
data1                
(-3.196, -1.594]    55.0    2.567283    -0.322128    -2.235570
(-1.594, 0.00267]    418.0    2.505320    -0.052012    -3.109679
(0.00267, 1.599]    467.0    3.187789    0.069884    -3.000375
(1.599, 3.195]    60.0    1.873327    0.022410    -1.934497
grouping = pd.qcut(frame.data1,10,labels = False)  #qcut能够利用样本分位数得到大小相同的区间,labels = False 即可
grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()
返回
     count      max           mean         min
data1                
0    100.0    2.567283    -0.184858    -2.235570
1    100.0    2.235197    -0.025099    -2.251106
2    100.0    2.011793    -0.096504    -2.184717
3    100.0    1.953811    -0.120498    -2.911569
4    100.0    3.124611    0.156681    -3.109679
5    100.0    3.067191    0.075880    -2.844021
6    100.0    2.328596    0.082201    -2.068833
7    100.0    3.187789    0.105087    -2.310712
8    100.0    2.624968    -0.047236    -3.000375
9    100.0    2.735516    -0.000431    -2.182246

示例:用特定于分组的值填充缺失值

states = ['Ohio','New York','Vermont','Florida','Oregon','Nevada','California','Idaho']
group_key = ['East'] * 4 + ['West'] *4
data = pd.Series(np.random.randn(8),index = states)
data[['Vermont','Nevada','Idaho']] = np.nan
data

data.groupby(group_key).mean() #求出平均值

fill_mean = lambda g: g.fillna(g.mean()) #用分组平均值填充NA值
data.groupby(group_key).apply(fill_mean)

fill_values = {'East':0.5,'West':-1} #预定义填充值,利用name属性填充
fill_func = lambda g: g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)

透视表和交叉表

 

posted @ 2018-09-10 21:54  积水聚沙  阅读(579)  评论(0编辑  收藏  举报