数据聚合与分组运算学习笔记1

本章内容:

  • 根据一个或多个键拆分pandas对象;
  • 计算分数摘要统计;
  • 对DataFrame的列应用各种函数;
  • 应用组内转换或运算;
  • 计算透视表或交叉表;
  • 执行分位数分析等;

1 GroupBy技术

分组聚合演示:

In [4]: df = DataFrame({'key1':list('aabba'),'key2':['one','two','one','two','o
   ...: ne'],'data1':np.random.randn(5),'data2':np.random.randn(5)})

In [5]: df
Out[5]:
      data1     data2 key1 key2
0 -0.187814 -0.224238    a  one
1 -1.363033  0.087923    a  two
2  0.624801 -0.132735    b  one
3  0.643536  0.335998    b  two
4 -1.318154  0.208845    a  one

In [6]: grouped=df['data1'].groupby(df['key1'])  # 将data1数据按照key1进行分组

In [7]: grouped
Out[7]: <pandas.core.groupby.SeriesGroupBy object at 0x0000000007D7FEF0>

In [8]: grouped.mean()  # 求各分组的平均值
Out[8]:
key1
a   -0.956334
b    0.634169
Name: data1, dtype: float64

In [9]: means = df['data1'].groupby([df['key1'],df['key2']]).mean()

In [10]: means   #分层索引 进行分组
Out[10]:
key1  key2
a     one    -0.752984
      two    -1.363033
b     one     0.624801
      two     0.643536
Name: data1, dtype: float64

In [11]: means.unstack()  # 解stack
Out[11]:
key2       one       two
key1
a    -0.752984 -1.363033
b     0.624801  0.643536

# 数据和分组是Series,其实也可以是其他任意的数组,arange

In [12]: df.groupby('key1').mean()
Out[12]:  # key2不是数值列,因此属于“麻烦列”,不能用于计算mean,所以去除掉了。
         data1     data2
key1
a    -0.956334  0.024177
b     0.634169  0.101632

In [13]: df.groupby(['key1','key2']).mean()
Out[13]:
              data1     data2
key1 key2
a    one  -0.752984 -0.007696
     two  -1.363033  0.087923
b    one   0.624801 -0.132735
     two   0.643536  0.335998

# 返回分组后,每组的数量
In [17]: df.groupby(['key1','key2']).size()
Out[17]:
key1  key2
a     one     2
     two     1
b     one     1
     two     1
dtype: int64

1.1 对分组进行迭代

对groupby对象,可以进行迭代。

# name: set(key1)中的值,进行循环组名
# group:每个name对应的df的子集
In [18]: for name,group in df.groupby('key1'):
    ...:     print(name)
    ...:     print(group)
    ...:
a
      data1     data2 key1 key2
0 -0.187814 -0.224238    a  one
1 -1.363033  0.087923    a  two
4 -1.318154  0.208845    a  one
b
      data1     data2 key1 key2
2  0.624801 -0.132735    b  one
3  0.643536  0.335998    b  two

# 使用(k1,k2)元组来循环分组的组名
In [20]: for (k1,k2),group in df.groupby(['key1','key2']):
    ...:     print(k1,k2)
    ...:     print(group)
    ...:
('a', 'one')
      data1     data2 key1 key2
0 -0.187814 -0.224238    a  one
4 -1.318154  0.208845    a  one
('a', 'two')
      data1     data2 key1 key2
1 -1.363033  0.087923    a  two
('b', 'one')
      data1     data2 key1 key2
2  0.624801 -0.132735    b  one
('b', 'two')
      data1     data2 key1 key2
3  0.643536  0.335998    b  two

# 可以进行字典、列表操作
In [21]: dict(list(df.groupby('key1')))['b']
Out[21]:
      data1     data2 key1 key2
2  0.624801 -0.132735    b  one
3  0.643536  0.335998    b  two

In [24]: df.dtypes
Out[24]:
data1    float64
data2    float64
key1      object
key2      object
dtype: object

# 默认使用axis=0进行分类,也可以按照列进行分类
In [25]: df.groupby(df.dtypes,axis=1)
Out[25]: <pandas.core.groupby.DataFrameGroupBy object at 0x0000000007F024A8>

In [26]: grouped = df.groupby(df.dtypes,axis=1)

In [27]: dict(list(grouped))
Out[27]:
{dtype('float64'):       data1     data2
 0 -0.187814 -0.224238
 1 -1.363033  0.087923
 2  0.624801 -0.132735
 3  0.643536  0.335998
 4 -1.318154  0.208845, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}

1.2 选取一个或一组列

In [28]: df.groupby('key1')['data1'] # 等价于[31]
Out[28]: <pandas.core.groupby.SeriesGroupBy object at 0x0000000007F02550>

In [29]: df.groupby('key1')[['data1','data2']] # 等价于[33]
Out[29]: <pandas.core.groupby.DataFrameGroupBy object at 0x0000000007E448D0

In [31]: df['data1'].groupby(df['key1'])
Out[31]: <pandas.core.groupby.SeriesGroupBy object at 0x0000000007F02198>

In [33]: df[['data1','data2']].groupby(df['key1'])
Out[33]: <pandas.core.groupby.DataFrameGroupBy object at 0x0000000007FE8EF0>

1.3 通过字典或Series进行分组

In [35]: people = DataFrame(np.random.randn(5,5),columns=list('abcde'),index=['
    ...: Joe','Steve','Wes','Jim','Travis'])

In [36]: people.ix[2:3,'b':'c'] = np.nan

In [37]: people
Out[37]:
               a         b         c         d         e
Joe     0.231921 -0.185109 -1.103585  0.919247  0.827190
Steve   0.336541 -0.358981 -0.897516  0.772835 -1.774432
Wes    -0.069938       NaN       NaN  0.268681 -1.199919
Jim     1.324175 -0.968378 -2.376905 -0.222219 -2.366535
Travis  0.815229 -1.139785 -2.045500 -1.576867 -0.578046

# 用字典进行分组
In [38]: mapping = {'a':'red','b':'green','c':'red','d':'red','e':'green'}

In [39]: people.groupby(mapping,axis=1)
Out[39]: <pandas.core.groupby.DataFrameGroupBy object at 0x000000000803A048>

In [40]: people.groupby(mapping,axis=1).sum()  # 将列进行重新映射
Out[40]:
           green       red
Joe     0.642080  0.047583
Steve  -2.133413  0.211860
Wes    -1.199919  0.198743
Jim    -3.334913 -1.274948
Travis -1.717832 -2.807137

# 同样的,也可以使用Series进行分组
In [44]: mapping = Series(mapping)

In [45]: people.groupby(mapping,axis=1).count()
Out[45]:
        green  red
Joe         2    3
Steve       2    3
Wes         1    2
Jim         2    3
Travis      2    3

1.4 通过函数进行分组

groupby也可以传入函数,分别在索引值上执行一便,并将返回值作为新的分组名称。

In [46]: people
Out[46]:
               a         b         c         d         e
Joe     0.231921 -0.185109 -1.103585  0.919247  0.827190
Steve   0.336541 -0.358981 -0.897516  0.772835 -1.774432
Wes    -0.069938       NaN       NaN  0.268681 -1.199919
Jim     1.324175 -0.968378 -2.376905 -0.222219 -2.366535
Travis  0.815229 -1.139785 -2.045500 -1.576867 -0.578046

# 按照人名长度进行分类,并计算对应列的和
In [47]: people.groupby(len).sum()
Out[47]:
          a         b         c         d         e
3  1.486159 -1.153487 -3.480490  0.965709 -2.739264
5  0.336541 -0.358981 -0.897516  0.772835 -1.774432
6  0.815229 -1.139785 -2.045500 -1.576867 -0.578046

In [48]: key_list = ['one','one','one','two','two']
# 按照人名、keylist进行分组,并计算组中最小值
In [49]: people.groupby([len,key_list]).min()
Out[49]:
              a         b         c         d         e
3 one -0.069938 -0.185109 -1.103585  0.268681 -1.199919
  two  1.324175 -0.968378 -2.376905 -0.222219 -2.366535
5 one  0.336541 -0.358981 -0.897516  0.772835 -1.774432
6 two  0.815229 -1.139785 -2.045500 -1.576867 -0.578046

1.5 按照索引级别进行分组

In [51]: columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5
    ...: ,1,3]],names=['cty','tenor'])

In [52]: hier_df = DataFrame(np.random.randn(4,5), columns=columns)

In [53]: hier_df
Out[53]:
cty          US                            JP
tenor         1         3         5         1         3
0     -0.295106  1.021162 -0.233712 -0.708512  0.617306
1      0.404333  1.224998 -0.169200  0.939965 -1.030823
2      0.095957  0.617305  0.867929  1.774633  0.800739
3      1.433123  0.355095 -0.377080 -0.360470  0.690263

In [54]: hier_df.groupby(level='cty',axis=1).count()  #按照级别cty进行分组
Out[54]:
cty  JP  US
0     2   3
1     2   3
2     2   3
3     2   3

待续。。。

posted @ 2017-02-13 14:32  felo  阅读(546)  评论(0编辑  收藏  举报