数据聚合与分组运算学习笔记1
本章内容:
- 根据一个或多个键拆分pandas对象;
- 计算分数摘要统计;
- 对DataFrame的列应用各种函数;
- 应用组内转换或运算;
- 计算透视表或交叉表;
- 执行分位数分析等;
1 GroupBy技术
分组聚合演示:
In [4]: df = DataFrame({'key1':list('aabba'),'key2':['one','two','one','two','o
...: ne'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
In [5]: df
Out[5]:
data1 data2 key1 key2
0 -0.187814 -0.224238 a one
1 -1.363033 0.087923 a two
2 0.624801 -0.132735 b one
3 0.643536 0.335998 b two
4 -1.318154 0.208845 a one
In [6]: grouped=df['data1'].groupby(df['key1']) # 将data1数据按照key1进行分组
In [7]: grouped
Out[7]: <pandas.core.groupby.SeriesGroupBy object at 0x0000000007D7FEF0>
In [8]: grouped.mean() # 求各分组的平均值
Out[8]:
key1
a -0.956334
b 0.634169
Name: data1, dtype: float64
In [9]: means = df['data1'].groupby([df['key1'],df['key2']]).mean()
In [10]: means #分层索引 进行分组
Out[10]:
key1 key2
a one -0.752984
two -1.363033
b one 0.624801
two 0.643536
Name: data1, dtype: float64
In [11]: means.unstack() # 解stack
Out[11]:
key2 one two
key1
a -0.752984 -1.363033
b 0.624801 0.643536
# 数据和分组是Series,其实也可以是其他任意的数组,arange
In [12]: df.groupby('key1').mean()
Out[12]: # key2不是数值列,因此属于“麻烦列”,不能用于计算mean,所以去除掉了。
data1 data2
key1
a -0.956334 0.024177
b 0.634169 0.101632
In [13]: df.groupby(['key1','key2']).mean()
Out[13]:
data1 data2
key1 key2
a one -0.752984 -0.007696
two -1.363033 0.087923
b one 0.624801 -0.132735
two 0.643536 0.335998
# 返回分组后,每组的数量
In [17]: df.groupby(['key1','key2']).size()
Out[17]:
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
1.1 对分组进行迭代
对groupby对象,可以进行迭代。
# name: set(key1)中的值,进行循环组名
# group:每个name对应的df的子集
In [18]: for name,group in df.groupby('key1'):
...: print(name)
...: print(group)
...:
a
data1 data2 key1 key2
0 -0.187814 -0.224238 a one
1 -1.363033 0.087923 a two
4 -1.318154 0.208845 a one
b
data1 data2 key1 key2
2 0.624801 -0.132735 b one
3 0.643536 0.335998 b two
# 使用(k1,k2)元组来循环分组的组名
In [20]: for (k1,k2),group in df.groupby(['key1','key2']):
...: print(k1,k2)
...: print(group)
...:
('a', 'one')
data1 data2 key1 key2
0 -0.187814 -0.224238 a one
4 -1.318154 0.208845 a one
('a', 'two')
data1 data2 key1 key2
1 -1.363033 0.087923 a two
('b', 'one')
data1 data2 key1 key2
2 0.624801 -0.132735 b one
('b', 'two')
data1 data2 key1 key2
3 0.643536 0.335998 b two
# 可以进行字典、列表操作
In [21]: dict(list(df.groupby('key1')))['b']
Out[21]:
data1 data2 key1 key2
2 0.624801 -0.132735 b one
3 0.643536 0.335998 b two
In [24]: df.dtypes
Out[24]:
data1 float64
data2 float64
key1 object
key2 object
dtype: object
# 默认使用axis=0进行分类,也可以按照列进行分类
In [25]: df.groupby(df.dtypes,axis=1)
Out[25]: <pandas.core.groupby.DataFrameGroupBy object at 0x0000000007F024A8>
In [26]: grouped = df.groupby(df.dtypes,axis=1)
In [27]: dict(list(grouped))
Out[27]:
{dtype('float64'): data1 data2
0 -0.187814 -0.224238
1 -1.363033 0.087923
2 0.624801 -0.132735
3 0.643536 0.335998
4 -1.318154 0.208845, dtype('O'): key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one}
1.2 选取一个或一组列
In [28]: df.groupby('key1')['data1'] # 等价于[31]
Out[28]: <pandas.core.groupby.SeriesGroupBy object at 0x0000000007F02550>
In [29]: df.groupby('key1')[['data1','data2']] # 等价于[33]
Out[29]: <pandas.core.groupby.DataFrameGroupBy object at 0x0000000007E448D0
In [31]: df['data1'].groupby(df['key1'])
Out[31]: <pandas.core.groupby.SeriesGroupBy object at 0x0000000007F02198>
In [33]: df[['data1','data2']].groupby(df['key1'])
Out[33]: <pandas.core.groupby.DataFrameGroupBy object at 0x0000000007FE8EF0>
1.3 通过字典或Series进行分组
In [35]: people = DataFrame(np.random.randn(5,5),columns=list('abcde'),index=['
...: Joe','Steve','Wes','Jim','Travis'])
In [36]: people.ix[2:3,'b':'c'] = np.nan
In [37]: people
Out[37]:
a b c d e
Joe 0.231921 -0.185109 -1.103585 0.919247 0.827190
Steve 0.336541 -0.358981 -0.897516 0.772835 -1.774432
Wes -0.069938 NaN NaN 0.268681 -1.199919
Jim 1.324175 -0.968378 -2.376905 -0.222219 -2.366535
Travis 0.815229 -1.139785 -2.045500 -1.576867 -0.578046
# 用字典进行分组
In [38]: mapping = {'a':'red','b':'green','c':'red','d':'red','e':'green'}
In [39]: people.groupby(mapping,axis=1)
Out[39]: <pandas.core.groupby.DataFrameGroupBy object at 0x000000000803A048>
In [40]: people.groupby(mapping,axis=1).sum() # 将列进行重新映射
Out[40]:
green red
Joe 0.642080 0.047583
Steve -2.133413 0.211860
Wes -1.199919 0.198743
Jim -3.334913 -1.274948
Travis -1.717832 -2.807137
# 同样的,也可以使用Series进行分组
In [44]: mapping = Series(mapping)
In [45]: people.groupby(mapping,axis=1).count()
Out[45]:
green red
Joe 2 3
Steve 2 3
Wes 1 2
Jim 2 3
Travis 2 3
1.4 通过函数进行分组
groupby也可以传入函数,分别在索引值上执行一便,并将返回值作为新的分组名称。
In [46]: people
Out[46]:
a b c d e
Joe 0.231921 -0.185109 -1.103585 0.919247 0.827190
Steve 0.336541 -0.358981 -0.897516 0.772835 -1.774432
Wes -0.069938 NaN NaN 0.268681 -1.199919
Jim 1.324175 -0.968378 -2.376905 -0.222219 -2.366535
Travis 0.815229 -1.139785 -2.045500 -1.576867 -0.578046
# 按照人名长度进行分类,并计算对应列的和
In [47]: people.groupby(len).sum()
Out[47]:
a b c d e
3 1.486159 -1.153487 -3.480490 0.965709 -2.739264
5 0.336541 -0.358981 -0.897516 0.772835 -1.774432
6 0.815229 -1.139785 -2.045500 -1.576867 -0.578046
In [48]: key_list = ['one','one','one','two','two']
# 按照人名、keylist进行分组,并计算组中最小值
In [49]: people.groupby([len,key_list]).min()
Out[49]:
a b c d e
3 one -0.069938 -0.185109 -1.103585 0.268681 -1.199919
two 1.324175 -0.968378 -2.376905 -0.222219 -2.366535
5 one 0.336541 -0.358981 -0.897516 0.772835 -1.774432
6 two 0.815229 -1.139785 -2.045500 -1.576867 -0.578046
1.5 按照索引级别进行分组
In [51]: columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5
...: ,1,3]],names=['cty','tenor'])
In [52]: hier_df = DataFrame(np.random.randn(4,5), columns=columns)
In [53]: hier_df
Out[53]:
cty US JP
tenor 1 3 5 1 3
0 -0.295106 1.021162 -0.233712 -0.708512 0.617306
1 0.404333 1.224998 -0.169200 0.939965 -1.030823
2 0.095957 0.617305 0.867929 1.774633 0.800739
3 1.433123 0.355095 -0.377080 -0.360470 0.690263
In [54]: hier_df.groupby(level='cty',axis=1).count() #按照级别cty进行分组
Out[54]:
cty JP US
0 2 3
1 2 3
2 2 3
3 2 3
待续。。。