pandas.DataFrame.groupby—使用映射器或通过一系列列对数据框进行分组

语法格式

DataFrame.groupby(by=Noneaxis=0level=Noneas_index=Truesort=Truegroup_keys=_NoDefault.no_defaultsqueeze=_NoDefault.no_defaultobserved=Falsedropna=True)

常用的几个参数解释:

  • by: 可接受映射、函数、标签或标签列表。用于确定分组。
  • axis: 接受0(index)或1(columns),表示按行分或按列分。默认按行分。
  • level: 接受整数、level名,或序列,默认为None。不能与by选项同时使用。
  • as_index: 接受布尔值。默认值为True,表示整合输出时返回以group标签为索引的对象。
  • dropna: 布尔值。默认为True,表示删除NA

 代码示例

import pandas as pd

#数据框
d1 = [[3,"negative",2,1],[4,None,1,2],[5,"positive",0,2],[6,"positive",2,3],[3,"positive",6,4]]
df1 = pd.DataFrame(d1, columns=["xuhao","result","value1","value2"], index=["a","b","c","a","b"])
print(df1)
# xuhao    result  value1  value2
# a      3  negative       2       1
# b      4      None       1       2
# c      5  positive       0       2
# a      6  positive       2       3
# b      3  positive       6       4

# 使用Pandas的groupby()函数按数据框一列分组
groups1 = df1.groupby(['result']).mean()
print(groups1)
# xuhao    value1  value2
# result
# negative  3.000000  2.000000     1.0
# positive  4.666667  2.666667     3.0
groups1_1 = df1.groupby(['result'],dropna=False).mean()
print(groups1_1)
# xuhao    value1  value2
# result
# negative  3.000000  2.000000     1.0
# positive  4.666667  2.666667     3.0
# NaN       4.000000  1.000000     2.0

# 使用Pandas的groupby()函数按数据框两列分组
groups2 = df1.groupby(["xuhao",'result']).mean()
print(groups2)
#                 value1  value2
# xuhao result
# 3     negative     2.0     1.0
#       positive     6.0     4.0
# 5     positive     0.0     2.0
# 6     positive     2.0     3.0

# 使用Pandas的groupby()函数按数据框两列分组,并只求其中一列的均值
groups3 = df1.groupby(["xuhao",'result'])["value1"].mean()
print(groups3)
# xuhao  result
# 3      negative    2.0
#        positive    6.0
# 5      positive    0.0
# 6      positive    2.0
# Name: value1, dtype: float64

#将as_index设置为False,使 groupby的结果不以组标签为索引
groups4 = df1.groupby(["xuhao",'result'], as_index=False).mean() 
print(groups4)
#    xuhao    result  value1  value2
# 0      3  negative     2.0     1.0
# 1      3  positive     6.0     4.0
# 2      5  positive     0.0     2.0
# 3      6  positive     2.0     3.0

#按照行索引分组
groups5 = df1.groupby(level=0).mean() 
print(groups5)
#    xuhao  value1  value2
# a    4.5     2.0     2.0
# b    3.5     3.5     3.0
# c    5.0     0.0     2.0

#当使用.apply()时,group keys默认为True

注:df.groupby() 返回一系列键值对,print()仅能看到分组结果的数据类型,将分组结果利用list()转换成了list或利用for循环可看到具体内容。

groupby对象操作函数

import pandas as pd
import numpy as np

#数据框
d1 = [[3,"negative",2],[4,"negative",6],[11,"positive",0],[12,"positive",2]]
df1 = pd.DataFrame(d1, columns=["xuhao","result","value"])
print(df1)
# xuhao    result  value
# 0      3  negative      2
# 1      4  negative      6
# 2     11  positive      0
# 3     12  positive      2

#describe()查看每组的统计信息,包括组内样本数、平均值、中位数、方差、最大值和最小值等
group1 = df1.groupby("result").describe()
#group1 = df1.groupby("result").describe()["value"] #仅查看value列
print(group1)
#          xuhao                                                 value
#          count  mean       std   min    25%   50%    75%   max count mean       std  min  25%  50%  75%  max
# result
# negative   2.0   3.5  0.707107   3.0   3.25   3.5   3.75   4.0   2.0  4.0  2.828427  2.0  3.0  4.0  5.0  6.0
# positive   2.0  11.5  0.707107  11.0  11.25  11.5  11.75  12.0   2.0  1.0  1.414214  0.0  0.5  1.0  1.5  2.0

#agg()聚合操作,包括min, max, sum, mean, median, std, var和count
#group2 = df1.groupby("result").agg("mean")
#group2 = df1.groupby("result").agg("mean")["value"] #仅查看value列
group2 = df1.groupby("result")["value"].agg("mean") #
print(group2)
# result
# negative    4.0
# positive    1.0
# Name: value, dtype: float64

group3 = df1.groupby("result").agg({"xuhao":"sum","value":"mean"})#计算不同列的不同指标
print(group3)
#           xuhao  value
# result
# negative      7    4.0
# positive     23    1.0

#transform()将计算得到的值直接追加到数据框的最后一列
df1["mean_value"] = df1.groupby("result")["value"].transform("mean")
print(df1) 
#    xuhao    result  value  mean_value
# 0      3  negative      2         4.0
# 1      4  negative      6         4.0
# 2     11  positive      0         1.0
# 3     12  positive      2         1.0

#apply函数按特定方式计算各组数据,也可自定义函数
group4 = df1.groupby("result").apply(np.mean)
print(group4)
#           xuhao  value  mean_value
# result
# negative    3.5    4.0         4.0
# positive   11.5    1.0         1.0
posted @ 2023-04-25 21:09  yayagogogo  阅读(37)  评论(0编辑  收藏  举报