pandas.DataFrame.groupby—使用映射器或通过一系列列对数据框进行分组
语法格式
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=_NoDefault.no_default, squeeze=_NoDefault.no_default, observed=False, dropna=True)
常用的几个参数解释:
- by: 可接受映射、函数、标签或标签列表。用于确定分组。
- axis: 接受0(index)或1(columns),表示按行分或按列分。默认按行分。
- level: 接受整数、level名,或序列,默认为None。不能与by选项同时使用。
- as_index: 接受布尔值。默认值为True,表示整合输出时返回以group标签为索引的对象。
- dropna: 布尔值。默认为True,表示删除NA
代码示例
import pandas as pd
#数据框
d1 = [[3,"negative",2,1],[4,None,1,2],[5,"positive",0,2],[6,"positive",2,3],[3,"positive",6,4]]
df1 = pd.DataFrame(d1, columns=["xuhao","result","value1","value2"], index=["a","b","c","a","b"])
print(df1)
# xuhao result value1 value2
# a 3 negative 2 1
# b 4 None 1 2
# c 5 positive 0 2
# a 6 positive 2 3
# b 3 positive 6 4
# 使用Pandas的groupby()函数按数据框一列分组
groups1 = df1.groupby(['result']).mean()
print(groups1)
# xuhao value1 value2
# result
# negative 3.000000 2.000000 1.0
# positive 4.666667 2.666667 3.0
groups1_1 = df1.groupby(['result'],dropna=False).mean()
print(groups1_1)
# xuhao value1 value2
# result
# negative 3.000000 2.000000 1.0
# positive 4.666667 2.666667 3.0
# NaN 4.000000 1.000000 2.0
# 使用Pandas的groupby()函数按数据框两列分组
groups2 = df1.groupby(["xuhao",'result']).mean()
print(groups2)
# value1 value2
# xuhao result
# 3 negative 2.0 1.0
# positive 6.0 4.0
# 5 positive 0.0 2.0
# 6 positive 2.0 3.0
# 使用Pandas的groupby()函数按数据框两列分组,并只求其中一列的均值
groups3 = df1.groupby(["xuhao",'result'])["value1"].mean()
print(groups3)
# xuhao result
# 3 negative 2.0
# positive 6.0
# 5 positive 0.0
# 6 positive 2.0
# Name: value1, dtype: float64
#将as_index设置为False,使 groupby的结果不以组标签为索引
groups4 = df1.groupby(["xuhao",'result'], as_index=False).mean()
print(groups4)
# xuhao result value1 value2
# 0 3 negative 2.0 1.0
# 1 3 positive 6.0 4.0
# 2 5 positive 0.0 2.0
# 3 6 positive 2.0 3.0
#按照行索引分组
groups5 = df1.groupby(level=0).mean()
print(groups5)
# xuhao value1 value2
# a 4.5 2.0 2.0
# b 3.5 3.5 3.0
# c 5.0 0.0 2.0
#当使用.apply()时,group keys默认为True
注:df.groupby() 返回一系列键值对,print()仅能看到分组结果的数据类型,将分组结果利用list()转换成了list或利用for循环可看到具体内容。
groupby对象操作函数
import pandas as pd
import numpy as np
#数据框
d1 = [[3,"negative",2],[4,"negative",6],[11,"positive",0],[12,"positive",2]]
df1 = pd.DataFrame(d1, columns=["xuhao","result","value"])
print(df1)
# xuhao result value
# 0 3 negative 2
# 1 4 negative 6
# 2 11 positive 0
# 3 12 positive 2
#describe()查看每组的统计信息,包括组内样本数、平均值、中位数、方差、最大值和最小值等
group1 = df1.groupby("result").describe()
#group1 = df1.groupby("result").describe()["value"] #仅查看value列
print(group1)
# xuhao value
# count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
# result
# negative 2.0 3.5 0.707107 3.0 3.25 3.5 3.75 4.0 2.0 4.0 2.828427 2.0 3.0 4.0 5.0 6.0
# positive 2.0 11.5 0.707107 11.0 11.25 11.5 11.75 12.0 2.0 1.0 1.414214 0.0 0.5 1.0 1.5 2.0
#agg()聚合操作,包括min, max, sum, mean, median, std, var和count
#group2 = df1.groupby("result").agg("mean")
#group2 = df1.groupby("result").agg("mean")["value"] #仅查看value列
group2 = df1.groupby("result")["value"].agg("mean") #
print(group2)
# result
# negative 4.0
# positive 1.0
# Name: value, dtype: float64
group3 = df1.groupby("result").agg({"xuhao":"sum","value":"mean"})#计算不同列的不同指标
print(group3)
# xuhao value
# result
# negative 7 4.0
# positive 23 1.0
#transform()将计算得到的值直接追加到数据框的最后一列
df1["mean_value"] = df1.groupby("result")["value"].transform("mean")
print(df1)
# xuhao result value mean_value
# 0 3 negative 2 4.0
# 1 4 negative 6 4.0
# 2 11 positive 0 1.0
# 3 12 positive 2 1.0
#apply函数按特定方式计算各组数据,也可自定义函数
group4 = df1.groupby("result").apply(np.mean)
print(group4)
# xuhao value mean_value
# result
# negative 3.5 4.0 4.0
# positive 11.5 1.0 1.0