使用pyspark计算几何平均值

  根据Wikipedia的定义,几何平均值定义为n个数字乘积的n次根。根据同一页,几何平均值也可以表示为对数的算术平均值的指数。我将使用它来计算每一列的几何平均值。

  计算几何平均值,方法是将c1c2的列数据组合到一个名为value的新列中,该列将源列名称存储在column中。重新格式化数据后,通过按column (c1c2)分组并计算每个组的对数值的算术平均值的指数来确定几何平均值。在此计算中,NaN值被忽略。

%spark.pyspark
from pyspark.sql import functions as F

df = sqlContext.range(0, 10)    
df = df.select(F.rand(seed=10).alias("c1"), F.randn(seed=27).alias("c2"))
df_id = df.repartition(1).withColumn("id", F.monotonically_increasing_id()).repartition(2)
print("df_id表:")
df_id.show()

kvp = F.explode(F.array([F.struct(F.lit(c).alias("column"), F.col(c).alias("value")) for c in df.columns])).alias("kvp")
df_pivoted = df_id.select(['id'] + [kvp]).select(['id'] + ["kvp.column", "kvp.value"])
print("df_pivoted:")
df_pivoted.show()

df_geometric_mean = df_pivoted.groupBy(['column']).agg(F.exp(F.avg(F.log(df_pivoted.value))))
print("df_geometric_mean:")
df_geometric_mean.show()

df_geometric_mean.withColumnRenamed("EXP(avg(LOG(value)))", "geometric_mean").show()

结果:
df_id表:
+-------------------+--------------------+---+
|                 c1|                  c2| id|
+-------------------+--------------------+---+
| 0.5996723933366402| -0.6950946460326453|  8|
|0.16452185994603707|  0.8306551181802446|  3|
| 0.3654625958161396| 0.16420866768809156|  1|
| 0.4175019040792016| -1.0451987154313813|  2|
|0.34084319330900115| -1.4298752463301871|  6|
| 0.6396141227834357|-0.07706276399864868|  9|
| 0.9697474945375325|  0.5991404703866096|  5|
|0.18141810315190554|  1.3792681955381285|  4|
|0.03422639313807285| 0.45800664187768786|  0|
| 0.4725977369833597|-0.19668314148825305|  7|
+-------------------+--------------------+---+

df_pivoted:
+---+------+--------------------+
| id|column|               value|
+---+------+--------------------+
|  8|    c1|  0.5996723933366402|
|  8|    c2| -0.6950946460326453|
|  3|    c1| 0.16452185994603707|
|  3|    c2|  0.8306551181802446|
|  1|    c1|  0.3654625958161396|
|  1|    c2| 0.16420866768809156|
|  2|    c1|  0.4175019040792016|
|  2|    c2| -1.0451987154313813|
|  6|    c1| 0.34084319330900115|
|  6|    c2| -1.4298752463301871|
|  9|    c1|  0.6396141227834357|
|  9|    c2|-0.07706276399864868|
|  5|    c1|  0.9697474945375325|
|  5|    c2|  0.5991404703866096|
|  4|    c1| 0.18141810315190554|
|  4|    c2|  1.3792681955381285|
|  0|    c1| 0.03422639313807285|
|  0|    c2| 0.45800664187768786|
|  7|    c1|  0.4725977369833597|
|  7|    c2|-0.19668314148825305|
+---+------+--------------------+

df_geometric_mean:
+------+-------------------+
|column|EXP(avg(ln(value)))|
+------+-------------------+
|    c1|0.31407283822800264|
|    c2| 0.5528065693289157|
+------+-------------------+

+------+-------------------+
|column|EXP(avg(ln(value)))|
+------+-------------------+
|    c1|0.31407283822800264|
|    c2| 0.5528065693289157|
+------+-------------------+

 

posted @ 2022-05-07 17:13  干了这瓶老干妈  阅读(279)  评论(0编辑  收藏  举报
Live2D