pyspark编程实践(agg操作&自定义聚合函数)
agg操作&自定义聚合函数
agg-groupby的情况
pyspark中的agg聚合运算应该才能达到聚合字段的目的, apply的运算都是一行一行的运算且并没有真实的聚合.
pyspark中已经对agg操作定义了很多方便的运算函数,可以直接调用来对其进行运算.
from:
+---------+-------------+-----+-------+-------+-------+-------+--------+
|ID | P |index|xinf |xup |yinf |ysup | M |
+---------+-------------+-----+-------+-------+-------+-------+--------+
| 0|10279.9003906| 13| 0.3| 0.5| 2.5| 3.0|540928.0|
| 2|12024.2998047| 13| 0.3| 0.5| 2.5| 3.0|541278.0|
| 0|10748.7001953| 13| 0.3| 0.5| 2.5| 3.0|541243.0|
| 1| 10988.5| 13| 0.3| 0.5| 2.5| 3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+
to:
+---------+-------------+-----+-------+-------+-------+-------+--------+
|Id | P |index|xinf |xup |yinf |ysup | M |
+---------+-------------+-----+-------+-------+-------+-------+--------+
| 0|10514.3002929| 13| 0.3| 0.5| 2.5| 3.0|540928.0,541243.0|
| 2|12024.2998047| 13| 0.3| 0.5| 2.5| 3.0|541278.0|
| 1| 10988.5| 13| 0.3| 0.5| 2.5| 3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+
# So, I want to reduce by Id, and calculate mean of P rows and concatenate M rows.
from pyspark.sql.functions import first, collect_list, mean
In:
df.groupBy("ID").agg(mean("P"), first("index"),
first("xinf"), first("xup"),
first("yinf"), first("ysup"),
collect_list("M"))
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()
datavengers = [
("Carol","Data Scientist","USA",70000,5),
("Peter","Data Scientist","USA",90000,7),
("Clark","Data Scientist","UK",111000,10),
("Jean","Data Scientist","UK",220000,30),
("Bruce","Data Engineer","UK",80000,4),
("Thanos","Data Engineer","USA",115000,13),
("Scott","Data Engineer","UK",180000,15),
("T'challa","CEO","USA",300000,20),
("Xavier","Marketing","USA",100000,11),
("Wade","Marketing","UK",60000,2)
]
schema = ["Name","Job","Country","salary","seniority"]
df = spark.createDataFrame(data=datavengers, schema = schema)
df.printSchema()
OUT:
root
|-- Name: string (nullable = true)
|-- Job: string (nullable = true)
|-- Country: string (nullable = true)
|-- salary: long (nullable = true)
|-- seniority: long (nullable = true)
df.show()
OUT:
+--------+--------------+-------+------+---------+
| Name| Job|Country|salary|seniority|
+--------+--------------+-------+------+---------+
| Carol|Data Scientist| USA| 70000| 5|
| Peter|Data Scientist| USA| 90000| 7|
| Clark|Data Scientist| UK|111000| 10|
| Jean|Data Scientist| UK|220000| 30|
| Bruce| Data Engineer| UK| 80000| 4|
| Thanos| Data Engineer| USA|115000| 13|
| Scott| Data Engineer| UK|180000| 15|
|T'challa| CEO| USA|300000| 20|
| Xavier| Marketing| USA|100000| 11|
| Wade| Marketing| UK| 60000| 2|
+--------+--------------+-------+------+---------+
# 单列聚合
df.groupBy('Job').max('salary').show()
OUT:
+--------------+-----------+
| Job|max(salary)|
+--------------+-----------+
| CEO| 300000|
|Data Scientist| 220000|
| Marketing| 100000|
| Data Engineer| 180000|
+--------------+-----------+
# 多关键字分组, 对多列进行同样的聚合操作
df.groupy('Job','Country').agg('salary', 'seniority').show()
OUT:
+--------------+-------+-----------+--------------+
|Job |Country|avg(salary)|avg(seniority)|
+--------------+-------+-----------+--------------+
|Marketing |UK |60000.0 |2.0 |
|Data Engineer |UK |130000.0 |9.5 |
|Data Scientist|UK |165500.0 |20.0 |
|Marketing |USA |100000.0 |11.0 |
|Data Scientist|USA |80000.0 |6.0 |
|CEO |USA |300000.0 |20.0 |
|Data Engineer |USA |115000.0 |13.0 |
+--------------+-------+-----------+--------------+
# 对多列进行不同的聚合操作, 并修改相应的列名
df.groupBy("Job") \
.agg(f.sum("salary").alias("sum_salary"),
f.avg("salary").alias("avg_salary"),
f.min("salary").alias("min_salary"),
f.max("salary").alias("max_salary"),
f.mean("salary").alias("mean_salary")
) \
.show(truncate=False)
OUT:
+--------------+----------+----------+----------+----------+-----------+
|Job |sum_salary|avg_salary|min_salary|max_salary|mean_salary|
+--------------+----------+----------+----------+----------+-----------+
|CEO |300000 |300000.0 |300000 |300000 |300000.0 |
|Data Scientist|491000 |122750.0 |70000 |220000 |122750.0 |
|Marketing |160000 |80000.0 |60000 |100000 |80000.0 |
|Data Engineer |375000 |125000.0 |80000 |180000 |125000.0 |
+--------------+----------+----------+----------+----------+-----------+
查看数据概况 .summary()方法
df.select('salary').summary().show()
OUT:
+-------+-----------------+
|summary| salary|
+-------+-----------------+
| count| 10|
| mean| 132600.0|
| stddev|77113.48059119683|
| min| 60000|
| 25%| 80000|
| 50%| 100000|
| 75%| 180000|
| max| 300000|
+-------+-----------------+
agg-DataFrame直接计算
- spark的agg可以直接对DataFrame进行聚合运算, 简单情况即每一列是可以直接列举的
- 复杂情况是无法直接穷举每一列的表达式, 而是需要创建表达式集合的情况
# =========================简单情况=========================
data.show(5)
+--------+-------+--------+--------------------+-----+--------+
|glass_id|step_id|equip_id| timekey|label| unit_id|
+--------+-------+--------+--------------------+-----+--------+
|Y95PR090| 14200|A2PDC100|20190601094814153863| 1|A2PDC100|
|Y95PR090| 14207|A2VTM100|20190601120431648744| 1|A2VTM100|
|Y95PR090| 1420V|A2PVD100|20190601120428511673| 1|A2PVD100|
|Y95PR090| 14300|A2RSM100|20190601125957981111| 1|A2RSM100|
|Y95PR090| 14315|A2PHT500|20190601150105054455| 1|A2PHT500|
+--------+-------+--------+--------------------+-----+--------+
only showing top 5 rows
data.agg(mean('label')).show()
+------------------+
| avg(label)|
+------------------+
|0.7411402157164869|
+------------------+
# ============直接使用循环来创建表达式的集合===============
tips_.show(2)
+----------+----+----+
|total_bill| tip|size|
+----------+----+----+
| 16.99|1.01| 2.0|
| 10.34|1.66| 3.0|
+----------+----+----+
only showing top 2 rows
agglist = [mean(x) for x in tips_.columns]
agglist
Out[109]: [Column<b'avg(total_bill)'>, Column<b'avg(tip)'>, Column<b'avg(size)'>]
tips_.agg(*agglist)
Out[111]: DataFrame[avg(total_bill): double, avg(tip): double, avg(size): double]
tips_.agg(*agglist).show()
+------------------+----------------+-----------------+
| avg(total_bill)| avg(tip)| avg(size)|
+------------------+----------------+-----------------+
|19.785942643392282|2.99827868821191|2.569672131147541|
+------------------+----------------+-----------------+