pyspark编程实践(agg操作&自定义聚合函数)

agg操作&自定义聚合函数

agg-groupby的情况

pyspark中的agg聚合运算应该才能达到聚合字段的目的, apply的运算都是一行一行的运算且并没有真实的聚合.

pyspark中已经对agg操作定义了很多方便的运算函数,可以直接调用来对其进行运算.

from:
+---------+-------------+-----+-------+-------+-------+-------+--------+
|ID       |           P |index|xinf   |xup    |yinf   |ysup   |     M  |
+---------+-------------+-----+-------+-------+-------+-------+--------+
|        0|10279.9003906|   13|    0.3|    0.5|    2.5|    3.0|540928.0|
|        2|12024.2998047|   13|    0.3|    0.5|    2.5|    3.0|541278.0|
|        0|10748.7001953|   13|    0.3|    0.5|    2.5|    3.0|541243.0|
|        1|      10988.5|   13|    0.3|    0.5|    2.5|    3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+

to:
+---------+-------------+-----+-------+-------+-------+-------+--------+
|Id       |           P |index|xinf   |xup    |yinf   |ysup   |     M  |
+---------+-------------+-----+-------+-------+-------+-------+--------+
|        0|10514.3002929|   13|    0.3|    0.5|    2.5|    3.0|540928.0,541243.0|
|        2|12024.2998047|   13|    0.3|    0.5|    2.5|    3.0|541278.0|
|        1|      10988.5|   13|    0.3|    0.5|    2.5|    3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+

# So, I want to reduce by Id, and calculate mean of P rows and concatenate M rows. 
from pyspark.sql.functions import first, collect_list, mean

In:
df.groupBy("ID").agg(mean("P"), first("index"),
                     first("xinf"), first("xup"), 
                     first("yinf"), first("ysup"), 
                     collect_list("M"))
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
spark = SparkSession.builder.appName('pyspark - example join').getOrCreate()

datavengers = [
    ("Carol","Data Scientist","USA",70000,5),
    ("Peter","Data Scientist","USA",90000,7),
    ("Clark","Data Scientist","UK",111000,10),
    ("Jean","Data Scientist","UK",220000,30),
    ("Bruce","Data Engineer","UK",80000,4),
    ("Thanos","Data Engineer","USA",115000,13),
    ("Scott","Data Engineer","UK",180000,15),
    ("T'challa","CEO","USA",300000,20),
    ("Xavier","Marketing","USA",100000,11),
    ("Wade","Marketing","UK",60000,2)
]

schema = ["Name","Job","Country","salary","seniority"]
df = spark.createDataFrame(data=datavengers, schema = schema)
df.printSchema()
OUT:
root
 |-- Name: string (nullable = true)
 |-- Job: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- seniority: long (nullable = true)
    
df.show()
OUT:
+--------+--------------+-------+------+---------+
|    Name|           Job|Country|salary|seniority|
+--------+--------------+-------+------+---------+
|   Carol|Data Scientist|    USA| 70000|        5|
|   Peter|Data Scientist|    USA| 90000|        7|
|   Clark|Data Scientist|     UK|111000|       10|
|    Jean|Data Scientist|     UK|220000|       30|
|   Bruce| Data Engineer|     UK| 80000|        4|
|  Thanos| Data Engineer|    USA|115000|       13|
|   Scott| Data Engineer|     UK|180000|       15|
|T'challa|           CEO|    USA|300000|       20|
|  Xavier|     Marketing|    USA|100000|       11|
|    Wade|     Marketing|     UK| 60000|        2|
+--------+--------------+-------+------+---------+
# 单列聚合
df.groupBy('Job').max('salary').show()
OUT:
+--------------+-----------+
|           Job|max(salary)|
+--------------+-----------+
|           CEO|     300000|
|Data Scientist|     220000|
|     Marketing|     100000|
| Data Engineer|     180000|
+--------------+-----------+

# 多关键字分组, 对多列进行同样的聚合操作
df.groupy('Job','Country').agg('salary', 'seniority').show()
OUT:
+--------------+-------+-----------+--------------+
|Job           |Country|avg(salary)|avg(seniority)|
+--------------+-------+-----------+--------------+
|Marketing     |UK     |60000.0    |2.0           |
|Data Engineer |UK     |130000.0   |9.5           |
|Data Scientist|UK     |165500.0   |20.0          |
|Marketing     |USA    |100000.0   |11.0          |
|Data Scientist|USA    |80000.0    |6.0           |
|CEO           |USA    |300000.0   |20.0          |
|Data Engineer |USA    |115000.0   |13.0          |
+--------------+-------+-----------+--------------+

# 对多列进行不同的聚合操作, 并修改相应的列名
df.groupBy("Job") \
    .agg(f.sum("salary").alias("sum_salary"), 
         f.avg("salary").alias("avg_salary"), 
         f.min("salary").alias("min_salary"), 
         f.max("salary").alias("max_salary"), 
         f.mean("salary").alias("mean_salary") 
     ) \
    .show(truncate=False)
OUT:
+--------------+----------+----------+----------+----------+-----------+
|Job           |sum_salary|avg_salary|min_salary|max_salary|mean_salary|
+--------------+----------+----------+----------+----------+-----------+
|CEO           |300000    |300000.0  |300000    |300000    |300000.0   |
|Data Scientist|491000    |122750.0  |70000     |220000    |122750.0   |
|Marketing     |160000    |80000.0   |60000     |100000    |80000.0    |
|Data Engineer |375000    |125000.0  |80000     |180000    |125000.0   |
+--------------+----------+----------+----------+----------+-----------+

查看数据概况 .summary()方法

df.select('salary').summary().show()
OUT:
+-------+-----------------+
|summary|           salary|
+-------+-----------------+
|  count|               10|
|   mean|         132600.0|
| stddev|77113.48059119683|
|    min|            60000|
|    25%|            80000|
|    50%|           100000|
|    75%|           180000|
|    max|           300000|
+-------+-----------------+
agg-DataFrame直接计算
  • spark的agg可以直接对DataFrame进行聚合运算, 简单情况即每一列是可以直接列举的
  • 复杂情况是无法直接穷举每一列的表达式, 而是需要创建表达式集合的情况
# =========================简单情况=========================
data.show(5)
+--------+-------+--------+--------------------+-----+--------+
|glass_id|step_id|equip_id|             timekey|label| unit_id|
+--------+-------+--------+--------------------+-----+--------+
|Y95PR090|  14200|A2PDC100|20190601094814153863|    1|A2PDC100|
|Y95PR090|  14207|A2VTM100|20190601120431648744|    1|A2VTM100|
|Y95PR090|  1420V|A2PVD100|20190601120428511673|    1|A2PVD100|
|Y95PR090|  14300|A2RSM100|20190601125957981111|    1|A2RSM100|
|Y95PR090|  14315|A2PHT500|20190601150105054455|    1|A2PHT500|
+--------+-------+--------+--------------------+-----+--------+
only showing top 5 rows

data.agg(mean('label')).show()
+------------------+
|        avg(label)|
+------------------+
|0.7411402157164869|
+------------------+


# ============直接使用循环来创建表达式的集合===============
tips_.show(2)
+----------+----+----+
|total_bill| tip|size|
+----------+----+----+
|     16.99|1.01| 2.0|
|     10.34|1.66| 3.0|
+----------+----+----+
only showing top 2 rows

agglist = [mean(x) for x in tips_.columns]
agglist
Out[109]: [Column<b'avg(total_bill)'>, Column<b'avg(tip)'>, Column<b'avg(size)'>]
tips_.agg(*agglist)
Out[111]: DataFrame[avg(total_bill): double, avg(tip): double, avg(size): double]
tips_.agg(*agglist).show()
+------------------+----------------+-----------------+
|   avg(total_bill)|        avg(tip)|        avg(size)|
+------------------+----------------+-----------------+
|19.785942643392282|2.99827868821191|2.569672131147541|
+------------------+----------------+-----------------+

posted @ 2020-11-21 16:49  seekerJunYu  阅读(5565)  评论(0编辑  收藏  举报