spark scala分组取最新日期的几条记录max date

复制代码
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
    (2,"Rose",1,"2010","20","M",4000),
    (1,"Williams",1,"2020","10","M",1000),
    (2,"Jones",2,"2005","10","F",2000),
    (1,"Brown",2,"2020","40","",-1),
      (6,"Brown",2,"2010","50","",-1)
  )
  val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
       "emp_dept_id","gender","salary")

  import spark.sqlContext.implicits._
  val empDF = emp.toDF(empColumns:_*)
  empDF.show(false)

scala> val b = empDF
scala> b.show
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|    name|superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|     1|   Smith|             -1|       2018|         10|     M|  3000|
|     2|    Rose|              1|       2010|         20|     M|  4000|
|     1|Williams|              1|       2020|         10|     M|  1000|
|     2|   Jones|              2|       2005|         10|     F|  2000|
|     1|   Brown|              2|       2020|         40|      |    -1|
|     6|   Brown|              2|       2010|         50|      |    -1|
+------+--------+---------------+-----------+-----------+------+------+

scala> val a = empDF.groupBy("emp_id").agg(max("year_joined").alias("max"))
a: org.apache.spark.sql.DataFrame = [emp_id: int, max: string]

scala> a.show
+------+----+
|emp_id| max|
+------+----+
|     1|2020|
|     6|2010|
|     2|2010|
+------+----+

scala> b.join(a, Seq("emp_id"), "left").show
+------+--------+---------------+-----------+-----------+------+------+----+
|emp_id|    name|superior_emp_id|year_joined|emp_dept_id|gender|salary| max|
+------+--------+---------------+-----------+-----------+------+------+----+
|     1|   Smith|             -1|       2018|         10|     M|  3000|2020|
|     2|    Rose|              1|       2010|         20|     M|  4000|2010|
|     1|Williams|              1|       2020|         10|     M|  1000|2020|
|     2|   Jones|              2|       2005|         10|     F|  2000|2010|
|     1|   Brown|              2|       2020|         40|      |    -1|2020|
|     6|   Brown|              2|       2010|         50|      |    -1|2010|
+------+--------+---------------+-----------+-----------+------+------+----+

scala> b.join(a, Seq("emp_id"), "left").where(s"year_joined = max").show
+------+--------+---------------+-----------+-----------+------+------+----+
|emp_id|    name|superior_emp_id|year_joined|emp_dept_id|gender|salary| max|
+------+--------+---------------+-----------+-----------+------+------+----+
|     2|    Rose|              1|       2010|         20|     M|  4000|2010|
|     1|Williams|              1|       2020|         10|     M|  1000|2020|
|     1|   Brown|              2|       2020|         40|      |    -1|2020|
|     6|   Brown|              2|       2010|         50|      |    -1|2010|
+------+--------+---------------+-----------+-----------+------+------+----+
复制代码
参考:
https://sparkbyexamples.com/spark/spark-sql-dataframe-join/
https://stackoverflow.com/questions/39699495/spark-2-0-groupby-column-and-then-get-maxdate-on-a-datetype-column?rq=1
posted @   船长博客  阅读(485)  评论(0编辑  收藏  举报
编辑推荐:
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
阅读排行:
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· 从HTTP原因短语缺失研究HTTP/2和HTTP/3的设计差异
· 三行代码完成国际化适配,妙~啊~
历史上的今天:
2018-01-26 Python pandas 获取Excel重复记录
永远相信美好的事情即将发生!
点击右上角即可分享
微信分享提示