spark 计算前后两条记录之间的差(diff),时间差等

有时候会遇到这样的场景:有一个datafram,我们需要计算同一组对象中,前后两条记录之间的差值,此处并不仅限于时间,还可以是其他的数据类型
需要用到两个工具:spark窗口函数Window对对象分组以及lag函数

val df = Seq(
    ("notebook","2019-01-01 00:00:00"),
    ("notebook", "2019-01-10 13:02:00"),
    ("notebook", "2019-01-10 13:15:22"),
    ("small_phone", "2019-01-30 09:30:00"),
    ("small_phone", "2019-01-15 12:00:00"),
    ("small_phone", "2019-01-30 09:50:00"),
    ("small_phone", "2019-01-30 09:32:00"),
    ("big_phone", "2019-01-2 09:30:00")
).toDF("device", "purchase_time").sort("device","purchase_time")

val sessionWindow = Window.partitionBy("device").orderBy("purchase_time")
val diffDf = df.withColumn("pre_time",
                          functions.lag($"purchase_time",1).over(sessionWindow))
diffDf.show()

val minitesDf = diffDf.withColumn("purchase_time",
                                  functions.to_timestamp(col("purchase_time"),"yyyy-mm-dd HH:mm:ss"))
                       .withColumn("pre_time",
                                 functions.to_timestamp(col("pre_time"),"yyyy-mm-dd HH:mm:ss"))
                       .withColumn("minitues_diff",
                                  round((col("purchase_time").cast(LongType)-col("pre_time").cast(LongType))/60))
minitesDf.show()

posted @ 2021-07-17 12:06  real-zhouyc  阅读(1961)  评论(0编辑  收藏  举报