SparkSql连续问题

计算连续

  1. 0_ : 表示按id分组时间排序的行号
  2. 1_ : 表示是否符合筛选条件, 例如卖出的钱/和上一条对比卖出的钱是不是2倍?
  3. 2_ : 找到连续行的结尾
  4. 3_ : 先filter 1_ 2_全是null的行, 用first( , true) 把连续的行分到一个组里.
consumer_id time code 0_row_num
part by consumer
order by time
1_lead 符合条件? 2_找一个结尾
1_ is null ? lag(0_): null
3_ 划分分组
过滤掉 1_ 2_全是null的行
first 忽略null
1 0 a 1 1 null 2
1 50 a 2 null 2 2
1 51 b 3 1 null 9
1 52 b 4 1 null 9
1 53 b 5 1 null 9
1 54 b 6 1 null 9
1 55 b 7 1 null 9
1 56 b 8 1 null 9
1 57 b 9 null 9 9
2 13 c 1 null null deleted
2 19 a 2 null null deleted
2 20 b 3 1 null 5
2 21 b 4 1 null 5
2 22 b 5 null 5 5

数据分组

  1. 将test表进行分组
  2. test 表仅有 n 列
  3. test2 加了row_number
n rn lag_n sum_lag_n group_num
1 1 1 1 0
1 2 1 2 0
1 3 1 3 0
0 4 1 4 0
1 5 0 4 -1
1 6 1 5 -1
1 7 1 6 -1
1 8 1 7 -1
1 9 1 8 -1
1 10 1 9 -1
1 11 1 10 -1
1 12 1 11 -1
0 13 1 12 -1
1 14 0 12 -2
1 15 1 13 -2
1 16 1 14 -2
1 17 1 15 -2
1 18 1 16 -2
1 19 1 17 -2
1 20 1 18 -2

解答

select
*,
sum_lag_n - rn as group_num
from(
  select
  test3.*,
  sum(lag_n) over(partition by 1 order by 1 rows between unbounded preceding and current row) as sum_lag_n
  from (
  select
    *,
    lag(n,1,1) over(partition by 1 order by 1) as lag_n
  from test2
  ) test3
) test4
posted @ 2021-05-11 11:03  一根咸鱼干  阅读(236)  评论(0编辑  收藏  举报