SparkSql连续问题
计算连续
- 0_ : 表示按id分组时间排序的行号
- 1_ : 表示是否符合筛选条件, 例如卖出的钱/和上一条对比卖出的钱是不是2倍?
- 2_ : 找到连续行的结尾
- 3_ : 先filter 1_ 2_全是null的行, 用first( , true) 把连续的行分到一个组里.
consumer_id | time | code | 0_row_num part by consumer order by time |
1_lead 符合条件? | 2_找一个结尾 1_ is null ? lag(0_): null |
3_ 划分分组 过滤掉 1_ 2_全是null的行 first 忽略null |
---|---|---|---|---|---|---|
1 | 0 | a | 1 | 1 | null | 2 |
1 | 50 | a | 2 | null | 2 | 2 |
1 | 51 | b | 3 | 1 | null | 9 |
1 | 52 | b | 4 | 1 | null | 9 |
1 | 53 | b | 5 | 1 | null | 9 |
1 | 54 | b | 6 | 1 | null | 9 |
1 | 55 | b | 7 | 1 | null | 9 |
1 | 56 | b | 8 | 1 | null | 9 |
1 | 57 | b | 9 | null | 9 | 9 |
2 | 13 | c | 1 | null | null | deleted |
2 | 19 | a | 2 | null | null | deleted |
2 | 20 | b | 3 | 1 | null | 5 |
2 | 21 | b | 4 | 1 | null | 5 |
2 | 22 | b | 5 | null | 5 | 5 |
数据分组
- 将test表进行分组
- test 表仅有 n 列
- test2 加了row_number
n | rn | lag_n | sum_lag_n | group_num |
---|---|---|---|---|
1 | 1 | 1 | 1 | 0 |
1 | 2 | 1 | 2 | 0 |
1 | 3 | 1 | 3 | 0 |
0 | 4 | 1 | 4 | 0 |
1 | 5 | 0 | 4 | -1 |
1 | 6 | 1 | 5 | -1 |
1 | 7 | 1 | 6 | -1 |
1 | 8 | 1 | 7 | -1 |
1 | 9 | 1 | 8 | -1 |
1 | 10 | 1 | 9 | -1 |
1 | 11 | 1 | 10 | -1 |
1 | 12 | 1 | 11 | -1 |
0 | 13 | 1 | 12 | -1 |
1 | 14 | 0 | 12 | -2 |
1 | 15 | 1 | 13 | -2 |
1 | 16 | 1 | 14 | -2 |
1 | 17 | 1 | 15 | -2 |
1 | 18 | 1 | 16 | -2 |
1 | 19 | 1 | 17 | -2 |
1 | 20 | 1 | 18 | -2 |
解答
select
*,
sum_lag_n - rn as group_num
from(
select
test3.*,
sum(lag_n) over(partition by 1 order by 1 rows between unbounded preceding and current row) as sum_lag_n
from (
select
*,
lag(n,1,1) over(partition by 1 order by 1) as lag_n
from test2
) test3
) test4