关于 hive 分桶重排序的一个栗子

需求

原始数据

year
tag
2014
1
2015
1
2016
0
2017
0
2018
0
2020
1
2021
1
2022
1

结果数据

2014
1
1
2015
1
2
2016
0
1
2017
0
2
2018
0
3
2020
1
1
2021
1
2
2022
1
3

说明:字段1是有序的,按照字段2分块计数,每当字段2变化,就重新开始计数,计数的结果当作字段3返回

数据准备

create table temp_block_sort_0309
(
    year int,
    tag  int
) stored as orc
    tblproperties ('orc.compress' = 'snappy');
insert into temp_block_sort_0309
values (2014, 1),
       (2015, 1),
       (2016, 0),
       (2017, 0),
       (2018, 0),
       (2020, 1),
       (2021, 1),
       (2022, 1);
select * from temp_block_sort_0309;

解题方案

解决这个问题可以分 3 步走
1、判定当前行和上一行的第二列是否相等,如果不等置为1,相等为0,作为flag列

select *, case when lag(tag, 1) over (order by year) != tag then 1 else 0 end as flag
from temp_block_sort_0309;

 

year
tag
flag
2014
1
0
2015
1
0
2016
0
1
2017
0
0
2018
0
0
2020
1
1
2021
1
0
2022
1
0
2、对 flag列进行累加,值作为分桶id
select *, sum(flag) over (order by year) tong_id
from (select *, case when lag(tag, 1) over (order by year) != tag then 1 else 0 end as flag
      from temp_block_sort_0309) t;
year
tag
flag
tong_id
2014
1
0
0
2015
1
0
0
2016
0
1
1
2017
0
0
1
2018
0
0
1
2020
1
1
2
2021
1
0
2
2022
1
0
2
3、按照累加的值(分桶id)分组 & 组内重排序

select year, tag, row_number() over (partition by tong_id order by year) js_id, flag
from (select *, sum(flag) over (order by year) tong_id
      from (select *, case when lag(tag, 1) over (order by year) != tag then 1 else 0 end as flag
            from temp_block_sort_0309) t) t
order by 1;
year
tag
js_id
flag_id
2014
1
1
0
2015
1
2
0
2016
0
1
1
2017
0
2
0
2018
0
3
0
2020
1
1
1
2021
1
2
0
2022
1
3
0

posted @ 2022-03-09 17:06  晓枫的春天  阅读(42)  评论(0编辑  收藏  举报