关于 hive 分桶重排序的一个栗子

需求

原始数据

year	tag
2014	1
2015	1
2016	0
2017	0
2018	0
2020	1
2021	1
2022	1

结果数据

2014	1	1
2015	1	2
2016	0	1
2017	0	2
2018	0	3
2020	1	1
2021	1	2
2022	1	3

说明：字段1是有序的，按照字段2分块计数，每当字段2变化，就重新开始计数，计数的结果当作字段3返回

数据准备

create table temp_block_sort_0309
(
    year int,
    tag  int
) stored as orc
    tblproperties ('orc.compress' = 'snappy');
insert into temp_block_sort_0309
values (2014, 1),
       (2015, 1),
       (2016, 0),
       (2017, 0),
       (2018, 0),
       (2020, 1),
       (2021, 1),
       (2022, 1);
select * from temp_block_sort_0309;

解题方案

解决这个问题可以分 3 步走

1、判定当前行和上一行的第二列是否相等,如果不等置为1，相等为0,作为flag列

select *, case when lag(tag, 1) over (order by year) != tag then 1 else 0 end as flag
from temp_block_sort_0309;

year	tag	flag
2014	1	0
2015	1	0
2016	0	1
2017	0	0
2018	0	0
2020	1	1
2021	1	0
2022	1	0

2、对 flag列进行累加，值作为分桶id

select *, sum(flag) over (order by year) tong_id
from (select *, case when lag(tag, 1) over (order by year) != tag then 1 else 0 end as flag
      from temp_block_sort_0309) t;

year	tag	flag	tong_id
2014	1	0	0
2015	1	0	0
2016	0	1	1
2017	0	0	1
2018	0	0	1
2020	1	1	2
2021	1	0	2
2022	1	0	2

3、按照累加的值（分桶id）分组 & 组内重排序

select year, tag, row_number() over (partition by tong_id order by year) js_id, flag
from (select *, sum(flag) over (order by year) tong_id
      from (select *, case when lag(tag, 1) over (order by year) != tag then 1 else 0 end as flag
            from temp_block_sort_0309) t) t
order by 1;

year	tag	js_id	flag_id
2014	1	1	0
2015	1	2	0
2016	0	1	1
2017	0	2	0
2018	0	3	0
2020	1	1	1
2021	1	2	0
2022	1	3	0

posted @ 2022-03-09 17:06 晓枫的春天阅读(42) 评论(0) 编辑收藏举报

刷新页面返回顶部

滴水穿石不是靠力，而是因为不舍昼夜。

关于 hive 分桶重排序的一个栗子

需求

数据准备

解题方案

公告