Hive 刷题——区间合并

需求描述

给定多个时间段,每个时间段分为开始时间、结束时间,将相互重叠的多个时间段合并为一个区间。

--数据:id、开始时间、结束时间
1001 16 18
1002 17 19
1003 29 32
1004 30 31
1005 57 58
1006 44 44
1007 56 57
1008 12 15

合并后结果如下;

flag    start_time    end_time
1    12    15
2    16    19
3    29    32
4    44    44
5    56    58

需求分析

解题要点:如何判断哪些区间是要合并的?其实换个角度就是哪些区间是交叉的,哪些是重复的?判断思路:如果将起始时间,和结束时间进行排序,当前行的起始时间小于等于上一行的结束时间,那么日期就存在交叉,存在重复的数据。根据该条件我们可以设置断点,然后用经典的思路sum() over()来获取分组id,问题便得到解决。第一步:按照起始时间和结束时间进行降序排序,获取上一行的结束时间,目的是为了比较 

with t as (
    select 1001 as id, 16 as start_time, 18 as end_time
    union all
    select 1002 as id, 17 as start_time, 19 as end_time
    union all
    select 1003 as id, 29 as start_time, 32 as end_time
    union all
    select 1004 as id, 30 as start_time, 31 as end_time
    union all
    select 1005 as id, 57 as start_time, 58 as end_time
    union all
    select 1006 as id, 44 as start_time, 44 as end_time
    union all
    select 1007 as id, 56 as start_time, 57 as end_time
    union all
    select 1008 as id, 12 as start_time, 15 as end_time)
select *, lag(end_time, 1, end_time) over (order by start_time ,end_time) as last_end_time  from t;
1008    12    15    15
1001    16    18    15
1002    17    19    18
1003    29    32    19
1004    30    31    32
1006    44    44    31
1007    56    57    44
1005    57    58    57

第二步:根据lag_end_time进行判断,当当前行的start_time <=  lag_end_time时候设置标记值0,否则为1(经典的按条件变化后的分组思路,这里一定是满足条件的时候置为0,不满足条件的时候置为1)

select t1.id,
       t1.start_time,
       t1.end_time,
       t1.last_end_time,
       if(start_time <= last_end_time, 0, 1) flag --条件成立的时候为 0,不成立的时候为 1
from (select *, lag(end_time, 1, end_time) over (order by start_time ,end_time) as last_end_time
      from t) t1
1008    12    15    15    0
1001    16    18    15    1
1002    17    19    18    0
1003    29    32    19    1
1004    30    31    32    0
1006    44    44    31    1
1007    56    57    44    1
1005    57    58    57    0

 第三步:按照sum() over()的方法获取分组id

with t as (
    select 1001 as id, 16 as start_time, 18 as end_time
    union all
    select 1002 as id, 17 as start_time, 19 as end_time
    union all
    select 1003 as id, 29 as start_time, 32 as end_time
    union all
    select 1004 as id, 30 as start_time, 31 as end_time
    union all
    select 1005 as id, 57 as start_time, 58 as end_time
    union all
    select 1006 as id, 44 as start_time, 44 as end_time
    union all
    select 1007 as id, 56 as start_time, 57 as end_time
    union all
    select 1008 as id, 12 as start_time, 15 as end_time)
select id, start_time, end_time, last_end_time, flag, sum(flag) over (order by start_time,end_time) grep_flag
      from (select t1.id,
                   t1.start_time,
                   t1.end_time,
                   t1.last_end_time,
                   if(start_time <= last_end_time, 0, 1) flag --条件成立的时候为 0,不成立的时候为 1
            from (select *, lag(end_time, 1, end_time) over (order by start_time ,end_time) as last_end_time
                  from t) t1) t
1008    12    15    15    0    0
1001    16    18    15    1    1
1002    17    19    18    0    1
1003    29    32    19    1    2
1004    30    31    32    0    2
1006    44    44    31    1    3
1007    56    57    44    1    4
1005    57    58    57    0    4

第四步:在分组里获取,最小值及最大值,最小值即为起始点,最大值即为结束点,分组id即为id

select grep_flag + 1 as id, min(start_time) start_time, max(end_time) end_time
from (select id, start_time, end_time, last_end_time, flag, sum(flag) over (order by start_time,end_time) grep_flag
      from (select t1.id,
                   t1.start_time,
                   t1.end_time,
                   t1.last_end_time,
                   if(start_time <= last_end_time, 0, 1) flag --条件成立的时候为 0,不成立的时候为 1
            from (select *, lag(end_time, 1, end_time) over (order by start_time ,end_time) as last_end_time
                  from t) t1) t) t
group by grep_flag + 1
order by 1;
1    12    15
2    16    19
3    29    32
4    44    44
5    56    58
posted @ 2023-02-26 09:26  晓枫的春天  阅读(269)  评论(0编辑  收藏  举报