Hive 刷题——会话划分问题
需求描述
现有页面浏览记录表(page_view_events)如下,表中有每个用户的每次页面访问记录。
user_id | page_id | view_timestamp |
---|---|---|
100 | home | 1659950435 |
100 | good_search | 1659950446 |
100 | good_list | 1659950457 |
100 | home | 1659950541 |
100 | good_detail | 1659950552 |
100 | cart | 1659950563 |
101 | home | 1659950435 |
101 | good_search | 1659950446 |
101 | good_list | 1659950457 |
101 | home | 1659950541 |
101 | good_detail | 1659950552 |
101 | cart | 1659950563 |
102 | home | 1659950435 |
102 | good_search | 1659950446 |
102 | good_list | 1659950457 |
103 | home | 1659950541 |
103 | good_detail | 1659950552 |
103 | cart | 1659950563 |
规定若同一用户的相邻两次访问记录时间间隔小于60s,则认为两次浏览记录属于同一会话。现有如下需求,为属于同一会话的访问记录增加一个相同的会话id字段,会话id格式为"user_id-number",其中number从1开始,用于区分同一用户的不同会话,期望结果如下:
user_id <int> (用户id) | page_id <string> (页面id) | view_timestamp <bigint> (浏览时间戳) | session_id <string> (会话id) |
---|---|---|---|
100 | home | 1659950435 | 100-1 |
100 | good_search | 1659950446 | 100-1 |
100 | good_list | 1659950457 | 100-1 |
100 | home | 1659950541 | 100-2 |
100 | good_detail | 1659950552 | 100-2 |
100 | cart | 1659950563 | 100-2 |
101 | home | 1659950435 | 101-1 |
101 | good_search | 1659950446 | 101-1 |
101 | good_list | 1659950457 | 101-1 |
101 | home | 1659950541 | 101-2 |
101 | good_detail | 1659950552 | 101-2 |
101 | cart | 1659950563 | 101-2 |
102 | home | 1659950435 | 102-1 |
102 | good_search | 1659950446 | 102-1 |
102 | good_list | 1659950457 | 102-1 |
103 | home | 1659950541 | 103-1 |
103 | good_detail | 1659950552 | 103-1 |
参考实现
select user_id,page_id,view_timestamp,concat(user_id,'-',flag)session_id from (select user_id, page_id,view_timestamp, sum(flag) over (partition by user_id order by view_timestamp) flag from (select user_id, page_id, view_timestamp, if(view_timestamp - last_view_timestamp > 60, 1, 0) flag --打标签,本次page的时间戳减去上个页面的时间戳,如果大于60秒,证明本次page是会话的初始page,打上一 from (select user_id, page_id, view_timestamp, lag(view_timestamp, 1, 0) over (partition by user_id order by view_timestamp) last_view_timestamp --取出上一个page页面的时间戳 from page_view_events) t) t) t;