使用hql-统计连续登陆的三天及以上的用户

@

这个问题可以扩展到很多相似的问题:连续几个月充值会员、连续天数有商品卖出、连续打车、连续逾期……

数据提供

 用户ID、登入日期
 user01,2018-02-28
 user01,2018-03-01
 user01,2018-03-02
 user01,2018-03-04
 user01,2018-03-05
 user01,2018-03-06
 user01,2018-03-07
 user02,2018-03-01
 user02,2018-03-02
 user02,2018-03-03
 user02,2018-03-06

输出字段

+---------+--------+-------------+-------------+--+
|   uid   | times  | start_date  |  end_date   |
+---------+--------+-------------+-------------+--+

解法一

先对每个用户的登录日期排序,然后拿第n行的日期,减第n-2行的日期,如果等于2,就说明连续三天登录了。

解法二

开窗,窗囗内部排序然后做差

rownumber() oover

建表

create table wedw_dw.t_login_info(
 user_id string  COMMENT '用户ID'
,login_date date COMMENT '登录日期'
)
row format delimited fields terminated by ',';

导数据

hdfs dfs -put /test/login.txt /data/hive/test/wedw/dw/t_login_info/

验证数据

select * from wedw_dw.t_login_info;
+----------+-------------+--+
| user_id  | login_date  |
+----------+-------------+--+
| user01   | 2018-02-28  |
| user01   | 2018-03-01  |
| user01   | 2018-03-02  |
| user01   | 2018-03-04  |
| user01   | 2018-03-05  |
| user01   | 2018-03-06  |
| user01   | 2018-03-07  |
| user02   | 2018-03-01  |
| user02   | 2018-03-02  |
| user02   | 2018-03-03  |
| user02   | 2018-03-06  |
+----------+-------------+--+

解决方案-使用解法二

select
 t2.user_id         as user_id,
 count(1)           as times,
 min(t2.login_date) as start_date,
 max(t2.login_date) as end_date
from
(
    select
     t1.user_id,
     t1.login_date,
     date_sub(t1.login_date,rn) as date_diff
    from
    (
        select
         user_id,
         login_date,
         row_number() over(partition by user_id order by login_date asc) as rn 
        from
        wedw_dw.t_login_info
    ) t1
) t2
group by 
 t2.user_id, t2.date_diff
having times >= 3;

结果

+----------+--------+-------------+-------------+--+
| user_id  | times  | start_date  |  end_date   |
+----------+--------+-------------+-------------+--+
| user01   | 3      | 2018-02-28   | 2018-03-02  |
| user01   | 4      | 2018-03-04  | 2018-03-07   |
| user02   | 3      | 2018-03-01   | 2018-03-03  |
+----------+--------+-------------+-------------+--+

思路

  1. 先把数据按照用户id分组,根据登录日期排序
select
	user_id
	,login_date
	,row_number() over(partition by user_id order by login_date asc) as rn 
	from
	wedw_dw.t_login_info

+----------+-------------+-----+--+
| user_id  | login_date  | rn  |
+----------+-------------+-----+--+
| user01   | 2018-02-28  | 1   |
| user01   | 2018-03-01  | 2   |
| user01   | 2018-03-02  | 3   |
| user01   | 2018-03-04  | 4   |
| user01   | 2018-03-05  | 5   |
| user01   | 2018-03-06  | 6   |
| user01   | 2018-03-07  | 7   |
| user02   | 2018-03-01  | 1   |
| user02   | 2018-03-02  | 2   |
| user02   | 2018-03-03  | 3   |
| user02   | 2018-03-06  | 4   |
+----------+-------------+-----+--+
  1. 用登录日期减去排序数字rn,得到的差值日期如果是相等的,则说明这两天肯定是连续的
select
     t1.user_id
    ,t1.login_date
    ,date_sub(t1.login_date,rn) as date_diff
    from
    (
        select
         user_id
        ,login_date
        ,row_number() over(partition by user_id order by login_date asc) as rn 
        from
        wedw_dw.t_login_info
    ) t1
    ;


+----------+-------------+-------------+--+
| user_id  | login_date  |  date_diff  |
+----------+-------------+-------------+--+
| user01   | 2018-02-28  | 2018-02-27  |
| user01   | 2018-03-01  | 2018-02-27  |
| user01   | 2018-03-02  | 2018-02-27  |
| user01   | 2018-03-04  | 2018-02-28  |
| user01   | 2018-03-05  | 2018-02-28  |
| user01   | 2018-03-06  | 2018-02-28  |
| user01   | 2018-03-07  | 2018-02-28  |
| user02   | 2018-03-01  | 2018-02-28  |
| user02   | 2018-03-02  | 2018-02-28  |
| user02   | 2018-03-03  | 2018-02-28  |
| user02   | 2018-03-06  | 2018-03-02  |
+----------+-------------+-------------+--+
  1. 根据user_id和日期差date_diff 分组,最小登录日期即为此次连续登录的开始日期start_date,最大登录日期即为结束日期end_date,登录次数即为分组后的count(1)
select
 t2.user_id         as user_id
,count(1)           as times
,min(t2.login_date) as start_date
,max(t2.login_date) as end_date
from
(
    select
     t1.user_id
    ,t1.login_date
    ,date_sub(t1.login_date,rn) as date_diff
    from
    (
        select
         user_id
        ,login_date
        ,row_number() over(partition by user_id order by login_date asc) as rn 
        from
        wedw_dw.t_login_info
    ) t1
) t2
group by 
 t2.user_id
,t2.date_diff
having times >= 3
;

+----------+--------+-------------+-------------+--+
| user_id  | times  | start_date  |  end_date   |
+----------+--------+-------------+-------------+--+
| user01   | 3      | 2018-02-28   | 2018-03-02  |
| user01    | 4      | 2018-03-04  | 2018-03-07  |
| user02   | 3      | 2018-03-01   | 2018-03-03  |
+----------+--------+-------------+-------------+--+
posted @ 2021-02-21 00:20  孙晨c  阅读(505)  评论(0编辑  收藏  举报