Hive 面试题——设计一个1~60天的注册、活跃留存表

需求描述

现有一个用户活跃表 user_active (user_id,active_date)、用户注册表 user_regist(user_id,regist_date),表中分区字段都为dt(yyyy-MM-dd),用户字段均为user_id;设计一张1-60天注册活跃留存表;

需求分析

需求中要求设计留存周期为1-60天的注册、活跃留存输出效果如下:

主要考察点: 一对多关联 

观察表结构:我们可以看出,针对每一天来讲分母固定,分子随留存周期变动,

第一步:求出注册表中,每日注册数,注册数作为分母,针对每一天的留存数据来讲是个固定值,这里用开窗求解该指标。

select user_id, regist_date, count(user_id) over (partition by regist_date) regist_cnt
from user_regist
where dt >= date_sub(current_date(), 60)
第二步:用户注册表作为主表,关联活跃表,关联健为user_id,由于一对多的关系,产生笛卡尔集;注意:活跃用户表,每天用户会有多次活跃注意去重
select t.regist_date,
       t.user_id,
       t.regist_cnt,
       t1.user_id,
       t1.active_date,
       datediff(t1.active_date, t.regist_date) adte_diff
from (select user_id, regist_date, count(user_id) over (partition by regist_date) regist_cnt
      from user_regist
      where dt >= date_sub(current_date(), 60)) t
         left join
     (select user_id, active_date
      from user_active
      where dt >= date_sub(current_date(), 60)
      group by user_id, active_date) t1
     on t.user_id = t1.user_id;

第三步:按照注册日期和留存周期分组,计算该留存周期下、该时间点下的活跃用户数

select t.regist_date,
       max(t.regist_cnt)                       regist_cnt,--注册用户数为固定值,使用 max 取就行
       datediff(t1.active_date, t.regist_date) adte_diff,
       count(t1.user_id)                       active_cnt
from (select user_id, regist_date, count(user_id) over (partition by regist_date) regist_cnt
      from user_regist
      where dt >= date_sub(current_date(), 60)) t
         left join
     (select user_id, active_date
      from user_active
      where dt >= date_sub(current_date(), 60)
      group by user_id, active_date) t1
     on t.user_id = t1.user_id
where datediff(t1.active_date, t.regist_date) >= 1
  and datediff(t1.active_date, t.regist_date) <= 60
group by t.regist_date, datediff(t1.active_date, t.regist_date)

第四步:计算率留存

select regist_date, date_diff, active_cnt, active_cnt / regist_cnt retention_rate
from (select t.regist_date,
             max(t.regist_cnt)                       regist_cnt,--注册用户数为固定值,使用 max 取就行
             datediff(t1.active_date, t.regist_date) date_diff,
             count(t1.user_id)                       active_cnt
      from (select user_id, regist_date, count(user_id) over (partition by regist_date) regist_cnt
            from user_regist
            where dt >= date_sub(current_date(), 60)) t
               left join
           (select user_id, active_date
            from user_active
            where dt >= date_sub(current_date(), 60)
            group by user_id, active_date) t1
           on t.user_id = t1.user_id
      where datediff(t1.active_date, t.regist_date) >= 1
        and datediff(t1.active_date, t.regist_date) <= 60
      group by t.regist_date, datediff(t1.active_date, t.regist_date)) t
where regist_cnt > 0;
posted @ 2023-02-12 07:19  晓枫的春天  阅读(132)  评论(0编辑  收藏  举报