Hive 面试题——设计一个1~60天的注册、活跃留存表
需求描述
现有一个用户活跃表 user_active (user_id,active_date)、用户注册表 user_regist(user_id,regist_date),表中分区字段都为dt(yyyy-MM-dd),用户字段均为user_id;设计一张1-60天注册活跃留存表;
需求分析
需求中要求设计留存周期为1-60天的注册、活跃留存输出效果如下:
主要考察点: 一对多关联
观察表结构:我们可以看出,针对每一天来讲分母固定,分子随留存周期变动,
第一步:求出注册表中,每日注册数,注册数作为分母,针对每一天的留存数据来讲是个固定值,这里用开窗求解该指标。
select user_id, regist_date, count(user_id) over (partition by regist_date) regist_cnt from user_regist where dt >= date_sub(current_date(), 60)
第二步:用户注册表作为主表,关联活跃表,关联健为user_id,由于一对多的关系,产生笛卡尔集;注意:活跃用户表,每天用户会有多次活跃注意去重
select t.regist_date, t.user_id, t.regist_cnt, t1.user_id, t1.active_date, datediff(t1.active_date, t.regist_date) adte_diff from (select user_id, regist_date, count(user_id) over (partition by regist_date) regist_cnt from user_regist where dt >= date_sub(current_date(), 60)) t left join (select user_id, active_date from user_active where dt >= date_sub(current_date(), 60) group by user_id, active_date) t1 on t.user_id = t1.user_id;
第三步:按照注册日期和留存周期分组,计算该留存周期下、该时间点下的活跃用户数
select t.regist_date, max(t.regist_cnt) regist_cnt,--注册用户数为固定值,使用 max 取就行 datediff(t1.active_date, t.regist_date) adte_diff, count(t1.user_id) active_cnt from (select user_id, regist_date, count(user_id) over (partition by regist_date) regist_cnt from user_regist where dt >= date_sub(current_date(), 60)) t left join (select user_id, active_date from user_active where dt >= date_sub(current_date(), 60) group by user_id, active_date) t1 on t.user_id = t1.user_id where datediff(t1.active_date, t.regist_date) >= 1 and datediff(t1.active_date, t.regist_date) <= 60 group by t.regist_date, datediff(t1.active_date, t.regist_date)
第四步:计算率留存
select regist_date, date_diff, active_cnt, active_cnt / regist_cnt retention_rate from (select t.regist_date, max(t.regist_cnt) regist_cnt,--注册用户数为固定值,使用 max 取就行 datediff(t1.active_date, t.regist_date) date_diff, count(t1.user_id) active_cnt from (select user_id, regist_date, count(user_id) over (partition by regist_date) regist_cnt from user_regist where dt >= date_sub(current_date(), 60)) t left join (select user_id, active_date from user_active where dt >= date_sub(current_date(), 60) group by user_id, active_date) t1 on t.user_id = t1.user_id where datediff(t1.active_date, t.regist_date) >= 1 and datediff(t1.active_date, t.regist_date) <= 60 group by t.regist_date, datediff(t1.active_date, t.regist_date)) t where regist_cnt > 0;