SQL经典用法之用户连续访问区间记录
案例:有一张DWS表记录了每个用户近30天活跃情况,需要通过该表记录得出用户近30天的连续访问间隔分布情况
逻辑梳理:
1,要通过埋点日志数据,筛选出最近30天的日志记录,然后按照用户ID分组,通过对用户活跃时间进行排序得出近30天的连续活跃区间情况:使用开窗函数ROW_NUMBER() 实现
得到如下一张表:
user_id rng_start rng_end
uid01 2022-05-20 2022-05-26
uid01 2022-05-29 2022-06-01
uid01 2022-06-03 2022-06-20
uid02 2022-05-22 2022-05-30
uid02 2022-06-03 2022-06-06
uid03 2022-05-20 2022-05-26
uid04 2022-06-03 2022-06-20
uid05 2022-06-19 2022-06-20
2,通过上面的表我们要得到下面这张表:
user_id jiange_days jiange_cnts
uid01 间隔0天 26
uid02 间隔0天 11
uid03 间隔0天 6
uid04 间隔0天 17
uid05 间隔0天 1
uid01 间隔2天 1
uid01 间隔3天 1
uid02 间隔4天 1
实现逻辑:
对每一个区间,x=rng_end - rng_start ,得到:间隔0天,x次
对于相邻区间,y=rng_2_start - rng_1_end,下一个区间开始-上个区间结束,得到间隔y天,1次
--sqltext:
--隔0天的次数
select user_id,
0 as jiange_days,
datediff(rng_end,rng_start) as jiange_cnts
from dws_api_uca_ rng
--隔n天的次数:lead() over()将两个相邻区间的结束 和起始放到一行
如:
user_id rng_start rng_end lead(rng_start) over()
uid01 2022-05-20 2022-05-26 2022-05-29
uid01 2022-05-29 2022-06-01 2022-06-03
uid01 2022-06-03 2022-06-20 null
select user_id,
datediff(lead(rng_start) over(partition by user_id order by rng_start), rng_end) as jiange_days
from dws_api_uca_ rng
3,合并间隔0天和间隔n天的数据
--隔0天的次数
select user_id,
concat('间隔',0,'天') as jiange_days,
datediff(rng_end,rng_start) as jiange_cnts
from dws_api_uca_ rng
union all
select user_id,
concat('间隔',jiange_days,'天') as jiange_days ,
sum(1) as jiange_cnts
from
(select user_id,
datediff(lead(rng_start) over(partition by user_id order by rng_start), rng_end) as jiange_days
from dws_api_uca_ rng
) t
where t.jiange_days is not null
group by user_id,jiange_days