hive SQL案例

  • 上个月用户连续n天登录天数

用户登录记录表user_login,包括用户id(user_id)、日期(login_date)
user_id(bigint) login_date(string)
12333256 2021-01-03
84272916 2021-01-03
94038271 2021-01-02
201934012 2021-01-03
过滤出当月数据:where login_date > ''
转换成date:to_date函数,参数默认按照"yyyy-MM-dd"的格式
去重:一天可能有多次登录记录(可以用distinct、group by、row_number去重)
开窗排序,生成rn1:按user_id分区,login_date排序
login_date与rn1作差得到新日期dt2:date_sub(login_date, rn1)
统计天数:按user_id、dt2分组,再使用count(1)统计连续登录天数
开窗排序,生成rn2:按user_id分区,count(1)排序
with t1 as (
    select
        user_id, 
        to_date(login_date) as dt1,
        row_number() over(partition by user_id order by to_date(login_date)) as rn1
    from db.user_login
    where login_date > '2023-07-31'
    group by user_id, to_date(login_date)
),
t2 as (
    select
        user_id,
        date_sub(dt1, rn1) as dt2
    from t1
),
t3 as (
    select
        user_id,
        dt2,
        count(1) as continuous_day,
        row_number() over(partition by user_id order by count(1) desc) rn2
    from t2
    group by user_id, dt2
)
select
    user_id,
    dt2
from t3
where rn2 = 1;
View Code

部分代码为:

select sku_id,
    create_date,
    date_sub(create_date, row_number() over (partition by sku_id order by create_date)) as sub_res
from order_detail
group by sku_id, create_date
-- 筛选出销售总额超过100的记录
having sum(price * sku_num) > 100
View Code
  • 求每个店铺访问次数TopN的访客信息

用户访问表user_log,包括user_id、shop

user_id shop
u1 a
u2 a
u3 b
u4 c

 

with t1 as {
    select
        shop,
        user_id,
        count(1) as cnt,
        row_number() over(partition by shop order by count(1) desc) as rn1
    from user_log
    group by shop, user_id
}

select
    shop,
    user_id,
    cnt
from t1
where rn1 <= n;
View Code
  • 计算出1日、3日、7日留存率

用户登录记录表user_log,包括user_id、log_date
user_id(bigint) log_date(string)
12333256 2021-01-03
84272916 2021-01-03
94038271 2021-01-02
201934012 2021-01-03
思路1:x天数据多次left join每天的数据
https://blog.csdn.net/weixin_45788152/article/details/110597806
思路2:x天的数据一次left join其他天数据,使用case when处理
 https://blog.csdn.net/cold___play/article/details/124053392
去重:一天可能有多次登录记录(可以用distinct、group by、row_number去重);
关联:表自身关联,过滤出右表日期大于左表日期的日志
求x日留存:使用datediff得到日期差求留存
计算留存率:x日留存率=x日留存用户数/基准日活跃用户数
with t1 as (
    select
        a.user_id as user_id,
        a.start_date as start_date,
        b.end_date as end_date,
        datediff(b.end_date, a.start_date) as diff_date
    from 
        (
        select
            user_id,
            to_date(log_date) as start_date
        from db.user_log
        group by user_id, to_date(log_date)) a
        left join
        (
        select
            user_id,
            to_date(log_date) as end_date
        from db.user_log
        group by user_id, to_date(log_date)) b
    on a.user_id = b.user_id
    where a.start_date < b.end_date
),
t2 as (
    select
        start_date,
        count(distinct user_id) as base_cnt,
        count(distinct if(diff_date = 1, user_id, null)) as remain_1d,
        count(distinct if(diff_date = 3, user_id, null)) as remain_3d,
        count(distinct if(diff_date = 7, user_id, null)) as remain_7d
    from t1
)
select
    start_date,
    base_cnt,
    remain_1d,
    remain_3d,
    remain_7d,
    concat(round(remain_1d/base_cnt * 100, 2), '%') as remain_1d_rate,
    concat(round(remain_3d/base_cnt * 100, 2), '%') as remain_3d_rate,
    concat(round(remain_7d/base_cnt * 100, 2), '%') as remain_7d_rate
from t2; 
View Code
  • pv累加趋势图

用户访问表user_log,包括user_id、log_date,求累计到当前月的总数
user_id(bigint) log_date(string)
12333256 2021-01-03
84272916 2021-01-03
94038271 2021-01-02
201934012

2021-01-03

 over里面的order by的作用:聚合函数默认从起始行到当前行范围内计算,如果没有排序意味着,聚合函数在整个分区范围中计算

select
    month,
    pv,
    sum(pv) over(order by month asc) as total_pv --从第一行加到当前行
from (
    select
        substr(log_date,1,7) as month,
        count(user_id) as pv
    from db.user_log
    group by substr(log_date,1,7)
) t1;
View Code

有用户访问表,包含字段用户id、直播间id、进入直播间时间、离开直播间时间,统计直播间最高在线人数

增加标志位,将进入记为1,出去记为-1
统计累计到每个时刻的在线人数
得到最高在线人数
select 
    room_id,
    max(online_num) as max_online_num
from(
    select
        uid,
        time,
        sum(p) over(partition by room_id order by time) online_num 
    from (
        select uid, room_id, entry_time as time, 1 as p from table
        union all
        select uid, room_id, leave_time as time, -1 as p from table
    ) t1
) t2
group by room_id;
View Code
  • 计算商品的售价涨幅

商品价格变更明细表sku_price_modify_detail,包括sku_id、new_price、change_date
sku_id new_price change_date
12333256 13 2021-01-03
84272916 54 2021-01-03
94038271 67 2021-01-02
201934012 21 2021-01-05
select t1.sku_id,t1.price_change
from
  (
    select sku_id,
      new_price - lead(new_price,1,0)over(partition by sku_id order by change_date desc) price_change,
      rank()over(partition by sku_id order by change_date desc) rn
    from
      sku_price_modify_detail
  ) t1
where rn=1
order by t1.price_change;
View Code
    • 求连续点击三次的用户数,而且中间不能有别人的点击

用户点击表user_click,包括user_id、click_time
user_id(bigint) click_time(string)
12333256 2021-01-03 10:23:01
12333256 2021-01-03 10:23:02
84272916 2021-01-03 10:23:03
12333256 2021-01-03 10:23:04
12333256 2021-01-03 10:23:05
12333256 2021-01-03 10:23:06
12333256 2021-01-03 10:23:07
 
row_number() over (order by click_time) as rank_1得到rank_1为 1 2 3 4 5 6 7
row_number() over (partition by user_id order by click_time)得到rank_2为 1 2 1 3 4 5 6

rank1 - rank2得到diff为 0 0 2 1 1 1 1

select distinct user_id
from    
(
   select *, rank_1 - rank2  as diff
   from
  (
      select *,
      row_number() over(order by click_time) as  rank_1
      row_number() over(partition by user_id order by click_time) as rank_2
      from user_click
   ) b
) c
group by diff, user_id
having count(diff) >=3
View Code
  • 用户行为分析

用户行为表tracking_log,包括user_id、op、log_time
user_id(bigint) op log_time(string)
201934012 A 2021-01-01 10:23:01
12333256 B 2021-01-02 10:23:02
84272916 A 2021-01-03 10:23:03
12333256 C 2021-01-04 10:23:04
94038271 A 2021-01-01 10:23:05
201934012 A 2021-01-02 10:23:06
12333256 B 2021-01-03 10:23:07

统计每天符合以下条件的用户数:A操作之后是B操作,AB操作必须相邻

 to_date(string timestr):返回时间字符串中的日期部分,比如to_date('2021-01-01 10:23:01')返回2021-01-01

 1 select date, count(*)
 2 from(
 3     select user_id
 4     from(
 5         select 
 6                 user_id,
 7                 to_date(log_time) date,
 8                 op,
 9                 lag(op, 1) over(partition by user_id, to_date(log_time) order by log_time) l
10         from tracking_log
11     ) a
12     where op='A' and l='B'
13 ) b
14 group by date;
15     
View Code

统计用户行为序列为A-B-D的用户数

其中:A-B之间可以有任何其他浏览记录(如C、E等),B-D之间除了C,记录可以有任何其他浏览记录(如A、E等)

select to_date(log_time) as dt
      ,count(distinct user_id)
from
(
    select user_id
          ,log_time
          ,concat_ws(',',collect_set(op) over(partition by user_id order by log_time)) as op_str
    from tracking_log
) t
where op_str like '%A%B%D' and op_str not like '%A%B%C%D'
group by to_date(log_time)
View Code
学生得分表user_score,包括user_id、score
user_id(bigint) score
201934012 82
12333256 62
84272916 67
12333256 87
94038271 91
201934012 38
12333256 56
 
select t0.sid as sid
      ,max(t0.avg_score) as avg_score
      ,max(t0.rn) as rn     
from
  (select sid
               ,score
               ,avg_score
               ,dense_rank() over(order by avg_score desc) rn
           from(
                     select sid
                              ,score
                              ,avg(score) over(partition by sid) as avg_score             
                     from user_score
                 ) m
  ) t0
  left semi join 
     ( select sid, sum(1) cnt
       from user_score
       where score < 60
       group by sid
       having cnt >= 2;
     ) t1
  on t0.sid=t1.sid
group by t0.sid
View Code

hive在0.13版本以后开始支持更多的子查询,如in ,not in的子查询,推荐使用 left semi join 进行计算,left semi join 比 in 效率更高

 

posted @ 2023-08-27 10:06  akia开凯  阅读(50)  评论(0编辑  收藏  举报