hive SQL案例
-
上个月用户连续n天登录天数
用户登录记录表user_login,包括用户id(user_id)、日期(login_date)
user_id(bigint) | login_date(string) |
12333256 | 2021-01-03 |
84272916 | 2021-01-03 |
94038271 | 2021-01-02 |
201934012 | 2021-01-03 |
过滤出当月数据:where login_date > ''
转换成date:to_date函数,参数默认按照"yyyy-MM-dd"的格式
转换成date:to_date函数,参数默认按照"yyyy-MM-dd"的格式
去重:一天可能有多次登录记录(可以用distinct、group by、row_number去重)
开窗排序,生成rn1:按user_id分区,login_date排序
login_date与rn1作差得到新日期dt2:date_sub(login_date, rn1)
统计天数:按user_id、dt2分组,再使用count(1)统计连续登录天数
开窗排序,生成rn2:按user_id分区,count(1)排序
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
with t1 as ( select user_id, to_date(login_date) as dt1, row_number() over(partition by user_id order by to_date(login_date)) as rn1 from db.user_login where login_date > '2023-07-31' group by user_id, to_date(login_date) ), t2 as ( select user_id, date_sub(dt1, rn1) as dt2 from t1 ), t3 as ( select user_id, dt2, count(1) as continuous_day, row_number() over(partition by user_id order by count(1) desc) rn2 from t2 group by user_id, dt2 ) select user_id, dt2 from t3 where rn2 = 1;
部分代码为:
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
select sku_id, create_date, date_sub(create_date, row_number() over (partition by sku_id order by create_date)) as sub_res from order_detail group by sku_id, create_date -- 筛选出销售总额超过100的记录 having sum(price * sku_num) > 100
-
求每个店铺访问次数TopN的访客信息
用户访问表user_log,包括user_id、shop
user_id | shop |
u1 | a |
u2 | a |
u3 | b |
u4 | c |
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
with t1 as { select shop, user_id, count(1) as cnt, row_number() over(partition by shop order by count(1) desc) as rn1 from user_log group by shop, user_id } select shop, user_id, cnt from t1 where rn1 <= n;
-
计算出1日、3日、7日留存率
用户登录记录表user_log,包括user_id、log_date
user_id(bigint) | log_date(string) |
12333256 | 2021-01-03 |
84272916 | 2021-01-03 |
94038271 | 2021-01-02 |
201934012 | 2021-01-03 |
思路1:x天数据多次left join每天的数据
https://blog.csdn.net/weixin_45788152/article/details/110597806
思路2:x天的数据一次left join其他天数据,使用case when处理
https://blog.csdn.net/cold___play/article/details/124053392
View Code
View Code
View Code
View Code
View Code
View Code
去重:一天可能有多次登录记录(可以用distinct、group by、row_number去重);
关联:表自身关联,过滤出右表日期大于左表日期的日志
求x日留存:使用datediff得到日期差求留存
计算留存率:x日留存率=x日留存用户数/基准日活跃用户数
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
with t1 as ( select a.user_id as user_id, a.start_date as start_date, b.end_date as end_date, datediff(b.end_date, a.start_date) as diff_date from ( select user_id, to_date(log_date) as start_date from db.user_log group by user_id, to_date(log_date)) a left join ( select user_id, to_date(log_date) as end_date from db.user_log group by user_id, to_date(log_date)) b on a.user_id = b.user_id where a.start_date < b.end_date ), t2 as ( select start_date, count(distinct user_id) as base_cnt, count(distinct if(diff_date = 1, user_id, null)) as remain_1d, count(distinct if(diff_date = 3, user_id, null)) as remain_3d, count(distinct if(diff_date = 7, user_id, null)) as remain_7d from t1 ) select start_date, base_cnt, remain_1d, remain_3d, remain_7d, concat(round(remain_1d/base_cnt * 100, 2), '%') as remain_1d_rate, concat(round(remain_3d/base_cnt * 100, 2), '%') as remain_3d_rate, concat(round(remain_7d/base_cnt * 100, 2), '%') as remain_7d_rate from t2;
-
pv累加趋势图
用户访问表user_log,包括user_id、log_date,求累计到当前月的总数
user_id(bigint) | log_date(string) |
12333256 | 2021-01-03 |
84272916 | 2021-01-03 |
94038271 | 2021-01-02 |
201934012 |
2021-01-03 |
over里面的order by的作用:聚合函数默认从起始行到当前行范围内计算,如果没有排序意味着,聚合函数在整个分区范围中计算
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
select month, pv, sum(pv) over(order by month asc) as total_pv --从第一行加到当前行 from ( select substr(log_date,1,7) as month, count(user_id) as pv from db.user_log group by substr(log_date,1,7) ) t1;
有用户访问表,包含字段用户id、直播间id、进入直播间时间、离开直播间时间,统计直播间最高在线人数
增加标志位,将进入记为1,出去记为-1
统计累计到每个时刻的在线人数
得到最高在线人数
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
select room_id, max(online_num) as max_online_num from( select uid, time, sum(p) over(partition by room_id order by time) online_num from ( select uid, room_id, entry_time as time, 1 as p from table union all select uid, room_id, leave_time as time, -1 as p from table ) t1 ) t2 group by room_id;
-
计算商品的售价涨幅
商品价格变更明细表sku_price_modify_detail,包括sku_id、new_price、change_date
sku_id | new_price | change_date |
12333256 | 13 | 2021-01-03 |
84272916 | 54 | 2021-01-03 |
94038271 | 67 | 2021-01-02 |
201934012 | 21 | 2021-01-05 |
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
select t1.sku_id,t1.price_change from ( select sku_id, new_price - lead(new_price,1,0)over(partition by sku_id order by change_date desc) price_change, rank()over(partition by sku_id order by change_date desc) rn from sku_price_modify_detail ) t1 where rn=1 order by t1.price_change;
-
求连续点击三次的用户数,而且中间不能有别人的点击
用户点击表user_click,包括user_id、click_time
user_id(bigint) | click_time(string) |
12333256 | 2021-01-03 10:23:01 |
12333256 | 2021-01-03 10:23:02 |
84272916 | 2021-01-03 10:23:03 |
12333256 | 2021-01-03 10:23:04 |
12333256 | 2021-01-03 10:23:05 |
12333256 | 2021-01-03 10:23:06 |
12333256 | 2021-01-03 10:23:07 |
row_number() over (order by click_time) as rank_1得到rank_1为 1 2 3 4 5 6 7
row_number() over (partition by user_id order by click_time)得到rank_2为 1 2 1 3 4 5 6
rank1 - rank2得到diff为 0 0 2 1 1 1 1
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
select distinct user_id from ( select *, rank_1 - rank2 as diff from ( select *, row_number() over(order by click_time) as rank_1 row_number() over(partition by user_id order by click_time) as rank_2 from user_click ) b ) c group by diff, user_id having count(diff) >=3
-
用户行为分析
用户行为表tracking_log,包括user_id、op、log_time
user_id(bigint) | op | log_time(string) |
201934012 | A | 2021-01-01 10:23:01 |
12333256 | B | 2021-01-02 10:23:02 |
84272916 | A | 2021-01-03 10:23:03 |
12333256 | C | 2021-01-04 10:23:04 |
94038271 | A | 2021-01-01 10:23:05 |
201934012 | A | 2021-01-02 10:23:06 |
12333256 | B | 2021-01-03 10:23:07 |
统计每天符合以下条件的用户数:A操作之后是B操作,AB操作必须相邻
to_date(string timestr):返回时间字符串中的日期部分,比如to_date('2021-01-01 10:23:01')返回2021-01-01
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
1 select date, count(*) 2 from( 3 select user_id 4 from( 5 select 6 user_id, 7 to_date(log_time) date, 8 op, 9 lag(op, 1) over(partition by user_id, to_date(log_time) order by log_time) l 10 from tracking_log 11 ) a 12 where op='A' and l='B' 13 ) b 14 group by date; 15
统计用户行为序列为A-B-D的用户数
其中:A-B之间可以有任何其他浏览记录(如C、E等),B-D之间除了C,记录可以有任何其他浏览记录(如A、E等)
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
select to_date(log_time) as dt ,count(distinct user_id) from ( select user_id ,log_time ,concat_ws(',',collect_set(op) over(partition by user_id order by log_time)) as op_str from tracking_log ) t where op_str like '%A%B%D' and op_str not like '%A%B%C%D' group by to_date(log_time)
学生得分表user_score,包括user_id、score
user_id(bigint) | score |
201934012 | 82 |
12333256 | 62 |
84272916 | 67 |
12333256 | 87 |
94038271 | 91 |
201934012 | 38 |
12333256 | 56 |
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
select t0.sid as sid ,max(t0.avg_score) as avg_score ,max(t0.rn) as rn from (select sid ,score ,avg_score ,dense_rank() over(order by avg_score desc) rn from( select sid ,score ,avg(score) over(partition by sid) as avg_score from user_score ) m ) t0 left semi join ( select sid, sum(1) cnt from user_score where score < 60 group by sid having cnt >= 2; ) t1 on t0.sid=t1.sid group by t0.sid
hive在0.13版本以后开始支持更多的子查询,如in ,not in的子查询,推荐使用 left semi join 进行计算,left semi join 比 in 效率更高