尚硅谷大数据SQL题库 - 高级58道(答案解析)
高级题
计算2021年里有播放记录的每个视频的完播率(结果保留三位小数),并按完播率降序排序
注:视频完播率是指完成播放次数占总播放次数的比例。
简单起见,结束观看时间与开始播放时间的差>=视频时长时,视为完成播放。
1.2 表结构
用户-视频互动表k1_user_video_log
短视频信息表k2_video_info
1.3 建表/插数语句
DROP TABLE IF EXISTS k1_user_video_log, k2_video_info;
create table if not exists
k1_user_video_log
(
uid int,
video_id int,
start_time timestamp,
end_time timestamp,
if_follow tinyint,
if_like tinyint,
if_retweet tinyint,
comment_id int
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
k2_video_info
(
video_id int,
author int,
tag string,
duration int,
release_time timestamp
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO k1_user_video_log(uid, video_id, start_time, end_time, if_follow, i
f_like, if_retweet, comment_id) VALUES
(101, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:30', 0, 1, 1, null),
(102, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:24', 0, 0, 1, null),
(103, 2001, '2021-10-01 11:00:00', '2021-10-01 11:00:34', 0, 1, 0, 1732526),
(101, 2002, '2021-09-01 10:00:00', '2021-09-01 10:00:42', 1, 0, 1, null),
(102, 2002, '2021-10-01 11:00:00', '2021-10-01 11:00:30', 1, 0, 1, null);
INSERT INTO k2_video_info(video_id, author, tag, duration, release_time) VALUES
(2001, 901, '影视', 30, '2021-01-01 7:00:00'),
(2002, 901, '美食', 60, '2021-01-01 7:00:00'),
(2003, 902, '旅游', 90, '2021-01-01 7:00:00');
1.4 代码实现
SELECT t1.video_id, ROUND(sum(if(unix_timestamp(t1.end_time)-unix_timestamp(t1.start_time)>=t2.duration,1,0))/count(t1.video_id),3) as avg_comp_play_rate
FROM k1_user_video_log as t1
LEFT JOIN k2_video_info as t2
on t1.video_id=t2.video_id
WHERE year(start_time)=2021
GROUP BY t1.video_id
ORDER BY avg_comp_play_rate DESC;
示例数据的结果如下:
videojd avg_cornp_play_rate
视频2001在2021年10月有3次播放记录,观看时长分别为30秒、24秒、34秒,视频时长30秒,因此有两次是被 认为完成播放了的,故完播率为0.667 ;
视频2002在2021年9月和10月共2次播放记录,观看时长分别为42秒、30秒,视频时长60秒,故完播率为0.000。
SQL2 平均播放进度大于60%的视频类别-简单
2.1 题目需求
计算各类视频的平均播放进度,将进度大于60%的类别输出。
注:播放进度=播放时长÷视频时长*100%,当播放时长大于视频时长时,播放进度均记为100%。
结果保留两位小数,并按播放进度倒序排序。
2.2 表结构
用户-视频互动表k3_user_video_log
短视频信息表k4_video_info
2.3 建表/插数语句
create table if not exists
k3_user_video_log
(
uid int,
video_id int,
start_time timestamp,
end_time timestamp,
if_follow tinyint,
if_like tinyint,
if_retweet tinyint,
comment_id int
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
k4_video_info
(
video_id int,
author int,
tag string,
duration int,
release_time timestamp
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO k3_user_video_log(uid, video_id, start_time, end_time, if_follow, i
f_like, if_retweet, comment_id) VALUES
(101, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:30', 0, 1, 1, null),
(102, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:21', 0, 0, 1, null),
(103, 2001, '2021-10-01 11:00:50', '2021-10-01 11:01:20', 0, 1, 0, 1732526),
(102, 2002, '2021-10-01 11:00:00', '2021-10-01 11:00:30', 1, 0, 1, null),
(103, 2002, '2021-10-01 10:59:05', '2021-10-01 11:00:05', 1, 0, 1, null);
INSERT INTO k4_video_info(video_id, author, tag, duration, release_time) VALUES
(2001, 901, '影视', 30, '2021-01-01 7:00:00'),
(2002, 901, '美食', 60, '2021-01-01 7:00:00'),
(2003, 902, '旅游', 90, '2020-01-01 7:00:00');
2.4 代码实现
select
a1.video_id ,
a1.tag,
if(avg(rate)>100,100,avg(rate)) rate
from
(
select k3_user_video_log.uid id,k4_video_info.tag tag,k3_user_video_log.video_id,(unix_timestamp(end_time)-unix_timestamp(start_time))*100/k4_video_info.duration rate
from k3_user_video_log left join k4_video_info on k3_user_video_log.video_id = k4_video_info.video_id
)a1
group by a1.video_id,a1.tag
having avg(rate) > 60
order by avg(rate) desc;
输出:
影视|90.00%
美食|75.00%
解释:
影视类视频2001被用户101、102、103看过,播放进度分别为:30秒(100%)、21秒(70%)、30秒(100%),平均播放进度为90.00%(保留两位小数);
美食类视频2002被用户102、103看过,播放进度分别为:30秒(50%)、60秒(100%),平均播放进度为75.00%(保留两位小数);
SQL3 每类视频近一个月的转发量/率-中等
3.1 题目需求
统计在有用户互动的最近一个月(按包含当天在内的近30天算,比如10月31日的近30天为10.2~10.31之间的数据)中,每类视频的转发量和转发率(保留3位小数)。
注:转发率=转发量÷播放量。结果按转发率降序排序。
3.2 表结构
用户-视频互动表k5_user_video_log
短视频信息表k6_video_info
3.3 建表/插数语句
DROP TABLE IF EXISTS tb_user_video_log, tb_video_info;
create table if not exists
k5_user_video_log
(
uid int,
video_id int,
start_time timestamp,
end_time timestamp,
if_follow tinyint,
if_like tinyint,
if_retweet tinyint,
comment_id int
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
k6_video_info
(
video_id int,
author int,
tag string,
duration int,
release_time timestamp
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO k5_user_video_log(uid, video_id, start_time, end_time, if_follow, i
f_like, if_retweet, comment_id) VALUES
(101, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:20', 0, 1, 1, null)
,(102, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:15', 0, 0, 1, null)
,(103, 2001, '2021-10-01 11:00:50', '2021-10-01 11:01:15', 0, 1, 0, 1732526)
,(102, 2002, '2021-09-10 11:00:00', '2021-09-10 11:00:30', 1, 0, 1, null)
,(103, 2002, '2021-10-01 10:59:05', '2021-10-01 11:00:05', 1, 0, 0, null);
INSERT INTO k6_video_info(video_id, author, tag, duration, release_time) VALUES
(2001, 901, '影视', 30, '2021-01-01 7:00:00')
,(2002, 901, '美食', 60, '2021-01-01 7:00:00')
,(2003, 902, '旅游', 90, '2020-01-01 7:00:00');
3.4 代码实现
select
tag,
sum(if_retweet) retweet_cut,
cast(sum(if_retweet)/count(*) as decimal(16,3)) retweet_rate
from(
select
*
from k5_user_video_log
where start_time >= date_sub('2021-10-01',30)
) t1
join k6_video_info t2 on t1.video_id = t2.video_id
group by tag;
输出:
影视|2|0.667
美食|1|0.500
解释:
由表k5_user_video_log的数据可得,数据转储当天为2021年10月1日。近30天内,影视类视频2001共有3次播
放记录,被转发2次,转发率为0.667;美食类视频2002共有2次播放记录,1次被转发,转发率为0.500。
SQL4 每个创作者每月的涨粉率及截止当前的总粉丝量-中等
4.1 题目需求
计算2021年里每个创作者每月的涨粉率及截止当月的总粉丝量
注:涨粉率=(加粉量 - 掉粉量) / 播放量。结果按创作者ID、总粉丝量升序排序。
if_follow-是否关注为1表示用户观看视频中关注了视频创作者,为0表示此次互动前后关注状态未发生变化,为2表示本次观看过程中取消了关注。
4.2 表结构
用户-视频互动表k7_user_video_log
短视频信息表k8_video_info
4.3 建表/插数语句
create table if not exists
k7_user_video_log
(
uid int,
video_id int,
start_time timestamp,
end_time timestamp,
if_follow tinyint,
if_like tinyint,
if_retweet tinyint,
comment_id int
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
k8_video_info
(
video_id int,
author int,
tag string,
duration int,
release_time timestamp
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO tb_user_video_log(uid, video_id, start_time, end_time, if_follow, i
f_like, if_retweet, comment_id) VALUES
(101, 2001, '2021-09-01 10:00:00', '2021-09-01 10:00:20', 0, 1, 1, null)
,(105, 2002, '2021-09-10 11:00:00', '2021-09-10 11:00:30', 1, 0, 1, null)
,(101, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:20', 1, 1, 1, null)
,(102, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:15', 0, 0, 1, null)
,(103, 2001, '2021-10-01 11:00:50', '2021-10-01 11:01:15', 1, 1, 0, 1732526)
,(106, 2002, '2021-10-01 10:59:05', '2021-10-01 11:00:05', 2, 0, 0, null);
INSERT INTO tb_video_info(video_id, author, tag, duration, release_time) VALUES
(2001, 901, '影视', 30, '2021-01-01 7:00:00')
,(2002, 901, '影视', 60, '2021-01-01 7:00:00')
,(2003, 902, '旅游', 90, '2020-01-01 7:00:00')
,(2004, 902, '美女', 90, '2020-01-01 8:00:00');
4.4 代码实现
select
author,
month,
fans_growth_rate,
sum(month_add) over(partition by author order by month) total_fans
from(
select
author,
month,
cast((sum(add)-sum(lose))/ count(*) as decimal(16,3)) fans_growth_rate,
sum(add - lose) month_add
from (
select
video_id,
if(if_follow=1,1,0) add,
if(if_follow=2,1,0) lose,
date_format(start_time,'yyyy-MM') month
from k7_user_video_log
) t1
join k8_video_info t2 on t1.video_id = t2.video_id
group by author, month
order by month
) t2;
输出:
901|2021-09|0.500|1
901|2021-10|0.250|2
解释:
示例数据中表 k7_user_video_log 里只有视频2001和2002的播放记录,都来自创作者901,播放时间在2021年9月和10月;其中9月里加粉量为1,掉粉量为0,播放量为2,因此涨粉率为0.500(保留3位小数);其中10月里加粉量为2,掉份量为1,播放量为4,因此涨粉率为0.250,截止当前总粉丝数为2。
SQL5 近一个月发布的视频中热度最高的top3视频-困难
5.1 题目需求
找出近一个月发布的视频中热度最高的top3视频。
注:热度=(a*视频完播率+b*点赞数+c*评论数+d*转发数)*新鲜度;
新鲜度=1/(最近无播放天数+1);
当前配置的参数a,b,c,d分别为100、5、3、2。
最近播放日期以 end_time-结束观看时间 为准,假设为T,则最近一个月按 [T-29, T] 闭区间统计。
结果中热度保留为整数,并按热度降序排序。
5.2 表结构
用户-视频互动表k11_user_video_log
短视频信息表k12_video_info
5.3 建表/插数语句
create table if not exists
k11_user_video_log
(
uid int,
video_id int,
start_time timestamp,
end_time timestamp,
if_follow tinyint,
if_like tinyint,
if_retweet tinyint,
comment_id int
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
k12_video_info
(
video_id int,
author int,
tag string,
duration int,
release_time timestamp
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO tb_user_video_log(uid, video_id, start_time, end_time, if_follow, i
f_like, if_retweet, comment_id) VALUES
(101, 2001, '2021-09-24 10:00:00', '2021-09-24 10:00:20', 1, 1, 0, null)
,(105, 2002, '2021-09-25 11:00:00', '2021-09-25 11:00:30', 0, 0, 1, null)
,(102, 2002, '2021-09-25 11:00:00', '2021-09-25 11:00:30', 1, 1, 1, null)
,(101, 2002, '2021-09-26 11:00:00', '2021-09-26 11:00:30', 1, 0, 1, null)
,(101, 2002, '2021-09-27 11:00:00', '2021-09-27 11:00:30', 1, 1, 0, null)
,(102, 2002, '2021-09-28 11:00:00', '2021-09-28 11:00:30', 1, 0, 1, null)
,(103, 2002, '2021-09-29 11:00:00', '2021-09-29 11:00:30', 1, 0, 1, null)
,(102, 2002, '2021-09-30 11:00:00', '2021-09-30 11:00:30', 1, 1, 1, null)
,(101, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:20', 1, 1, 0, null)
,(102, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:15', 0, 0, 1, null)
,(103, 2001, '2021-10-01 11:00:50', '2021-10-01 11:01:15', 1, 1, 0, 1732526)
,(106, 2002, '2021-10-02 10:59:05', '2021-10-02 11:00:05', 2, 0, 1, null)
,(107, 2002, '2021-10-02 10:59:05', '2021-10-02 11:00:05', 1, 0, 1, null)
,(108, 2002, '2021-10-02 10:59:05', '2021-10-02 11:00:05', 1, 1, 1, null)
,(109, 2002, '2021-10-03 10:59:05', '2021-10-03 11:00:05', 0, 1, 0, null);
INSERT INTO tb_video_info(video_id, author, tag, duration, release_time) VALUES
(2001, 901, '旅游', 30, '2020-01-01 7:00:00')
,(2002, 901, '旅游', 60, '2021-01-01 7:00:00')
,(2003, 902, '影视', 90, '2020-01-01 7:00:00')
,(2004, 902, '美女', 90, '2020-01-01 8:00:00');
5.4 代码实现
select a.video_id, round((full_play_rate + like_rate + common_rate + retweet_rate) * fresh_rate, 0) high_index
from (
-- 求完播率
select video_id, sum(if(t >= duration, 1, 0)) * 100 / count(*) full_play_rate
from (
select k11_user_video_log.video_id,
unix_timestamp(end_time, 'yyyy-MM-dd HH:mm:ss:sss') -
unix_timestamp(start_time, 'yyyy-MM-dd HH:mm:ss:sss') t,
k12vi.duration
from k11_user_video_log
left join k12_video_info k12vi
on k11_user_video_log.video_id = k12vi.video_id) t1
group by video_id) a
left join (
-- 点赞、评论、转发
select video_id,
sum(if_like) * 5 like_rate,
sum(if(comment_id is not null, 1, 0)) * 3 common_rate,
sum(if_retweet) * 2 retweet_rate
from k11_user_video_log
group by video_id) b on a.video_id = b.video_id
left join (
-- 新鲜度
select video_id, 1 / (if(no_play_days > 29, 29, no_play_days) + 1) fresh_rate
from (
-- 最近无播放天数
select video_id,
end_date,
row_number() over (partition by video_id order by end_date desc) rk,
datediff(today, end_date) `no_play_days`
from (
-- 将结束时间转为日期,然后排序
select video_id, to_date(end_time) end_date, today
from k11_user_video_log
join(select max(to_date(end_time)) today
from k11_user_video_log
group by 1) tx
) t1) t2
where rk = 1) c
on a.video_id = c.video_id
order by high_index desc
limit 3;
输出:
旅游|2021-10-01|5|2
旅游|2021-10-02|5|3
旅游|2021-10-03|6|3
解释:
最近播放日期为2021-10-03,记作当天日期;近一个月(2021-09-04及之后)发布的视频有2001、2002、2003、2004,不过2004暂时还没有播放记录;视频2001完播率1.0(被播放次数4次,完成播放4次),被点赞3次,评论1次,转发2次,最近无播放天数为0,
因此热度为: (100*1.0+5*3+3*1+2*2)/(0+1)=122
同理,视频2003完播率0,被点赞数1,评论和转发均为0,最近无播放天数为3,因此热度为:(100*0+5*1+3*0+2*0)/(3+1)=1(1.2保留为整数)
SQL6 2021年11月每天的人均浏览文章时长-简单
6.1 题目需求
场景逻辑说明:
artical_id-文章ID代表用户浏览的文章的ID,artical_id-文章ID为0表示用户在非文章内容页 (比如App内的列表页、活动页等)。
问题:
统计2021年11月每天的人均浏览文章时长(秒数),结果保留1位小数,并按时长由短到长排序。
6.2 表结构
用户行为日志表m1_user_log
6.3 建表/插数语句
create table if not exists
m1_user_log
(
uid INT COMMENT '用户ID',
artical_id INT COMMENT '视频ID',
in_time timestamp COMMENT '进入时间',
out_time timestamp COMMENT '离开时间',
sign_in TINYINT COMMENT '是否签到'
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO m1_user_log(uid, artical_id, in_time, out_time, sign_in) VALUES
(101, 9001, '2021-11-01 10:00:00', '2021-11-01 10:00:31', 0),
(102, 9001, '2021-11-01 10:00:00', '2021-11-01 10:00:24', 0),
(102, 9002, '2021-11-01 11:00:00', '2021-11-01 11:00:11', 0),
(101, 9001, '2021-11-02 10:00:00', '2021-11-02 10:00:50', 0),
(102, 9002, '2021-11-02 11:00:01', '2021-11-02 11:00:24', 0);
6.4 代码实现
select
a.in_time,
round(cast((sum(view_time) / count(distinct uid)) as decimal(10,2)),1) avg_view_len_sec
from
(select
uid,
substr(in_time,1,10) in_time,
substr(out_time,1,10) out_time,
(unix_timestamp(out_time) - unix_timestamp(in_time)) view_time
from m1_user_log) a
group by a.in_time;
输出:
2021-11-01|33.0
2021-11-02|36.5
解释
11月1日有2个人浏览文章,总共浏览时长为31+24+11=66秒,人均浏览33秒;
11月2日有2个人浏览文章,总共时长为50+23=73秒,人均时长为36.5秒。
SQL7 2021年11月每天新用户的次日留存率-中等
7.1 题目需求
统计2021年11月每天新用户的次日留存率(保留2位小数)
注:次日留存率为当天新增的用户数中第二天又活跃了的用户数占比。
如果in_time-进入时间和out_time-离开时间跨天了,在两天里都记为该用户活跃过,结果按日期升序。
7.2 表结构
用户行为日志表m3_user_log
7.3 建表/插数语句
create table if not exists
m3_user_log
(
uid INT COMMENT '用户ID',
artical_id INT COMMENT '视频ID',
in_time timestamp COMMENT '进入时间',
out_time timestamp COMMENT '离开时间',
sign_in TINYINT COMMENT '是否签到'
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO m3_user_log(uid, artical_id, in_time, out_time, sign_in) VALUES (101, 0, '2021-11-01 10:00:00', '2021-11-01 10:00:42', 1), (102, 9001, '2021-11-01 10:00:00', '2021-11-01 10:00:09', 0), (103, 9001, '2021-11-01 10:00:01', '2021-11-01 10:01:50', 0), (101, 9002, '2021-11-02 10:00:09', '2021-11-02 10:00:28', 0), (103, 9002, '2021-11-02 10:00:51', '2021-11-02 10:00:59', 0), (104, 9001, '2021-11-02 10:00:28', '2021-11-02 10:00:50', 0), (101, 9003, '2021-11-03 11:00:55', '2021-11-03 11:01:24', 0), (104, 9003, '2021-11-03 11:00:45', '2021-11-03 11:00:55', 0), (105, 9003, '2021-11-03 11:00:53', '2021-11-03 11:00:59', 0), (101, 9002, '2021-11-04 11:00:55', '2021-11-04 11:00:59', 0);
7.4 代码实现
select t2.dt,round(count(t1.uid)/count(t2.uid),2) uv_left_rate
from(
select uid,min(date(in_time)) dt
from m3_user_log
group by uid
)t2
left join(
select uid,date(in_time) dt
from m3_user_log
union
select uid,date(out_time) dt
from m3_user_log
)t1
on t2.uid=t1.uid and datediff(t1.dt,t2.dt)=1
where date_format(t2.dt,'yyyy-MM')='2021-11'
group by t2.dt
order by t2.dt
输出:
2021-11-01|0.67
2021-11-02|1.00
2021-11-03|0.00
解释:
11.01有3个用户活跃101、102、103,均为新用户,在11.02只有101、103两个又活跃了,因此11.01的次日留存率为0.67;
11.02有104一位新用户,在11.03又活跃了,因此11.02的次日留存率为1.00;
11.03有105一位新用户,在11.04未活跃,因此11.03的次日留存率为0.00;
11.04没有新用户,不输出。
SQL8 每天的日活数及新用户占比-较难
8.1 题目需求
统计每天的日活数及新用户占比
注:新用户占比=当天的新用户数÷当天活跃用户数(日活数)。
如果in_time-进入时间和out_time-离开时间跨天了,在两天里都记为该用户活跃过。
新用户占比保留2位小数,结果按日期升序排序。
8.2 表结构
用户行为日志表m5_user_log
8.3 建表/插数语句
create table if not exists
m5_user_log
(
uid INT COMMENT '用户ID',
artical_id INT COMMENT '视频ID',
in_time timestamp COMMENT '进入时间',
out_time timestamp COMMENT '离开时间',
sign_in TINYINT COMMENT '是否签到'
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO m5_user_log(uid, artical_id, in_time, out_time, sign_in) VALUES
(101, 9001, '2021-10-31 10:00:00', '2021-10-31 10:00:09', 0),
(102, 9001, '2021-10-31 10:00:00', '2021-10-31 10:00:09', 0),
(101, 0, '2021-11-01 10:00:00', '2021-11-01 10:00:42', 1),
(102, 9001, '2021-11-01 10:00:00', '2021-11-01 10:00:09', 0),
(108, 9001, '2021-11-01 10:00:01', '2021-11-01 10:01:50', 0),
(108, 9001, '2021-11-02 10:00:01', '2021-11-02 10:01:50', 0),
(104, 9001, '2021-11-02 10:00:28', '2021-11-02 10:00:50', 0),
(106, 9001, '2021-11-02 10:00:28', '2021-11-02 10:00:50', 0),
(108, 9001, '2021-11-03 10:00:01', '2021-11-03 10:01:50', 0),
(109, 9002, '2021-11-03 11:00:55', '2021-11-03 11:00:59', 0),
(104, 9003, '2021-11-03 11:00:45', '2021-11-03 11:00:55', 0),
(105, 9003, '2021-11-03 11:00:53', '2021-11-03 11:00:59', 0),
(106, 9003, '2021-11-03 11:00:45', '2021-11-03 11:00:55', 0);
8.4 代码实现
select dt,count(distinct uid)DAU,
round(count(`if`(rk=1,1,null))/count(*),2)uv_new_ratio
from (
select date_format(in_time,'yyyy-MM-dd')dt,uid,artical_id,row_number() over (partition by uid order by date_format(in_time,'yyyy-MM-dd')) rk
from m5_user_log
)t1
group by dt;
输出:
2021-10-31|2|1.00
2021-11-01|3|0.33
2021-11-02|3|0.67
2021-11-03|5|0.40
解释:
2021年10月31日有2个用户活跃,都为新用户,新用户占比1.00;
2021年11月1日有3个用户活跃,其中1个新用户,新用户占比0.33;
SQL9 连续签到领金币 困难与SQL42相同
9.1 题目需求
场景逻辑说明:
artical_id-文章ID代表用户浏览的文章的ID,特殊情况artical_id-文章ID为0表示用户在非文章内容页(比如App内的列表页、活动页等)。注意:只有artical_id为0时sign_in值才有效。
从2021年7月7日0点开始,用户每天签到可以领1金币,并可以开始累积签到天数,连续签到的第3、7天分别可额外领2、6金币。
每连续签到7天后重新累积签到天数(即重置签到天数:连续第8天签到时记为新的一轮签到的第一天,领1金币)
问题:
计算每个用户2021年7月以来每月获得的金币数(该活动到10月底结束,11月1日开始的签到不再获得金币)。结果按月份、ID升序排序。
注:如果签到记录的in_time-进入时间和out_time-离开时间跨天了,也只记作in_time对应的日期签到了。
9.2 表结构
用户行为日志表m6_user_log
9.3 建表/插数语句
create table if not exists
m6_user_log
(
uid INT COMMENT '用户ID',
artical_id INT COMMENT '视频ID',
in_time timestamp COMMENT '进入时间',
out_time timestamp COMMENT '离开时间',
sign_in TINYINT COMMENT '是否签到'
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO m6_user_log(uid, artical_id, in_time, out_time, sign_in) VALUES
(101, 0, '2021-07-07 10:00:00', '2021-07-07 10:00:09', 1),
(101, 0, '2021-07-08 10:00:00', '2021-07-08 10:00:09', 1),
(101, 0, '2021-07-09 10:00:00', '2021-07-09 10:00:42', 1),
(101, 0, '2021-07-10 10:00:00', '2021-07-10 10:00:09', 1),
(101, 0, '2021-07-11 23:59:55', '2021-07-11 23:59:59', 1),
(101, 0, '2021-07-12 10:00:28', '2021-07-12 10:00:50', 1),
(101, 0, '2021-07-13 10:00:28', '2021-07-13 10:00:50', 1),
(102, 0, '2021-10-01 10:00:28', '2021-10-01 10:00:50', 1),
(102, 0, '2021-10-02 10:00:01', '2021-10-02 10:01:50', 1),
(102, 0, '2021-10-03 11:00:55', '2021-10-03 11:00:59', 1),
(102, 0, '2021-10-04 11:00:45', '2021-10-04 11:00:55', 0),
(102, 0, '2021-10-05 11:00:53', '2021-10-05 11:00:59', 1),
(102, 0, '2021-10-06 11:00:45', '2021-10-06 11:00:55', 1);
9.4 代码实现
select uid,mouth,sum(score) coin
from (
select uid,dt,mouth,sign,
`if`((rk=1 and score=5),score,`if`((rk=2 and score=5),3,`if`((rk=1 and score=3),3,sign))) score
from (
select uid,dt,mouth,sign,score,
row_number() over (partition by uid,score order by dt) rk
from (
select t1.uid uid,t1.dt dt,t1.mouth mouth,t1.sign_in sign,`if`(
(t1.sign_in=1 and lag(sign_in,1) over (partition by uid) = 1
and lag(sign_in,2) over (partition by uid) = 1
and lag(sign_in,3) over (partition by uid) = 1
and lag(sign_in,4) over (partition by uid) = 1),5,
`if`(
(t1.sign_in=1 and lag(sign_in,1) over (partition by uid) = 1
and lag(sign_in,2) over (partition by uid) = 1),3,sign_in
)) score
from (
select uid,date_format(in_time,'yyyy-MM-dd')dt,date_format(in_time,'yyyy-MM')mouth,sign_in
from m6_user_log
)t1
)t2
)t3
)t4
group by uid, mouth;
输出:
101|202107|15
102|202110|7
解释:
101在活动期内连续签到了7天,因此获得1*7+2+6=15金币;
102在10.01~10.03连续签到3天获得5金币
10.04断签了,10.05~10.06连续签到2天获得2金币,共得到7金币。
SQL10 计算商城中2021年每月的GMV-简单
10.1 题目需求
场景逻辑说明:
用户将购物车中多件商品一起下单时,订单总表会生成一个订单(但此时未付款,status-订单状态为0,表示待付款);当用户支付完成时,在订单总表修改对应订单记录的status-订单状态为1,表示已付款;若用户退货退款,在订单总表生成一条交易总金额为负值的记录(表示退款金额,订单号为退款单号, status-订单状态为2表示已退款)。
问题:
请计算商城中2021年每月的GMV,输出GMV大于10w的每月GMV,值保留到整数。
注:GMV为已付款订单和未付款订单两者之和。结果按GMV升序排序。
10.2 表结构
订单总表n1_order_overall
10.3 建表/插数语句
create table if not exists
n1_order_overall
(
order_id INT COMMENT '订单号',
uid INT COMMENT '用户ID',
event_time timestamp COMMENT '下单时间',
total_amount DECIMAL COMMENT '订单总金额',
total_cnt INT COMMENT '订单商品总件数',
status TINYINT COMMENT '订单状态'
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO n1_order_overall(order_id, uid, event_time, total_amount, total_cnt, `status`) VALUES (301001, 101, '2021-10-01 10:00:00', 15900, 2, 1), (301002, 101, '2021-10-01 11:00:00', 15900, 2, 1),
(301003, 102, '2021-10-02 10:00:00', 34500, 8, 0), (301004, 103, '2021-10-12 10:00:00', 43500, 9, 1), (301005, 105, '2021-11-01 10:00:00', 31900, 7, 1), (301006, 102, '2021-11-02 10:00:00', 24500, 6, 1), (391007, 102, '2021-11-03 10:00:00', -24500, 6, 2), (301008, 104, '2021-11-04 10:00:00', 55500, 12, 0);
10.4 代码实现
select
t1.month,t1.GMV
from
(select
date_format(event_time,"yyyy-MM") month,
sum(total_amount) over(partition by date_format(event_time,"yyyy-MM") ) GMV
from n1_order_overall
where status <> 2)t1
group by t1.month, t1.GMV;
输出:
2021-10|109800
2021-11|111900
解释:
2021年10月有3笔已付款的订单,1笔未付款订单,总交易金额为109800;2021年11月有2笔已付款订单,1笔未付款订单,总交易金额为111900(还有1笔退款订单由于已计算了付款的订单金额,无需计算在GMV中)。
SQL11 某店铺的各商品毛利率及店铺整体毛利率-中等
11.1 题目需求
请计算2021年10月以来店铺901中商品毛利率大于24.9%的商品信息及店铺整体毛利率。
注: 商品毛利率=(1-进价/平均单件售价)*100% ;
店铺毛利率=(1-总进价成本/总销售收入)*100% 。
结果先输出店铺毛利率,再按商品ID升序输出各商品毛利率,均保留1位小数。
11.2表结构
商品信息表n3_product_info
订单总表n4_order_overall
订单明细表n5_order_detail
11.3 建表/插数语句
create table if not exists
n3_order_overall
(
order_id INT COMMENT '订单号',
uid INT COMMENT '用户ID',
event_time timestamp COMMENT '下单时间',
total_amount DECIMAL COMMENT '订单总金额',
total_cnt INT COMMENT '订单商品总件数',
status TINYINT COMMENT '订单状态'
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
n4_product_info
(
product_id INT COMMENT '商品ID',
shop_id INT COMMENT '店铺ID',
tag string COMMENT '商品类别标签',
in_price DECIMAL COMMENT '进货价格',
quantity INT COMMENT '进货数量',
release_time timestamp COMMENT '上架时间'
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
n5_order_detail
(
order_id INT COMMENT '订单号',
product_id INT COMMENT '商品ID',
price DECIMAL COMMENT '商品单价',
cnt INT COMMENT '下单数量'
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO n3_order_overall(order_id, uid, event_time, total_amount, total_cnt,
`status`) VALUES
(301001, 101, '2021-10-01 10:00:00', 30000, 3, 1),
(301002, 102, '2021-10-01 11:00:00', 23900, 2, 1),
(301003, 103, '2021-10-02 10:00:00', 31000, 2, 1);
INSERT INTO n4_product_info(product_id, shop_id, tag, in_price, quantity, relea
se_time) VALUES
(8001, 901, '家电', 6000, 100, '2020-01-01 10:00:00'),
(8002, 902, '家电', 12000, 50, '2020-01-01 10:00:00'),
(8003, 901, '3C数码', 12000, 50, '2020-01-01 10:00:00');
INSERT INTO n5_order_detail(order_id, product_id, price, cnt) VALUES
(301001, 8001, 8500, 2),
(301001, 8002, 15000, 1),
(301002, 8001, 8500, 1),
(301002, 8002, 16000, 1),
(301003, 8002, 14000, 1),
(301003, 8003, 18000, 1);
11.4 代码实现
select product_id,
profit_rate
from(
select '店铺毛利率' product_id,
concat(round((1-sum(in_price * cnt) / sum(price * cnt))* 100,1) , '%') as profit_rate
from(
select shop_id,
in_price,
price,
cnt,
date_format(event_time,'yyyy-MM') event_time
from n4_product_info n4 left join n5_order_detail n5
on n4.product_id=n5.product_id
join n3_order_overall n3
on n3.order_id=n5.order_id
) t1
where t1.shop_id=901 and event_time>=2021-10
group by shop_id
union
select cast(t2.product_id as string) product_id,
concat(round((1-sum(in_price * cnt) / sum(price * cnt))* 100,1),'%') as profit_rate
from(
select n4.product_id product_id,
shop_id,
in_price,
price,
cnt,
date_format(event_time,'yyyy-MM') event_time
from n4_product_info n4 left join n5_order_detail n5
on n4.product_id=n5.product_id
join n3_order_overall n3
on n3.order_id=n5.order_id
) t2
where t2.shop_id=901 and event_time>=2021-10
group by product_id
having 1-sum(in_price * cnt) / sum(price * cnt) >0.249
order by product_id
) t3;
输出:
店铺汇总|31.0%
解释:
店铺901有两件商品8001和8003;8001售出了3件,销售总额为25500,进价总额为18000,毛利率为1-
18000/25500=29.4%,8003售出了1件,售价为18000,进价为12000,毛利率为33.3%;
店铺卖出的这4件商品总销售额为43500,总进价为30000,毛利率为1-30000/43500=31.0%
SQL12 零食类商品中复购率top3高的商品-中等
12.1 题目需求
请统计零食类商品中复购率top3高的商品。
注:复购率指用户在一段时间内对某商品的重复购买比例,复购率越大,则反映出消费者对品牌的忠诚度就越高,也叫回头率
此处我们定义:某商品复购率 = 近90天内购买它至少两次的人数 ÷ 购买它的总人数
近90天指包含最大日期(记为当天)在内的近90天。结果中复购率保留3位小数,并按复购率倒序、商品ID升序排序
12.2 表结构
商品信息表n6_product_info
订单总表n7_order_overall
订单明细表n8_order_detail
12.3 建表/插数语句
create table if not exists
n6_order_overall
(
order_id INT COMMENT '订单号',
uid INT COMMENT '用户ID',
event_time timestamp COMMENT '下单时间',
total_amount DECIMAL COMMENT '订单总金额',
total_cnt INT COMMENT '订单商品总件数',
status TINYINT COMMENT '订单状态'
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
n7_product_info
(
product_id INT COMMENT '商品ID',
tag string COMMENT '商品类别标签',
in_price DECIMAL COMMENT '进货价格',
quantity INT COMMENT '进货数量',
release_time timestamp COMMENT '上架时间'
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
n8_order_detail
(
order_id INT COMMENT '订单号',
product_id INT COMMENT '商品ID',
price DECIMAL COMMENT '商品单价',
cnt INT COMMENT '下单数量'
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO n6_product_info(product_id, shop_id, tag, in_price, quantity, relea
se_time) VALUES
(8001, 901, '零食', 60, 1000, '2020-01-01 10:00:00'),
(8002, 901, '零食', 140, 500, '2020-01-01 10:00:00'),
(8003, 901, '零食', 160, 500, '2020-01-01 10:00:00');
INSERT INTO n7_order_overall(order_id, uid, event_time, total_amount, total_cnt,
`status`) VALUES
(301001, 101, '2021-09-30 10:00:00', 140, 1, 1),
(301002, 102, '2021-10-01 11:00:00', 235, 2, 1),
(301011, 102, '2021-10-31 11:00:00', 250, 2, 1),
(301003, 101, '2021-11-02 10:00:00', 300, 2, 1),
(301013, 105, '2021-11-02 10:00:00', 300, 2, 1),
(301005, 104, '2021-11-03 10:00:00', 170, 1, 1);
INSERT INTO n8_order_detail(order_id, product_id, price, cnt) VALUES
(301001, 8002, 150, 1),
(301011, 8003, 200, 1),
(301011, 8001, 80, 1),
(301002, 8001, 85, 1),
(301002, 8003, 180, 1),
(301003, 8002, 140, 1),
(301003, 8003, 180, 1),
(301013, 8002, 140, 2),
(301005, 8003, 180, 1);
12.4 代码实现
select t4.product_id, round(t4.count/(t5.total-1), 3) cpr
from (
--左表查询每个商品复购次数
select product_id, count(product_id) count
from (
select product_id
from (
--筛选出每个商品重复下单的uid
select product_id, uid, event_time,
lag(event_time, 1, 000) over(partition by product_id, uid order by product_id) lag_time
from (select od.product_id, oo.uid, oo.event_time
from n8_order_detail od join n6_order_overall oo
on od.order_id = oo.order_id
where oo.status = 1
order by od.product_id, oo.event_time) t1
) t2
where datediff(event_time, lag_time) < 90
) t3
group by product_id) t4
join (
--右表查询总下单次数
select distinct product_id, count(product_id) over (partition by product_id) total
from n8_order_detail) t5
on t4.product_id = t5.product_id
order by t4.product_id;
输出:
8001|1.000
8002|0.500
8003|0.333
解释:
商品8001、8002、8003都是零食类商品,8001只被用户102购买了两次,复购率1.000;
商品8002被101购买了两次,被105购买了1次,复购率0.500;
商品8003被102购买两次,被101和105各购买1次,复购率为0.333。
SQL13 2021年国庆在北京接单3次及以上的司机 统计信息-简单
13.1 题目需求
场景逻辑说明:
用户提交打车请求后,在用户打车记录表生成一条打车记录,order_id-订单号设为null;
当有司机接单时,在打车订单表生成一条订单,填充order_time-接单时间及其左边的字段,start_time-开始计费的上车时间及其右边的字段全部为null,并把order_id-订单号和order_time-接单时间(end_time-打车结束时间)写入打车记录表;若一直无司机接单,超时或中途用户主动取消打车,则记录end_time-打车结束时间。
若乘客上车前,乘客或司机点击取消订单,会将打车订单表对应订单的finish_time-订单完成时间填充为取消时间,其余字段设为null。
当司机接上乘客时,填充订单表中该start_time-开始计费的上车时间。
当订单完成时填充订单完成时间、里程数、费用;评分设为null,在用户给司机打1~5星评价后填充。
问题:
请统计2021年国庆7天期间在北京市接单至少3次的司机的平均接单数和平均兼职收入(暂不考虑平台佣金,直接计算完成的订单费用总额),结果保留3位小数。
13.2 表结构
用户打车记录表p1_get_car_record
打车订单表p2_get_car_order
13.3 建表/插数语句
create table if not exists
p1_get_car_record
(
uid INT COMMENT '用户ID',
city string COMMENT '城市',
event_time timestamp COMMENT '打车时间',
end_time timestamp COMMENT '打车结束时间',
order_id INT COMMENT '订单号'
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
p2_get_car_order
(
order_id INT COMMENT '订单号',
uid INT COMMENT '用户ID',
driver_id INT COMMENT '司机ID',
order_time timestamp COMMENT '接单时间',
start_time timestamp COMMENT '开始计费的上车时间',
finish_time timestamp COMMENT '订单结束时间',
mileage DOUBLE COMMENT '行驶里程数',
fare DOUBLE COMMENT '费用',
grade TINYINT COMMENT '评分'
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO p1_get_car_record(uid, city, event_time, end_time, order_id) VALUES
(101, '北京', '2021-10-01 07:00:00', '2021-10-01 07:02:00', null),
(102, '北京', '2021-10-01 09:00:30', '2021-10-01 09:01:00', 9001),
(101, '北京', '2021-10-01 08:28:10', '2021-10-01 08:30:00', 9002),
(103, '北京', '2021-10-02 07:59:00', '2021-10-02 08:01:00', 9003),
(104, '北京', '2021-10-03 07:59:20', '2021-10-03 08:01:00', 9004),
(105, '北京', '2021-10-01 08:00:00', '2021-10-01 08:02:10', 9005),
(106, '北京', '2021-10-01 17:58:00', '2021-10-01 18:01:00', 9006),
(107, '北京', '2021-10-02 11:00:00', '2021-10-02 11:01:00', 9007),
(108, '北京', '2021-10-02 21:00:00', '2021-10-02 21:01:00', 9008) ;
INSERT INTO p2_get_car_order(order_id, uid, driver_id, order_time, start_time,
finish_time, mileage, fare, grade) VALUES
(9002, 101, 201, '2021-10-01 08:30:00', null, '2021-10-01 08:31:00', null, null
, null),
(9001, 102, 202, '2021-10-01 09:01:00', '2021-10-01 09:06:00', '2021-10-01 09:
31:00', 10.0, 41.5, 5),
(9003, 103, 202, '2021-10-02 08:01:00', '2021-10-02 08:15:00', '2021-10-02 08:
31:00', 11.0, 41.5, 4),
(9004, 104, 202, '2021-10-03 08:01:00', '2021-10-03 08:13:00', '2021-10-03 08:
31:00', 7.5, 22, 4),
(9005, 105, 203, '2021-10-01 08:02:10', '2021-10-01 08:18:00', '2021-10-01 08:
31:00', 15.0, 44, 5),
(9006, 106, 203, '2021-10-01 18:01:00', '2021-10-01 18:09:00', '2021-10-01 18:
31:00', 8.0, 25, 5),
(9007, 107, 203, '2021-10-02 11:01:00', '2021-10-02 11:07:00', '2021-10-02 11:
31:00', 9.9, 30, 5),
(9008, 108, 203, '2021-10-02 21:01:00', '2021-10-02 21:10:00', '2021-10-02 21:
31:00', 13.2, 38, 4);
13.4 代码实现
select city,
round(sum(ct)/count(*),3) avg_order_num,
round(sum(total_fare)/count(*),3) avg_income
from(select distinct city,
sum(fare) over(partition by driver_id) total_fare,
count(*) over(partition by driver_id) ct
from p2_get_car_order p2
left join p1_get_car_record p1
on p2.order_id = p1.order_id
where city='北京'
and date_format(order_time,'yyyy-MM-dd')>='2021-10-01'
and date_format(order_time,'yyyy-MM-dd')<='2021-10-07')t1
where ct>=3
group by city;
输出:
北京|3.500|121.000
解释:
在2021年国庆期间北京市的订单中,202共接了3单,兼职收入105;203接了4单,兼职收入137;201共接了1单,但取消了;
接单至少3次的司机有202和203,他两人全部总共接单数为7,总收入为242。因此平均接单数为3.500,平均收入为121.000;
SQL14 有取消订单记录的司机平均评分-简单
14.1 题目需求
请找到2021年10月有过取消订单记录的司机,计算他们每人全部已完成的有评分订单的平均评分及总体平均评分,保留1位小数。先按driver_id升序输出,再输出总体情况。
14.2 表结构
用户打车记录表p3_get_car_record
打车订单表p4_get_car_order
14.3 建表/插数语句
create table if not exists
p3_get_car_record
(
uid INT COMMENT '用户ID',
city string COMMENT '城市',
event_time timestamp COMMENT '打车时间',
end_time timestamp COMMENT '打车结束时间',
order_id INT COMMENT '订单号'
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
p4_get_car_order
(
order_id INT COMMENT '订单号',
uid INT COMMENT '用户ID',
driver_id INT COMMENT '司机ID',
order_time timestamp COMMENT '接单时间',
start_time timestamp COMMENT '开始计费的上车时间',
finish_time timestamp COMMENT '订单结束时间',
mileage DOUBLE COMMENT '行驶里程数',
fare DOUBLE COMMENT '费用',
grade TINYINT COMMENT '评分'
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO p3_get_car_record(uid, city, event_time, end_time, order_id) VALUES
(101, '北京', '2021-10-01 07:00:00', '2021-10-01 07:02:00', null),
(102, '北京', '2021-10-01 09:00:30', '2021-10-01 09:01:00', 9001),
(101, '北京', '2021-10-01 08:28:10', '2021-10-01 08:30:00', 9002),
(103, '北京', '2021-10-02 07:59:00', '2021-10-02 08:01:00', 9003),
(104, '北京', '2021-10-03 07:59:20', '2021-10-03 08:01:00', 9004),
(105, '北京', '2021-10-01 08:00:00', '2021-10-01 08:02:10', 9005),
(106, '北京', '2021-10-01 17:58:00', '2021-10-01 18:01:00', 9006),
(107, '北京', '2021-10-02 11:00:00', '2021-10-02 11:01:00', 9007),
(108, '北京', '2021-10-02 21:00:00', '2021-10-02 21:01:00', 9008),
(109, '北京', '2021-10-08 18:00:00', '2021-10-08 18:01:00', 9009);
INSERT INTO p4_get_car_order(order_id, uid, driver_id, order_time, start_time,
finish_time, mileage, fare, grade) VALUES
(9002, 101, 202, '2021-10-01 08:30:00', null, '2021-10-01 08:31:00', null, null
, null),
(9001, 102, 202, '2021-10-01 09:01:00', '2021-10-01 09:06:00', '2021-10-01 09:
31:00', 10.0, 41.5, 5),
(9003, 103, 202, '2021-10-02 08:01:00', '2021-10-02 08:15:00', '2021-10-02 08:
31:00', 11.0, 41.5, 4),
(9004, 104, 202, '2021-10-03 08:01:00', '2021-10-03 08:13:00', '2021-10-03 08:
31:00', 7.5, 22, 4),
(9005, 105, 203, '2021-10-01 08:02:10', null, '2021-10-01 08:31:00', null, null
, null),
(9006, 106, 203, '2021-10-01 18:01:00', '2021-10-01 18:09:00', '2021-10-01 18:
31:00', 8.0, 25.5, 5),
(9007, 107, 203, '2021-10-02 11:01:00', '2021-10-02 11:07:00', '2021-10-02 11:
31:00', 9.9, 30, 5),
(9008, 108, 203, '2021-10-02 21:01:00', '2021-10-02 21:10:00', '2021-10-02 21:
31:00', 13.2, 38, 4),
(9009, 109, 203, '2021-10-08 18:01:00', '2021-10-08 18:11:50', '2021-10-08 18:
51:00', 13, 40, 5);
14.4 代码实现
select cast(driver_id as string) driver_id,round(avg(grade) ,1) avg_grade from
(select t1.driver_id,t2.grade from
(select driver_id from p4_get_car_order where start_time is null)t1
join
(select * from p4_get_car_order where start_time is not null )t2 where t1.driver_id=t2.driver_id
) t3 group by driver_id
union all
select '总体' driver_id,round(sum(grade)/count(*),1) grade from (select t1.driver_id,t2.grade from
(select driver_id from p4_get_car_order where start_time is null)t1
join
(select * from p4_get_car_order where start_time is not null )t2 where t1.driver_id=t2.driver_id)
t4 order by driver_id;
输出:
202|4.3
203|4.8
总体|4.6
解释:
2021年国庆有未完成订单的司机有202和203;202的所有订单评分有:5、4、4,平均分为4.3;203的所有订单评分有:5、5、4、5,平均评分为4.8;总体平均评分为(5+4+4+5+5+4+5)/7=4.6
SQL15 每个城市中评分最高的司机信息-中等
15.1 题目需求
请统计每个城市中评分最高的司机平均评分、日均接单量和日均行驶里程数。
注:有多个司机评分并列最高时,都输出。
平均评分和日均接单量保留1位小数,
日均行驶里程数保留3位小数,按日均接单数升序排序。
15.2 表结构
用户打车记录表p5_get_car_record
打车订单表p6_get_car_order
15.3 建表/插数语句
create table if not exists
p5_get_car_record
(
uid INT COMMENT '用户ID',
city string COMMENT '城市',
event_time timestamp COMMENT '打车时间',
end_time timestamp COMMENT '打车结束时间',
order_id INT COMMENT '订单号'
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
p6_get_car_order
(
order_id INT COMMENT '订单号',
uid INT COMMENT '用户ID',
driver_id INT COMMENT '司机ID',
order_time timestamp COMMENT '接单时间',
start_time timestamp COMMENT '开始计费的上车时间',
finish_time timestamp COMMENT '订单结束时间',
mileage DOUBLE COMMENT '行驶里程数',
fare DOUBLE COMMENT '费用',
grade TINYINT COMMENT '评分'
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO p5_get_car_record(uid, city, event_time, end_time, order_id) VALUES
(101, '北京', '2021-10-01 07:00:00', '2021-10-01 07:02:00', null),
(102, '北京', '2021-10-01 09:00:30', '2021-10-01 09:01:00', 9001),
(101, '北京', '2021-10-01 08:28:10', '2021-10-01 08:30:00', 9002),
(103, '北京', '2021-10-02 07:59:00', '2021-10-02 08:01:00', 9003),
(104, '北京', '2021-10-03 07:59:20', '2021-10-03 08:01:00', 9004),
(105, '北京', '2021-10-01 08:00:00', '2021-10-01 08:02:10', 9005),
(106, '北京', '2021-10-01 17:58:00', '2021-10-01 18:01:00', 9006),
(107, '北京', '2021-10-02 11:00:00', '2021-10-02 11:01:00', 9007),
(108, '北京', '2021-10-02 21:00:00', '2021-10-02 21:01:00', 9008),
(109, '北京', '2021-10-08 18:00:00', '2021-10-08 18:01:00', 9009);
INSERT INTO p6_get_car_order(order_id, uid, driver_id, order_time, start_time,
finish_time, mileage, fare, grade) VALUES
(9002, 101, 202, '2021-10-01 08:30:00', null, '2021-10-01 08:31:00', null, null
, null),
(9001, 102, 202, '2021-10-01 09:01:00', '2021-10-01 09:06:00', '2021-10-01 09:
31:00', 10.0, 41.5, 5),
(9003, 103, 202, '2021-10-02 08:01:00', '2021-10-02 08:15:00', '2021-10-02 08:
31:00', 11.0, 41.5, 4),
(9004, 104, 202, '2021-10-03 08:01:00', '2021-10-03 08:13:00', '2021-10-03 08:
31:00', 7.5, 22, 4),
(9005, 105, 203, '2021-10-01 08:02:10', null, '2021-10-01 08:31:00', null, null
, null),
(9006, 106, 203, '2021-10-01 18:01:00', '2021-10-01 18:09:00', '2021-10-01 18:
31:00', 8.0, 25.5, 5),
(9007, 107, 203, '2021-10-02 11:01:00', '2021-10-02 11:07:00', '2021-10-02 11:
31:00', 9.9, 30, 5),
(9008, 108, 203, '2021-10-02 21:01:00', '2021-10-02 21:10:00', '2021-10-02 21:
31:00', 13.2, 38, 4),
(9009, 109, 203, '2021-10-08 18:01:00', '2021-10-08 18:11:50', '2021-10-08 18:
51:00', 13, 40, 5);
15.4 代码实现
select
city,
driver_id,
avg_grade,
avg_order_num,
avg_mileage
from(
select
city,
driver_id,
round(avg(grade),1) avg_grade,
round(count(car_order.order_id)/count(distinct date_format(order_time,'yyyy-MM-dd HH:mm:ss')),1) avg_order_num,
round(sum(mileage)/(count(distinct date_format(order_time,'yyyy-MM-dd HH:mm:ss'))),3) avg_mileage,
dense_rank() over(partition by city order by round(avg(grade),1) desc) rk
from p6_get_car_order car_order
inner join p5_get_car_record car_record
on car_order.order_id = car_record.order_id
group by city,driver_id
) t
where rk = 1
order by avg_order_num
输出:
北京|203|4.8|1.7|14.700
解释:
示例数据中,在北京市,共有2个司机接单,202的平均评分为4.3,203的平均评分为4.8,因此北京的最高评分的司机为203;203的共在3天里接单过,一共接单5次(包含1次接单后未完成),因此日均接单数为1.7;总行驶里程数为44.1,因此日均行驶里程数为14.700
SQL16 国庆期间近7日日均取消订单量-中等
16.1 题目需求
请统计国庆头3天里,每天的近7日日均订单完成量和日均订单取消量,按日期升序排序。结果保留2位小数。
注:有多个司机评分并列最高时,都输出。
平均评分和日均接单量保留1位小数,
日均行驶里程数保留3位小数,按日均接单数升序排序。
16.2 表结构
用户打车记录表p7_get_car_record
打车订单表p8_get_car_order
16.3 建表/插数语句
create table if not exists
p7_get_car_record
(
uid INT COMMENT '用户ID',
city string COMMENT '城市',
event_time timestamp COMMENT '打车时间',
end_time timestamp COMMENT '打车结束时间',
order_id INT COMMENT '订单号'
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
p8_get_car_order
(
order_id INT COMMENT '订单号',
uid INT COMMENT '用户ID',
driver_id INT COMMENT '司机ID',
order_time timestamp COMMENT '接单时间',
start_time timestamp COMMENT '开始计费的上车时间',
finish_time timestamp COMMENT '订单结束时间',
mileage DOUBLE COMMENT '行驶里程数',
fare DOUBLE COMMENT '费用',
grade TINYINT COMMENT '评分'
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO p7_get_car_record(uid, city, event_time, end_time, order_id) VALUES
(101, '北京', '2021-09-25 08:28:10', '2021-09-25 08:30:00', 9011),
(102, '北京', '2021-09-25 09:00:30', '2021-09-25 09:01:00', 9012),
(103, '北京', '2021-09-26 07:59:00', '2021-09-26 08:01:00', 9013),
(104, '北京', '2021-09-26 07:59:00', '2021-09-26 08:01:00', 9023),
(104, '北京', '2021-09-27 07:59:20', '2021-09-27 08:01:00', 9014),
(105, '北京', '2021-09-28 08:00:00', '2021-09-28 08:02:10', 9015),
(106, '北京', '2021-09-29 17:58:00', '2021-09-29 18:01:00', 9016),
(107, '北京', '2021-09-30 11:00:00', '2021-09-30 11:01:00', 9017),
(108, '北京', '2021-09-30 21:00:00', '2021-09-30 21:01:00', 9018),
(102, '北京', '2021-10-01 09:00:30', '2021-10-01 09:01:00', 9002),
(106, '北京', '2021-10-01 17:58:00', '2021-10-01 18:01:00', 9006),
(101, '北京', '2021-10-02 08:28:10', '2021-10-02 08:30:00', 9001),
(107, '北京', '2021-10-02 11:00:00', '2021-10-02 11:01:00', 9007),
(108, '北京', '2021-10-02 21:00:00', '2021-10-02 21:01:00', 9008),
(103, '北京', '2021-10-02 07:59:00', '2021-10-02 08:01:00', 9003),
(104, '北京', '2021-10-03 07:59:20', '2021-10-03 08:01:00', 9004),
(109, '北京', '2021-10-03 18:00:00', '2021-10-03 18:01:00', 9009);
INSERT INTO p8_get_car_order(order_id, uid, driver_id, order_time, start_time,
finish_time, mileage, fare, grade) VALUES
(9011, 101, 211, '2021-09-25 08:30:00', '2021-09-25 08:31:00', '2021-09-25 08:
54:00', 10, 35, 5),
(9012, 102, 211, '2021-09-25 09:01:00', '2021-09-25 09:01:50', '2021-09-25 09:
28:00', 11, 32, 5),
(9013, 103, 212, '2021-09-26 08:01:00', '2021-09-26 08:03:00', '2021-09-26 08:
27:00', 12, 31, 4),
(9023, 104, 213, '2021-09-26 08:01:00', null, '2021-09-26 08:27:00', null, null
, null),
(9014, 104, 212, '2021-09-27 08:01:00', '2021-09-27 08:04:00', '2021-09-27 08:
21:00', 11, 31, 5),
(9015, 105, 212, '2021-09-28 08:02:10', '2021-09-28 08:04:10', '2021-09-28 08:
25:10', 12, 31, 4),
(9016, 106, 213, '2021-09-29 18:01:00', '2021-09-29 18:02:10', '2021-09-29 18:
23:00', 11, 39, 4),
(9017, 107, 213, '2021-09-30 11:01:00', '2021-09-30 11:01:40', '2021-09-30 11:
31:00', 11, 38, 5),
(9018, 108, 214, '2021-09-30 21:01:00', '2021-09-30 21:02:50', '2021-09-30 21:
21:00', 14, 38, 5),
(9002, 102, 202, '2021-10-01 09:01:00', '2021-10-01 09:06:00', '2021-10-01 09:
31:00', 10.0, 41.5, 5),
(9006, 106, 203, '2021-10-01 18:01:00', '2021-10-01 18:09:00', '2021-10-01 18:
31:00', 8.0, 25.5, 4),
(9001, 101, 202, '2021-10-02 08:30:00', null, '2021-10-02 08:31:00', null, null
, null),
(9007, 107, 203, '2021-10-02 11:01:00', '2021-10-02 11:07:00', '2021-10-02 11:
31:00', 9.9, 30, 5),
(9008, 108, 204, '2021-10-02 21:01:00', '2021-10-02 21:10:00', '2021-10-02 21:
31:00', 13.2, 38, 4),
(9003, 103, 202, '2021-10-02 08:01:00', '2021-10-02 08:15:00', '2021-10-02 08:
31:00', 11.0, 41.5, 4),
(9004, 104, 202, '2021-10-03 08:01:00', '2021-10-03 08:13:00', '2021-10-03 08:
31:00', 7.5, 22, 4),
(9009, 109, 204, '2021-10-03 18:01:00', null, '2021-10-03 18:51:00', null, null
, null);
16.4 代码实现
select
*
from(
select distinct
dt,
round(sum(finish_num) over(order by dt rows 6 preceding) / 7,2) finish_num_7d,
round(sum(cancel_num) over(order by dt rows 6 preceding) / 7,2) cancel_num_7d
from(
select
date_format(order_time,'yyyy-MM-dd HH:mm:ss') dt,
count(order_time) total_num,
sum(if(start_time is null,1,0)) cancel_num,
sum(if(start_time is not null,1,0)) finish_num
from p7_get_car_record
group by dt
order by dt
) t1
)t2
where dt between '2021-10-01' and '2021-10-03'
输出:
2021-10-01|1.43|0.14
2021-10-02|1.57|0.29
2021-10-03|1.57|0.29
解释:
2021年9月25到10月3日每天的订单完成量为:2、1、1、1、1、2、2、3、1;每天的订单取消量为:0、1、0、0、0、0、0、1、1;因此10.1到10.3期间的近7日订单完成量分别为10、11、11,因此日均订单完成量为:1.43、1.57、1.57;
近7日订单取消量分别为1、2、2,因此日均订单取消量为0.14、0.29、0.29;
SQL17 工作日各时段叫车量、等待接单时间和调度时间-较难
17.1 题目需求
统计周一到周五各时段的叫车量、平均等待接单时间和平均调度时间。全部以event_time-开始打车时间为时段划分依据,平均等待接单时间和平均调度时间均保留1位小数,平均调度时间仅计算完成了的订单,结果按叫车量升序排序。
注:不同时段定义:早高峰 [07:00:00 , 09:00:00)、工作时间 [09:00:00 , 17:00:00)、晚高峰 [17:00:00 ,20:00:00)、休息时间 [20:00:00 , 07:00:00) 时间区间左闭右开(即7:00:00算作早高峰,而9:00:00不算做早高峰)
从开始打车到司机接单为等待接单时间,从司机接单到上车为调度时间。
17.2 表结构
用户打车记录表p9_get_car_record
打车订单表p10_get_car_order
17.3 建表/插数语句
create table if not exists
p9_get_car_record
(
uid INT COMMENT '用户ID',
city string COMMENT '城市',
event_time timestamp COMMENT '打车时间',
end_time timestamp COMMENT '打车结束时间',
order_id INT COMMENT '订单号'
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
p10_get_car_order
(
order_id INT COMMENT '订单号',
uid INT COMMENT '用户ID',
driver_id INT COMMENT '司机ID',
order_time timestamp COMMENT '接单时间',
start_time timestamp COMMENT '开始计费的上车时间',
finish_time timestamp COMMENT '订单结束时间',
mileage DOUBLE COMMENT '行驶里程数',
fare DOUBLE COMMENT '费用',
grade TINYINT COMMENT '评分'
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO p9_get_car_record(uid, city, event_time, end_time, order_id) VALUES
(107, '北京', '2021-09-20 11:00:00', '2021-09-20 11:00:30', 9017),
(108, '北京', '2021-09-20 21:00:00', '2021-09-20 21:00:40', 9008),
(108, '北京', '2021-09-20 18:59:30', '2021-09-20 19:01:00', 9018),
(102, '北京', '2021-09-21 08:59:00', '2021-09-21 09:01:00', 9002),
(106, '北京', '2021-09-21 17:58:00', '2021-09-21 18:01:00', 9006),
(103, '北京', '2021-09-22 07:58:00', '2021-09-22 08:01:00', 9003),
(104, '北京', '2021-09-23 07:59:00', '2021-09-23 08:01:00', 9004),
(103, '北京', '2021-09-24 19:59:20', '2021-09-24 20:01:00', 9019),
(101, '北京', '2021-09-24 08:28:10', '2021-09-24 08:30:00', 9011);
INSERT INTO p10_get_car_order(order_id, uid, driver_id, order_time, start_time,
finish_time, mileage, fare, grade) VALUES
(9017, 107, 213, '2021-09-20 11:00:30', '2021-09-20 11:02:10', '2021-09-20 11:
31:00', 11, 38, 5),
(9008, 108, 204, '2021-09-20 21:00:40', '2021-09-20 21:03:00', '2021-09-20 21:
31:00', 13.2, 38, 4),
(9018, 108, 214, '2021-09-20 19:01:00', '2021-09-20 19:04:50', '2021-09-20 19:
21:00', 14, 38, 5),
(9002, 102, 202, '2021-09-21 09:01:00', '2021-09-21 09:06:00', '2021-09-21 09:
31:00', 10.0, 41.5, 5),
(9006, 106, 203, '2021-09-21 18:01:00', '2021-09-21 18:09:00', '2021-09-21 18:
31:00', 8.0, 25.5, 4),
(9007, 107, 203, '2021-09-22 11:01:00', '2021-09-22 11:07:00', '2021-09-22 11:
31:00', 9.9, 30, 5),
(9003, 103, 202, '2021-09-22 08:01:00', '2021-09-22 08:15:00', '2021-09-22 08:
31:00', 11.0, 41.5, 4),
(9004, 104, 202, '2021-09-23 08:01:00', '2021-09-23 08:13:00', '2021-09-23 08:
31:00', 7.5, 22, 4),
(9005, 105, 202, '2021-09-23 10:01:00', '2021-09-23 10:13:00', '2021-09-23 10:
31:00', 9, 29, 5),
(9019, 103, 202, '2021-09-24 20:01:00', '2021-09-24 20:11:00', '2021-09-24 20:
51:00', 10, 39, 4),
(9011, 101, 211, '2021-09-24 08:30:00', '2021-09-24 08:31:00', '2021-09-24 08:
54:00', 10, 35, 5);
17.4 代码实现
select
period,count(*) get_car_num,
round(avg(wait_time/60),1),
round(avg(dispatch_time/60),1)
from(
select
case
when hour(event_time) >= 7 and hour(event_time) < 9 then '早高峰'
when hour(event_time)>=9 and hour(event_time) <17 then '工作时间'
when hour(event_time) >=17 and hour(event_time) <20 then '晚高峰'
else '休息时间'
end as period,unix_timestamp(order_time,'yyyy-MM-dd HH:mm:ss') - unix_timestamp(event_time,'yyyy-MM-dd HH:mm:ss') wait_time,
unix_timestamp(start_time,'yyyy-MM-dd HH:mm:ss') - unix_timestamp(order_time,'yyyy-MM-dd HH:mm:ss') dispatch_time
from (
select
event_time,order_time,start_time
from p9_get_car_record t1 join p10_get_car_order t2
on t1.order_id = t2.order_id
where pmod(datediff(event_time,'1920-01-01')-3,7) <> 6
and pmod(datediff(event_time,'1920-01-01')-3,7)<> 0)t1)t2
group by period
order by get_car_num;
输出:
工作时间|1|0.5|1.7
休息时间|1|0.7|2.3
晚高峰|3|2.1|7.3
早高峰|4|2.2|8.0
解释:
订单9017打车开始于11点整,属于工作时间,等待时间30秒,调度时间为1分40秒,示例数据中工作时间打车订单就一个,平均等待时间0.5分钟,平均调度时间1.7分钟。
SQL18 各城市最大同时等车人数-较难
18.1 题目需求
请统计各个城市在2021年10月期间,单日中最大的同时等车人数。
注:等车指从开始打车起,直到取消打车、取消等待或上车前的这段时间里用户的状态。
如果同一时刻有人停止等车,有人开始等车,等车人数记作先增加后减少。
结果按各城市最大等车人数升序排序,相同时按城市升序排序。
18.2 表结构
用户打车记录表p11_get_car_record
打车订单表p12_get_car_order
18.3 建表/插数语句
create table if not exists
p11_get_car_record
(
uid INT COMMENT '用户ID',
city string COMMENT '城市',
event_time timestamp COMMENT '打车时间',
end_time timestamp COMMENT '打车结束时间',
order_id INT COMMENT '订单号'
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
p12_get_car_order
(
order_id INT COMMENT '订单号',
uid INT COMMENT '用户ID',
driver_id INT COMMENT '司机ID',
order_time timestamp COMMENT '接单时间',
start_time timestamp COMMENT '开始计费的上车时间',
finish_time timestamp COMMENT '订单结束时间',
mileage DOUBLE COMMENT '行驶里程数',
fare DOUBLE COMMENT '费用',
grade TINYINT COMMENT '评分'
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO p11_get_car_record(uid, city, event_time, end_time, order_id) VALUES
(108, '北京', '2021-10-20 08:00:00', '2021-10-20 08:00:40', 9008),
(108, '北京', '2021-10-20 08:00:10', '2021-10-20 08:00:45', 9018),
(102, '北京', '2021-10-20 08:00:30', '2021-10-20 08:00:50', 9002),
(106, '北京', '2021-10-20 08:05:41', '2021-10-20 08:06:00', 9006),
(103, '北京', '2021-10-20 08:05:50', '2021-10-20 08:07:10', 9003),
(104, '北京', '2021-10-20 08:01:01', '2021-10-20 08:01:20', 9004),
(103, '北京', '2021-10-20 08:01:15', '2021-10-20 08:01:30', 9019),
(101, '北京', '2021-10-20 08:28:10', '2021-10-20 08:30:00', 9011);
INSERT INTO p12_get_car_order(order_id, uid, driver_id, order_time, start_time,
finish_time, mileage, fare, grade) VALUES
(9008, 108, 204, '2021-10-20 08:00:40', '2021-10-20 08:03:00', '2021-10-20 08:
31:00', 13.2, 38, 4),
(9018, 108, 214, '2021-10-20 08:00:45', '2021-10-20 08:04:50', '2021-10-20 08:
21:00', 14, 38, 5),
(9002, 102, 202, '2021-10-20 08:00:50', '2021-10-20 08:06:00', '2021-10-20 08:
31:00', 10.0, 41.5, 5),
(9006, 106, 203, '2021-10-20 08:06:00', '2021-10-20 08:09:00', '2021-10-20 08:
31:00', 8.0, 25.5, 4),
(9003, 103, 202, '2021-10-20 08:07:10', '2021-10-20 08:15:00', '2021-10-20 08:
31:00', 11.0, 41.5, 4),
(9004, 104, 202, '2021-10-20 08:01:20', '2021-10-20 08:13:00', '2021-10-20 08:
31:00', 7.5, 22, 4),
(9019, 103, 202, '2021-10-20 08:01:30', '2021-10-20 08:11:00', '2021-10-20 08:
51:00', 10, 39, 4),
(9011, 101, 211, '2021-10-20 08:30:00', '2021-10-20 08:31:00', '2021-10-20 08:
54:00', 10, 35, 5);
18.4 代码实现
select
city,max(p_num) max_wait_uv
from (
select
city,sum(diff) over(partition by city order by tt, diff desc ) p_num
from (
select
city,event_time tt,1 diff
from tb_get_car_record t1 join tb_get_car_order t2
on t1.order_id = t2.order_id
union all
select
city,if(start_time is not null,start_time,finish_time) tt,-1 diff
from tb_get_car_record t1 join tb_get_car_order t2
on t1.order_id = t2.order_id)t5
where date_format(tt,'yyyy-MM') = '2021-10'
)t6
group by city
order by max_wait_uv ,city ;
输出:
北京|5
解释:
由打车订单表可以得知北京2021年10月20日有8条打车记录,108号乘客从08:00:00等到08:03:00,118号乘客从08:00:10等到08:04:50....,由此得知08:02:00秒时刻,共有5人在等车。
SQL19 某宝店铺的SPU数量 简单
19.1 题目需求
11月结束后,小牛同学需要对其在某宝的网店就11月份用户交易情况和产品情况进行分析以更好的经营小店。请你统计每款的SPU(货号)数量,并按SPU数量降序排序。
19.2 表结构
产品情况表q1_product_tb
19.3 建表/插数语句
create table if not exists
q1_product_tb
(
item_id string ,
style_id string ,
tag_price int,
inventory int
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO q1_product_tb VALUES('A001', 'A', 100, 20);
INSERT INTO q1_product_tb VALUES('A002', 'A', 120, 30);
INSERT INTO q1_product_tb VALUES('A003', 'A', 200, 15);
INSERT INTO q1_product_tb VALUES('B001', 'B', 130, 18);
INSERT INTO q1_product_tb VALUES('B002', 'B', 150, 22);
INSERT INTO q1_product_tb VALUES('B003', 'B', 125, 10);
INSERT INTO q1_product_tb VALUES('B004', 'B', 155, 12);
INSERT INTO q1_product_tb VALUES('C001', 'C', 260, 25);
INSERT INTO q1_product_tb VALUES('C002', 'C', 280, 18);
19.4 代码实现
SELECT
style_id,
COUNT(*) SPU_num
FROM q1_product_tb
GROUP BY style_id
ORDER BY SPU_num DESC
输出:
B|4
A|3
C|2
SQL20 某宝店铺的实际销售额与客单价-简单
20.1 题目需求
11月结束后,小牛同学需要对其在某宝的网店就11月份用户交易情况和产品情况进行分析以更好的经营小店。请你统计实际总销售额与客单价(人均付费,总收入/总用户数,结果保留两位小数)
20.2 表结构
销售数据表q2_sales_tb
20.3 建表/插数语句
create table if not exists
q2_product_tb
(
sales_date date ,
user_id int ,
item_id string ,
sales_num int ,
sales_price int
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO q2_sales_tb VALUES('2021-11-1', 1, 'A001', 1, 90);
INSERT INTO q2_sales_tb VALUES('2021-11-1', 2, 'A002', 2, 220);
INSERT INTO q2_sales_tb VALUES('2021-11-1', 2, 'B001', 1, 120);
INSERT INTO q2_sales_tb VALUES('2021-11-2', 3, 'C001', 2, 500);
INSERT INTO q2_sales_tb VALUES('2021-11-2', 4, 'B001', 1, 120);
INSERT INTO q2_sales_tb VALUES('2021-11-3', 5, 'C001', 1, 240);
INSERT INTO q2_sales_tb VALUES('2021-11-3', 6, 'C002', 1, 270);
INSERT INTO q2_sales_tb VALUES('2021-11-4', 7, 'A003', 1, 180);
INSERT INTO q2_sales_tb VALUES('2021-11-4', 8, 'B002', 1, 140);
INSERT INTO q2_sales_tb VALUES('2021-11-4', 9, 'B001', 1, 125);
INSERT INTO q2_sales_tb VALUES('2021-11-5', 10, 'B003', 1, 120);
INSERT INTO q2_sales_tb VALUES('2021-11-5', 10, 'B004', 1, 150);
INSERT INTO q2_sales_tb VALUES('2021-11-5', 10, 'A003', 1, 180);
INSERT INTO q2_sales_tb VALUES('2021-11-6', 11, 'B003', 1, 120);
INSERT INTO q2_sales_tb VALUES('2021-11-6', 10, 'B004', 1, 150);
20.4 代码实现
SELECT
SUM(sales_price),
round(SUM(sales_price) / COUNT(DISTINCT user_id) ,2) per_trans
FROM q2_sales_tb
输出:
2725|247.73
SQL21 某宝店铺折扣率 中等
21.1 题目需求
11月结束后,小牛同学需要对其在某宝的网店就11月份用户交易情况和产品情况进行分析以更好的经营小店。请你统计折扣率(GMV/吊牌金额,GMV指的是成交金额)
21.2 表结构
产品情况表q3_product_tb
销售数据表q4_sales_tb
21.3 建表/插数语句
create table if not exists
q3_product_tb
(
item_id string ,
style_id string ,
tag_price int,
inventory int
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
q4_sales_tb
(
sales_date date ,
user_id int ,
item_id string ,
sales_num int ,
sales_price int
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO q3_product_tb VALUES('A001', 'A', 100, 20);
INSERT INTO q3_product_tb VALUES('A002', 'A', 120, 30);
INSERT INTO q3_product_tb VALUES('A003', 'A', 200, 15);
INSERT INTO q3_product_tb VALUES('B001', 'B', 130, 18);
INSERT INTO q3_product_tb VALUES('B002', 'B', 150, 22);
INSERT INTO q3_product_tb VALUES('B003', 'B', 125, 10);
INSERT INTO q3_product_tb VALUES('B004', 'B', 155, 12);
INSERT INTO q3_product_tb VALUES('C001', 'C', 260, 25);
INSERT INTO q3_product_tb VALUES('C002', 'C', 280, 18);
INSERT INTO q4_sales_tb VALUES('2021-11-1', 1, 'A001', 1, 90);
INSERT INTO q4_sales_tb VALUES('2021-11-1', 2, 'A002', 2, 220);
INSERT INTO q4_sales_tb VALUES('2021-11-1', 2, 'B001', 1, 120);
INSERT INTO q4_sales_tb VALUES('2021-11-2', 3, 'C001', 2, 500);
INSERT INTO q4_sales_tb VALUES('2021-11-2', 4, 'B001', 1, 120);
INSERT INTO q4_sales_tb VALUES('2021-11-3', 5, 'C001', 1, 240);
INSERT INTO q4_sales_tb VALUES('2021-11-3', 6, 'C002', 1, 270);
INSERT INTO q4_sales_tb VALUES('2021-11-4', 7, 'A003', 1, 180);
INSERT INTO q4_sales_tb VALUES('2021-11-4', 8, 'B002', 1, 140);
INSERT INTO q4_sales_tb VALUES('2021-11-4', 9, 'B001', 1, 125);
INSERT INTO q4_sales_tb VALUES('2021-11-5', 10, 'B003', 1, 120);
INSERT INTO q4_sales_tb VALUES('2021-11-5', 10, 'B004', 1, 150);
INSERT INTO q4_sales_tb VALUES('2021-11-5', 10, 'A003', 1, 180);
INSERT INTO q4_sales_tb VALUES('2021-11-6', 11, 'B003', 1, 120);
INSERT INTO q4_sales_tb VALUES('2021-11-6', 10, 'B004', 1, 150);
21.4 代码实现
select round(sum(sales_price) * 100 / sum(tag_price * sales_num), 2) discount_rate
from product_tb t1
join sales_tb t2 using (item_id);
输出:
93.97
SQL22 某宝店铺动销率与售罄率 较难
22.1 题目需求
11月结束后,小牛同学需要对其在某宝的网店就11月份用户交易情况和产品情况进行分析以更好的经营小店。
请你统计每款的动销率(pin_rate,有销售的SKU数量/在售SKU数量)与售罄率(sell-through_rate,GMV/备货值,备货值=吊牌价*库存数),按style_id升序排序
22.2 表结构
产品情况表q5_product_tb
销售数据表q6_sales_tb
22.3 建表/插数语句
create table if not exists q5_product_tb
(
item_id string ,
style_id string ,
tag_price int,
inventory int
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
q6_sales_tb
(
sales_date date ,
user_id int ,
item_id string ,
sales_num int ,
sales_price int
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO q5_product_tb VALUES('A001', 'A', 100, 20);
INSERT INTO q5_product_tb VALUES('A002', 'A', 120, 30);
INSERT INTO q5_product_tb VALUES('A003', 'A', 200, 15);
INSERT INTO q5_product_tb VALUES('B001', 'B', 130, 18);
INSERT INTO q5_product_tb VALUES('B002', 'B', 150, 22);
INSERT INTO q5_product_tb VALUES('B003', 'B', 125, 10);
INSERT INTO q5_product_tb VALUES('B004', 'B', 155, 12);
INSERT INTO q5_product_tb VALUES('C001', 'C', 260, 25);
INSERT INTO q5_product_tb VALUES('C002', 'C', 280, 18);
INSERT INTO q6_sales_tb VALUES('2021-11-1', 1, 'A001', 1, 90);
INSERT INTO q6_sales_tb VALUES('2021-11-1', 2, 'A002', 2, 220);
INSERT INTO q6_sales_tb VALUES('2021-11-1', 2, 'B001', 1, 120);
INSERT INTO q6_sales_tb VALUES('2021-11-2', 3, 'C001', 2, 500);
INSERT INTO q6_sales_tb VALUES('2021-11-2', 4, 'B001', 1, 120);
INSERT INTO q6_sales_tb VALUES('2021-11-3', 5, 'C001', 1, 240);
INSERT INTO q6_sales_tb VALUES('2021-11-3', 6, 'C002', 1, 270);
INSERT INTO q6_sales_tb VALUES('2021-11-4', 7, 'A003', 1, 180);
INSERT INTO q6_sales_tb VALUES('2021-11-4', 8, 'B002', 1, 140);
INSERT INTO q6_sales_tb VALUES('2021-11-4', 9, 'B001', 1, 125);
INSERT INTO q6_sales_tb VALUES('2021-11-5', 10, 'B003', 1, 120);
INSERT INTO q6_sales_tb VALUES('2021-11-5', 10, 'B004', 1, 150);
INSERT INTO q6_sales_tb VALUES('2021-11-5', 10, 'A003', 1, 180);
INSERT INTO q6_sales_tb VALUES('2021-11-6', 11, 'B003', 1, 120);
INSERT INTO q6_sales_tb VALUES('2021-11-6', 10, 'B004', 1, 150);
22.4 代码实现
with
a as (
select
style_id,
sum(inventory) sum_inventory,
sum(inventory * tag_price) sum_inv_amount
from product_tb
group by style_id
),
b as (
select
style_id,
sum(sales_num) sum_sales_num,
sum(sales_price) gmv
from sales_tb
join product_tb pt
on sales_tb.item_id = pt.item_id
group by style_id
)
select
a.style_id,
round(sum_sales_num / (sum_inventory - sum_sales_num) * 100, 2) pin_rate,
round(gmv / sum_inv_amount * 100, 2) pin_rate
from a
join b
on a.style_id = b.style_id
order by style_id;
输出:
A|8.33|7.79
B|14.81|11.94
C|10.26|8.75
SQL23 某宝店铺连续2天及以上购物的用户及其对应的天数-较难
23.1 题目需求
11月结束后,小牛同学需要对其在某宝的网店就11月份用户交易情况和产品情况进行分析以更好的经营小店。
请你统计连续2天及以上在该店铺购物的用户及其对应的次数(若有多个用户,按user_id升序排序)
23.2 表结构
销售数据表q7_sales_tb
23.3 建表/插数语句
create table if not exists
q7_sales_tb
(
sales_date date ,
user_id int ,
item_id string ,
sales_num int ,
sales_price int
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO q7_sales_tb VALUES('2021-11-1', 1, 'A001', 1, 90);
INSERT INTO q7_sales_tb VALUES('2021-11-1', 2, 'A002', 2, 220);
INSERT INTO q7_sales_tb VALUES('2021-11-1', 2, 'B001', 1, 120);
INSERT INTO q7_sales_tb VALUES('2021-11-2', 3, 'C001', 2, 500);
INSERT INTO q7_sales_tb VALUES('2021-11-2', 4, 'B001', 1, 120);
INSERT INTO q7_sales_tb VALUES('2021-11-3', 5, 'C001', 1, 240);
INSERT INTO q7_sales_tb VALUES('2021-11-3', 6, 'C002', 1, 270);
INSERT INTO q7_sales_tb VALUES('2021-11-4', 7, 'A003', 1, 180);
INSERT INTO q7_sales_tb VALUES('2021-11-4', 8, 'B002', 1, 140);
INSERT INTO q7_sales_tb VALUES('2021-11-4', 9, 'B001', 1, 125);
INSERT INTO q7_sales_tb VALUES('2021-11-5', 10, 'B003', 1, 120);
INSERT INTO q7_sales_tb VALUES('2021-11-5', 10, 'B004', 1, 150);
INSERT INTO q7_sales_tb VALUES('2021-11-5', 10, 'A003', 1, 180);
INSERT INTO q7_sales_tb VALUES('2021-11-6', 11, 'B003', 1, 120);
INSERT INTO q7_sales_tb VALUES('2021-11-6', 10, 'B004', 1, 150);
23.4 代码实现
select
user_id,count(1) cn
from(
select
user_id,sales_date,rn,date_sub(sales_date,rn) flag
from(
select
distinct sales_date,user_id,
row_number() over (partition by user_id order by sales_date) rn
from q7_sales_tb
)t1
)t2
group by user_id,flag
having cn>=2;
输出:
10|2
SQL24 牛客直播转换率 简单
24.1 题目需求
牛客某页面推出了数据分析系列直播课程介绍。用户可以选择报名任意一场或多场直播课。
请你统计每个科目的转换率(sign_rate(%),转化率=报名人数/浏览人数,结果保留两位小数)。
注:按照course_id升序排序。
24.2 表结构
课程表r1_course_tb
用户行为表r2_behavior_tb
24.3 建表/插数语句
create table if not exists
r1_course_tb
(
course_id int,
course_name string,
course_datetime string
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
r2_behavior_tb
(
user_id int,
if_vw int,
if_fav int,
if_sign int,
course_id int
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO r1_course_tb VALUES(1, 'Python', '2021-12-1 19:00-21:00');
INSERT INTO r1_course_tb VALUES(2, 'SQL', '2021-12-2 19:00-21:00');
INSERT INTO r1_course_tb VALUES(3, 'R', '2021-12-3 19:00-21:00');
INSERT INTO r2_behavior_tb VALUES(100, 1, 1, 1, 1);
INSERT INTO r2_behavior_tb VALUES(100, 1, 1, 1, 2);
INSERT INTO r2_behavior_tb VALUES(100, 1, 1, 1, 3);
INSERT INTO r2_behavior_tb VALUES(101, 1, 1, 1, 1);
INSERT INTO r2_behavior_tb VALUES(101, 1, 1, 1, 2);
INSERT INTO r2_behavior_tb VALUES(101, 1, 0, 0, 3);
INSERT INTO r2_behavior_tb VALUES(102, 1, 1, 1, 1);
INSERT INTO r2_behavior_tb VALUES(102, 1, 1, 1, 2);
INSERT INTO r2_behavior_tb VALUES(102, 1, 1, 1, 3);
INSERT INTO r2_behavior_tb VALUES(103, 1, 1, 0, 1);
INSERT INTO r2_behavior_tb VALUES(103, 1, 0, 0, 2);
INSERT INTO r2_behavior_tb VALUES(103, 1, 0, 0, 3);
INSERT INTO r2_behavior_tb VALUES(104, 1, 1, 1, 1);
INSERT INTO r2_behavior_tb VALUES(104, 1, 1, 1, 2);
INSERT INTO r2_behavior_tb VALUES(104, 1, 1, 0, 3);
INSERT INTO r2_behavior_tb VALUES(105, 1, 0, 0, 1);
INSERT INTO r2_behavior_tb VALUES(106, 1, 0, 0, 1);
INSERT INTO r2_behavior_tb VALUES(107, 1, 0, 0, 1);
INSERT INTO r2_behavior_tb VALUES(107, 1, 1, 1, 2);
INSERT INTO r2_behavior_tb VALUES(108, 1, 1, 1, 3);
24.4 代码实现
select r1ct.course_id,course_name, concat( round(sum(if_sign)/sum(if_vw)*100,2),'%') sign_rate
from r2_behavior_tb left join r1_course_tb r1ct on r2_behavior_tb.course_id = r1ct.course_id
group by r1ct.course_id,course_name
order by r1ct.course_id asc ;
输出:
1|Python|50.00
2|SQL|83.33
3|R|50.00
SQL25 牛客直播开始时各直播间在线人数 中等
25.1 题目需求
牛客某页面推出了数据分析系列直播课程介绍。用户可以选择报名任意一场或多场直播课。
请你统计直播开始时(19:00),各科目的在线人数
注:按照course_id升序排序。
25.2 表结构
课程表r3_course_tb
上课情况表r4_attend_tb
25.3 建表/插数语句
create table if not exists
r3_course_tb
(
course_id int,
course_name string,
course_datetime string
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
r4_attend_tb
(
user_id int,
course_id int,
in_datetime timestamp,
out_datetime timestamp
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO r3_course_tb VALUES(1, 'Python', '2021-12-1 19:00-21:00');
INSERT INTO r3_course_tb VALUES(2, 'SQL', '2021-12-2 19:00-21:00');
INSERT INTO r3_course_tb VALUES(3, 'R', '2021-12-3 19:00-21:00');
INSERT INTO r4_attend_tb VALUES(100, 1, '2021-12-1 19:00:00', '2021-12-1 19:28:00') ;
INSERT INTO r4_attend_tb VALUES(100, 1, '2021-12-1 19:30:00', '2021-12-1 19:53:00') ;
INSERT INTO r4_attend_tb VALUES(101, 1, '2021-12-1 19:00:00', '2021-12-1 20:55:00') ;
INSERT INTO r4_attend_tb VALUES(102, 1, '2021-12-1 19:00:00', '2021-12-1 19:05:00') ;
INSERT INTO r4_attend_tb VALUES(104, 1, '2021-12-1 19:00:00', '2021-12-1 20:59:00') ;
INSERT INTO r4_attend_tb VALUES(101, 2, '2021-12-2 19:05:00', '2021-12-2 20:58:00') ;
INSERT INTO r4_attend_tb VALUES(102, 2, '2021-12-2 18:55:00', '2021-12-2 21:00:00') ;
INSERT INTO r4_attend_tb VALUES(104, 2, '2021-12-2 18:57:00', '2021-12-2 20:56:00') ;
INSERT INTO r4_attend_tb VALUES(107, 2, '2021-12-2 19:10:00', '2021-12-2 19:18:00') ;
INSERT INTO r4_attend_tb VALUES(100, 3, '2021-12-3 19:01:00', '2021-12-3 21:00:00') ;
INSERT INTO r4_attend_tb VALUES(102, 3, '2021-12-3 18:58:00', '2021-12-3 19:05:00') ;
INSERT INTO r4_attend_tb VALUES(108, 3, '2021-12-3 19:01:00', '2021-12-3 19:56:00') ;
25.4 代码实现
select course_id, course_name, sum(c) ct
from (select course_id,course_name,start111,if(substring(in_datetime,1,16)<=start111 and substring(out_datetime,1,16)>=start111,1,0) c
from (select r3_course_tb.course_id,r3_course_tb.course_name,in_datetime,out_datetime,concat(substring(course_datetime,1,8),0,substring(course_datetime,9,7)) as start111
from r3_course_tb
join r4_attend_tb r4at on r3_course_tb.course_id = r4at.course_id) t1) t2
group by course_id,course_name;
输出:
1|Python|4
2|SQL|2
3|R|1
SQL26 牛客直播各科目平均观看时长-中等
26.1 题目需求
牛客某页面推出了数据分析系列直播课程介绍。用户可以选择报名任意一场或多场直播课。
请你统计每个科目的平均观看时长(观看时长定义为离开直播间的时间与进入直播间的时间之差,单位是分钟),输出结果按平均观看时长降序排序,结果保留两位小数。
26.2 表结构
课程表r5_course_tb
上课情况表r6_attend_tb
26.3 建表/插数语句
create table if not exists
r5_course_tb
(
course_id int,
course_name string,
course_datetime string
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
r6_attend_tb
(
user_id int,
course_id int,
in_datetime timestamp,
out_datetime timestamp
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO r5_course_tb VALUES(1, 'Python', '2021-12-1 19:00-21:00');
INSERT INTO r5_course_tb VALUES(2, 'SQL', '2021-12-2 19:00-21:00');
INSERT INTO r5_course_tb VALUES(3, 'R', '2021-12-3 19:00-21:00');
INSERT INTO r6_attend_tb VALUES(100, 1, '2021-12-1 19:00:00', '2021-12-1 19:28:00') ;
INSERT INTO r6_attend_tb VALUES(100, 1, '2021-12-1 19:30:00', '2021-12-1 19:53:00') ;
INSERT INTO r6_attend_tb VALUES(101, 1, '2021-12-1 19:00:00', '2021-12-1 20:55:00') ;
INSERT INTO r6_attend_tb VALUES(102, 1, '2021-12-1 19:00:00', '2021-12-1 19:05:00') ;
INSERT INTO r6_attend_tb VALUES(104, 1, '2021-12-1 19:00:00', '2021-12-1 20:59:00') ;
INSERT INTO r6_attend_tb VALUES(101, 2, '2021-12-2 19:05:00', '2021-12-2 20:58:00') ;
INSERT INTO r6_attend_tb VALUES(102, 2, '2021-12-2 18:55:00', '2021-12-2 21:00:00') ;
INSERT INTO r6_attend_tb VALUES(104, 2, '2021-12-2 18:57:00', '2021-12-2 20:56:00') ;
INSERT INTO r6_attend_tb VALUES(107, 2, '2021-12-2 19:10:00', '2021-12-2 19:18:00') ;
INSERT INTO r6_attend_tb VALUES(100, 3, '2021-12-3 19:01:00', '2021-12-3 21:00:00') ;
INSERT INTO r6_attend_tb VALUES(102, 3, '2021-12-3 18:58:00', '2021-12-3 19:05:00') ;
INSERT INTO r6_attend_tb VALUES(108, 3, '2021-12-3 19:01:00', '2021-12-3 19:56:00') ;
26.4 代码实现
select
course_id,
(to_unix_timestamp (out_datetime) - to_unix_timestamp(in_datetime)) during_time
from r6_attend_tb; --t1
----------------------------------------------------------------
select
t1.course_id,
-- round(avg(during_time)/60,2) avg_duringtime
-- cast(round(avg(during_time)/60,2) as decimal(16,4)),
cast(avg(during_time)/60 as decimal(16,2)) avg_duringtime
from (
select
course_id,
(to_unix_timestamp (out_datetime) - to_unix_timestamp(in_datetime)) during_time
from r6_attend_tb
) t1
group by t1.course_id; -- t2
-------------------------------------------------------------------
select
r.course_name,t2.avg_duringtime
from (
select
t1.course_id,
-- round(avg(during_time)/60,2) avg_duringtime
-- cast(round(avg(during_time)/60,2) as decimal(16,2)),
cast(avg(during_time)/60 as decimal(16,2)) avg_duringtime
from (
select
course_id,
(to_unix_timestamp (out_datetime) - to_unix_timestamp(in_datetime)) during_time
from r6_attend_tb
) t1
group by t1.course_id
) t2 join r1_course_tb r
on t2.course_id = r.course_id
order by avg_duringtime desc;
输出:
SQL|91.25
R|60.33
Python|58.00
SQL27 牛客直播各科目出勤率 较难
27.1 题目需求
牛客某页面推出了数据分析系列直播课程介绍。用户可以选择报名任意一场或多场直播课。
请你统计每个科目的出勤率(attend_rate(%),结果保留两位小数),出勤率=出勤(在线时长10分钟及以上)人数 / 报名人数,输出结果按course_id升序排序
注:按照course_id升序排序。
27.2 表结构
课程表r7_course_tb
用户行为表r8_behavior_tb
27.3 建表/插数语句
create table if not exists
r7_course_tb
(
course_id int,
course_name string,
course_datetime string
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
r8_behavior_tb
(
user_id int,
if_vw int,
if_fav int,
if_sign int,
course_id int
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
r9_attend_tb
(
user_id int,
course_id int,
in_datetime timestamp,
out_datetime timestamp
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO r7_course_tb VALUES(1, 'Python', '2021-12-1 19:00-21:00');
INSERT INTO r7_course_tb VALUES(2, 'SQL', '2021-12-2 19:00-21:00');
INSERT INTO r7_course_tb VALUES(3, 'R', '2021-12-3 19:00-21:00');
INSERT INTO r8_behavior_tb VALUES(100, 1, 1, 1, 1);
INSERT INTO r8_behavior_tb VALUES(100, 1, 1, 1, 2);
INSERT INTO r8_behavior_tb VALUES(100, 1, 1, 1, 3);
INSERT INTO r8_behavior_tb VALUES(101, 1, 1, 1, 1);
INSERT INTO r8_behavior_tb VALUES(101, 1, 1, 1, 2);
INSERT INTO r8_behavior_tb VALUES(101, 1, 0, 0, 3);
INSERT INTO r8_behavior_tb VALUES(102, 1, 1, 1, 1);
INSERT INTO r8_behavior_tb VALUES(102, 1, 1, 1, 2);
INSERT INTO r8_behavior_tb VALUES(102, 1, 1, 1, 3);
INSERT INTO r8_behavior_tb VALUES(103, 1, 1, 0, 1);
INSERT INTO r8_behavior_tb VALUES(103, 1, 0, 0, 2);
INSERT INTO r8_behavior_tb VALUES(103, 1, 0, 0, 3);
INSERT INTO r8_behavior_tb VALUES(104, 1, 1, 1, 1);
INSERT INTO r8_behavior_tb VALUES(104, 1, 1, 1, 2);
INSERT INTO r8_behavior_tb VALUES(104, 1, 1, 0, 3);
INSERT INTO r8_behavior_tb VALUES(105, 1, 0, 0, 1);
INSERT INTO r8_behavior_tb VALUES(106, 1, 0, 0, 1);
INSERT INTO r8_behavior_tb VALUES(107, 1, 0, 0, 1);
INSERT INTO r8_behavior_tb VALUES(107, 1, 1, 1, 2);
INSERT INTO r8_behavior_tb VALUES(108, 1, 1, 1, 3);
INSERT INTO r9_attend_tb VALUES(100, 1, '2021-12-1 19:00:00', '2021-12-1 19:28:00')
;
INSERT INTO r9_attend_tb VALUES(100, 1, '2021-12-1 19:30:00', '2021-12-1 19:53:00')
;
INSERT INTO r9_attend_tb VALUES(101, 1, '2021-12-1 19:00:00', '2021-12-1 20:55:00')
;
INSERT INTO r9_attend_tb VALUES(102, 1, '2021-12-1 19:00:00', '2021-12-1 19:05:00')
;
INSERT INTO r9_attend_tb VALUES(104, 1, '2021-12-1 19:00:00', '2021-12-1 20:59:00')
;
INSERT INTO r9_attend_tb VALUES(101, 2, '2021-12-2 19:05:00', '2021-12-2 20:58:00')
;
INSERT INTO r9_attend_tb VALUES(102, 2, '2021-12-2 18:55:00', '2021-12-2 21:00:00')
;
INSERT INTO r9_attend_tb VALUES(104, 2, '2021-12-2 18:57:00', '2021-12-2 20:56:00')
;
INSERT INTO r9_attend_tb VALUES(107, 2, '2021-12-2 19:10:00', '2021-12-2 19:18:00')
;
INSERT INTO r9_attend_tb VALUES(100, 3, '2021-12-3 19:01:00', '2021-12-3 21:00:00')
;
INSERT INTO r9_attend_tb VALUES(102, 3, '2021-12-3 18:58:00', '2021-12-3 19:05:00')
;
INSERT INTO r9_attend_tb VALUES(108, 3, '2021-12-3 19:01:00', '2021-12-3 19:56:00')
;
27.4 代码实现
-- 求出每个科目出勤人数
select
user_id,
course_id,
(to_unix_timestamp (out_datetime) - to_unix_timestamp(in_datetime)) / 60 during_time
from r6_attend_tb; --t1
select
t1.course_id,
t1.user_id,
sum(t1.during_time) during_time
from (
select
user_id,
course_id,
(to_unix_timestamp (out_datetime) - to_unix_timestamp(in_datetime)) / 60 during_time
from r6_attend_tb
) t1
group by t1.course_id,t1.user_id
order by course_id; -- t2
select
course_id,count(user_id) ct
from (
select
t1.course_id,
t1.user_id,
sum(t1.during_time) during_time
from (
select
user_id,
course_id,
(to_unix_timestamp (out_datetime) - to_unix_timestamp(in_datetime)) / 60 during_time
from r6_attend_tb
) t1
group by t1.course_id,t1.user_id
order by course_id
) t2
where t2.during_time >=10
group by course_id; --t3
-- 每个科目的报名人数
select
r.course_id,count(r.user_id) sign_num
from r2_behavior_tb r
where r.if_sign = 1
group by r.course_id; -- t4
-- 两表join求出出勤率,再与r1_course_tb表join,拿到course_name字段
select
t4.course_id,r.course_name,cast(t3.ct / t4.sign_num * 100 as decimal(16,2)) attend_rate
from (
select
course_id,count(user_id) ct
from (
select
t1.course_id,
t1.user_id,
sum(t1.during_time) during_time
from (
select
user_id,
course_id,
(to_unix_timestamp (out_datetime) - to_unix_timestamp(in_datetime)) / 60 during_time
from r6_attend_tb
) t1
group by t1.course_id,t1.user_id
--order by course_id
) t2
where t2.during_time >=10
group by course_id
) t3 join (
select
r.course_id,count(r.user_id) sign_num
from r2_behavior_tb r
where r.if_sign = 1
group by r.course_id
) t4
on t3.course_id = t4.course_id
join r1_course_tb r
on r.course_id = t4.course_id
order by t4.course_id; -- t5
-- 与r1_course_tb表join,拿到course_name字段
select
t5.course_id,r.course_name,t5.attend_rate
from (
select
t4.course_id,cast(round(t3.ct / t4.sign_num * 100,2) as decimal(16,2)) attend_rate
from (
select
course_id,count(user_id) ct
from (
select
t1.course_id,
t1.user_id,
sum(t1.during_time) during_time
from (
select
user_id,
course_id,
(to_unix_timestamp (out_datetime) - to_unix_timestamp(in_datetime)) / 60 during_time
from r6_attend_tb
) t1
group by t1.course_id,t1.user_id
order by course_id
) t2
where t2.during_time >=10
group by course_id
) t3 join (
select
r.course_id,count(r.user_id) sign_num
from r2_behavior_tb r
where r.if_sign = 1
group by r.course_id
) t4
on t3.course_id = t4.course_id
) t5 join r1_course_tb r
on t5.course_id = r.course_id
order by t5.course_id;
输出:
1|Python|75.00
2|SQL|60.00
3|R|66.67
SQL28 牛客直播各科目同时在线人数-较难
28.1 题目需求
牛客某页面推出了数据分析系列直播课程介绍。用户可以选择报名任意一场或多场直播课。
请你统计每个科目最大同时在线人数(按course_id排序)
注:按照course_id升序排序。
28.2 表结构
课程表r10_course_tb
上课情况表r11_attend_tb
28.3 建表/插数语句
create table if not exists
r10_course_tb
(
course_id int,
course_name string,
course_datetime string )
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
r11_attend_tb
(
user_id int,
course_id int,
in_datetime timestamp,
out_datetime timestamp
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO r10_course_tb VALUES(1, 'Python', '2021-12-1 19:00-21:00');
INSERT INTO r10_course_tb VALUES(2, 'SQL', '2021-12-2 19:00-21:00');
INSERT INTO r10_course_tb VALUES(3, 'R', '2021-12-3 19:00-21:00');
INSERT INTO r11_attend_tb VALUES(100, 1, '2021-12-1 19:00:00', '2021-12-1 19:28:00')
;
INSERT INTO r11_attend_tb VALUES(100, 1, '2021-12-1 19:30:00', '2021-12-1 19:53:00')
;
INSERT INTO r11_attend_tb VALUES(101, 1, '2021-12-1 19:00:00', '2021-12-1 20:55:00')
;
INSERT INTO r11_attend_tb VALUES(102, 1, '2021-12-1 19:00:00', '2021-12-1 19:05:00')
;
INSERT INTO r11_attend_tb VALUES(104, 1, '2021-12-1 19:00:00', '2021-12-1 20:59:00')
;
INSERT INTO r11_attend_tb VALUES(101, 2, '2021-12-2 19:05:00', '2021-12-2 20:58:00')
;
INSERT INTO r11_attend_tb VALUES(102, 2, '2021-12-2 18:55:00', '2021-12-2 21:00:00')
;
INSERT INTO r11_attend_tb VALUES(104, 2, '2021-12-2 18:57:00', '2021-12-2 20:56:00')
;
INSERT INTO r11_attend_tb VALUES(107, 2, '2021-12-2 19:10:00', '2021-12-2 19:18:00')
;
INSERT INTO r11_attend_tb VALUES(100, 3, '2021-12-3 19:01:00', '2021-12-3 21:00:00')
;
INSERT INTO r11_attend_tb VALUES(102, 3, '2021-12-3 18:58:00', '2021-12-3 19:05:00')
;
INSERT INTO r11_attend_tb VALUES(108, 3, '2021-12-3 19:01:00', '2021-12-3 19:56:00')
;
28.4 代码实现
select
t2.course_id course_id,
r.course_name course_name,
max(num) max_num
from (
select
course_id,
sum(flag) over(partition by course_id order by datetime ) num
from(
select
user_id,
course_id,
in_datetime datetime,
1 flag
from r11_attend_tb
union
select
user_id,
course_id,
out_datetime datetime,
-1 flag
from r11_attend_tb
)t1
)t2
join r10_course_tb r
on t2.course_id = r.course_id
group by t2.course_id,r.course_name
order by t2.course_id;
输出:
1|Python|4
2|SQL|4
3|R|3
SQL29 某乎问答11月份日人均回答量 简单
29.1 题目需求
请你统计11月份日人均回答量(回答问题数量/答题人数),按回答日期排序,结果保留两位小数
29.2 表结构
问答创作者回答情况表s1_answer_tb
29.3 建表/插数语句
create table if not exists
s1_answer_tb
(
answer_date date,
author_id int,
issue_id string,
char_len int;
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO s1_answer_tb VALUES('2021-11-1', 101, 'E001' ,150);
INSERT INTO s1_answer_tb VALUES('2021-11-1', 101, 'E002', 200);
INSERT INTO s1_answer_tb VALUES('2021-11-1',102, 'C003' ,50);
INSERT INTO s1_answer_tb VALUES('2021-11-1' ,103, 'P001', 35);
INSERT INTO s1_answer_tb VALUES('2021-11-1', 104, 'C003', 120);
INSERT INTO s1_answer_tb VALUES('2021-11-1' ,105, 'P001', 125);
INSERT INTO s1_answer_tb VALUES('2021-11-1' , 102, 'P002', 105);
INSERT INTO s1_answer_tb VALUES('2021-11-2', 101, 'P001' ,201);
INSERT INTO s1_answer_tb VALUES('2021-11-2', 110, 'C002', 200);
INSERT INTO s1_answer_tb VALUES('2021-11-2', 110, 'C001', 225);
INSERT INTO s1_answer_tb VALUES('2021-11-2' , 110, 'C002', 220);
INSERT INTO s1_answer_tb VALUES('2021-11-3', 101, 'C002', 180);
INSERT INTO s1_answer_tb VALUES('2021-11-4' ,109, 'E003', 130);
INSERT INTO s1_answer_tb VALUES('2021-11-4', 109, 'E001',123);
INSERT INTO s1_answer_tb VALUES('2021-11-5', 108, 'C001',160);
INSERT INTO s1_answer_tb VALUES('2021-11-5', 108, 'C002', 120);
INSERT INTO s1_answer_tb VALUES('2021-11-5', 110, 'P001', 180);
INSERT INTO s1_answer_tb VALUES('2021-11-5' , 106, 'P002' , 45);
INSERT INTO s1_answer_tb VALUES('2021-11-5' , 107, 'E003', 56);
29.4 代码实现
select
answer_date,
cast(count(*)/count(distinct author_id) as decimal(10,2)) per_num
from test.s1_answer_tb
group by answer_date
order by answer_date;
输出:
2021-11-01|1.40
2021-11-02|2.00
2021-11-03|1.00
2021-11-04|2.00
2021-11-05|1.25
SQL30 某乎问答高质量的回答中用户属于各级别的数量-中等
30.1 题目需求
回答字数大于等于100字的认为是高质量回答,请你统计某乎问答高质量的回答中用户属于1-2级、3-4级、5-6级的数量分别是多少,按数量降序排列
30.2 表结构
问答创作者信息表s2_author_tb
问答创作者回答情况表s3_answer_tb
30.3 建表/插数语句
create table if not exists
s2_author_tb
(
author_id int,
author_level int,
sex string
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
s3_answer_tb
(
answer_date date,
author_id int,
issue_id string,
char_len int;
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO s2_author_tb VALUES(101 , 6, 'm');
INSERT INTO s2_author_tb VALUES(102 , 1, 'f');
INSERT INTO s2_author_tb VALUES(103 , 1, 'm');
INSERT INTO s2_author_tb VALUES(104 , 3, 'm');
INSERT INTO s2_author_tb VALUES(105 , 4, 'f');
INSERT INTO s2_author_tb VALUES(106 , 2, 'f');
INSERT INTO s2_author_tb VALUES(107 , 2, 'm');
INSERT INTO s2_author_tb VALUES(108 , 5, 'f');
INSERT INTO s2_author_tb VALUES(109 , 6, 'f');
INSERT INTO s2_author_tb VALUES(110 , 5, 'm');
INSERT INTO s3_answer_tb VALUES('2021-11-1', 101, 'E001' ,150);
INSERT INTO s3_answer_tb VALUES('2021-11-1', 101, 'E002', 200);
INSERT INTO s3_answer_tb VALUES('2021-11-1',102, 'C003' ,50);
INSERT INTO s3_answer_tb VALUES('2021-11-1' ,103, 'P001', 35);
INSERT INTO s3_answer_tb VALUES('2021-11-1', 104, 'C003', 120);
INSERT INTO s3_answer_tb VALUES('2021-11-1' ,105, 'P001', 125);
INSERT INTO s3_answer_tb VALUES('2021-11-1' , 102, 'P002', 105);
INSERT INTO s3_answer_tb VALUES('2021-11-2', 101, 'P001' ,201);
INSERT INTO s3_answer_tb VALUES('2021-11-2', 110, 'C002', 200);
INSERT INTO s3_answer_tb VALUES('2021-11-2', 110, 'C001', 225);
INSERT INTO s3_answer_tb VALUES('2021-11-2' , 110, 'C002', 220);
INSERT INTO s3_answer_tb VALUES('2021-11-3', 101, 'C002', 180);
INSERT INTO s3_answer_tb VALUES('2021-11-4' ,109, 'E003', 130);
INSERT INTO s3_answer_tb VALUES('2021-11-4', 109, 'E001',123);
INSERT INTO s3_answer_tb VALUES('2021-11-5', 108, 'C001',160);
INSERT INTO s3_answer_tb VALUES('2021-11-5', 108, 'C002', 120);
INSERT INTO s3_answer_tb VALUES('2021-11-5', 110, 'P001', 180);
INSERT INTO s3_answer_tb VALUES('2021-11-5' , 106, 'P002' , 45);
INSERT INTO s3_answer_tb VALUES('2021-11-5' , 107, 'E003', 56);
30.4 代码实现
select
t2.level_cut,
count(*) num
from (
select
(case when t1.author_level<=2 then '1-2级'
when t1.author_level<=4 then '3-4级'
else '5-6级' end) level_cut,t1.author_id
from(
select
s3_answer_tb.author_id,s2_author_tb.author_level
from s3_answer_tb
join s2_author_tb
on s3_answer_tb.author_id=s2_author_tb.author_id
where s3_answer_tb.char_len>=100
)t1
)t2
group by t2.level_cut
order by num desc;
输出:
5-6级|12
3-4级|2
1-2级|1
SQL31 某乎问答单日回答问题数大于等于3个的所有用户-中等
31.1 题目需求
请你统计11月份单日回答问题数大于等于3个的所有用户信息(author_date表示回答日期、author_id表示创作者id,answer_cnt表示回答问题个数)
注:若有多条数据符合条件,按answer_date、author_id升序排序。
31.2 表结构
问答创作者回答情况表s4_answer_tb
31.3 建表/插数语句
create table if not exists
s4_answer_tb
(
answer_date date,
author_id int,
issue_id string,
char_len int;
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO s4_answer_tb VALUES('2021-11-1', 101, 'E001' ,150);
INSERT INTO s4_answer_tb VALUES('2021-11-1', 101, 'E002', 200);
INSERT INTO s4_answer_tb VALUES('2021-11-1',102, 'C003' ,50);
INSERT INTO s4_answer_tb VALUES('2021-11-1' ,103, 'P001', 35);
INSERT INTO s4_answer_tb VALUES('2021-11-1', 104, 'C003', 120);
INSERT INTO s4_answer_tb VALUES('2021-11-1' ,105, 'P001', 125);
INSERT INTO s4_answer_tb VALUES('2021-11-1' , 102, 'P002', 105);
INSERT INTO s4_answer_tb VALUES('2021-11-2', 101, 'P001' ,201);
INSERT INTO s4_answer_tb VALUES('2021-11-2', 110, 'C002', 200);
INSERT INTO s4_answer_tb VALUES('2021-11-2', 110, 'C001', 225);
INSERT INTO s4_answer_tb VALUES('2021-11-2' , 110, 'C002', 220);
INSERT INTO s4_answer_tb VALUES('2021-11-3', 101, 'C002', 180);
INSERT INTO s4_answer_tb VALUES('2021-11-4' ,109, 'E003', 130);
INSERT INTO s4_answer_tb VALUES('2021-11-4', 109, 'E001',123);
INSERT INTO s4_answer_tb VALUES('2021-11-5', 108, 'C001',160);
INSERT INTO s4_answer_tb VALUES('2021-11-5', 108, 'C002', 120);
INSERT INTO s4_answer_tb VALUES('2021-11-5', 110, 'P001', 180);
INSERT INTO s4_answer_tb VALUES('2021-11-5' , 106, 'P002' , 45);
INSERT INTO s4_answer_tb VALUES('2021-11-5' , 107, 'E003', 56);
31.4 代码实现
select
answer_date,
author_id,
count(issue_id) answer_cnt
from s4_answer_tb
group by answer_date,author_id
having answer_cnt>=3;
输出:
2021-11-02|110|3
SQL32 某乎问答回答过教育类问题的用户里有多少用户回答过职场类问题-中等
32.1 题目需求
请你统计回答过教育类问题的用户里有多少用户回答过职场类问题
若有多条数据符合条件,按answer_date、author_id升序排序。
32.2 表结构
问答题目信息表s5_issue_tb
问答创作者回答情况表s6_answer_tb
32.3 建表/插数语句
create table if not exists
s5_issue_tb
(
answer_date date,
author_id int,
issue_id string,
char_len int;
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
s6_answer_tb
(
answer_date date,
author_id int,
issue_id string,
char_len int;
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO s5_issue_tb VALUES('E001' ,'Education');
INSERT INTO s5_issue_tb VALUES('E002' ,'Education');
INSERT INTO s5_issue_tb VALUES('E003' ,'Education');
INSERT INTO s5_issue_tb VALUES('C001', 'Career');
INSERT INTO s5_issue_tb VALUES('C002', 'Career');
INSERT INTO s5_issue_tb VALUES('C003', 'Career');
INSERT INTO s5_issue_tb VALUES('C004', 'Career');
INSERT INTO s5_issue_tb VALUES('P001' ,'Psychology');
INSERT INTO s5_issue_tb VALUES('P002' ,'Psychology');
INSERT INTO s6_answer_tb VALUES('2021-11-1', 101, 'E001' ,150);
INSERT INTO s6_answer_tb VALUES('2021-11-1', 101, 'E002', 200);
INSERT INTO s6_answer_tb VALUES('2021-11-1',102, 'C003' ,50);
INSERT INTO s6_answer_tb VALUES('2021-11-1' ,103, 'P001', 35);
INSERT INTO s6_answer_tb VALUES('2021-11-1', 104, 'C003', 120);
INSERT INTO s6_answer_tb VALUES('2021-11-1' ,105, 'P001', 125);
INSERT INTO s6_answer_tb VALUES('2021-11-1' , 102, 'P002', 105);
INSERT INTO s6_answer_tb VALUES('2021-11-2', 101, 'P001' ,201);
INSERT INTO s6_answer_tb VALUES('2021-11-2', 110, 'C002', 200);
INSERT INTO s6_answer_tb VALUES('2021-11-2', 110, 'C001', 225);
INSERT INTO s6_answer_tb VALUES('2021-11-2' , 110, 'C002', 220);
INSERT INTO s6_answer_tb VALUES('2021-11-3', 101, 'C002', 180);
INSERT INTO s6_answer_tb VALUES('2021-11-4' ,109, 'E003', 130);
INSERT INTO s6_answer_tb VALUES('2021-11-4', 109, 'E001',123);
INSERT INTO s6_answer_tb VALUES('2021-11-5', 108, 'C001',160);
INSERT INTO s6_answer_tb VALUES('2021-11-5', 108, 'C002', 120);
INSERT INTO s6_answer_tb VALUES('2021-11-5', 110, 'P001', 180);
INSERT INTO s6_answer_tb VALUES('2021-11-5' , 106, 'P002' , 45);
INSERT INTO s6_answer_tb VALUES('2021-11-5' , 107, 'E003', 56);
32.4 代码实现
select
count(distinct (t2.author_id)) num
from (
select
t1.author_id,
issue_id
from (
select
author_id
from s5_issue_tb s5
join s6_answer_tb s6
on s5.issue_id = s6.issue_id
where issue_type = 'Education'
) t1
join s6_answer_tb
on t1.author_id = s6_answer_tb.author_id
) t2
join s5_issue_tb
on t2.issue_id = s5_issue_tb.issue_id
where issue_type = 'Career';
输出:
1
SQL33 某乎问答最大连续回答问题天数大于等于3天的用户及其对应等级-较难
33.1 题目需求
请你统计最大连续回答问题的天数大于等于3天的用户及其等级(若有多条符合条件的数据,按author_id升序排序)
33.2 表结构
问答创作者信息表s7_author_tb
问答创作者回答情况表s8_answer_tb
33.3 建表/插数语句
create table if not exists
s7_author_tb
(
author_id int,
author_level int,
sex string
)
row format delimited fields terminated by ','
stored as textfile;
create table if not exists
s8_answer_tb
(
answer_date date,
author_id int,
issue_id string,
char_len int;
)
row format delimited fields terminated by ','
stored as textfile;
INSERT INTO s7_author_tb VALUES(101 , 6, 'm');
INSERT INTO s7_author_tb VALUES(102 , 1, 'f');
INSERT INTO s7_author_tb VALUES(103 , 1, 'm');
INSERT INTO s7_author_tb VALUES(104 , 3, 'm');
INSERT INTO s7_author_tb VALUES(105 , 4, 'f');
INSERT INTO s7_author_tb VALUES(106 , 2, 'f');
INSERT INTO s7_author_tb VALUES(107 , 2, 'm');
INSERT INTO s7_author_tb VALUES(108 , 5, 'f');
INSERT INTO s7_author_tb VALUES(109 , 6, 'f');
INSERT INTO s7_author_tb VALUES(110 , 5, 'm');
INSERT INTO answer_tb VALUES('2021-11-1', 101, 'E001' ,150);
INSERT INTO answer_tb VALUES('2021-11-1', 101, 'E002', 200);
INSERT INTO answer_tb VALUES('2021-11-1',102, 'C003' ,50);
INSERT INTO answer_tb VALUES('2021-11-1' ,103, 'P001', 35);
INSERT INTO answer_tb VALUES('2021-11-1', 104, 'C003', 120);
INSERT INTO answer_tb VALUES('2021-11-1' ,105, 'P001', 125);
INSERT INTO answer_tb VALUES('2021-11-1' , 102, 'P002', 105);
INSERT INTO answer_tb VALUES('2021-11-2', 101, 'P001' ,201);
INSERT INTO answer_tb VALUES('2021-11-2', 110, 'C002', 200);
INSERT INTO answer_tb VALUES('2021-11-2', 110, 'C001', 225);
INSERT INTO answer_tb VALUES('2021-11-2' , 110, 'C002', 220);
INSERT INTO answer_tb VALUES('2021-11-3', 101, 'C002', 180);
INSERT INTO answer_tb VALUES('2021-11-4' ,109, 'E003', 130);
INSERT INTO answer_tb VALUES('2021-11-4', 109, 'E001',123);
INSERT INTO answer_tb VALUES('2021-11-5', 108, 'C001',160);
INSERT INTO answer_tb VALUES('2021-11-5', 108, 'C002', 120);
INSERT INTO answer_tb VALUES('2021-11-5', 110, 'P001', 180);
INSERT INTO answer_tb VALUES('2021-11-5' , 106, 'P002' , 45);
INSERT INTO answer_tb VALUES('2021-11-5' , 107, 'E003', 56);
33.4 代码实现
select
t3.author_id author_id,
s7_author_tb.author_level author_level,
t3.days_cnt days_cnt
from (
select
t2.author_id,count(*) days_cnt
from(
select
t1.author_id,date_sub(t1.answer_date,t1.rk) answer_date
from(
select
answer_date,author_id,dense_rank() over (partition by author_id order by answer_date) rk
from s8_answer_tb
group by answer_date, author_id
)t1
)t2
group by t2.author_id,answer_date
having days_cnt>=3
)t3
join s7_author_tb
on t3.author_id=s7_author_tb.author_id
order by author_id;
输出:
101|6|3
SQL34 找出所有科目成绩都大于某一学科平均成绩的学生
34.1 题目需求
找出所有科目成绩都大于某一学科平均成绩的学生
① 求出每个学科平均成绩
② 根据是否大于平均成绩记录标记flag,大于标记为0,否则标记为1
③ 根据学生id进行分组统计flag的和,和为0的则是所有学科都大于平均成绩
34.2 表结构
表名:score_info
字段:uid、subject_id,score
34.3 建表/插数语句
hive(default)>
create table score_info(
uid string, -- 学生id
subject_id string, -- 课程id
score int -- 课程分数
)
row format delimited fields terminated by '\t';
Insert into score_info values('1001','01',90);
Insert into score_info values('1001','02',90);
Insert into score_info values('1001','03',90);
Insert into score_info values('1002','01',85);
Insert into score_info values('1002','02',85);
Insert into score_info values('1002','03',70);
Insert into score_info values('1003','01',70);
1003 02 70
1003 03 85
34.4 代码实现
select
t2.uid
from(
select
t1.uid,
if(t1.score>t1.avg_score,0,1) flag
from(
select
uid,
score,
avg(score) over(partition by subject_id) avg_score
from score_info
)t1
)t2
group by uid
having sum(t2.flag)=0;
结果:
OK
t2.uid
1001
SQL35 某平台的用户访问数据
35.1 题目需求
要求使用SQL统计出每个用户的月累计访问次数及累计访问次数(注意日期数据的格式是否能解析)
① 修改数据格式,将日期中的"/"转换成"-"
② 计算每人单月访问量
③ 按月累计访问量
35.2 表结构
表名:action
字段:userId,visitDate,visitCount
35.3 建表/插数语句
hive(default)>
create table action(
userId string, -- 用户id
visitDate string, -- 访问日期
visitCount int -- 访问次数
)
row format delimited fields terminated by "\t";
Insert into action values('u01','2017/1/21',5),
('u02','2017/1/23',6),
('u03','2017/1/22',8),
('u04','2017/1/20',3),
('u01','2017/1/23',6),
('u01','2017/2/21',8),
('u02','2017/1/23',6),
('u01','2017/2/22',4);
35.4 代码实现
select
userId,
mn,
mn_count,
sum(mn_count) over(partition by userId order by mn)
from(
select
userId,
mn,
sum(visitCount) mn_count
from(
select
userId,
date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn,
visitCount
from action
)t1
group by userId,mn
)t2;
结果:
OK
t2.userid t2.mn t2.sum_month sum_acc
u01 2017-01 11 11
u01 2017-02 12 23
u02 2017-01 12 12
u03 2017-01 8 8
u04 2017-01 3 3
SQL36 电商店铺数据
36.1 题目需求
有50W个京东店铺,每个顾客访客访问任何一个店铺的任何一个商品时,都会产生一个访问日志,访问日志存储的表名。
求:每个店铺的UV(访客数)
每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
36.2 表结构
表名:visit
字段名:user_id,shop
36.3 建表/插数语句
hive(default)>
create table visit(
user_id string,
shop string
)
row format delimited fields terminated by '\t';
Insert into visit values('u1','a'),
('u2','b'),
('u1','b'),
('u1','a'),
('u3','c'),
('u4','b'),
('u1','a'),
('u2','c'),
('u5','b'),
('u4','b'),
('u6','c'),
('u2','c'),
('u1','b'),
('u2','a'),
('u2','a'),
('u3','a'),
('u5','a'),
('u5','a'),
('u5','a');
36.4 代码实现
每个店铺UV(访客数)
hive(default)>
select
shop,
count(distinct user_id) uv
from visit
group by shop;
结果
OK
shop uv
a 4
b 4
c 3
每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
hive(default)>
select
t2.shop,
t2.user_id,
t2.user_num
from (
select
t1.shop,
t1.user_id,
t1.user_num,
rank() over(partition by t1.shop order by t1.user_num desc ) rk
from (
select
shop,
user_id,
count(user_id) user_num
from visit
group by shop,user_id
) t1
) t2
where t2.rk <=3;
SQL37 某月销售指标
37.1 题目需求
给出 2017年每个月的订单数、用户数、总成交金额。
给出2017年11月的新客数(指在11月才有第一笔订单)
37.2 表结构
表名:order_tab
字段名:dt(下单日期)、order_id(订单编号)、user_id(用户ID)、amount(成交金额)
37.3 建表/插数语句
hive(default)>
create table order_tab(
dt string, -- 下单日期
order_id string, -- 订单id
user_id string, -- 用户id
amount decimal(10,2)-- 订单金额
)
row format delimited fields terminated by ',';
INSERT INTO ORDER_TAB VALUES('2017-01-01','10029028','1000003251',33.57),
('2017-01-02','10029029','1000003251',20.57),
('2017-02-01','10029030','1000003251',40.57),
('2017-02-02','10029031','1000003252',60.57),
('2017-02-03','10029032','1000003252',60.57),
('2017-11-03','10029033','1000003253',70.57),
('2017-11-03','10029034','1000003253',80.57);
37.4 代码实现
给出 2017年每个月的订单数、用户数、总成交金额。
hive(default)>
select
date_format(dt,'yyyy-MM') mn,
count(order_id) order_num,
count(distinct user_id) user_num,
sum(amount) sum_amount
from order_tab
where year(dt) = '2017'
group by date_format(dt,'yyyy-MM');
结果:
OK
mn order_num user_num sum_amount
2017-01 2 1 54.14
2017-02 3 2 161.71
2017-11 2 1 151.14
给出2017年11月的新客数(指在11月才有第一笔订单)
hive(default)>
select
user_id,
count(distinct user_id) user_num
from order_tab
group by user_id
having date_format(min(dt),'yyyy-MM')='2017-11';
结果:
OK
user_id user_num
1000003253 1
SQL38 统计所有用户和活跃用户的总数及平均年龄
38.1 题目需求
① 按照日期以及用户分组,按照日期排序并给出排名
② 计算日期及排名的差值
③ 过滤出差值大于等于2的,即为连续两天活跃的用户
④ 对数据进行去重处理(一个用户可以在两个不同的时间地点连续登录)
⑤ 计算活跃用户(两天连续有访问)的人数以及平均年龄
⑦ 对全量数据集进行按照用户去重
⑧ 计算所有用户的数量以及平均年龄
⑨ 将第⑤步和第⑦步的结果进行union all操作
活跃用户:指连续两天都有访问记录的用户
38.2 表结构
表名:user_age
字段名:dt(访问日期)、user_id(用户ID)、age(用户年龄)
38.3 建表/插数语句
hive(default)>
create table user_age(
dt string, -- 日期
user_id string, -- 用户
age int -- 年龄
)
row format delimited fields terminated by ',';
Insert into user_age values('2019-02-11','user_1',23),
('2019-02-11','user_2',19),
('2019-02-11','user_3',39),
('2019-02-11','user_1',23),
('2019-02-11','user_3',39),
('2019-02-11','user_1',23),
('2019-02-12','user_2',19),
('2019-02-13','user_1',23),
('2019-02-15','user_2',19),
('2019-02-16','user_2',19);
38.4 代码实现
hive(default)>
select
sum(user_total_count),
sum(user_total_avg_age),
sum(twice_count),
sum(twice_count_avg_age)
from (
select
0 user_total_count,
0 user_total_avg_age,
count(*) twice_count,
cast(sum(age)/count(*) as decimal(10,2)) twice_count_avg_age
from(
select
user_id,
min(age) age
from(
select
user_id,
min(age) age
from(
select
user_id,
age,
date_sub(dt,rk) flag
from(
select
dt,
user_id,
min(age) age,
rank() over(partition by user_id order by dt) rk
from user_age
group by dt,user_id
)t1
)t2
group by user_id,flag
having count(*)>=2
)t3
group by user_id
)t4
union all
select
count(*) user_total_count,
cast((sum(age)/count(*)) as decimal(10,1)),
0 twice_count,
0 twice_count_avg_age
from(
select
user_id,
min(age) age
from user_age
group by user_id
)t5
)t6;
另一种算法:
hive(default)>
select
sum(t3.user_num_all) user_num_all,
sum(t3.avg_age_all) avg_age_all,
sum(t3.user_num_active) user_num_active,
sum(t3.avg_age_active) avg_age_active
from (
select
count(t1.user_id) user_num_all,
cast((sum(t1.min_age) / count(t1.user_id)) as decimal(4,2)) avg_age_all,
0 user_num_active,
0 avg_age_active
from (
select
user_id,
min(age) min_age
from user_age
group by user_id
) t1
union
select
0 user_num_all,
0 avg_age_all,
count(t2.user_id) user_num_active,
cast((sum(t2.age)/count(t2.user_id)) as decimal(4,2)) avg_age_active
from (
select
distinct t1.user_id,
age
from (
select
user_id,
datediff(dt,lag(dt,1,'1900-01-01') over(partition by user_id order by dt )) date_diff,
age
from user_age
) t1
where t1.date_diff=1
) t2
) t3;
SQL39 用不同字段统计用户首次购买金额
39.1 题目需求
① 仅使用paymenttime字段,统计所有用户在今年(2022年)10月份第一次购买商品的金额
② 仅使用paymentdate字段,统计所有用户在今年(2022年)10月份第一次购买商品的金额
39.2 表结构
表名:ordertable
字段名:userid(用户ID)、money(金额)、paymenttime(支付时间戳)paymentdate(支付日期)、orderid(订单ID)
39.3 建表/插数语句
hive(default)>
create table ordertable(
userid string, -- 购买用户
money int, -- 金额
paymenttime string, -- 支付时间戳
paymentdate string,
orderid string -- 订单id
)
row format delimited fields terminated by '\t';
Insert into ordertable values ('1001',100,'1633609057000','2021-10-07','1111111111'),
('1002',110,'1636287457000','2021-11-07','2222222222'),
('1003',120,'1667823457000','2022-11-07','3333333333'),
('1004',130,'1665145057000','2022-10-07','4444444444'),
('1005',140,'1666613857000','2022-10-24','5555555555'),
('1005',150,'1665577057000','2022-10-12','6666666666'),
('1004',160,'1660306657000','2022-08-12','7777777777'),
('1003',170,'1665749857000','2022-10-14','8888888888'),
('1005',180,'1665922657000','2022-10-16','9999999999');
39.4 代码实现
使用paymenttime字段的答案
hive(default)>
select
t2.userid,
t2.money,
t2.orderid
from (
select
t1.userid,
t1.money,
t1.mnd,
t1.orderid,
rank() over(partition by userid order by t1.mnd) rk
from (
select
userid,
money,
from_unixtime(cast(paymenttime *0.001 as bigint) ,'yyyy-MM') mn,
from_unixtime(cast(paymenttime *0.001 as bigint) ,'yyyy-MM-dd') mnd,
orderid
from ordertable
) t1
where t1.mn='2022-10'
) t2
where t2.rk=1;
结果:
OK
t2.userid t2.money t2.orderid
1003 170 8888888888
1004 130 4444444444
1005 150 6666666666
使用paymentdate字段的答案
hive(default)>
select
t1.userid,
t1.paymentdate,
od.money
from
(select
userid,
min(paymentdate) paymentdate
from
ordertable
where
date_format(paymentdate,'yyyy-MM')='2022-10'
group by
userid)t1
join
ordertable od
on
t1.userid=od.userid
and
t1.paymentdate=od.paymentdate;
结果:
OK
t1.userid t1.paymentdate od.money
1003 2022-10-14 170
1004 2022-10-07 130
1005 2022-10-12 150
SQL40 用户ip地址
40.1 题目需求
求11月9号下午14点(14-15点),访问/api/user/login接口次数的top2的ip地址
① 过滤符合要求的数据
② 对数据进行分组统计,分组字段ip,统计记录个数
③ 按照每个ip的统计记录个数排序,取前2个。
40.2 表结构
有一个线上服务器访问日志,数据结构如下:
表名:ip_info
字段名:time1(访问时间)、interface(访问接口)、ip(访问的IP地址)
40.3 建表/插数语句
hive(default)>
create table ip_info(
time1 string, -- 访问时间
interface string, -- 访问接口
ip string -- 访问的ip地址
)
row format delimited fields terminated by '\t';
Insert into ip_info values('2016-11-09 14:22:05','/api/user/login','110.23.5.33'),
('2016-11-09 11:23:10','/api/user/detail','57.3.2.16'),
('2016-11-09 14:59:40','/api/user/login','200.6.5.166'),
('2016-11-09 14:22:05','/api/user/login','110.23.5.34'),
('2016-11-09 14:22:05','/api/user/login','110.23.5.34'),
('2016-11-09 14:22:05','/api/user/login','110.23.5.34'),
('2016-11-09 11:23:10','/api/user/detail','57.3.2.16'),
('2016-11-09 23:59:40','/api/user/login','200.6.5.166'),
('2016-11-09 14:22:05','/api/user/login','110.23.5.34'),
('2016-11-09 11:23:10','/api/user/detail','57.3.2.16'),
('2016-11-09 23:59:40','/api/user/login','200.6.5.166'),
('2016-11-09 14:22:05','/api/user/login','110.23.5.35'),
('2016-11-09 14:23:10','/api/user/detail','57.3.2.16'),
('2016-11-09 23:59:40','/api/user/login','200.6.5.166'),
('2016-11-09 14:59:40','/api/user/login','200.6.5.166'),
('2016-11-09 14:59:40','/api/user/login','200.6.5.166');
40.4 代码实现
hive(default)>
select
ip, interface, count(ip) ip_times
from ip_info
where date_format(time1,"yyyy-MM-dd HH") >= "2016-11-09 14"
and date_format(time1,"yyyy-MM-dd HH") <= "2016-11-09 15"
and interface ="/api/user/login"
group by ip, interface
order by ip_times desc
limit 2;
SQL41 账号查询
41.1 题目需求
查询各自区组的money排名前十的账号(分组取前10)
41.2 表结构
表名:account
字段名:dist_id(区组id)、account(账号)、gold(金币)
41.3 建表/插数语句
hive(default)>
CREATE TABLE `account`
(
`dist_id` int, -- '区组id'
`account` string, -- '账号'
`gold` string -- '金币'
)
row format delimited fields terminated by '\t';
Insert into account values(1003,'11133','23000'),
(1001,'11113','43000'),
(1003,'11134','41000'),
(1002,'11123','33000'),
(1003,'11135','31001'),
(1002,'11121','22000'),
(1002,'11122','31000'),
(1001,'11114','21000'),
(1001,'11115','11001'),
(1001,'11111','25500'),
(1001,'11112','34300'),
(1002,'11124','11700'),
(1002,'11125','23501'),
(1003,'11131','45800'),
(1003,'11132','14900'),
(1003,'11133','23400'),
(1001,'11113','43200'),
(1003,'11134','41700'),
(1002,'11123','33800'),
(1003,'11135','31901'),
(1002,'11121','22100'),
(1002,'11122','31200'),
(1001,'11114','21400'),
(1001,'11115','11101');
41.4 代码实现
hive(default)>
select
*
from account as a
where(
select
count(distinct(a1.gold))
from account as a1
where
a1.dist_id=a.dist_id
and a1.gold > a.gold
)<3;
SQL42 会员表
42.1 题目需求
(1)销售表中的销售记录可以是会员购买,也可以是非会员购买。(即销售表中的memberid可以为空);
(2)销售表中的一个会员可以有多条购买记录;
(3)退货表中的退货记录可以是会员,也可是非会员;
(4)一个会员可以有一条或多条退货记录。
分组查出销售表中所有会员购买金额,同时分组查出退货表中所有会员的退货金额,把会员id相同的购买金额-退款金额得到的结果更新到表会员表中对应会员的积分字段(credits保留两位小数)
42.2 表结构
① 会员表:member 字段名:memberid、credits(积分)
② 销售表:sale 字段名:memberid、MNAccount(购买金额)
③ 退货表:regoods 字段名:memberid、RMNAccount(退货金额)
42.3 建表/插数语句
// member会员表
hive(default)>
create table member(
memberid string,
credits decimal(10,2)
)
row format delimited fields terminated by '\t';
// sale销售表
hive(default)>
create table sale(
memberid string,
MNAccount double
)
row format delimited fields terminated by '\t;
// regoods退货表
hive(default)>
create table regoods(
memberid string,
RMNAccount double
)
row format delimited fields terminated by '\t';
Insert into sale values('1001',50.3),
('1002',56.5),
('1003',235),
('1001',23.6),
('1005',56.2),
(null,25.6),
(null,33.5);
Insert into regoods values('1001',20.1),
('1002',23.6),
('1001',10.1),
(null,23.5),
(null,10.2),
('1005',0.8);
42.4 代码实现
hive(default)>
insert into table member
select
t1.memberid,
MNAccount-RMNAccount
from
(select
memberid,
sum(MNAccount) MNAccount
from
sale
where
memberid!=''
group by
memberid
)t1
join
(select
memberid,
sum(RMNAccount) RMNAccount
from
regoods
where
memberid!=''
group by
memberid
)t2
on
t1.memberid=t2.memberid;
SQL43 统计出每门课都大于80分的学生姓名
43.1 题目需求
统计出每门课都大于80分的学生姓名
43.2 表结构
表名:stu_subject
字段名:name、kecheng、fenshu
43.3 建表/插数语句
hive(default)>
create table stu_subject(
name string,
kecheng string,
fenshu int
)
row format delimited fields terminated by ',';
Insert into stu_subject values('张三','语文',81),
('张三','数学',75),
('李四','语文',76),
('李四','数学',90),
('王五','语文',81),
('王五','数学',100),
('王五','英语',90);
43.4 代码实现
hive(default)>
select
distinct name
from stu_subject
where name not in (
select
distinct name
from stu_subject
where fenshu<=80
);
hive(default)>
select
name
from stu_subject
group by name
having min(fenshu)>80;
SQL44 删除除了id不同其他都相同的学生冗余信息
44.1 题目需求
删除除了id不同其他都相同的学生冗余信息
44.2 表结构
表名: student(学生表)
字段名:id、stu_id、name、subject_id、subject_name、score
44.3 建表/插数语句
hive(default)>
// 创建中件表
create table student(
id int,
stu_id string,
name string,
subject_id int,
subject_name string,
score string
)
row format delimited fields terminated by '\t';
// 创建支持事务的表
create table student1(
id int,
stu_id string,
name string,
subject_id int,
subject_name string,
score string
)
row format delimited fields terminated by '\t'
stored as orc
TBLPROPERTIES ("transactional"="true");
Insert into student1 values(1001,'1234','张三',11,'数学','80'),
(1002,'1234','张三',11,'数学','80'),
(1003,'1235','张21',12,'语文','90'),
(1004,'1235','张21',13,'英语','8'),
(1005,'1235','张21',11,'数学','70'),
(1006,'1234','张三',12,'语文','80'),
(1007,'1234','张三',11,'数学','90'),
(1008,'1234','张三',13,'英语','80'),
(1009,'1236','张22',11,'数学','80'),
(1010,'1234','张22',12,'语文','80');
44.4 代码实现
删除除了id不同其他都相同的学生冗余信息(切换事务管理器等配置)
hive(default)> set hive.support.concurrency=true;
hive(default)> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
hive(default)>
delete from student1
where id not in(
select
min(id)
from student
group by stu_id, name,subject_id,subject_name,score
);
SQL45 排列组合
45.1 题目需求
一个叫team的表,里面只有一个字段name,一共有4条纪录,分别是a,b,c,d,对应四个球队,现在四个球队进行比赛,用一条sql语句显示所有可能的比赛组合。
45.2 表结构
表名:team
字段名:name
45.3 建表/插数语句
Insert into team values('a'),('b'),('c'),('d');
45.4 代码实现
hive(default)>
select
a.name,
b.name
from team a, team b
where a.name < b.name;
SQL46 年月计算
46.1 题目需求
year month amount 1991 1 1.1 1991 2 1.2 1991 3 1.3 1991 4 1.4 1992 1 2.1 1992 2 2.2 1992 3 2.3 1992 4 2.4 | 转换成 | year m1 m2 m3 m4 1991 1.1 1.2 1.3 1.4 1992 2.1 2.2 2.3 2.4 |
46.2 表结构
表名:tableName
字段名:year、month、amount
46.3 建表/插数语句
hive(default)>
create table tableName(
year int,
month int,
amount double
)
row format delimited fields terminated by '\t';
Insert into tableName values(1991,1,1.1),
(1991,2,1.2),
(1991,3,1.3),
(1991,4,1.4),
(1992,1,2.1),
(1992,2,2.2),
(1992,3,2.3),
(1992,4,2.4);
46.4 代码实现
hive(default)>
select year,
(select amount from tableName m where month=1 and m.year= tableName.year) as m1,
(select amount from tableName m where month=2 and m.year= tableName.year) as m2,
(select amount from tableName m where month=3 and m.year= tableName.year) as m3,
(select amount from tableName m where month=4 and m.year= tableName.year) as m4
from tableName group by year;
结果(本地模式):
OK
1991 1.1 1.2 1.3 1.4
1992 2.1 2.2 2.3 2.4
Time taken: 15.687 seconds, Fetched: 2 row(s)
第二种方法
hive(default)>
select
year,
t1.amounts[0] m1,
t1.amounts[1] m2,
t1.amounts[2] m3,
t1.amounts[3] m4
from (
select
distinct year,
collect_list(amount) over(partition by year order by month rows between unbounded preceding and unbounded following) amounts
from tableName
) t1;
结果(本地模式):
OK
1991 1.1 1.2 1.3 1.4
1992 2.1 2.2 2.3 2.4
Time taken: 2.587 seconds, Fetched: 2 row(s)
SQL47 统计数据记录
47.1 题目需求
1 java 70 2 oracle 90 3 xml 40 4 jsp 30 5 servlet 80 | 1 java 70 pass 2 oracle 90 pass 3 xml 40 fail 4 jsp 30 fail 5 servlet 80 pass |
47.2 表结构
表名:course
字段名:courseid、coursename、score
47.3 建表/插数语句
hive(default)>
create table course(
courseid int,
coursename string,
score int
)
row format delimited fields terminated by '\t';
Insert into course values(1,'java',70),
(2,'oracle',90),
(3,'xml',40),
(4,'jsp',30),
(5,'servlet',80);
47.4 代码实现
统计数据中每条记录是否通过,分值大于60的pass,分值小于60的fail
hive(default)>
select
courseid,
coursename ,
score ,
if(score>=60, "pass","fail") as mark
from course
SQL48 统计商品指标
48.1 题目需求
给出所有购入商品为两种或两种以上的people_id记录
统计那些顾客购买了两种及以上的商品,列出顾客id,商品种类数量及商品名称
48.2 表结构
表名:shopMessage
字段名:people_id(顾客ID)、good_name(商品名称)、num(个数)
48.3 建表/插数语句
hive(default)>
create table shopMessage(
people_id int,
good_name string,
num int
)
row format delimited fields terminated by '\t';
Insert into shopMessage values(1001,'iPhone10',3),
(1002,'iPhone10',1),
(1001,'巴黎世家',10),
(1002,'巴黎世家',20),
(1003,'iPhone10',2),
(1001,'巴黎世家',30),
(1003,'vivo',1),
(1004,'vivo',1),
(1005,'巴黎世家',10),
(1004,'vivo',1),
(1006,'iPhone10',2),
(1001,'vivo',1),
(1003,'巴黎世家',2);
48.4 代码实现
给出所有购入商品为两种或两种以上的people_id记录
hive(default)>
select
*
from shopMessage
where people_id in (
select
people_id,
from shopMessage
group by people_id having count(*) >=2
);
统计那些顾客购买了两种及以上的商品,列出顾客id,商品种类数量及商品名称
hive(default)>
select
t1.people_id,
t1.good_type_num,
t1.good_type_names
from (
select
people_id,
size(collect_set(good_name) over(partition by people_id)) good_type_num,
collect_set(good_name) over(partition by people_id) good_type_names
from shopMessage
) t1
where t1.good_type_num >=2
group by t1.people_id, t1.good_type_num, t1.good_type_names;
结果(本地模式):
OK
1001 3 ["iPhone10","vivo","巴黎世家"]
1002 2 ["iPhone10","巴黎世家"]
1003 3 ["巴黎世家","iPhone10","vivo"]
Time taken: 2.735 seconds, Fetched: 3 row(s)
SQL49 统计每日的win个数和lose个数
49.1 题目需求
统计每日的win个数和lose个数
49.2 表结构
表名:info
字段名:date,result
49.3 建表/插数语句
hive(default)>
create table info(
`date` string,
result string
)
row format delimited fields terminated by ',';
Insert into info values ('2005-05-09','win'),
('2005-05-09','lose'),
('2005-05-09','lose'),
('2005-05-09','lose'),
('2005-05-10','win'),
('2005-05-10','lose'),
('2005-05-10','lose');
49.4 代码实现
hive(default)>
select
`date`,
sum(case when result = 'win' then 1 else 0 end) as win_num,
sum(case when result = 'lose' then 1 else 0 end) as lose_num
from info group by `date`;
结果(本地模式):
2005-05-09 1 3
2005-05-10 1 2
Time taken: 1.278 seconds, Fetched: 2 row(s)
SQL50 分区表插数、计算及数据分析
50.1 题目需求
① 在hive中创建分区表,并导入数据;
② 查询dt=’2018-09-01’里每个渠道的订单数,下单人数(去重),总金额;
③ 查询dt=’2018-09-01’里每个渠道的金额最大3笔订单;
④ 有一天发现订单数据重复,请分析原因;
50.2 表结构
表名:order,
字段名:order_id(订单ID),user_id(用户ID),amount(金额),pay_datetime(付费时间),channel_id(渠道ID),dt(分区字段)。
50.3 建表/插数语句
[atguigu@hadoop102 ~]$ vim /opt/module/hive/datas/order.txt
11111 1001 3400 2018-09-01 23:30:22 01
22222 1001 2500 2018-09-01 23:30:29 02
33333 1002 3200 2018-09-02 22:12:12 02
44444 1002 3400 2018-09-01 23:30:22 01
55555 1003 2500 2018-11-11 13:23:29 01
66666 1004 3200 2018-04-02 12:12:12 02
11122 1001 3400 2018-09-01 23:30:22 02
22233 1001 2500 2018-11-21 23:30:29 01
33344 1002 3200 2018-10-12 22:12:12 02
44455 1002 3400 2018-12-09 23:30:22 01
55566 1003 2500 2018-09-01 13:23:29 02
66677 1004 3200 2018-07-13 12:12:12 02
在hive中创建order表
hive(default)>
create table `order`(
order_id int,
user_id int,
amount double,
pay_datatime string,
channel_id int
)partitioned by(dt string)
row format delimited fields terminated by '\t';
在hive中创建中间表order_tmp
hive(default)>
create table order_tmp(
order_id int,
user_id int,
amount double,
pay_datatime timestamp,
channel_id int
)
row format delimited fields terminated by '\t';
向中间表加载数据
hive(default)> load data local inpath "/opt/module/hive/datas/order.txt" into table order_tmp;
通过中间表向分区表order中插入数据
hive(default)> insert into table `order` select order_id,user_id,amount,pay_datatime,channel_id, date_format(pay_datatime,'yyyy-MM-dd') from order_tmp;
50.4 代码实现
查询dt=’2018-09-01’里每个渠道的订单数,下单人数(去重),总金额;
hive(default)>
select
channel_id,
count(order_id) channel_order_num,
count(distinct user_id) channel_user_num,
sum(amount) channel_amount_sum
from `order`
where dt='2018-09-01'
group by channel_id;
结果:
OK
channel_id channel_order_num channel_user_num channel_amount_sum
1 2 2 6800.0
2 3 2 8400.0
Time taken: 22.712 seconds, Fetched: 2 row(s)
查询dt=’2018-09-01’里每个渠道的金额最大3笔订单;
hive(default)>
select
t1.channel_id,
t1.order_id,
t1.amount
from (
select
channel_id,
order_id,
amount,
rank() over(partition by channel_id order by amount desc ) rk
from `order`
where dt='2018-09-01'
) t1
where t1.rk<=3;
结果:
OK
t1.channel_id t1.order_id t1.amount
1 11111 3400.0
1 44444 3400.0
2 11122 3400.0
2 22222 2500.0
2 55566 2500.0
Time taken: 1.395 seconds, Fetched: 5 row(s)
有一天发现订单数据重复,请分析原因;
分析:
订单属于业务数据,在关系型数据库中不会存在数据重复
hive建表时也不会导致数据重复,
我推测是在数据迁移时,迁移失败导致重复迁移数据冗余了
SQL51 统计商品销售数
51.1 题目需求
统计最近一个月,销售数量最多的10个商品
统计最近一个月,每个种类里销售数量最多的10个商品(一个订单对应一个商品,一个商品对应一个品类)
51.2 表结构
表名 | t_order(订单表) | t_item(商品表) | t_item(商品表) |
字段名 | order_id(订单id) item_id (商品id) create_time(下单时间) amount (下单金额) | item_id(商品id) item_name(商品名称) category(品类) | item_id(商品id) item_name(商品名称) category_1(一级品类) category_2(二级品类) |
51.3 建表/插数语句
[atguigu@hadoop102 ~]$ vim /opt/module/hive/datas/t_order.txt
11123 1001 2018-09-01 23:30:22 3400
11121 1002 2018-09-02 13:30:22 3500
11112 1003 2018-09-01 03:12:22 2400
11111 1004 2018-09-12 12:20:22 3200
11133 1002 2018-09-23 13:34:22 5400
11134 1003 2018-09-26 09:35:22 8700
11212 1003 2018-09-15 08:36:22 2200
21123 1002 2018-08-01 23:30:22 3400
41121 1002 2018-08-02 13:30:22 3500
31112 1008 2018-08-01 03:12:22 2400
11311 1008 2018-08-01 12:20:22 3200
51133 1009 2018-08-01 13:34:22 5400
61134 10010 2018-08-01 09:35:22 8700
71212 1001 2018-08-01 08:36:22 2200
[atguigu@hadoop102 ~]$ vim /opt/module/hive/datas/t_item.txt
1001 iPhone10 01
1002 巴黎世家 02
1003 手纸 03
1004 vivo 04
1005 oppo 04
1006 耳机 05
1007 AR眼镜 05
1008 显示器 06
1009 笔记本电脑 06
1010 可乐 07
1011 雪碧 07
// 创建t_order订单表
hive(default)>
create table t_order(
order_id bigint,
item_id bigint,
create_time string,
amount bigint
)
row format delimited
fields terminated by '\t';
// 创建t_item商品表
create table t_item(
item_id bigint,
item_name string,
category string
)
row format delimited
fields terminated by '\t';
51.4 代码实现
统计最近一个月(这里可以理解为月份,也可以理解为30天),销售数量最多的10个商品
// 30天
hive(default)>
select
item_id,
count(order_id) num
from
t_order
where
abs(datediff(date_format(create_time, 'yyyy-MM-dd'), '2018-10-01'))<=30
group by
item_id
order by num desc;
结果:
OK
item_id num
1003 3
1002 2
1004 1
1001 1
Time taken: 2.65 seconds, Fetched: 4 row(s)
// 月份
hive(default)>
select
item_id,
count(order_id) num
from t_order
where year(create_time) ='2018' and month(create_time) ='9' -- 可使用now()函数获取当前日期
group by item_id
order by num desc;
结果:
OK
item_id num
1003 3
1002 2
1004 1
1001 1
Time taken: 2.65 seconds, Fetched: 4 row(s)
每个种类里销售数量最多的10个商品(一个订单对应一个商品,一个商品对应一个品类)
hive(default)>
select
t2.category,
t2.item_id,
t2.item_count
from (
select
t1.category,
t1.item_id,
count(t1.item_id) item_count,
rank() over(partition by t1.category, t1.item_id order by count(t1.item_id) desc ) rk
from
(
select
tor.order_id,
tor.item_id,
tit.item_name,
tit.category
from
t_order tor
join
t_item tit
on
tor.item_id = tit.item_id
) t1
group by t1.category, t1.item_id
order by item_count desc
) t2
where t2.rk<=10;
SQL52 统计用户行为
52.1 题目需求
统计平台的每一个用户发过多少日记、共获得多少点赞数。
统计平台的每一个用户发过的日记详情、总日记数量及每个日记的点赞数
52.2 表结构
52.3 建表/插数语句
hive(default)>
create table diary(
uid int,
log_id int,
log string
)
row format delimited fields terminated by '\t';
hive(default)>
create table `like`(
log_id int,
like_uid int
)
row format delimited fields terminated by '\t';
Insert into `diary` values
(1001,100101,'震惊!鸡是卵生动物'),
(1002,100201,'不要再吃隔夜菜了!'),
(1003,100301,'吃烧烤会致癌!18岁年轻小伙顿顿烧烤已进入ICU'),
(1001,100102,'震惊!这些明星都是外国人'),
(1001,100103,'震惊!水加热到100度后会消失'),
(1001,100104,'震惊!如果三天不喝热水,人就会感到非常口渴'),
(1002,100202,'不要再喝隔夜茶了!'),
(1003,100302,'吃辣椒可以预防癌症!18岁小伙顿顿离不开辣椒,从没有患过癌症');
Insert into `like` values (100101,10011),
(100101,10012),
(100101,10013),
(100102,10014),
(100102,10011),
(100103,10011),
(100103,10012),
(100103,10011),
(100101,10014),
(100104,10011),
(100104,10012),
(100104,10013),
(100201,10011),
(100201,10012),
(100202,10013),
(100202,10014),
(100301,10011),
(100301,10012),
(100301,10014),
(100302,10015);
52.4 代码实现
统计平台的每一个用户一共发过多少日记、共获得多少点赞数。
hive(default)>
with t1 as(
select
t2.uid,--用户Id
count( distinct t2.log_id) log_cnt
from (
select d.uid,d.log_id, d.log,l.like_uid from
diary d left join `like` l
on d.log_id = l.log_id
)t2
group by t2.uid
)
select
t1.uid,
t1.log_cnt,
t3.liked_sum
from t1 join (
select
t4.uid,
sum(t4.liked_cnt) liked_sum
from (
select
t5.uid,--用户Id
count(t5.like_uid) liked_cnt
from (
select d.uid,d.log_id, d.log,l.like_uid from
diary d left join `like` l
on d.log_id = l.log_id
) t5
group by t5.uid, t5.log_id
) t4
group by t4.uid
) t3 on t1.uid = t3.uid;
结果:
OK
t1.uid t1.log_cnt t3.liked_sum
1001 4 12
1002 2 4
1003 2 4
Time taken: 70.991 seconds, Fetched: 3 row(s)
统计平台的每一个用户发过的日记详情、总日记数量及每个日记的点赞数
hive(default)>
with t1 as(
select
d.uid,d.log_id,d.log,l.like_uid
from diary d join `like` l on d.log_id = l.log_id
)
select
t1.uid,
t1.log_id,
t1.log,
t1.like_uid,
count(distinct t1.log_id) over(partition by t1.uid order by t1.log_id rows between unbounded preceding and unbounded following) log_cnt,
count(t1.like_uid) over(partition by t1.uid,t1.log_id ) liked_cnt
from t1;
结果:
OK
t1.uid t1.log_id t1.log t1.like_uid log_cnt liked_cnt
1001 100101 震惊!鸡是卵生动物 10013 4 4
1001 100101 震惊!鸡是卵生动物 10011 4 4
1001 100101 震惊!鸡是卵生动物 10014 4 4
1001 100101 震惊!鸡是卵生动物 10012 4 4
1001 100102 震惊!这些明星都是外国人 10014 4 2
1001 100102 震惊!这些明星都是外国人 10011 4 2
1001 100103 震惊!水加热到100度后会消失 10011 4 3
1001 100103 震惊!水加热到100度后会消失 10011 4 3
1001 100103 震惊!水加热到100度后会消失 10012 4 3
1001 100104 震惊!如果三天不喝热水,人就会感到非常口渴 10011 4 3
1001 100104 震惊!如果三天不喝热水,人就会感到非常口渴 10013 4 3
1001 100104 震惊!如果三天不喝热水,人就会感到非常口渴 10012 4 3
1002 100201 不要再吃隔夜菜了! 10012 2 2
1002 100201 不要再吃隔夜菜了! 10011 2 2
1002 100202 不要再喝隔夜茶了! 10013 2 2
1002 100202 不要再喝隔夜茶了! 10014 2 2
1003 100301 吃烧烤会致癌!18岁年轻小伙顿顿烧烤已进入ICU 10012 2 3
1003 100301 吃烧烤会致癌!18岁年轻小伙顿顿烧烤已进入ICU 10011 2 3
1003 100301 吃烧烤会致癌!18岁年轻小伙顿顿烧烤已进入ICU 10014 2 3
1003 100302 吃辣椒可以预防癌症!18岁小伙顿顿离不开辣椒,从没有患过癌症 10015 2 1
Time taken: 44.122 seconds, Fetched: 20 row(s)
SQL53 找出T1表中最大版本号
53.1 题目需求
找出T1表中最大版本号
计算出所有版本号排序,要求对于相同的版本号,排序号并列
53.4 代码实现
hive(default)>
select
v_id,--版本号
max(split(v_id,".")[0]) v1,//主版本不会为空
max(if(split(v_id,".")[1]="",0,split(v_id,".")[1]))v2,--取出子版本并判断是否为空,并给默认值
max(if(split(v_id,".")[2]="",0,split(v_id,".")[2]))v3 --取出阶段版本并判断是否为空,并给默认值
from
t1;
计算出所有版本号排序,要求对于相同的版本号,排序号并列
hive(default)>
select
v_id,
rank() over(partition by v_id order by v_id)seq
from
t1;
SQL54 连续问题
54.1 题目需求
找出连续3天及以上减少碳排放量在100以上的用户
54.2 表结构
表名:test1
字段名:id(用户ID)、dt(日期)、lowcarbon(减少碳排放量)
54.3 建表/插数语句
hive(default)>
create table test1(
id bigint,
dt string,
lowcarbon bigint
)
row format delimited
fields terminated by '\t ';
Insert into test1 values(1001,'2021-12-12',123),
(1002,'2021-12-12',45),
(1001,'2021-12-13',43),
(1001,'2021-12-13',45),
(1001,'2021-12-13',23),
(1002,'2021-12-14',45),
(1001,'2021-12-14',230),
(1002,'2021-12-15',45),
(1001,'2021-12-15',23);
54.4 代码实现
hive(default)>
select
id,
flag,
count(*) ct
from
(select
id,
dt,
lowcarbon,
date_sub(dt,rk) flag
from
(select
id,
dt,
lowcarbon,
rank() over(partition by id order by dt) rk
from
(select
id,
dt,
sum(lowcarbon) lowcarbon
from test1
group by id,dt
having lowcarbon>100)t1)t2)t3
group by id,flag
having ct>=3;
SQL55 分组问题
55.1 题目需求
统计每个用户连续的访问记录中,如果时间间隔小于6s,就分为一个组(观察数是标准的时间戳吗?)
55.2 表结构
下列为电商公司用户访问时间数据 | 统计某个用户连续的访问记录,如果时间间隔小于60s,就分为一组 | |
id ts 1001 17523641234 1001 17523641256 1002 17523641278 1001 17523641334 1002 17523641434 1001 17523641534 1001 17523641544 1002 17523641634 1001 17523641638 1001 17523641654 | id ts group 1001 17523641234 1 1001 17523641256 1 1001 17523641334 2 1001 17523641534 3 1001 17523641544 3 1001 17523641638 4 1001 17523641654 4 1002 17523641278 1 1002 17523641434 2 1002 17523641634 3 |
55.3 建表/插数语句
hive(default)>
create table test2 (
id bigint,
ts bigint
)
row format delimited
fields terminated by '\t';
Insert into test2 values(1001,17523641234),
(1001,17523641256),
(1002,17523641278),
(1001,17523641334),
(1002,17523641434),
(1001,17523641534),
(1001,17523641544),
(1002,17523641634),
(1001,17523641638),
(1001,17523641654);
55.4 代码实现
hive(default)>
select
id,
ts,
sum(if(tsdiff>=60,1,0)) over(partition by id order by ts) groupid
from(
select
id,
ts,
ts-lagts tsdiff
from(
select
id,
ts,
lag(ts,1,0) over(partition by id order by ts) lagts
from
user_visit
)t1
)t2;
SQL56 间隔连续问题
56.1 题目需求
计算每个用户最大的连续登录天数,可以间隔一天。如:如果一个用户在1,3,5,6登录游戏,则视为连续6天登录。
56.2 表结构
某游戏公司记录的用户每日登录数据
表名:game_user
字段名:id(用户id)、dt(日期)
56.3 建表/插数语句
hive(default)>
create table game_user(
id bigint,
dt string
)
row format delimited
fields terminated by '\t';
Insert into game_user values(1001,'2022-05-01 23:21:33'),
(1003,'2022-05-02 23:21:33'),
(1002,'2022-05-01 23:21:33'),
(1003,'2022-05-01 23:21:33'),
(1001,'2022-05-03 23:21:33'),
(1003,'2022-05-04 23:21:33'),
(1002,'2022-05-01 23:21:33'),
(1001,'2022-05-05 23:21:33'),
(1001,'2022-05-01 23:21:33'),
(1002,'2022-05-06 23:21:33'),
(1001,'2022-05-06 23:21:33'),
(1001,'2022-05-07 23:21:33');
56.4 代码实现
hive(default)>
select
id,
max(days)+1
from (
select
id,
flag,
datediff(max(dt),min(dt)) days
from (
select
id,
dt,
sum(if(flag>2,1,0)) over(partition by id order by dt) flag
from (
select
id,
dt,
datediff(dt,lagdt) flag
from (
select
id,
dt,
lag(dt,1,'1970-01-01') over(partition by id order by dt) lagdt
from
game_user
)t1
)t2
)t3
group by id,flag
)t4
group by id;
第二种算法
hive(default)>
select
t3.id,
t3.days+1 as lianxu_day
from (
select
t2.id,
datediff(max(t2.dt),min(t2.last_dt)) as days
from (
select
t1.id,
t1.dt,
t1.last_dt,
datediff( t1.dt,t1.last_dt) chazhi
from (
select
id,
dt,
lag(dt,1,'1970-01-01') over(partition by id order by dt ) last_dt
from game_user
) t1
) t2
where t2.chazhi<=2
group by t2.id
) t3;
结果:
t3.id lianxu_day
1001 7
1002 3
1003 4
Time taken: 2.491 seconds, Fetched: 3 row(s)
SQL57 打折日期交叉问题
57.1 题目需求
计算每个商品总的打折销售天数
注意:其中的交叉日期。
比如vivo品牌,第一次活动时间为2021-06-05到2021-06-15,第二次活动时间为2021-06-09到2021-06-21其中9号到15号为重复天数,只统计一次,即vivo总打折天数为2021-06-05到2021-06-21共计17天。
57.2 表结构
数据为平台商品促销数据
表名:good_promotion
字段名:brand(品牌)、stt(打折开始日期)、edt(打折结束日期)。
57.3 建表/插数语句
hive(default)>
create table good_promotion(
brand string,
stt string,
edt string
)
row format delimited
fields terminated by '\t';
Insert into good_promotion values('oppo','2021-06-05','2021-06-09'),
('oppo','2021-06-11','2021-06-21'),
('vivo','2021-06-05','2021-06-15'),
('vivo','2021-06-09','2021-06-21'),
('redmi','2021-06-05','2021-06-21'),
('redmi','2021-06-09','2021-06-15'),
('redmi','2021-06-17','2021-06-26'),
('huawei','2021-06-05','2021-06-26'),
('huawei','2021-06-09','2021-06-15'),
('huawei','2021-06-17','2021-06-21');
57.4 代码实现
hive(default)>
select
brand,
sum(if(days>=0,days+1,0)) days
from (
select
brand,
datediff(edt,stt) days
from (
select
brand,
if(maxEdt is null,stt,if(stt>maxEdt,stt,date_add(maxEdt,1))) stt,
edt
from (
select
brand,
stt,
edt,
max(edt) over(partition by brand order by stt rows between UNBOUNDED PRECEDING and 1 PRECEDING) maxEdt
from good_promotion
)t1
)t2
)t3
group by brand;
SQL58 同时在线问题
58.1 题目需求
统计平台最高峰同时在线的主播人数
58.2 表结构
如下为某直播平台主播开播及关播时间
表名:show_user
字段名:id(主播id)、stt(开播时间)、edt(关播时间)
58.3 建表/插数语句
hive(default)>
create table show_user(
id bigint,
stt string,
edt string
)
row format delimited
fields terminated by '\t';
Insert into show_user values(1001,'2021-06-14 12:12:12','2021-06-14 18:12:12'),
(1003,'2021-06-14 13:12:12','2021-06-14 16:12:12'),
(1004,'2021-06-14 13:15:12','2021-06-14 20:12:12'),
(1002,'2021-06-14 15:12:12','2021-06-14 16:12:12'),
(1005,'2021-06-14 15:18:12','2021-06-14 20:12:12'),
(1001,'2021-06-14 20:12:12','2021-06-14 23:12:12'),
(1006,'2021-06-14 21:12:12','2021-06-14 23:15:12'),
(1007,'2021-06-14 22:12:12','2021-06-14 23:10:12');
58.4 代码实现
hive(default)>
select
max(sum_p)
from
(select
id,
dt,
sum(p) over(order by dt) sum_p
from
(select id,stt dt,1 p from test5
union
select id,edt dt,-1 p from show_user)t1)t2;
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本