尚硅谷大数据SQL题库 - 高级58道（答案解析）

高级题

SQL1各个视频的平均完播率-简单

1.1 题目需求

计算2021年里有播放记录的每个视频的完播率(结果保留三位小数)，并按完播率降序排序

注：视频完播率是指完成播放次数占总播放次数的比例。

简单起见，结束观看时间与开始播放时间的差>=视频时长时，视为完成播放。

1.2 表结构

用户-视频互动表k1_user_video_log

短视频信息表k2_video_info

1.3 建表/插数语句

DROP TABLE IF EXISTS k1_user_video_log, k2_video_info;

create table if not exists

k1_user_video_log

(

uid int,

video_id int,

start_time timestamp,

end_time timestamp,

if_follow tinyint,

if_like tinyint,

if_retweet tinyint,

comment_id int

)

row format delimited fields terminated by ','

stored as textfile;

create table if not exists

k2_video_info

(

video_id int,

author int,

tag string,

duration int,

release_time timestamp

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO k1_user_video_log(uid, video_id, start_time, end_time, if_follow, i

f_like, if_retweet, comment_id) VALUES

(101, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:30', 0, 1, 1, null),

(102, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:24', 0, 0, 1, null),

(103, 2001, '2021-10-01 11:00:00', '2021-10-01 11:00:34', 0, 1, 0, 1732526),

(101, 2002, '2021-09-01 10:00:00', '2021-09-01 10:00:42', 1, 0, 1, null),

(102, 2002, '2021-10-01 11:00:00', '2021-10-01 11:00:30', 1, 0, 1, null);

INSERT INTO k2_video_info(video_id, author, tag, duration, release_time) VALUES

(2001, 901, '影视', 30, '2021-01-01 7:00:00'),

(2002, 901, '美食', 60, '2021-01-01 7:00:00'),

(2003, 902, '旅游', 90, '2021-01-01 7:00:00');

1.4 代码实现

SELECT t1.video_id, ROUND(sum(if(unix_timestamp(t1.end_time)-unix_timestamp(t1.start_time)>=t2.duration,1,0))/count(t1.video_id),3) as avg_comp_play_rate

FROM k1_user_video_log as t1

LEFT JOIN k2_video_info as t2

on t1.video_id=t2.video_id

WHERE year(start_time)=2021

GROUP BY t1.video_id

ORDER BY avg_comp_play_rate DESC;

示例数据的结果如下：

videojd avg_cornp_play_rate

2001 0.667

2002 0.000

解释:

视频2001在2021年10月有3次播放记录，观看时长分别为30秒、24秒、34秒，视频时长30秒，因此有两次是被认为完成播放了的，故完播率为0.667 ；

视频2002在2021年9月和10月共2次播放记录，观看时长分别为42秒、30秒，视频时长60秒，故完播率为0.000。

SQL2 平均播放进度大于60%的视频类别-简单

2.1 题目需求

计算各类视频的平均播放进度，将进度大于60%的类别输出。

注：播放进度=播放时长÷视频时长*100%，当播放时长大于视频时长时，播放进度均记为100%。

结果保留两位小数，并按播放进度倒序排序。

2.2 表结构

用户-视频互动表k3_user_video_log

短视频信息表k4_video_info

2.3 建表/插数语句

create table if not exists

k3_user_video_log

(

uid int,

video_id int,

start_time timestamp,

end_time timestamp,

if_follow tinyint,

if_like tinyint,

if_retweet tinyint,

comment_id int

)

row format delimited fields terminated by ','

stored as textfile;

create table if not exists

k4_video_info

(

video_id int,

author int,

tag string,

duration int,

release_time timestamp

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO k3_user_video_log(uid, video_id, start_time, end_time, if_follow, i

f_like, if_retweet, comment_id) VALUES

(101, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:30', 0, 1, 1, null),

(102, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:21', 0, 0, 1, null),

(103, 2001, '2021-10-01 11:00:50', '2021-10-01 11:01:20', 0, 1, 0, 1732526),

(102, 2002, '2021-10-01 11:00:00', '2021-10-01 11:00:30', 1, 0, 1, null),

(103, 2002, '2021-10-01 10:59:05', '2021-10-01 11:00:05', 1, 0, 1, null);

INSERT INTO k4_video_info(video_id, author, tag, duration, release_time) VALUES

(2001, 901, '影视', 30, '2021-01-01 7:00:00'),

(2002, 901, '美食', 60, '2021-01-01 7:00:00'),

(2003, 902, '旅游', 90, '2020-01-01 7:00:00');

2.4 代码实现

select

a1.video_id ,

a1.tag,

if(avg(rate)>100,100,avg(rate)) rate

from

(

select k3_user_video_log.uid id,k4_video_info.tag tag,k3_user_video_log.video_id,(unix_timestamp(end_time)-unix_timestamp(start_time))*100/k4_video_info.duration rate

from k3_user_video_log left join k4_video_info on k3_user_video_log.video_id = k4_video_info.video_id

)a1

group by a1.video_id,a1.tag

having avg(rate) > 60

order by avg(rate) desc;

输出：

影视|90.00%

美食|75.00%

解释：

影视类视频2001被用户101、102、103看过，播放进度分别为：30秒（100%）、21秒（70%）、30秒（100%），平均播放进度为90.00%（保留两位小数）；

美食类视频2002被用户102、103看过，播放进度分别为：30秒（50%）、60秒（100%），平均播放进度为75.00%（保留两位小数）；

SQL3 每类视频近一个月的转发量/率-中等

3.1 题目需求

统计在有用户互动的最近一个月（按包含当天在内的近30天算，比如10月31日的近30天为10.2~10.31之间的数据）中，每类视频的转发量和转发率（保留3位小数）。

注：转发率＝转发量÷播放量。结果按转发率降序排序。

3.2 表结构

用户-视频互动表k5_user_video_log

短视频信息表k6_video_info

3.3 建表/插数语句

DROP TABLE IF EXISTS tb_user_video_log, tb_video_info;

create table if not exists

k5_user_video_log

(

uid int,

video_id int,

start_time timestamp,

end_time timestamp,

if_follow tinyint,

if_like tinyint,

if_retweet tinyint,

comment_id int

)

row format delimited fields terminated by ','

stored as textfile;

create table if not exists

k6_video_info

(

video_id int,

author int,

tag string,

duration int,

release_time timestamp

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO k5_user_video_log(uid, video_id, start_time, end_time, if_follow, i

f_like, if_retweet, comment_id) VALUES

(101, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:20', 0, 1, 1, null)

,(102, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:15', 0, 0, 1, null)

,(103, 2001, '2021-10-01 11:00:50', '2021-10-01 11:01:15', 0, 1, 0, 1732526)

,(102, 2002, '2021-09-10 11:00:00', '2021-09-10 11:00:30', 1, 0, 1, null)

,(103, 2002, '2021-10-01 10:59:05', '2021-10-01 11:00:05', 1, 0, 0, null);

INSERT INTO k6_video_info(video_id, author, tag, duration, release_time) VALUES

(2001, 901, '影视', 30, '2021-01-01 7:00:00')

,(2002, 901, '美食', 60, '2021-01-01 7:00:00')

,(2003, 902, '旅游', 90, '2020-01-01 7:00:00');

3.4 代码实现

select

tag,

sum(if_retweet) retweet_cut,

cast(sum(if_retweet)/count(*) as decimal(16,3)) retweet_rate

from(

select

from k5_user_video_log

where start_time >= date_sub('2021-10-01',30)

) t1

join k6_video_info t2 on t1.video_id = t2.video_id

group by tag;

输出：

影视|2|0.667

美食|1|0.500

解释：

由表k5_user_video_log的数据可得，数据转储当天为2021年10月1日。近30天内，影视类视频2001共有3次播

放记录，被转发2次，转发率为0.667；美食类视频2002共有2次播放记录，1次被转发，转发率为0.500。

SQL4 每个创作者每月的涨粉率及截止当前的总粉丝量-中等

4.1 题目需求

计算2021年里每个创作者每月的涨粉率及截止当月的总粉丝量

注：涨粉率=(加粉量 - 掉粉量) / 播放量。结果按创作者ID、总粉丝量升序排序。

if_follow-是否关注为1表示用户观看视频中关注了视频创作者，为0表示此次互动前后关注状态未发生变化，为2表示本次观看过程中取消了关注。

4.2 表结构

用户-视频互动表k7_user_video_log

短视频信息表k8_video_info

4.3 建表/插数语句

create table if not exists

k7_user_video_log

(

uid int,

video_id int,

start_time timestamp,

end_time timestamp,

if_follow tinyint,

if_like tinyint,

if_retweet tinyint,

comment_id int

)

row format delimited fields terminated by ','

stored as textfile;

create table if not exists

k8_video_info

(

video_id int,

author int,

tag string,

duration int,

release_time timestamp

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO tb_user_video_log(uid, video_id, start_time, end_time, if_follow, i

f_like, if_retweet, comment_id) VALUES

(101, 2001, '2021-09-01 10:00:00', '2021-09-01 10:00:20', 0, 1, 1, null)

,(105, 2002, '2021-09-10 11:00:00', '2021-09-10 11:00:30', 1, 0, 1, null)

,(101, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:20', 1, 1, 1, null)

,(102, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:15', 0, 0, 1, null)

,(103, 2001, '2021-10-01 11:00:50', '2021-10-01 11:01:15', 1, 1, 0, 1732526)

,(106, 2002, '2021-10-01 10:59:05', '2021-10-01 11:00:05', 2, 0, 0, null);

INSERT INTO tb_video_info(video_id, author, tag, duration, release_time) VALUES

(2001, 901, '影视', 30, '2021-01-01 7:00:00')

,(2002, 901, '影视', 60, '2021-01-01 7:00:00')

,(2003, 902, '旅游', 90, '2020-01-01 7:00:00')

,(2004, 902, '美女', 90, '2020-01-01 8:00:00');

4.4 代码实现

select

author,

month,

fans_growth_rate,

sum(month_add) over(partition by author order by month) total_fans

from(

select

author,

month,

cast((sum(add)-sum(lose))/ count(*) as decimal(16,3)) fans_growth_rate,

sum(add - lose) month_add

from (

select

video_id,

if(if_follow=1,1,0) add,

if(if_follow=2,1,0) lose,

date_format(start_time,'yyyy-MM') month

from k7_user_video_log

) t1

join k8_video_info t2 on t1.video_id = t2.video_id

group by author, month

order by month

) t2;

输出：

901|2021-09|0.500|1

901|2021-10|0.250|2

解释：

示例数据中表 k7_user_video_log 里只有视频2001和2002的播放记录，都来自创作者901，播放时间在2021年9月和10月；其中9月里加粉量为1，掉粉量为0，播放量为2，因此涨粉率为0.500（保留3位小数）；其中10月里加粉量为2，掉份量为1，播放量为4，因此涨粉率为0.250，截止当前总粉丝数为2。

SQL5 近一个月发布的视频中热度最高的top3视频-困难

5.1 题目需求

找出近一个月发布的视频中热度最高的top3视频。

注：热度=(a*视频完播率+b*点赞数+c*评论数+d*转发数)*新鲜度；

新鲜度=1/(最近无播放天数+1)；

当前配置的参数a,b,c,d分别为100、5、3、2。

最近播放日期以 end_time-结束观看时间为准，假设为T，则最近一个月按 [T-29, T] 闭区间统计。

结果中热度保留为整数，并按热度降序排序。

5.2 表结构

用户-视频互动表k11_user_video_log

短视频信息表k12_video_info

5.3 建表/插数语句

create table if not exists

k11_user_video_log

(

uid int,

video_id int,

start_time timestamp,

end_time timestamp,

if_follow tinyint,

if_like tinyint,

if_retweet tinyint,

comment_id int

)

row format delimited fields terminated by ','

stored as textfile;

create table if not exists

k12_video_info

(

video_id int,

author int,

tag string,

duration int,

release_time timestamp

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO tb_user_video_log(uid, video_id, start_time, end_time, if_follow, i

f_like, if_retweet, comment_id) VALUES

(101, 2001, '2021-09-24 10:00:00', '2021-09-24 10:00:20', 1, 1, 0, null)

,(105, 2002, '2021-09-25 11:00:00', '2021-09-25 11:00:30', 0, 0, 1, null)

,(102, 2002, '2021-09-25 11:00:00', '2021-09-25 11:00:30', 1, 1, 1, null)

,(101, 2002, '2021-09-26 11:00:00', '2021-09-26 11:00:30', 1, 0, 1, null)

,(101, 2002, '2021-09-27 11:00:00', '2021-09-27 11:00:30', 1, 1, 0, null)

,(102, 2002, '2021-09-28 11:00:00', '2021-09-28 11:00:30', 1, 0, 1, null)

,(103, 2002, '2021-09-29 11:00:00', '2021-09-29 11:00:30', 1, 0, 1, null)

,(102, 2002, '2021-09-30 11:00:00', '2021-09-30 11:00:30', 1, 1, 1, null)

,(101, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:20', 1, 1, 0, null)

,(102, 2001, '2021-10-01 10:00:00', '2021-10-01 10:00:15', 0, 0, 1, null)

,(103, 2001, '2021-10-01 11:00:50', '2021-10-01 11:01:15', 1, 1, 0, 1732526)

,(106, 2002, '2021-10-02 10:59:05', '2021-10-02 11:00:05', 2, 0, 1, null)

,(107, 2002, '2021-10-02 10:59:05', '2021-10-02 11:00:05', 1, 0, 1, null)

,(108, 2002, '2021-10-02 10:59:05', '2021-10-02 11:00:05', 1, 1, 1, null)

,(109, 2002, '2021-10-03 10:59:05', '2021-10-03 11:00:05', 0, 1, 0, null);

INSERT INTO tb_video_info(video_id, author, tag, duration, release_time) VALUES

(2001, 901, '旅游', 30, '2020-01-01 7:00:00')

,(2002, 901, '旅游', 60, '2021-01-01 7:00:00')

,(2003, 902, '影视', 90, '2020-01-01 7:00:00')

,(2004, 902, '美女', 90, '2020-01-01 8:00:00');

5.4 代码实现

select a.video_id, round((full_play_rate + like_rate + common_rate + retweet_rate) * fresh_rate, 0) high_index

from (

-- 求完播率

select video_id, sum(if(t >= duration, 1, 0)) * 100 / count(*) full_play_rate

from (

select k11_user_video_log.video_id,

unix_timestamp(end_time, 'yyyy-MM-dd HH:mm:ss:sss') -

unix_timestamp(start_time, 'yyyy-MM-dd HH:mm:ss:sss') t,

k12vi.duration

from k11_user_video_log

left join k12_video_info k12vi

on k11_user_video_log.video_id = k12vi.video_id) t1

group by video_id) a

left join (

-- 点赞、评论、转发

select video_id,

sum(if_like) * 5 like_rate,

sum(if(comment_id is not null, 1, 0)) * 3 common_rate,

sum(if_retweet) * 2 retweet_rate

from k11_user_video_log

group by video_id) b on a.video_id = b.video_id

left join (

-- 新鲜度

select video_id, 1 / (if(no_play_days > 29, 29, no_play_days) + 1) fresh_rate

from (

-- 最近无播放天数

select video_id,

end_date,

row_number() over (partition by video_id order by end_date desc) rk,

datediff(today, end_date) `no_play_days`

from (

-- 将结束时间转为日期，然后排序

select video_id, to_date(end_time) end_date, today

from k11_user_video_log

join(select max(to_date(end_time)) today

from k11_user_video_log

group by 1) tx

) t1) t2

where rk = 1) c

on a.video_id = c.video_id

order by high_index desc

limit 3;

输出：

旅游|2021-10-01|5|2

旅游|2021-10-02|5|3

旅游|2021-10-03|6|3

解释：

最近播放日期为2021-10-03，记作当天日期；近一个月（2021-09-04及之后）发布的视频有2001、2002、2003、2004，不过2004暂时还没有播放记录；视频2001完播率1.0（被播放次数4次，完成播放4次），被点赞3次，评论1次，转发2次，最近无播放天数为0，

因此热度为： (100*1.0+5*3+3*1+2*2)/(0+1)=122

同理，视频2003完播率0，被点赞数1，评论和转发均为0，最近无播放天数为3，因此热度为：(100*0+5*1+3*0+2*0)/(3+1)=1（1.2保留为整数）

SQL6 2021年11月每天的人均浏览文章时长-简单

6.1 题目需求

场景逻辑说明：

artical_id-文章ID代表用户浏览的文章的ID，artical_id-文章ID为0表示用户在非文章内容页（比如App内的列表页、活动页等）。

问题：

统计2021年11月每天的人均浏览文章时长（秒数），结果保留1位小数，并按时长由短到长排序。

6.2 表结构

用户行为日志表m1_user_log

6.3 建表/插数语句

create table if not exists

m1_user_log

(

uid INT COMMENT '用户ID',

artical_id INT COMMENT '视频ID',

in_time timestamp COMMENT '进入时间',

out_time timestamp COMMENT '离开时间',

sign_in TINYINT COMMENT '是否签到'

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO m1_user_log(uid, artical_id, in_time, out_time, sign_in) VALUES

(101, 9001, '2021-11-01 10:00:00', '2021-11-01 10:00:31', 0),

(102, 9001, '2021-11-01 10:00:00', '2021-11-01 10:00:24', 0),

(102, 9002, '2021-11-01 11:00:00', '2021-11-01 11:00:11', 0),

(101, 9001, '2021-11-02 10:00:00', '2021-11-02 10:00:50', 0),

(102, 9002, '2021-11-02 11:00:01', '2021-11-02 11:00:24', 0);

6.4 代码实现

select

a.in_time,

round(cast((sum(view_time) / count(distinct uid)) as decimal(10,2)),1) avg_view_len_sec

from

(select

uid,

substr(in_time,1,10) in_time,

substr(out_time,1,10) out_time,

(unix_timestamp(out_time) - unix_timestamp(in_time)) view_time

from m1_user_log) a

group by a.in_time;

输出：

2021-11-01|33.0

2021-11-02|36.5

解释

11月1日有2个人浏览文章，总共浏览时长为31+24+11=66秒，人均浏览33秒；

11月2日有2个人浏览文章，总共时长为50+23=73秒，人均时长为36.5秒。

SQL7 2021年11月每天新用户的次日留存率-中等

7.1 题目需求

统计2021年11月每天新用户的次日留存率（保留2位小数）

注：次日留存率为当天新增的用户数中第二天又活跃了的用户数占比。

如果in_time-进入时间和out_time-离开时间跨天了，在两天里都记为该用户活跃过，结果按日期升序。

7.2 表结构

用户行为日志表m3_user_log

7.3 建表/插数语句

create table if not exists

m3_user_log

(

uid INT COMMENT '用户ID',

artical_id INT COMMENT '视频ID',

in_time timestamp COMMENT '进入时间',

out_time timestamp COMMENT '离开时间',

sign_in TINYINT COMMENT '是否签到'

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO m3_user_log(uid, artical_id, in_time, out_time, sign_in) VALUES (101, 0, '2021-11-01 10:00:00', '2021-11-01 10:00:42', 1), (102, 9001, '2021-11-01 10:00:00', '2021-11-01 10:00:09', 0), (103, 9001, '2021-11-01 10:00:01', '2021-11-01 10:01:50', 0), (101, 9002, '2021-11-02 10:00:09', '2021-11-02 10:00:28', 0), (103, 9002, '2021-11-02 10:00:51', '2021-11-02 10:00:59', 0), (104, 9001, '2021-11-02 10:00:28', '2021-11-02 10:00:50', 0), (101, 9003, '2021-11-03 11:00:55', '2021-11-03 11:01:24', 0), (104, 9003, '2021-11-03 11:00:45', '2021-11-03 11:00:55', 0), (105, 9003, '2021-11-03 11:00:53', '2021-11-03 11:00:59', 0), (101, 9002, '2021-11-04 11:00:55', '2021-11-04 11:00:59', 0);

7.4 代码实现

select t2.dt,round(count(t1.uid)/count(t2.uid),2) uv_left_rate

from(

select uid,min(date(in_time)) dt

from m3_user_log

group by uid

)t2

left join(

select uid,date(in_time) dt

from m3_user_log

union

select uid,date(out_time) dt

from m3_user_log

)t1

on t2.uid=t1.uid and datediff(t1.dt,t2.dt)=1

where date_format(t2.dt,'yyyy-MM')='2021-11'

group by t2.dt

order by t2.dt

输出：

2021-11-01|0.67

2021-11-02|1.00

2021-11-03|0.00

解释：

11.01有3个用户活跃101、102、103，均为新用户，在11.02只有101、103两个又活跃了，因此11.01的次日留存率为0.67；

11.02有104一位新用户，在11.03又活跃了，因此11.02的次日留存率为1.00；

11.03有105一位新用户，在11.04未活跃，因此11.03的次日留存率为0.00；

11.04没有新用户，不输出。

SQL8 每天的日活数及新用户占比-较难

8.1 题目需求

统计每天的日活数及新用户占比

注：新用户占比=当天的新用户数÷当天活跃用户数（日活数）。

如果in_time-进入时间和out_time-离开时间跨天了，在两天里都记为该用户活跃过。

新用户占比保留2位小数，结果按日期升序排序。

8.2 表结构

用户行为日志表m5_user_log

8.3 建表/插数语句

create table if not exists

m5_user_log

(

uid INT COMMENT '用户ID',

artical_id INT COMMENT '视频ID',

in_time timestamp COMMENT '进入时间',

out_time timestamp COMMENT '离开时间',

sign_in TINYINT COMMENT '是否签到'

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO m5_user_log(uid, artical_id, in_time, out_time, sign_in) VALUES

(101, 9001, '2021-10-31 10:00:00', '2021-10-31 10:00:09', 0),

(102, 9001, '2021-10-31 10:00:00', '2021-10-31 10:00:09', 0),

(101, 0, '2021-11-01 10:00:00', '2021-11-01 10:00:42', 1),

(102, 9001, '2021-11-01 10:00:00', '2021-11-01 10:00:09', 0),

(108, 9001, '2021-11-01 10:00:01', '2021-11-01 10:01:50', 0),

(108, 9001, '2021-11-02 10:00:01', '2021-11-02 10:01:50', 0),

(104, 9001, '2021-11-02 10:00:28', '2021-11-02 10:00:50', 0),

(106, 9001, '2021-11-02 10:00:28', '2021-11-02 10:00:50', 0),

(108, 9001, '2021-11-03 10:00:01', '2021-11-03 10:01:50', 0),

(109, 9002, '2021-11-03 11:00:55', '2021-11-03 11:00:59', 0),

(104, 9003, '2021-11-03 11:00:45', '2021-11-03 11:00:55', 0),

(105, 9003, '2021-11-03 11:00:53', '2021-11-03 11:00:59', 0),

(106, 9003, '2021-11-03 11:00:45', '2021-11-03 11:00:55', 0);

8.4 代码实现

select dt,count(distinct uid)DAU,

round(count(`if`(rk=1,1,null))/count(*),2)uv_new_ratio

from (

select date_format(in_time,'yyyy-MM-dd')dt,uid,artical_id,row_number() over (partition by uid order by date_format(in_time,'yyyy-MM-dd')) rk

from m5_user_log

)t1

group by dt;

输出：

2021-10-31|2|1.00

2021-11-01|3|0.33

2021-11-02|3|0.67

2021-11-03|5|0.40

解释：

2021年10月31日有2个用户活跃，都为新用户，新用户占比1.00；

2021年11月1日有3个用户活跃，其中1个新用户，新用户占比0.33；

SQL9 连续签到领金币困难与SQL42相同

9.1 题目需求

场景逻辑说明：

artical_id-文章ID代表用户浏览的文章的ID，特殊情况artical_id-文章ID为0表示用户在非文章内容页（比如App内的列表页、活动页等）。注意：只有artical_id为0时sign_in值才有效。

从2021年7月7日0点开始，用户每天签到可以领1金币，并可以开始累积签到天数，连续签到的第3、7天分别可额外领2、6金币。

每连续签到7天后重新累积签到天数（即重置签到天数：连续第8天签到时记为新的一轮签到的第一天，领1金币）

问题：

计算每个用户2021年7月以来每月获得的金币数（该活动到10月底结束，11月1日开始的签到不再获得金币）。结果按月份、ID升序排序。

注：如果签到记录的in_time-进入时间和out_time-离开时间跨天了，也只记作in_time对应的日期签到了。

9.2 表结构

用户行为日志表m6_user_log

9.3 建表/插数语句

create table if not exists

m6_user_log

(

uid INT COMMENT '用户ID',

artical_id INT COMMENT '视频ID',

in_time timestamp COMMENT '进入时间',

out_time timestamp COMMENT '离开时间',

sign_in TINYINT COMMENT '是否签到'

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO m6_user_log(uid, artical_id, in_time, out_time, sign_in) VALUES

(101, 0, '2021-07-07 10:00:00', '2021-07-07 10:00:09', 1),

(101, 0, '2021-07-08 10:00:00', '2021-07-08 10:00:09', 1),

(101, 0, '2021-07-09 10:00:00', '2021-07-09 10:00:42', 1),

(101, 0, '2021-07-10 10:00:00', '2021-07-10 10:00:09', 1),

(101, 0, '2021-07-11 23:59:55', '2021-07-11 23:59:59', 1),

(101, 0, '2021-07-12 10:00:28', '2021-07-12 10:00:50', 1),

(101, 0, '2021-07-13 10:00:28', '2021-07-13 10:00:50', 1),

(102, 0, '2021-10-01 10:00:28', '2021-10-01 10:00:50', 1),

(102, 0, '2021-10-02 10:00:01', '2021-10-02 10:01:50', 1),

(102, 0, '2021-10-03 11:00:55', '2021-10-03 11:00:59', 1),

(102, 0, '2021-10-04 11:00:45', '2021-10-04 11:00:55', 0),

(102, 0, '2021-10-05 11:00:53', '2021-10-05 11:00:59', 1),

(102, 0, '2021-10-06 11:00:45', '2021-10-06 11:00:55', 1);

9.4 代码实现

select uid,mouth,sum(score) coin

from (

select uid,dt,mouth,sign,

`if`((rk=1 and score=5),score,`if`((rk=2 and score=5),3,`if`((rk=1 and score=3),3,sign))) score

from (

select uid,dt,mouth,sign,score,

row_number() over (partition by uid,score order by dt) rk

from (

select t1.uid uid,t1.dt dt,t1.mouth mouth,t1.sign_in sign,`if`(

(t1.sign_in=1 and lag(sign_in,1) over (partition by uid) = 1

and lag(sign_in,2) over (partition by uid) = 1

and lag(sign_in,3) over (partition by uid) = 1

and lag(sign_in,4) over (partition by uid) = 1),5,

`if`(

(t1.sign_in=1 and lag(sign_in,1) over (partition by uid) = 1

and lag(sign_in,2) over (partition by uid) = 1),3,sign_in

)) score

from (

select uid,date_format(in_time,'yyyy-MM-dd')dt,date_format(in_time,'yyyy-MM')mouth,sign_in

from m6_user_log

)t1

)t2

)t3

)t4

group by uid, mouth;

输出：

101|202107|15

102|202110|7

解释：

101在活动期内连续签到了7天，因此获得1*7+2+6=15金币；

102在10.01~10.03连续签到3天获得5金币

10.04断签了，10.05~10.06连续签到2天获得2金币，共得到7金币。

SQL10 计算商城中2021年每月的GMV-简单

10.1 题目需求

场景逻辑说明：

用户将购物车中多件商品一起下单时，订单总表会生成一个订单（但此时未付款，status-订单状态为0，表示待付款）；当用户支付完成时，在订单总表修改对应订单记录的status-订单状态为1，表示已付款；若用户退货退款，在订单总表生成一条交易总金额为负值的记录（表示退款金额，订单号为退款单号， status-订单状态为2表示已退款）。

问题：

请计算商城中2021年每月的GMV，输出GMV大于10w的每月GMV，值保留到整数。

注：GMV为已付款订单和未付款订单两者之和。结果按GMV升序排序。

10.2 表结构

订单总表n1_order_overall

10.3 建表/插数语句

create table if not exists

n1_order_overall

(

order_id INT COMMENT '订单号',

uid INT COMMENT '用户ID',

event_time timestamp COMMENT '下单时间',

total_amount DECIMAL COMMENT '订单总金额',

total_cnt INT COMMENT '订单商品总件数',

status TINYINT COMMENT '订单状态'

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO n1_order_overall(order_id, uid, event_time, total_amount, total_cnt, `status`) VALUES (301001, 101, '2021-10-01 10:00:00', 15900, 2, 1), (301002, 101, '2021-10-01 11:00:00', 15900, 2, 1),

(301003, 102, '2021-10-02 10:00:00', 34500, 8, 0), (301004, 103, '2021-10-12 10:00:00', 43500, 9, 1), (301005, 105, '2021-11-01 10:00:00', 31900, 7, 1), (301006, 102, '2021-11-02 10:00:00', 24500, 6, 1), (391007, 102, '2021-11-03 10:00:00', -24500, 6, 2), (301008, 104, '2021-11-04 10:00:00', 55500, 12, 0);

10.4 代码实现

select

t1.month,t1.GMV

from

(select

date_format(event_time,"yyyy-MM") month,

sum(total_amount) over(partition by date_format(event_time,"yyyy-MM") ) GMV

from n1_order_overall

where status <> 2)t1

group by t1.month, t1.GMV;

输出：

2021-10|109800

2021-11|111900

解释：

2021年10月有3笔已付款的订单，1笔未付款订单，总交易金额为109800；2021年11月有2笔已付款订单，1笔未付款订单，总交易金额为111900（还有1笔退款订单由于已计算了付款的订单金额，无需计算在GMV中）。

SQL11 某店铺的各商品毛利率及店铺整体毛利率-中等

11.1 题目需求

请计算2021年10月以来店铺901中商品毛利率大于24.9%的商品信息及店铺整体毛利率。

注：商品毛利率=(1-进价/平均单件售价)*100% ；

店铺毛利率=(1-总进价成本/总销售收入)*100% 。

结果先输出店铺毛利率，再按商品ID升序输出各商品毛利率，均保留1位小数。

11.2表结构

商品信息表n3_product_info

订单总表n4_order_overall

订单明细表n5_order_detail

11.3 建表/插数语句

create table if not exists

n3_order_overall

(

order_id INT COMMENT '订单号',

uid INT COMMENT '用户ID',

event_time timestamp COMMENT '下单时间',

total_amount DECIMAL COMMENT '订单总金额',

total_cnt INT COMMENT '订单商品总件数',

status TINYINT COMMENT '订单状态'

)

row format delimited fields terminated by ','

stored as textfile;

create table if not exists

n4_product_info

(

product_id INT COMMENT '商品ID',

shop_id INT COMMENT '店铺ID',

tag string COMMENT '商品类别标签',

in_price DECIMAL COMMENT '进货价格',

quantity INT COMMENT '进货数量',

release_time timestamp COMMENT '上架时间'

)

row format delimited fields terminated by ','

stored as textfile;

create table if not exists

n5_order_detail

(

order_id INT COMMENT '订单号',

product_id INT COMMENT '商品ID',

price DECIMAL COMMENT '商品单价',

cnt INT COMMENT '下单数量'

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO n3_order_overall(order_id, uid, event_time, total_amount, total_cnt,

`status`) VALUES

(301001, 101, '2021-10-01 10:00:00', 30000, 3, 1),

(301002, 102, '2021-10-01 11:00:00', 23900, 2, 1),

(301003, 103, '2021-10-02 10:00:00', 31000, 2, 1);

INSERT INTO n4_product_info(product_id, shop_id, tag, in_price, quantity, relea

se_time) VALUES

(8001, 901, '家电', 6000, 100, '2020-01-01 10:00:00'),

(8002, 902, '家电', 12000, 50, '2020-01-01 10:00:00'),

(8003, 901, '3C数码', 12000, 50, '2020-01-01 10:00:00');

INSERT INTO n5_order_detail(order_id, product_id, price, cnt) VALUES

(301001, 8001, 8500, 2),

(301001, 8002, 15000, 1),

(301002, 8001, 8500, 1),

(301002, 8002, 16000, 1),

(301003, 8002, 14000, 1),

(301003, 8003, 18000, 1);

11.4 代码实现

select product_id,

profit_rate

from(

select '店铺毛利率' product_id,

concat(round((1-sum(in_price * cnt) / sum(price * cnt))* 100,1) , '%') as profit_rate

from(

select shop_id,

in_price,

price,

cnt,

date_format(event_time,'yyyy-MM') event_time

from n4_product_info n4 left join n5_order_detail n5

on n4.product_id=n5.product_id

join n3_order_overall n3

on n3.order_id=n5.order_id

) t1

where t1.shop_id=901 and event_time>=2021-10

group by shop_id

union

select cast(t2.product_id as string) product_id,

concat(round((1-sum(in_price * cnt) / sum(price * cnt))* 100,1),'%') as profit_rate

from(

select n4.product_id product_id,

shop_id,

in_price,

price,

cnt,

date_format(event_time,'yyyy-MM') event_time

from n4_product_info n4 left join n5_order_detail n5

on n4.product_id=n5.product_id

join n3_order_overall n3

on n3.order_id=n5.order_id

) t2

where t2.shop_id=901 and event_time>=2021-10

group by product_id

having 1-sum(in_price * cnt) / sum(price * cnt) >0.249

order by product_id

) t3;

输出：

店铺汇总|31.0%

解释：

店铺901有两件商品8001和8003；8001售出了3件，销售总额为25500，进价总额为18000，毛利率为1-

18000/25500=29.4%，8003售出了1件，售价为18000，进价为12000，毛利率为33.3%；

店铺卖出的这4件商品总销售额为43500，总进价为30000，毛利率为1-30000/43500=31.0%

SQL12 零食类商品中复购率top3高的商品-中等

12.1 题目需求

请统计零食类商品中复购率top3高的商品。

注：复购率指用户在一段时间内对某商品的重复购买比例，复购率越大，则反映出消费者对品牌的忠诚度就越高，也叫回头率

此处我们定义：某商品复购率 = 近90天内购买它至少两次的人数 ÷ 购买它的总人数

近90天指包含最大日期（记为当天）在内的近90天。结果中复购率保留3位小数，并按复购率倒序、商品ID升序排序

12.2 表结构

商品信息表n6_product_info

订单总表n7_order_overall

订单明细表n8_order_detail

12.3 建表/插数语句

create table if not exists

n6_order_overall

(

order_id INT COMMENT '订单号',

uid INT COMMENT '用户ID',

event_time timestamp COMMENT '下单时间',

total_amount DECIMAL COMMENT '订单总金额',

total_cnt INT COMMENT '订单商品总件数',

status TINYINT COMMENT '订单状态'

)

row format delimited fields terminated by ','

stored as textfile;

create table if not exists

n7_product_info

(

product_id INT COMMENT '商品ID',

tag string COMMENT '商品类别标签',

in_price DECIMAL COMMENT '进货价格',

quantity INT COMMENT '进货数量',

release_time timestamp COMMENT '上架时间'

)

row format delimited fields terminated by ','

stored as textfile;

create table if not exists

n8_order_detail

(

order_id INT COMMENT '订单号',

product_id INT COMMENT '商品ID',

price DECIMAL COMMENT '商品单价',

cnt INT COMMENT '下单数量'

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO n6_product_info(product_id, shop_id, tag, in_price, quantity, relea

se_time) VALUES

(8001, 901, '零食', 60, 1000, '2020-01-01 10:00:00'),

(8002, 901, '零食', 140, 500, '2020-01-01 10:00:00'),

(8003, 901, '零食', 160, 500, '2020-01-01 10:00:00');

INSERT INTO n7_order_overall(order_id, uid, event_time, total_amount, total_cnt,

`status`) VALUES

(301001, 101, '2021-09-30 10:00:00', 140, 1, 1),

(301002, 102, '2021-10-01 11:00:00', 235, 2, 1),

(301011, 102, '2021-10-31 11:00:00', 250, 2, 1),

(301003, 101, '2021-11-02 10:00:00', 300, 2, 1),

(301013, 105, '2021-11-02 10:00:00', 300, 2, 1),

(301005, 104, '2021-11-03 10:00:00', 170, 1, 1);

INSERT INTO n8_order_detail(order_id, product_id, price, cnt) VALUES

(301001, 8002, 150, 1),

(301011, 8003, 200, 1),

(301011, 8001, 80, 1),

(301002, 8001, 85, 1),

(301002, 8003, 180, 1),

(301003, 8002, 140, 1),

(301003, 8003, 180, 1),

(301013, 8002, 140, 2),

(301005, 8003, 180, 1);

12.4 代码实现

select t4.product_id, round(t4.count/(t5.total-1), 3) cpr

from (

--左表查询每个商品复购次数

select product_id, count(product_id) count

from (

select product_id

from (

--筛选出每个商品重复下单的uid

select product_id, uid, event_time,

lag(event_time, 1, 000) over(partition by product_id, uid order by product_id) lag_time

from (select od.product_id, oo.uid, oo.event_time

from n8_order_detail od join n6_order_overall oo

on od.order_id = oo.order_id

where oo.status = 1

order by od.product_id, oo.event_time) t1

) t2

where datediff(event_time, lag_time) < 90

) t3

group by product_id) t4

join (

--右表查询总下单次数

select distinct product_id, count(product_id) over (partition by product_id) total

from n8_order_detail) t5

on t4.product_id = t5.product_id

order by t4.product_id;

输出：

8001|1.000

8002|0.500

8003|0.333

解释：

商品8001、8002、8003都是零食类商品，8001只被用户102购买了两次，复购率1.000；

商品8002被101购买了两次，被105购买了1次，复购率0.500；

商品8003被102购买两次，被101和105各购买1次，复购率为0.333。

SQL13 2021年国庆在北京接单3次及以上的司机统计信息-简单

13.1 题目需求

场景逻辑说明：

用户提交打车请求后，在用户打车记录表生成一条打车记录，order_id-订单号设为null；

当有司机接单时，在打车订单表生成一条订单，填充order_time-接单时间及其左边的字段，start_time-开始计费的上车时间及其右边的字段全部为null，并把order_id-订单号和order_time-接单时间（end_time-打车结束时间）写入打车记录表；若一直无司机接单，超时或中途用户主动取消打车，则记录end_time-打车结束时间。

若乘客上车前，乘客或司机点击取消订单，会将打车订单表对应订单的finish_time-订单完成时间填充为取消时间，其余字段设为null。

当司机接上乘客时，填充订单表中该start_time-开始计费的上车时间。

当订单完成时填充订单完成时间、里程数、费用；评分设为null，在用户给司机打1~5星评价后填充。

问题：

请统计2021年国庆7天期间在北京市接单至少3次的司机的平均接单数和平均兼职收入（暂不考虑平台佣金，直接计算完成的订单费用总额），结果保留3位小数。

13.2 表结构

用户打车记录表p1_get_car_record

打车订单表p2_get_car_order

13.3 建表/插数语句

create table if not exists

p1_get_car_record

(

uid INT COMMENT '用户ID',

city string COMMENT '城市',

event_time timestamp COMMENT '打车时间',

end_time timestamp COMMENT '打车结束时间',

order_id INT COMMENT '订单号'

)

row format delimited fields terminated by ','

stored as textfile;

create table if not exists

p2_get_car_order

(

order_id INT COMMENT '订单号',

uid INT COMMENT '用户ID',

driver_id INT COMMENT '司机ID',

order_time timestamp COMMENT '接单时间',

start_time timestamp COMMENT '开始计费的上车时间',

finish_time timestamp COMMENT '订单结束时间',

mileage DOUBLE COMMENT '行驶里程数',

fare DOUBLE COMMENT '费用',

grade TINYINT COMMENT '评分'

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO p1_get_car_record(uid, city, event_time, end_time, order_id) VALUES

(101, '北京', '2021-10-01 07:00:00', '2021-10-01 07:02:00', null),

(102, '北京', '2021-10-01 09:00:30', '2021-10-01 09:01:00', 9001),

(101, '北京', '2021-10-01 08:28:10', '2021-10-01 08:30:00', 9002),

(103, '北京', '2021-10-02 07:59:00', '2021-10-02 08:01:00', 9003),

(104, '北京', '2021-10-03 07:59:20', '2021-10-03 08:01:00', 9004),

(105, '北京', '2021-10-01 08:00:00', '2021-10-01 08:02:10', 9005),

(106, '北京', '2021-10-01 17:58:00', '2021-10-01 18:01:00', 9006),

(107, '北京', '2021-10-02 11:00:00', '2021-10-02 11:01:00', 9007),

(108, '北京', '2021-10-02 21:00:00', '2021-10-02 21:01:00', 9008) ;

INSERT INTO p2_get_car_order(order_id, uid, driver_id, order_time, start_time,

finish_time, mileage, fare, grade) VALUES

(9002, 101, 201, '2021-10-01 08:30:00', null, '2021-10-01 08:31:00', null, null

, null),

(9001, 102, 202, '2021-10-01 09:01:00', '2021-10-01 09:06:00', '2021-10-01 09:

31:00', 10.0, 41.5, 5),

(9003, 103, 202, '2021-10-02 08:01:00', '2021-10-02 08:15:00', '2021-10-02 08:

31:00', 11.0, 41.5, 4),

(9004, 104, 202, '2021-10-03 08:01:00', '2021-10-03 08:13:00', '2021-10-03 08:

31:00', 7.5, 22, 4),

(9005, 105, 203, '2021-10-01 08:02:10', '2021-10-01 08:18:00', '2021-10-01 08:

31:00', 15.0, 44, 5),

(9006, 106, 203, '2021-10-01 18:01:00', '2021-10-01 18:09:00', '2021-10-01 18:

31:00', 8.0, 25, 5),

(9007, 107, 203, '2021-10-02 11:01:00', '2021-10-02 11:07:00', '2021-10-02 11:

31:00', 9.9, 30, 5),

(9008, 108, 203, '2021-10-02 21:01:00', '2021-10-02 21:10:00', '2021-10-02 21:

31:00', 13.2, 38, 4);

13.4 代码实现

select city,

round(sum(ct)/count(*),3) avg_order_num,

round(sum(total_fare)/count(*),3) avg_income

from(select distinct city,

sum(fare) over(partition by driver_id) total_fare,

count(*) over(partition by driver_id) ct

from p2_get_car_order p2

left join p1_get_car_record p1

on p2.order_id = p1.order_id

where city='北京'

and date_format(order_time,'yyyy-MM-dd')>='2021-10-01'

and date_format(order_time,'yyyy-MM-dd')<='2021-10-07')t1

where ct>=3

group by city;

输出：

北京|3.500|121.000

解释：

在2021年国庆期间北京市的订单中，202共接了3单，兼职收入105；203接了4单，兼职收入137；201共接了1单，但取消了；

接单至少3次的司机有202和203，他两人全部总共接单数为7，总收入为242。因此平均接单数为3.500，平均收入为121.000；

SQL14 有取消订单记录的司机平均评分-简单

14.1 题目需求

请找到2021年10月有过取消订单记录的司机，计算他们每人全部已完成的有评分订单的平均评分及总体平均评分，保留1位小数。先按driver_id升序输出，再输出总体情况。

14.2 表结构

用户打车记录表p3_get_car_record

打车订单表p4_get_car_order

14.3 建表/插数语句

create table if not exists

p3_get_car_record

(

uid INT COMMENT '用户ID',

city string COMMENT '城市',

event_time timestamp COMMENT '打车时间',

end_time timestamp COMMENT '打车结束时间',

order_id INT COMMENT '订单号'

)

row format delimited fields terminated by ','

stored as textfile;

create table if not exists

p4_get_car_order

(

order_id INT COMMENT '订单号',

uid INT COMMENT '用户ID',

driver_id INT COMMENT '司机ID',

order_time timestamp COMMENT '接单时间',

start_time timestamp COMMENT '开始计费的上车时间',

finish_time timestamp COMMENT '订单结束时间',

mileage DOUBLE COMMENT '行驶里程数',

fare DOUBLE COMMENT '费用',

grade TINYINT COMMENT '评分'

)

row format delimited fields terminated by ','

stored as textfile;

INSERT INTO p3_get_car_record(uid, city, event_time, end_time, order_id) VALUES