SQL窗口函数综合案例
一、需求
统计每个用户的连续交易总额、连续登录天数、连续登录的开始和结束日期、登录间隔天数
二、数据
-- 建表语句
CREATE TABLE deal (
id INT,
`date` DATE,
amount DECIMAL(10,2)
);
-- 插入语句
INSERT INTO deal (id, `date`, amount) VALUES
(1, '2019-02-08', 6214.23),
(1, '2019-02-08', 6247.32),
(1, '2019-02-09', 85.63),
(1, '2019-02-09', 967.36),
(1, '2019-02-10', 85.69),
(1, '2019-02-12', 769.85),
(1, '2019-02-13', 943.86),
(1, '2019-02-14', 538.42),
(1, '2019-02-15', 369.76),
(1, '2019-02-16', 369.76),
(1, '2019-02-18', 795.15),
(1, '2019-02-19', 715.65),
(1, '2019-02-21', 537.71),
(2, '2019-02-08', 6214.23),
(2, '2019-02-08', 6247.32),
(2, '2019-02-09', 85.63),
(2, '2019-02-09', 967.36),
(2, '2019-02-10', 85.69),
(2, '2019-02-12', 769.85),
(2, '2019-02-13', 943.86),
(2, '2019-02-14', 943.18),
(2, '2019-02-15', 369.76),
(2, '2019-02-18', 795.15),
(2, '2019-02-19', 715.65),
(2, '2019-02-21', 537.71),
(3, '2019-02-08', 6214.23),
(3, '2019-02-08', 6247.32),
(3, '2019-02-09', 85.63),
(3, '2019-02-09', 967.36),
(3, '2019-02-10', 85.69),
(3, '2019-02-12', 769.85),
(3, '2019-02-13', 943.86),
(3, '2019-02-14', 276.81),
(3, '2019-02-15', 369.76),
(3, '2019-02-16', 369.76),
(3, '2019-02-18', 795.15),
(3, '2019-02-19', 715.65),
(3, '2019-02-21', 537.71);
二、思路
1.理解什么是连续交易?
简言之,就是一天接着一天不间断,例如2019-02-08、2019-02-09、2019-02-10是连续的,2019-02-10、2019-02-12是非连续的。
2.如何判断一段日期是连续的?(关键点)
用日期和行号的差值来判断
-
连续日期
假设某用户的交易日期为
2019-02-08
、2019-02-09
、2019-02-10
,对应的行号为1
、2
、3
。那么就有:2019-02-08 - 1 = 2019-02-07
2019-02-09 - 2 = 2019-02-07
2019-02-10 - 3 = 2019-02-07
-
非连续日期
假设某用户的交易日期为
2019-02-08
、2019-02-09
、2019-02-12
,对应的行号为1
、2
、3
。那么就有:2019-02-08 - 1 = 2019-02-07
2019-02-09 - 2 = 2019-02-07
2019-02-12 - 3 = 2019-02-09
可以看到,连续日期会产生相同差值,非连续日期会差生不同的差值。
3.理解什么是间隔天数?
用户上一次连续登录的结束日期与当前连续登录的开始日期间隔的天数
三、实现
1.按日期聚合
SELECT
id,
`date`,
SUM(amount) AS day_amount -- 合并同日多次交易
FROM deal
GROUP BY id, `date` ;
2.生成行号
用开窗函数ROW_NUMBER()生成rn列,为窗口内每一行生成一个唯一行号,且行号在窗口内连续递增
SELECT
id,
`date`,
SUM(amount) AS day_amount,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY `date`) AS rn -- 生成行号
FROM deal
GROUP BY id, `date`;
3.生成判断连续日期的标识
用日期与行号的差值作为连续性的标识
- MySQL中用
DATE_SUB(date, INTERVAL rn DAY)
- Hive中用
date_sub(date, rn)
SELECT
id,
`date`,
day_amount,
rn,
DATE_SUB(`date`, INTERVAL rn DAY) AS grp -- 连续性标识列
FROM
(
SELECT
id,
`date`,
SUM(amount) AS day_amount,
ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY
`date`
) AS rn
FROM
deal
GROUP BY
id,
`date`
) t1
4.计算连续交易总额、连续登录天数、连续登录的开始和结束日期
根据上一步的结果表,按照id、grp列进行分组聚合计算:用sum、count、min、max分别求连续交易总额、连续登录天数、连续登录的开始和结束日期
SELECT
id,
MIN(`date`) AS start_date, -- 聚合计算
MAX(`date`) AS end_date, -- 聚合计算
COUNT(*) AS days, -- 聚合计算
SUM(day_amount) AS total_amount --聚合计算
FROM
(
SELECT
id,
`date`,
day_amount,
rn,
DATE_SUB(`date`, INTERVAL rn DAY) AS grp
FROM
(
SELECT
id,
`date`,
SUM(amount) AS day_amount,
ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY
`date`
) AS rn
FROM
deal
GROUP BY
id,
`date`
) t1
) t2
GROUP BY -- 分组
id,
grp;
5.生成上次登录日期这一列
用开窗函数 LAG()来完成,取出end_date前一行的数据
SELECT
id,
total_amount AS `连续交易总额`,
days AS `连续登录天数`,
start_date AS `开始日期`,
end_date AS `结束日期`,
LAG(end_date, 1) OVER (
PARTITION BY id
ORDER BY start_date
) AS `上次登录日期`
FROM
(
-- 聚合每个连续交易区间的数据
SELECT
id,
MIN(`date`) AS start_date,
MAX(`date`) AS end_date,
COUNT(*) AS days,
SUM(day_amount) AS total_amount
FROM
(
-- 计算连续性标识(grp)
SELECT
id,
`date`,
day_amount,
rn,
DATE_SUB(`date`, INTERVAL rn DAY) AS grp
FROM
(
-- 为每条记录分配行号,并按用户和日期分组
SELECT
id,
`date`,
SUM(amount) AS day_amount,
ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY `date`
) AS rn
FROM
deal
GROUP BY
id,
`date`
) t1
) t2
GROUP BY
id,
grp
) t3;
6.计算间隔天数
本次登录开始日期与上次登录日期中间的天数即为间隔天数
SELECT
id,
total_amount AS `连续交易总额`,
days AS `连续登录天数`,
start_date AS `开始日期`,
end_date AS `结束日期`,
LAG(end_date, 1) OVER ( --计算间隔天数
PARTITION BY id
ORDER BY start_date
) AS `上次登录日期`,
DATEDIFF( --
start_date,
LAG(end_date, 1) OVER (
PARTITION BY id
ORDER BY start_date
)
) - 1 AS `间隔天数`
FROM
(
-- 聚合每个连续交易区间的数据
SELECT
id,
MIN(`date`) AS start_date,
MAX(`date`) AS end_date,
COUNT(*) AS days,
SUM(day_amount) AS total_amount
FROM
(
-- 计算连续性标识(grp)
SELECT
id,
`date`,
day_amount,
rn,
DATE_SUB(`date`, INTERVAL rn DAY) AS grp
FROM
(
-- 为每条记录分配行号,并按用户和日期分组
SELECT
id,
`date`,
SUM(amount) AS day_amount,
ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY `date`
) AS rn
FROM
deal
GROUP BY
id,
`date`
) t1
) t2
GROUP BY
id,
grp
) t3;
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 提示词工程——AI应用必不可少的技术
· Open-Sora 2.0 重磅开源!
· 周边上新:园子的第一款马克杯温暖上架