SQL窗口函数综合案例

一、需求

统计每个用户的连续交易总额连续登录天数连续登录的开始和结束日期登录间隔天数

二、数据

-- 建表语句
CREATE TABLE deal (
    id INT,
    `date` DATE,
    amount DECIMAL(10,2)
);

-- 插入语句
INSERT INTO deal (id, `date`, amount) VALUES
(1, '2019-02-08', 6214.23),
(1, '2019-02-08', 6247.32),
(1, '2019-02-09', 85.63),
(1, '2019-02-09', 967.36),
(1, '2019-02-10', 85.69),
(1, '2019-02-12', 769.85),
(1, '2019-02-13', 943.86),
(1, '2019-02-14', 538.42),
(1, '2019-02-15', 369.76),
(1, '2019-02-16', 369.76),
(1, '2019-02-18', 795.15),
(1, '2019-02-19', 715.65),
(1, '2019-02-21', 537.71),
(2, '2019-02-08', 6214.23),
(2, '2019-02-08', 6247.32),
(2, '2019-02-09', 85.63),
(2, '2019-02-09', 967.36),
(2, '2019-02-10', 85.69),
(2, '2019-02-12', 769.85),
(2, '2019-02-13', 943.86),
(2, '2019-02-14', 943.18),
(2, '2019-02-15', 369.76),
(2, '2019-02-18', 795.15),
(2, '2019-02-19', 715.65),
(2, '2019-02-21', 537.71),
(3, '2019-02-08', 6214.23),
(3, '2019-02-08', 6247.32),
(3, '2019-02-09', 85.63),
(3, '2019-02-09', 967.36),
(3, '2019-02-10', 85.69),
(3, '2019-02-12', 769.85),
(3, '2019-02-13', 943.86),
(3, '2019-02-14', 276.81),
(3, '2019-02-15', 369.76),
(3, '2019-02-16', 369.76),
(3, '2019-02-18', 795.15),
(3, '2019-02-19', 715.65),
(3, '2019-02-21', 537.71);

二、思路

1.理解什么是连续交易?

简言之,就是一天接着一天不间断,例如2019-02-08、2019-02-09、2019-02-10是连续的,2019-02-10、2019-02-12是非连续的。

2.如何判断一段日期是连续的?(关键点)

日期行号的差值来判断

  • 连续日期

    假设某用户的交易日期为 2019-02-082019-02-092019-02-10,对应的行号为 123。那么就有:

    • 2019-02-08 - 1 = 2019-02-07
    • 2019-02-09 - 2 = 2019-02-07
    • 2019-02-10 - 3 = 2019-02-07
  • 非连续日期

    假设某用户的交易日期为 2019-02-082019-02-092019-02-12,对应的行号为 123。那么就有:

    • 2019-02-08 - 1 = 2019-02-07
    • 2019-02-09 - 2 = 2019-02-07
    • 2019-02-12 - 3 = 2019-02-09

可以看到,连续日期会产生相同差值,非连续日期会差生不同的差值。

3.理解什么是间隔天数?

用户上一次连续登录的结束日期当前连续登录的开始日期间隔的天数

三、实现

1.按日期聚合

SELECT 
  id,
  `date`,
  SUM(amount) AS day_amount -- 合并同日多次交易
FROM deal
GROUP BY id, `date`  ;

2.生成行号

用开窗函数ROW_NUMBER()生成rn列,为窗口内每一行生成一个唯一行号,且行号在窗口内连续递增

SELECT 
  id,
  `date`,
  SUM(amount) AS day_amount, 
  ROW_NUMBER() OVER (PARTITION BY id ORDER BY `date`) AS rn -- 生成行号
FROM deal
GROUP BY id, `date`;  

3.生成判断连续日期的标识

日期与行号的差值作为连续性的标识

  • MySQL中用DATE_SUB(date, INTERVAL rn DAY)
  • Hive中用 date_sub(date, rn)
SELECT 
  id, 
  `date`, 
  day_amount, 
  rn, 
  DATE_SUB(`date`, INTERVAL rn DAY) AS grp -- 连续性标识列
FROM 
  (
    SELECT 
      id, 
      `date`, 
      SUM(amount) AS day_amount, 
      ROW_NUMBER() OVER (
        PARTITION BY id 
        ORDER BY 
          `date`
      ) AS rn 
    FROM 
      deal 
    GROUP BY 
      id, 
      `date`
  ) t1

4.计算连续交易总额连续登录天数连续登录的开始和结束日期

根据上一步的结果表,按照id、grp列进行分组聚合计算:用sum、count、min、max分别求连续交易总额连续登录天数连续登录的开始和结束日期

SELECT 
  id, 
  MIN(`date`) AS start_date, -- 聚合计算
  MAX(`date`) AS end_date, -- 聚合计算
  COUNT(*) AS days, -- 聚合计算
  SUM(day_amount) AS total_amount --聚合计算
FROM 
  (
    SELECT 
      id, 
      `date`, 
      day_amount, 
      rn, 
      DATE_SUB(`date`, INTERVAL rn DAY) AS grp 
    FROM 
      (
        SELECT 
          id, 
          `date`, 
          SUM(amount) AS day_amount, 
          ROW_NUMBER() OVER (
            PARTITION BY id 
            ORDER BY 
              `date`
          ) AS rn 
        FROM 
          deal 
        GROUP BY 
          id, 
          `date`
      ) t1
  ) t2 
GROUP BY -- 分组
  id, 
  grp;

5.生成上次登录日期这一列

用开窗函数 LAG()来完成,取出end_date前一行的数据

SELECT 
    id, 
    total_amount AS `连续交易总额`, 
    days AS `连续登录天数`, 
    start_date AS `开始日期`, 
    end_date AS `结束日期`, 
    LAG(end_date, 1) OVER ( 
        PARTITION BY id 
        ORDER BY start_date
    ) AS `上次登录日期`
FROM 
    (
        -- 聚合每个连续交易区间的数据
        SELECT 
            id, 
            MIN(`date`) AS start_date, 
            MAX(`date`) AS end_date, 
            COUNT(*) AS days, 
            SUM(day_amount) AS total_amount 
        FROM 
            (
                -- 计算连续性标识(grp)
                SELECT 
                    id, 
                    `date`, 
                    day_amount, 
                    rn, 
                    DATE_SUB(`date`, INTERVAL rn DAY) AS grp 
                FROM 
                    (
                       -- 为每条记录分配行号,并按用户和日期分组
                        SELECT 
                            id, 
                            `date`, 
                            SUM(amount) AS day_amount, 
                            ROW_NUMBER() OVER (
                                PARTITION BY id 
                                ORDER BY `date`
                            ) AS rn 
                        FROM 
                            deal 
                        GROUP BY 
                            id, 
                            `date`
                    ) t1
            ) t2 
        GROUP BY 
            id, 
            grp
    ) t3;

6.计算间隔天数

本次登录开始日期与上次登录日期中间的天数即为间隔天数

SELECT 
    id, 
    total_amount AS `连续交易总额`, 
    days AS `连续登录天数`, 
    start_date AS `开始日期`, 
    end_date AS `结束日期`, 
    LAG(end_date, 1) OVER ( --计算间隔天数
        PARTITION BY id 
        ORDER BY start_date
    ) AS `上次登录日期`, 
    DATEDIFF( -- 
        start_date, 
        LAG(end_date, 1) OVER (
            PARTITION BY id 
            ORDER BY start_date
        )
    ) - 1 AS `间隔天数` 
FROM 
    (
        -- 聚合每个连续交易区间的数据
        SELECT 
            id, 
            MIN(`date`) AS start_date, 
            MAX(`date`) AS end_date, 
            COUNT(*) AS days, 
            SUM(day_amount) AS total_amount 
        FROM 
            (
                -- 计算连续性标识(grp)
                SELECT 
                    id, 
                    `date`, 
                    day_amount, 
                    rn, 
                    DATE_SUB(`date`, INTERVAL rn DAY) AS grp 
                FROM 
                    (
                       -- 为每条记录分配行号,并按用户和日期分组
                        SELECT 
                            id, 
                            `date`, 
                            SUM(amount) AS day_amount, 
                            ROW_NUMBER() OVER (
                                PARTITION BY id 
                                ORDER BY `date`
                            ) AS rn 
                        FROM 
                            deal 
                        GROUP BY 
                            id, 
                            `date`
                    ) t1
            ) t2 
        GROUP BY 
            id, 
            grp
    ) t3;

posted @   WangYao_BigData  阅读(10)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 提示词工程——AI应用必不可少的技术
· Open-Sora 2.0 重磅开源!
· 周边上新:园子的第一款马克杯温暖上架
点击右上角即可分享
微信分享提示