oracle:实现滑动平均效果


问题背景

1321. 餐馆营业额变化增长 - 力扣(LeetCode)

问题背景
实现思路:
(1) 首先找到对应的7天之末的终点,分别是2019-01-07,2019-01-08,2019-01-09,2019-01-10.
Oracle中可以通过lead函数实现上述效果.但是需要考虑的两种情况:

  • 时间序列连续与不连续
  • 时间序列的重复与不重复
    注:这里的重复是完全相同

(2) 假定在上述(1)已经完全解决,那么只需要从表customer筛选出customer_id7天终结结点序列的样例,组成新表tmp,然后将其与customer表全连接即可,达到初步想要的结果.但事实上,这样的样例并非所有的都满足要求,所以需要进一步筛选.而其对应的筛选条件就是:customervisited_ontmpvisited_on的差值在[0,6]之间即可.
(3) 然后按照tmpvisited_on分组统计即可.
注:当7天终结结点序列中的时间戳customer中存在重复,在分组统计之前,需要对关联表进行去重.然后该节点会重复num次计算(num:表示时间戳重复的次数).
这里的7天终结结点序列就是滑动平均的时间各时间节点.
期望的结果是:sum=each_sum avg=each_avg
而事实上是:sum2=num*each_sum,avg2=sum2/7
在本例中,2019-01-10重复,

  • 不去重,会出现重复计算
select cu.*,tmp.visited_on t_visited_on
        from customer cu,
            (select * from customer 
            where visited_on in (
                select lead(visited_on,6 ) over(order by visited_on ) 
                    from (select distinct visited_on from customer)) 
            and visited_on is not null ) tmp
        where tmp.visited_on-cu.visited_on between 0 and 6
order by tmp.visited_on;

r2

  • 去重后(下面的结果才是期望得到的):
select distinct tmp2.* 
    from (
        select cu.*,tmp.visited_on t_visited_on
        from customer cu,
            (select * from customer 
            where visited_on in (
                select lead(visited_on,6 ) over(order by visited_on ) 
                    from (select distinct visited_on from customer)) 
            and visited_on is not null ) tmp
        where tmp.visited_on-cu.visited_on between 0 and 6
        ) tmp2 
order by t_visited_on;

r3

时间序列连续

数据准备

customer,特点:时间序列基本连续

create table customer(
    customer_id number,
    name varchar2(10),
    visited_on date,
    amount number
);

insert into customer values(1,'John',to_date('2019-01-01','yyyy-mm-dd'),100);
insert into customer values(2,'Daniel',to_date('2019-01-02','yyyy-mm-dd'),110);
insert into customer values(3,'Jade',to_date('2019-01-03','yyyy-mm-dd'),120);
insert into customer values(4,'Khaled',to_date('2019-01-04','yyyy-mm-dd'),130);
insert into customer values(5,'Winston',to_date('2019-01-05','yyyy-mm-dd'),110);
insert into customer values(6,'Elvis',to_date('2019-01-06','yyyy-mm-dd'),140);
insert into customer values(7,'Anna',to_date('2019-01-07','yyyy-mm-dd'),150);
insert into customer values(8,'Maria',to_date('2019-01-08','yyyy-mm-dd'),80);
insert into customer values(9,'Jaze',to_date('2019-01-09','yyyy-mm-dd'),110);
insert into customer values(1,'Jhon',to_date('2019-01-10','yyyy-mm-dd'),130);
insert into customer values(3,'Jade',to_date('2019-01-10','yyyy-mm-dd'),150);

完整的查询语句
select to_char(t_visited_on) visited_on,
        sum(amount) amount,
        round(sum(amount)/7,2)  average_amount
from (
    select distinct tmp2.* 
    from (
        select cu.*,tmp.visited_on t_visited_on
        from customer cu,
            (select * from customer 
            where visited_on in (
                select lead(visited_on,6 ) over(order by visited_on ) 
                    from (select distinct visited_on from customer)) 
            and visited_on is not null ) tmp
        where tmp.visited_on-cu.visited_on between 0 and 6
        ) tmp2 
    )        
group by t_visited_on
order by t_visited_on;

结果与问题背景中的结果一致

时间序列不完全连续

数据准备
create table customer2(
    customer_id number,
    name varchar2(10),
    visited_on date,
    amount number
);

insert into customer2 values(44,'Ashley',to_date('2019-01-04','yyyy-mm-dd'),160);
insert into customer2 values(23,'Sabo',to_date('2019-01-04','yyyy-mm-dd'),70);
insert into customer2 values(38,'Moustafa',to_date('2019-01-05','yyyy-mm-dd'),90);
insert into customer2 values(30,'Halley',to_date('2019-01-06','yyyy-mm-dd'),140);
insert into customer2 values(5,'Elvis',to_date('2019-01-07','yyyy-mm-dd'),160);
insert into customer2 values(12,'Leslie',to_date('2019-01-08','yyyy-mm-dd'),100);
insert into customer2 values(23,'Sabo',to_date('2019-01-08','yyyy-mm-dd'),90);
insert into customer2 values(13,'Will',to_date('2019-01-09','yyyy-mm-dd'),170);
insert into customer2 values(20,'Brock',to_date('2019-01-10','yyyy-mm-dd'),160);
insert into customer2 values(29,'Leo',to_date('2019-01-10','yyyy-mm-dd'),90);
insert into customer2 values(33,'Isaac',to_date('2019-01-11','yyyy-mm-dd'),60);
insert into customer2 values(46,'Selena',to_date('2019-01-12','yyyy-mm-dd'),100);
insert into customer2 values(4,'Winston',to_date('2019-01-13','yyyy-mm-dd'),150);
insert into customer2 values(15,'Marti',to_date('2019-01-13','yyyy-mm-dd'),160);

表2:customer2,问题要求与问题背景描述里面一致
在这里插入图片描述
对应的正确的查询结果:

[["2019-01-10",1230,175.71],
["2019-01-11",1060,151.43],
["2019-01-12",1070,152.86],
["2019-01-13",1240,177.14]]
具体实现

将上述完整程序中的表替换成customer2,其结果是:

/* Write your PL/SQL query statement below */    
select to_char(t_visited_on) visited_on,
        sum(amount) amount,
        round(sum(amount)/7,2)  average_amount
from (
    select distinct tmp2.* 
    from (
        select cu.*,tmp.visited_on t_visited_on
        from customer2 cu,
            (select * from customer2 
            where visited_on in (
                select lead(visited_on,6 ) over(order by visited_on ) 
                    from (select distinct visited_on from customer2)) 
            and visited_on is not null ) tmp
        where tmp.visited_on-cu.visited_on between 0 and 6
        ) tmp2 
    )        
group by t_visited_on
order by t_visited_on;

在最终的结果
时间序列的连续与不连续会影响,会影响滑动平均节点范围的查找时.查找之前去重,是解决该问题的关键.以下是两者的区别对比:

两个语句的结果:

--存在异常
select lead(visited_on,6 ) over(order by visited_on ) from  customer2;

LEAD(VISITED_O
--------------
08-1月 -19
09-1月 -19
10-1月 -19
10-1月 -19
11-1月 -19
12-1月 -19
13-1月 -19
13-1月 -19
--正确
select lead(visited_on,6 ) over(order by visited_on ) 
                    from (select distinct visited_on from customer2);
                    
LEAD(VISITED_O
--------------
10-1月 -19
11-1月 -19
12-1月 -19
13-1月 -19

时间序列存在重复样例

为了方便,对表customer中插入一列,使表中存在重复列.

insert into customer values(9,'Jaze',to_date('2019-01-09','yyyy-mm-dd'),110);

使用上述程序的查询结果为:
r6
理应2019-01-10amount变成1110,平均值也应该变化,但是没有,是因为distinct出现了误伤
r6
所以需要对该约束进行修改,具体程序如下:

/* Write your PL/SQL query statement below */    
select to_char(t_visited_on) visited_on,
        sum(amount)/(count(*)/(select sum(case when t_visited_on-visited_on between 0 and 6 then 1 else 0 end) from customer) ) amount,
        round(sum(amount)/(count(*)/(select sum(case when t_visited_on-visited_on between 0 and 6 then 1 else 0 end) from customer) )/7,2)  average_amount
from (
    select tmp2.* 
    from (
        select cu.*,tmp.visited_on t_visited_on
        from customer cu,
            (select * from customer
            where visited_on in (
                select lead(visited_on,6 ) over(order by visited_on ) 
                    from (select distinct visited_on from customer)) 
            and visited_on is not null ) tmp
        where tmp.visited_on-cu.visited_on between 0 and 6
        ) tmp2 
    )        
group by t_visited_on
order by t_visited_on;

最终结果


结论

结论:在时间序列连续或不连续情况下,找到准确的时间区间间隔是关键.这个案例为Oracle滑动平均提供思路,具体如下:
(1)确定滑动窗口的时间节点,假设以2019-03月为例,时间间隔为3(num),则对应的时间节点为:2019-03-01,2019-03-04,2019-03-07......;并依据该时间点筛选原表,组成tmp1
(2)将表tmp1与原表全连接,条件:时间差在[0,(3-1)num-1],
关联表的重复需要慎重处理,然后再分组统计.


使用开窗函数的方法

1321. 餐馆营业额变化增长 题解 - 力扣(LeetCode)
实现滑动平均,也可以直接应用Oracle中窗口函数,其在查询过程中为当前记录提供一个相关记录集,相应的记录集也会随之改变.分析函数+over()是常见的搭配形式.(oracle学习笔记(六):oracle中排序函数及其应用_数据库_八千里路云和月-CSDN博客,中总结了常见的窗口函数在排序中的应用.
实际应用中,一旦使用了窗口函数,都会为其产生一个可操作的记录集合.而对该记录,可以进一步使用窗口子句,来进一步限定窗口范围.常用的窗口子句包括两类:

利用rows子句的行方式进行限定;

其原理是,在已确定的窗口中,各记录已经按照某种标准进行了排序;而当前记录处于窗口中的特定位置;rows子句以当前记录为参照物,可以向前向后推移,并形成新的结果集,作为最终的操作窗口.
案例:在Emp表中,获得员工信息的同时,获得相邻三个员工的工资综合.

SQL> select e.*,
  2     sum(sal) over(order by empno rows between 1 preceding and 1 following) e_total
  3  from emp e;

     EMPNO ENAME      JOB              MGR HIREDATE              SAL       COMM     DEPTNO    E_TOTAL
---------- ---------- --------- ---------- -------------- ---------- ---------- ---------- ----------
      7369 SMITH      CLERK           7902 17-12月-80        1856.92                    20    4885.12
      7499 ALLEN      SALESMAN        7698 20-2月 -81         3028.2        300         30    7300.89
      7521 WARD       SALESMAN        7698 22-2月 -81        2415.77        500         30    10468.5
      7566 JONES      MANAGER         7839 02-4月 -81        5024.53                    20    9856.07
      7654 MARTIN     SALESMAN        7698 28-9月 -81        2415.77       1400         30   12298.45
      7698 BLAKE      MANAGER         7839 01-5月 -81        4858.15                    30   11599.67
      7782 CLARK      MANAGER         7839 09-6月 -81        4325.75                    10    13709.3
      7788 SCOTT      ANALYST         7566 19-4月 -87         4525.4                    20   16837.15
      7839 KING       PRESIDENT            17-11月-81           7986                    10   15293.19
      7844 TURNER     SALESMAN        7698 08-9月 -81        2781.79          0         30   12963.94
      7876 ADAMS      CLERK           7788 23-5月 -87        2196.15                    20    6954.48
      7900 JAMES      CLERK           7698 03-12月-81        1976.54                    30    8698.09
      7902 FORD       ANALYST         7566 03-12月-81         4525.4                    20    8990.91
      7934 MILLER     CLERK           7782 23-1月 -82        2488.97                    10    7014.37

已选择14行。

其中rows between 1 preceding and 1 following不一定返回三条记录.若排序之后,该记录只有位于首\或者末,则是两条记录的平均值.rows子句与位置相关.因此,在窗口函数中,必须含有排序子句order by.如果布用将会报错.

利用range子句的值方式进行限定.

按照列值进行窗口的进一步限制.
案例:

SQL> select e.*,
  2  count(1) over(partition by deptno order by sal range between 500 preceding and 500 following) new
  3  from emp e;

     EMPNO ENAME      JOB              MGR HIREDATE              SAL       COMM     DEPTNO        NEW
---------- ---------- --------- ---------- -------------- ---------- ---------- ---------- ----------
      7934 MILLER     CLERK           7782 23-1月 -82        2488.97                    10          1
      7782 CLARK      MANAGER         7839 09-6月 -81        4325.75                    10          1
      7839 KING       PRESIDENT            17-11月-81           7986                    10          1
      7369 SMITH      CLERK           7902 17-12月-80        1856.92                    20          2
      7876 ADAMS      CLERK           7788 23-5月 -87        2196.15                    20          2
      7788 SCOTT      ANALYST         7566 19-4月 -87         4525.4                    20          3
      7902 FORD       ANALYST         7566 03-12月-81         4525.4                    20          3
      7566 JONES      MANAGER         7839 02-4月 -81        5024.53                    20          3
      7900 JAMES      CLERK           7698 03-12月-81        1976.54                    30          3
      7521 WARD       SALESMAN        7698 22-2月 -81        2415.77        500         30          4
      7654 MARTIN     SALESMAN        7698 28-9月 -81        2415.77       1400         30          4
      7844 TURNER     SALESMAN        7698 08-9月 -81        2781.79          0         30          4
      7499 ALLEN      SALESMAN        7698 20-2月 -81         3028.2        300         30          2
      7698 BLAKE      MANAGER         7839 01-5月 -81        4858.15                    30          1

已选择14行。
unbounded和current now

unbounded:可以直接代替数值,表示内有任何限制;
current now:直接代表当前行.
对于order by 子句,如果没有显式指定rowsrange子句,那么就相当于使用了rows between unbounded preceding and current row 或者 range between unbounded preceding and current row.但是,当排序后存在相同排名时,则情况会存在些许差异.

基于上述新的知识点,解决上述问题.

SQL> select distinct visited_on,amount,round(amount/7,2)
  2  from (
  3  select to_char(visited_on) visited_on,
  4  sum(amount) over(order by visited_on range between 6  preceding and 0 following ) amount
  5  from customer cu
  6  )
  7  where visited_on in (
  8  select lead(visited_on,6 ) over(order by visited_on )
  9  from (select distinct visited_on from customer))
 10  and visited_on is not null;

VISITED_ON         AMOUNT ROUND(AMOUNT/7,2)
-------------- ---------- -----------------
07-1月 -19            860            122.86
08-1月 -19            840               120
09-1月 -19            950            135.71
10-1月 -19           1110            158.57
SQL> select distinct visited_on,amount,round(amount/7,2)
  2  from (
  3  select to_char(visited_on) visited_on,
  4  sum(amount) over(order by visited_on range between 6  preceding and 0 following ) amount
  5  from customer2 cu
  6  )
  7  where visited_on in (
  8  select lead(visited_on,6 ) over(order by visited_on )
  9  from (select distinct visited_on from customer2))
 10  and visited_on is not null;

VISITED_ON         AMOUNT ROUND(AMOUNT/7,2)
-------------- ---------- -----------------
10-1月 -19           1230            175.71
11-1月 -19           1060            151.43
13-1月 -19           1240            177.14
12-1月 -19           1070            152.86

参考:
<21天学通Oracle,张朝明,陈丹>

posted @ 2020-04-03 17:14  LgRun  阅读(409)  评论(0编辑  收藏  举报