hive-sql学习笔记之开窗函数

Rank() 排序相同时会重复,总数不会变
DENSE_RANK() 排序相同时会重复,总数会减少
ROW_NUMBER() 会根据顺序计算

实验数据

cookieid        creattime           pv

cookie1,   2017-12-10,    1
cookie1,   2017-12-11,    5
cookie1,   2017-12-12,    7
cookie1,   2017-12-13,    3
cookie1,   2017-12-14,    2
cookie1,   2017-12-15,    4
cookie1,   2017-12-16,    4

cookie2,   2017-12-12,    7
cookie2,   2017-12-16,    6
cookie2,   2017-12-24,    1

cookie3,   2017-12-22,    5

a,        2017-12-01,         3
b,                  2017-12-00,         3

 

实验一:

SELECT
cookieid,
creattime,
pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY creattime) AS pv1, -- 默认为从起点到当前行的pv和
SUM(pv) OVER(PARTITION BY cookieid ORDER BY creattime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2 --从起点到当前行,结果同pv1
row_num() OVER(PARTITION BY cookieid ORDER BY creattime) AS rn
FROM
dim.test_stu_info_study;

实验结果:

                                              pv     pv1      pv2    rn

cookie3   2017-12-22   5   5   5   1
cookie1   2017-12-10   1   1   1   1
cookie1   2017-12-11   5   6   6    2
cookie1   2017-12-12   7   13  13   3
cookie1   2017-12-13   3   16  16   4
cookie1   2017-12-14   2   18  18   5
cookie1   2017-12-15   4   22  22   6  (即使一样,也顺序排序)
cookie1   2017-12-16   4   26  26   7
b        2017-12-00   3   3   3   1
cookie2   2017-12-12   7   7   7   1
cookie2   2017-12-16   6   13 13         2
cookie2   2017-12-24   1   14    14   3
a        2017-12-01   3    3      3    1

实验二:

SELECT
cookieid,
creattime,
pv,
AVG(pv) OVER(PARTITION BY cookieid ORDER BY creattime) AS pv1, -- 默认为从起点到当前行的pv和 /pv的个数
AVG(pv) OVER(PARTITION BY cookieid ORDER BY creattime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2 --从起点到当前行,结果同pv1
row_num() OVER(PARTITION BY cookieid ORDER BY creattime) AS rn
FROM
  dim.test_stu_info_study;

结果:

                                         pv       pv1       pv2       rn

cookie3    2017-12-22   5   5.0   5.0   1
cookie1   2017-12-10   1   1.0   1.0   1
cookie1   2017-12-11   5   3.0   3.0   2
cookie1   2017-12-12   7   4.33   4.33  3
cookie1   2017-12-13   3   4.0   4.0   4
cookie1   2017-12-14   2   3.6   3.6   5
cookie1   2017-12-15   4   3.66   3.66       6
cookie1   2017-12-16   4   3.71     3.71        7
b        2017-12-00   3   3.0   3.0   1
cookie2   2017-12-12   7   7.0   7.0   1
cookie2   2017-12-16   6   6.5   6.5   2
cookie2   2017-12-24   1   4.66   4.66       3
a        2017-12-01   3   3.0   3.0   1

实验三:

 

SELECT
cookieid,
creattime,
pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY creattime) AS pv1, -- 默认为从起点到当前行的pv和
SUM(pv) OVER(PARTITION BY cookieid ORDER BY creattime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
SUM(pv) OVER(PARTITION BY cookieid ORDER BY creattime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv3, --当前行pv+往前3行pv的值(共四行pv的值相加)
SUM(pv) OVER(PARTITION BY cookieid ORDER BY creattime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv4, --当前行+往前3行+往后1行(当前行的pv值+往前三行的pv值+当前行往后一行的pv值,相当于共5行pv值的和
SUM(pv) OVER(PARTITION BY cookieid ORDER BY creattime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv5, ---当前行+往后所有行 (相当于第一行是所有行pv的总值,pv的值是逐渐减少的)
row_number() OVER(PARTITION BY cookieid ORDER BY creattime) AS rn
FROM
dim.test_stu_info_study;

实验三结果:

                                            pv     pv1   pv2    pv3   pv4    pv5     rn

 

cookie3   2017-12-22   5   5   5   5   5   5     1
cookie1   2017-12-10   1   1   1   1   6   26   1
cookie1   2017-12-11   5   6    6   6  13    25   2
cookie1   2017-12-12   7   13   13    13    16    20   3
cookie1   2017-12-13   3   16   16    16    18    13   4
cookie1   2017-12-14   2   18   18    17     21   10   5
cookie1   2017-12-15   4   22   22    16   20    8    6
cookie1   2017-12-16   4   26   26    13   13    4    7
b        2017-12-00   3   3   3    3     3     3    1
cookie2   2017-12-12   7   7   7    7    13   14   1
cookie2   2017-12-16   6   13  13    13    14   7        2
cookie2   2017-12-24   1   14   14   14   14    1        3
a      2017-12-01   3   3   3    3     3   3   1

说明:

窗口函数和聚合函数的不同,

sum()函数可以根据每一行的窗口返回各自行对应的值,有多少行记录就有多少个sum值,

而group by只能计算每一组的sum,每组只有一个值!

其中sum()计算的是分区内排序后一个个叠加的值,和order by有关!

 

如果没有order by,不仅分区内没有排序,sum()计算的pv也是整个分区的pv

 

注:max()函数无论有没有order by 都是计算整个分区的最大值

实验四:

SELECT
cookieid,
creattime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv) AS pv2, --从起点到当前行,结果同pv1
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv ) AS pv3, --当前行+往前3行
row_number() OVER(PARTITION BY cookieid ORDER BY pv) AS rn
FROM
dim.test_stu_info_study;

实验四结果:

                                         pv   pv1   pv2   rn

cookie3   2017-12-22   5   1   1   1
cookie1   2017-12-10   1   1   1   1
cookie1   2017-12-14   2   2   2   2
cookie1   2017-12-13   3   3   3   3
cookie1   2017-12-15   4   4   4   4
cookie1   2017-12-16   4   4   4   5
cookie1   2017-12-11   5   6   5   6
cookie1   2017-12-12   7   7   6   7
b        2017-12-00   3   1   1   1
cookie2   2017-12-24   1   1   1   1
cookie2   2017-12-16   6   2   2   2
cookie2   2017-12-12   7   3   3   3
a      2017-12-01   3   1   1   1

 

开窗函数和聚合函数区别

select ename,sal,sum(sal) over (partition by ename order by sal,empno) as running_total
from emp1
order by 2

按ename 汇总sal


over()开窗函数和聚合函数的不同之处是对于每个组返回多行,而聚合函数对于每个组只返回一行。


SQL> select e.empno,e.ename,e.job,e.sal,e.deptno, sum(e.sal) over (partition by e.deptno) as total_sal
  2  from emp e;

     EMPNO ENAME      JOB              SAL     DEPTNO  TOTAL_SAL
---------- ---------- --------- ---------- ---------- ----------
      7782 CLARK      MANAGER         2450         10       8750
      7839 KING       PRESIDENT       5000         10       8750
      7934 MILLER     CLERK           1300         10       8750
      7566 JONES      MANAGER         2975         20      10875
      7902 FORD       ANALYST         3000         20      10875
      7876 ADAMS      CLERK           1100         20      10875
      7369 SMITH      CLERK            800         20      10875
      7788 SCOTT      ANALYST         3000         20      10875
      7521 WARD       SALESMAN        1250         30       9400
      7844 TURNER     SALESMAN        1500         30       9400
      7499 ALLEN      SALESMAN        1600         30       9400

     EMPNO ENAME      JOB              SAL     DEPTNO  TOTAL_SAL
---------- ---------- --------- ---------- ---------- ----------
      7900 JAMES      CLERK            950         30       9400
      7698 BLAKE      MANAGER         2850         30       9400
      7654 MARTIN     SALESMAN        1250         30       9400

已选择14行。


聚合函数:
SQL> select sum(sal) ,deptno from emp group by deptno;

  SUM(SAL)     DEPTNO
---------- ----------
      9400         30
     10875         20
      8750         10

 

posted on 2019-10-21 13:49  大鹏的鸿鹄之志  阅读(718)  评论(0编辑  收藏  举报