Hive开窗函数整理

分析函数用于计算基于组的某种聚合值,它和聚合函数的不同之处是:对于每个组返回多行,而聚合函数对于每个组只返回一行。

开窗函数指定了分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变化而变化!到底什么是数据窗口?后面举例会详细讲到!

基础结构:

分析函数(如:sum(),max(),row_number()...) + 窗口子句(over函数)

over函数写法: over(partition by cookieid order by createtime) 先根据cookieid字段分区,相同的cookieid分为一区,每个分区内根据createtime字段排序(默认升序)

注:不加 partition by 的话则把整个数据集当作一个分区,不加 order by的话会对某些函数统计结果产生影响,如sum()

版本:1.1.0-cdh5.13.3

测试表

create table if not exists test (
cookie_id string,
create_time string,
pv int
)row format delimited fields terminated by ',';

 

测试数据

a,2017-12-01,3
b,2017-12-02,3
cookie1,2017-12-02,4
cookie1,2017-12-03,2
cookie1,2017-12-04,3
cookie1,2017-12-05,1
cookie1,2017-12-06,6
cookie1,2017-12-07,7
cookie2,2017-12-02,1
cookie2,2017-12-04,2
cookie3,2017-12-06,7
cookie3,2017-12-03,5

SUM、AVG、MIN、MAX

用于实现分组内所有和连续累积的统计。

以SUM举例

SELECT cookie_id,create_time,pv,
SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time) AS pv1, -- 默认为从起点到当前行
SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 
SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv3,   --当前行+往前3行
SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv4,    --当前行+往前3行+往后1行
SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv5   ---当前行+往后所有行  
FROM test;

 

结果:

+------------+--------------+-----+------+------+------+------+------+--+
| cookie_id  | create_time  | pv  | pv1  | pv2  | pv3  | pv4  | pv5  |
+------------+--------------+-----+------+------+------+------+------+--+
| a          | 2017-12-01   | 3   | 3    | 3    | 3    | 3    | 3    |
| b          | 2017-12-02   | 3   | 3    | 3    | 3    | 3    | 3    |
| cookie1    | 2017-12-02   | 4   | 4    | 4    | 4    | 6    | 23   |
| cookie1    | 2017-12-03   | 2   | 6    | 6    | 6    | 9    | 19   |
| cookie1    | 2017-12-04   | 3   | 9    | 9    | 9    | 10   | 17   |
| cookie1    | 2017-12-05   | 1   | 10   | 10   | 10   | 16   | 14   |
| cookie1    | 2017-12-06   | 6   | 16   | 16   | 12   | 19   | 13   |
| cookie1    | 2017-12-07   | 7   | 23   | 23   | 17   | 17   | 7    |
| cookie2    | 2017-12-02   | 1   | 1    | 1    | 1    | 3    | 3    |
| cookie2    | 2017-12-04   | 2   | 3    | 3    | 3    | 3    | 2    |
| cookie3    | 2017-12-03   | 5   | 5    | 5    | 5    | 12   | 12   |
| cookie3    | 2017-12-06   | 7   | 12   | 12   | 12   | 12   | 7    |
+------------+--------------+-----+------+------+------+------+------+--+

 

注:这些窗口的划分都是在分区内部!超过分区大小就无效了

可以看到如果不指定ROWS BETWEEN,默认统计窗口为从起点到当前行;

关键是理解 ROWS BETWEEN 含义,也叫做window子句:

  • PRECEDING:往前

  • FOLLOWING:往后

  • CURRENT ROW:当前行

  • UNBOUNDED:无边界,UNBOUNDED PRECEDING 表示从最前面的起点开始, UNBOUNDED FOLLOWING:表示到最后面的终点

其他AVG,MIN,MAX,和SUM用法一样

如果不加ORDER BY 会怎样

select cookie_id,create_time,pv,
sum(pv) over(PARTITION BY cookie_id) as pv1 
FROM test;

 

结果

+------------+--------------+-----+------+--+
| cookie_id  | create_time  | pv  | pv1  |
+------------+--------------+-----+------+--+
| a          | 2017-12-01   | 3   | 3    |
| b          | 2017-12-02   | 3   | 3    |
| cookie1    | 2017-12-07   | 7   | 23   |
| cookie1    | 2017-12-06   | 6   | 23   |
| cookie1    | 2017-12-05   | 1   | 23   |
| cookie1    | 2017-12-04   | 3   | 23   |
| cookie1    | 2017-12-03   | 2   | 23   |
| cookie1    | 2017-12-02   | 4   | 23   |
| cookie2    | 2017-12-04   | 2   | 3    |
| cookie2    | 2017-12-02   | 1   | 3    |
| cookie3    | 2017-12-03   | 5   | 12   |
| cookie3    | 2017-12-06   | 7   | 12   |
+------------+--------------+-----+------+--+

 

可以看到,如果没有order by,不仅分区内没有排序,sum()计算的pv也是整个分区的pv

注:max()函数无论有没有order by 都是计算整个分区的最大值

ROW_NUMBER 函数

ROW_NUMBER() 从1开始,按照顺序,生成分组内记录的序列

ROW_NUMBER() 的应用场景非常多,比如获取分组内排序Top N的记录、获取一个session中的第一条refer等。

SELECT cookie_id,create_time,pv,
ROW_NUMBER() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS rn  
FROM test;

 

结果

+------------+--------------+-----+-----+--+
| cookie_id  | create_time  | pv  | rn  |
+------------+--------------+-----+-----+--+
| a          | 2017-12-01   | 3   | 1   |
| b          | 2017-12-02   | 3   | 1   |
| cookie1    | 2017-12-07   | 7   | 1   |
| cookie1    | 2017-12-06   | 6   | 2   |
| cookie1    | 2017-12-02   | 4   | 3   |
| cookie1    | 2017-12-04   | 3   | 4   |
| cookie1    | 2017-12-03   | 2   | 5   |
| cookie1    | 2017-12-05   | 1   | 6   |
| cookie2    | 2017-12-04   | 2   | 1   |
| cookie2    | 2017-12-02   | 1   | 2   |
| cookie3    | 2017-12-06   | 7   | 1   |
| cookie3    | 2017-12-03   | 5   | 2   |
+------------+--------------+-----+-----+--+

 

RANK 和 DENSE_RANK 函数

RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位 DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位

我们把 rankdense_rankrow_number三者对比,这样比较清晰:

SELECT cookie_id,create_time,pv,
RANK() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS rank_res,
DENSE_RANK() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS dense_rank_res,
ROW_NUMBER() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS row_number_res
FROM test;

 

结果:

  1. 为了突出区别,对test表做了一点调整,将数据cookie1,2017-12-02,4改为了cookie1,2017-12-02,3

  2. 主要看cookie1,2017-12-04和2017-12-02的pv都是3

+------------+--------------+-----+-----------+-----------------+-----------------+--+
| cookie_id  | create_time  | pv  | rank_res  | dense_rank_res  | row_number_res  |
+------------+--------------+-----+-----------+-----------------+-----------------+--+
| a          | 2017-12-01   | 3   | 1         | 1               | 1               |
| b          | 2017-12-02   | 3   | 1         | 1               | 1               |
| cookie1    | 2017-12-07   | 7   | 1         | 1               | 1               |
| cookie1    | 2017-12-06   | 6   | 2         | 2               | 2               |
| cookie1    | 2017-12-04   | 3   | 3         | 3               | 3               |
| cookie1    | 2017-12-02   | 3   | 3         | 3               | 4               |
| cookie1    | 2017-12-03   | 2   | 5         | 4               | 5               |
| cookie1    | 2017-12-05   | 1   | 6         | 5               | 6               |
| cookie2    | 2017-12-04   | 2   | 1         | 1               | 1               |
| cookie2    | 2017-12-02   | 1   | 2         | 2               | 2               |
| cookie3    | 2017-12-06   | 7   | 1         | 1               | 1               |
| cookie3    | 2017-12-03   | 5   | 2         | 2               | 2               |
+------------+--------------+-----+-----------+-----------------+-----------------+--+

 

LAG 和 LEAD 函数

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值

第一个参数为列名,第二个参数为往上第n行(默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookie_id,create_time,
LAG(create_time,1) OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag1,
LAG(create_time,1,'1970-01-01') OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag1_with_default
FROM test;

 

结果:

+------------+--------------+-------------+--------------------+--+
| cookie_id  | create_time  |    lag1     | lag1_with_default  |
+------------+--------------+-------------+--------------------+--+
| a          | 2017-12-01   | NULL        | 1970-01-01         |
| b          | 2017-12-02   | NULL        | 1970-01-01         |
| cookie1    | 2017-12-02   | NULL        | 1970-01-01         |
| cookie1    | 2017-12-03   | 2017-12-02  | 2017-12-02         |
| cookie1    | 2017-12-04   | 2017-12-03  | 2017-12-03         |
| cookie1    | 2017-12-05   | 2017-12-04  | 2017-12-04         |
| cookie1    | 2017-12-06   | 2017-12-05  | 2017-12-05         |
| cookie1    | 2017-12-07   | 2017-12-06  | 2017-12-06         |
| cookie2    | 2017-12-02   | NULL        | 1970-01-01         |
| cookie2    | 2017-12-04   | 2017-12-02  | 2017-12-02         |
| cookie3    | 2017-12-03   | NULL        | 1970-01-01         |
| cookie3    | 2017-12-06   | 2017-12-03  | 2017-12-03         |
+------------+--------------+-------------+--------------------+--+

 

LEAD 函数则与 LAG 相反: LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值

第一个参数为列名,第二个参数为往下第n行(默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookie_id,create_time,
LEAD(create_time,1) OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag2,
LEAD(create_time,1,'1970-01-01') OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag2_with_default
FROM test;

 

结果:

+------------+--------------+-------------+--------------------+--+
| cookie_id  | create_time  |    lag2     | lag2_with_default  |
+------------+--------------+-------------+--------------------+--+
| a          | 2017-12-01   | NULL        | 1970-01-01         |
| b          | 2017-12-02   | NULL        | 1970-01-01         |
| cookie1    | 2017-12-02   | 2017-12-03  | 2017-12-03         |
| cookie1    | 2017-12-03   | 2017-12-04  | 2017-12-04         |
| cookie1    | 2017-12-04   | 2017-12-05  | 2017-12-05         |
| cookie1    | 2017-12-05   | 2017-12-06  | 2017-12-06         |
| cookie1    | 2017-12-06   | 2017-12-07  | 2017-12-07         |
| cookie1    | 2017-12-07   | NULL        | 1970-01-01         |
| cookie2    | 2017-12-02   | 2017-12-04  | 2017-12-04         |
| cookie2    | 2017-12-04   | NULL        | 1970-01-01         |
| cookie3    | 2017-12-03   | 2017-12-06  | 2017-12-06         |
| cookie3    | 2017-12-06   | NULL        | 1970-01-01         |
+------------+--------------+-------------+--------------------+--+

 

GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

准备数据:

+----------+-------------+-----------+--+
|  month   |     day     | cookieid  |
+----------+-------------+-----------+--+
| 2015-03  | 2015-03-10  | cookie1   |
| 2015-03  | 2015-03-10  | cookie5   |
| 2015-03  | 2015-03-12  | cookie7   |
| 2015-04  | 2015-04-12  | cookie3   |
| 2015-04  | 2015-04-13  | cookie2   |
| 2015-04  | 2015-04-13  | cookie4   |
| 2015-04  | 2015-04-16  | cookie4   |
| 2015-03  | 2015-03-10  | cookie2   |
| 2015-03  | 2015-03-10  | cookie3   |
| 2015-04  | 2015-04-12  | cookie5   |
| 2015-04  | 2015-04-13  | cookie6   |
| 2015-04  | 2015-04-15  | cookie3   |
| 2015-04  | 2015-04-15  | cookie2   |
| 2015-04  | 2015-04-16  | cookie1   |
+----------+-------------+-----------+--+

表:

create table test2(
month STRING,
day STRING, 
cookieid STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

 

GROUPING SETS

在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL

SELECT month
    ,day
    ,COUNT(DISTINCT cookieid) AS uv
FROM test2
GROUP BY month,day 
GROUPING SETS(month, day);

 

结果:

+----------+-------------+-----+--+
|  month   |     day     | uv  |
+----------+-------------+-----+--+
| NULL     | 2015-03-10  | 4   |
| NULL     | 2015-03-12  | 1   |
| NULL     | 2015-04-12  | 2   |
| NULL     | 2015-04-13  | 3   |
| NULL     | 2015-04-15  | 2   |
| NULL     | 2015-04-16  | 2   |
| 2015-03  | NULL        | 5   |
| 2015-04  | NULL        | 6   |
+----------+-------------+-----+--+

等价于

SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv FROM test2 GROUP BY month 
UNION ALL 
SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv FROM test2 GROUP BY day;

 

增强版

SELECT 
    month,
    day,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID 
FROM test2 
    GROUP BY month,day 
    GROUPING SETS (month,day,(month,day)) 
    ORDER BY GROUPING__ID;

 

其中的 GROUPING__ID,表示结果属于哪一个分组集合。

结果:

+----------+-------------+-----+---------------+--+
|  month   |     day     | uv  | grouping__id  |
+----------+-------------+-----+---------------+--+
| 2015-04  | NULL        | 6   | 1             |
| 2015-03  | NULL        | 5   | 1             |
| NULL     | 2015-03-10  | 4   | 2             |
| NULL     | 2015-04-16  | 2   | 2             |
| NULL     | 2015-04-15  | 2   | 2             |
| NULL     | 2015-04-13  | 3   | 2             |
| NULL     | 2015-04-12  | 2   | 2             |
| NULL     | 2015-03-12  | 1   | 2             |
| 2015-04  | 2015-04-16  | 2   | 3             |
| 2015-04  | 2015-04-12  | 2   | 3             |
| 2015-04  | 2015-04-13  | 3   | 3             |
| 2015-03  | 2015-03-12  | 1   | 3             |
| 2015-03  | 2015-03-10  | 4   | 3             |
| 2015-04  | 2015-04-15  | 2   | 3             |
+----------+-------------+-----+---------------+--+

等价于

SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test2 GROUP BY month 
UNION ALL 
SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test2 GROUP BY day
UNION ALL 
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test2 GROUP BY month,day;

 

CUBE

根据GROUP BY的维度的所有组合进行聚合。

SELECT 
    month,
    day,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID 
FROM test2 
    GROUP BY month,day 
    WITH CUBE 
    ORDER BY GROUPING__ID;

 

结果

+----------+-------------+-----+---------------+--+
|  month   |     day     | uv  | grouping__id  |
+----------+-------------+-----+---------------+--+
| NULL     | NULL        | 7   | 0             |
| 2015-03  | NULL        | 5   | 1             |
| 2015-04  | NULL        | 6   | 1             |
| NULL     | 2015-04-16  | 2   | 2             |
| NULL     | 2015-04-15  | 2   | 2             |
| NULL     | 2015-04-13  | 3   | 2             |
| NULL     | 2015-04-12  | 2   | 2             |
| NULL     | 2015-03-12  | 1   | 2             |
| NULL     | 2015-03-10  | 4   | 2             |
| 2015-04  | 2015-04-12  | 2   | 3             |
| 2015-04  | 2015-04-16  | 2   | 3             |
| 2015-03  | 2015-03-12  | 1   | 3             |
| 2015-03  | 2015-03-10  | 4   | 3             |
| 2015-04  | 2015-04-15  | 2   | 3             |
| 2015-04  | 2015-04-13  | 3   | 3             |
+----------+-------------+-----+---------------+--+

等价于

SELECT NULL as month,NULL as day,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM test2
UNION ALL 
SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test2 GROUP BY month 
UNION ALL 
SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test2 GROUP BY day
UNION ALL 
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test2 GROUP BY month,day;

 

ROLLUP

是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。

比如,以month维度进行层级聚合:

SELECT 
    month,
    day,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID  
FROM test2 
    GROUP BY month,day
    WITH ROLLUP 
    ORDER BY GROUPING__ID;

 

结果:

+----------+-------------+-----+---------------+--+
|  month   |     day     | uv  | grouping__id  |
+----------+-------------+-----+---------------+--+
| NULL     | NULL        | 7   | 0             |
| 2015-04  | NULL        | 6   | 1             |
| 2015-03  | NULL        | 5   | 1             |
| 2015-04  | 2015-04-16  | 2   | 3             |
| 2015-04  | 2015-04-15  | 2   | 3             |
| 2015-04  | 2015-04-13  | 3   | 3             |
| 2015-04  | 2015-04-12  | 2   | 3             |
| 2015-03  | 2015-03-12  | 1   | 3             |
| 2015-03  | 2015-03-10  | 4   | 3             |
+----------+-------------+-----+---------------+--+

 

实现的上钻过程:月天的UV->月的UV->总UV

如果把month和day调换顺序,则以day维度进行层级聚合:

SELECT 
    day,
    month,
    COUNT(DISTINCT cookieid) AS uv,
    GROUPING__ID  
FROM test2 
    GROUP BY day,month 
    WITH ROLLUP 
    ORDER BY GROUPING__ID;

 

结果:

+-------------+----------+-----+---------------+--+
|     day     |  month   | uv  | grouping__id  |
+-------------+----------+-----+---------------+--+
| NULL        | NULL     | 7   | 0             |
| 2015-04-12  | NULL     | 2   | 1             |
| 2015-04-15  | NULL     | 2   | 1             |
| 2015-03-12  | NULL     | 1   | 1             |
| 2015-04-16  | NULL     | 2   | 1             |
| 2015-03-10  | NULL     | 4   | 1             |
| 2015-04-13  | NULL     | 3   | 1             |
| 2015-04-16  | 2015-04  | 2   | 3             |
| 2015-04-15  | 2015-04  | 2   | 3             |
| 2015-04-13  | 2015-04  | 3   | 3             |
| 2015-03-12  | 2015-03  | 1   | 3             |
| 2015-03-10  | 2015-03  | 4   | 3             |
| 2015-04-12  | 2015-04  | 2   | 3             |
+-------------+----------+-----+---------------+--+

 

实现的上钻过程:天月的UV->天的UV->总UV

 


 

参考:http://lxw1234.com/archives/category/hive

参考:https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup

参考:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

 

posted @ 2020-05-19 16:21  sw_kong  阅读(6220)  评论(0编辑  收藏  举报