Hive开窗函数整理
分析函数用于计算基于组的某种聚合值,它和聚合函数的不同之处是:对于每个组返回多行,而聚合函数对于每个组只返回一行。
开窗函数指定了分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变化而变化!到底什么是数据窗口?后面举例会详细讲到!
基础结构:
分析函数(如:sum(),max(),row_number()...) + 窗口子句(over函数)
over函数写法: over(partition by cookieid order by createtime) 先根据cookieid字段分区,相同的cookieid分为一区,每个分区内根据createtime字段排序(默认升序)
注:不加 partition by 的话则把整个数据集当作一个分区,不加 order by的话会对某些函数统计结果产生影响,如sum()
版本:1.1.0-cdh5.13.3
测试表
create table if not exists test ( cookie_id string, create_time string, pv int )row format delimited fields terminated by ',';
测试数据
a,2017-12-01,3 b,2017-12-02,3 cookie1,2017-12-02,4 cookie1,2017-12-03,2 cookie1,2017-12-04,3 cookie1,2017-12-05,1 cookie1,2017-12-06,6 cookie1,2017-12-07,7 cookie2,2017-12-02,1 cookie2,2017-12-04,2 cookie3,2017-12-06,7 cookie3,2017-12-03,5
SUM、AVG、MIN、MAX
用于实现分组内所有和连续累积的统计。
以SUM举例
SELECT cookie_id,create_time,pv, SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time) AS pv1, -- 默认为从起点到当前行 SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv3, --当前行+往前3行 SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv4, --当前行+往前3行+往后1行 SUM(pv) OVER(PARTITION BY cookie_id ORDER BY create_time ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv5 ---当前行+往后所有行 FROM test;
结果:
+------------+--------------+-----+------+------+------+------+------+--+ | cookie_id | create_time | pv | pv1 | pv2 | pv3 | pv4 | pv5 | +------------+--------------+-----+------+------+------+------+------+--+ | a | 2017-12-01 | 3 | 3 | 3 | 3 | 3 | 3 | | b | 2017-12-02 | 3 | 3 | 3 | 3 | 3 | 3 | | cookie1 | 2017-12-02 | 4 | 4 | 4 | 4 | 6 | 23 | | cookie1 | 2017-12-03 | 2 | 6 | 6 | 6 | 9 | 19 | | cookie1 | 2017-12-04 | 3 | 9 | 9 | 9 | 10 | 17 | | cookie1 | 2017-12-05 | 1 | 10 | 10 | 10 | 16 | 14 | | cookie1 | 2017-12-06 | 6 | 16 | 16 | 12 | 19 | 13 | | cookie1 | 2017-12-07 | 7 | 23 | 23 | 17 | 17 | 7 | | cookie2 | 2017-12-02 | 1 | 1 | 1 | 1 | 3 | 3 | | cookie2 | 2017-12-04 | 2 | 3 | 3 | 3 | 3 | 2 | | cookie3 | 2017-12-03 | 5 | 5 | 5 | 5 | 12 | 12 | | cookie3 | 2017-12-06 | 7 | 12 | 12 | 12 | 12 | 7 | +------------+--------------+-----+------+------+------+------+------+--+
注:这些窗口的划分都是在分区内部!超过分区大小就无效了
可以看到如果不指定ROWS BETWEEN,默认统计窗口为从起点到当前行;
关键是理解 ROWS BETWEEN 含义,也叫做window子句:
-
PRECEDING:往前
-
FOLLOWING:往后
-
CURRENT ROW:当前行
-
UNBOUNDED:无边界,UNBOUNDED PRECEDING 表示从最前面的起点开始, UNBOUNDED FOLLOWING:表示到最后面的终点
其他AVG,MIN,MAX,和SUM用法一样
如果不加ORDER BY 会怎样
select cookie_id,create_time,pv, sum(pv) over(PARTITION BY cookie_id) as pv1 FROM test;
结果
+------------+--------------+-----+------+--+ | cookie_id | create_time | pv | pv1 | +------------+--------------+-----+------+--+ | a | 2017-12-01 | 3 | 3 | | b | 2017-12-02 | 3 | 3 | | cookie1 | 2017-12-07 | 7 | 23 | | cookie1 | 2017-12-06 | 6 | 23 | | cookie1 | 2017-12-05 | 1 | 23 | | cookie1 | 2017-12-04 | 3 | 23 | | cookie1 | 2017-12-03 | 2 | 23 | | cookie1 | 2017-12-02 | 4 | 23 | | cookie2 | 2017-12-04 | 2 | 3 | | cookie2 | 2017-12-02 | 1 | 3 | | cookie3 | 2017-12-03 | 5 | 12 | | cookie3 | 2017-12-06 | 7 | 12 | +------------+--------------+-----+------+--+
可以看到,如果没有order by,不仅分区内没有排序,sum()计算的pv也是整个分区的pv
注:max()函数无论有没有order by 都是计算整个分区的最大值
ROW_NUMBER 函数
ROW_NUMBER() 从1开始,按照顺序,生成分组内记录的序列
ROW_NUMBER() 的应用场景非常多,比如获取分组内排序Top N的记录、获取一个session中的第一条refer等。
SELECT cookie_id,create_time,pv, ROW_NUMBER() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS rn FROM test;
结果
+------------+--------------+-----+-----+--+ | cookie_id | create_time | pv | rn | +------------+--------------+-----+-----+--+ | a | 2017-12-01 | 3 | 1 | | b | 2017-12-02 | 3 | 1 | | cookie1 | 2017-12-07 | 7 | 1 | | cookie1 | 2017-12-06 | 6 | 2 | | cookie1 | 2017-12-02 | 4 | 3 | | cookie1 | 2017-12-04 | 3 | 4 | | cookie1 | 2017-12-03 | 2 | 5 | | cookie1 | 2017-12-05 | 1 | 6 | | cookie2 | 2017-12-04 | 2 | 1 | | cookie2 | 2017-12-02 | 1 | 2 | | cookie3 | 2017-12-06 | 7 | 1 | | cookie3 | 2017-12-03 | 5 | 2 | +------------+--------------+-----+-----+--+
RANK 和 DENSE_RANK 函数
RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位 DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位
我们把 rank、dense_rank、row_number三者对比,这样比较清晰:
SELECT cookie_id,create_time,pv, RANK() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS rank_res, DENSE_RANK() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS dense_rank_res, ROW_NUMBER() OVER(PARTITION BY cookie_id ORDER BY pv desc) AS row_number_res FROM test;
结果:
-
为了突出区别,对test表做了一点调整,将数据cookie1,2017-12-02,4改为了cookie1,2017-12-02,3
-
主要看cookie1,2017-12-04和2017-12-02的pv都是3
+------------+--------------+-----+-----------+-----------------+-----------------+--+ | cookie_id | create_time | pv | rank_res | dense_rank_res | row_number_res | +------------+--------------+-----+-----------+-----------------+-----------------+--+ | a | 2017-12-01 | 3 | 1 | 1 | 1 | | b | 2017-12-02 | 3 | 1 | 1 | 1 | | cookie1 | 2017-12-07 | 7 | 1 | 1 | 1 | | cookie1 | 2017-12-06 | 6 | 2 | 2 | 2 | | cookie1 | 2017-12-04 | 3 | 3 | 3 | 3 | | cookie1 | 2017-12-02 | 3 | 3 | 3 | 4 | | cookie1 | 2017-12-03 | 2 | 5 | 4 | 5 | | cookie1 | 2017-12-05 | 1 | 6 | 5 | 6 | | cookie2 | 2017-12-04 | 2 | 1 | 1 | 1 | | cookie2 | 2017-12-02 | 1 | 2 | 2 | 2 | | cookie3 | 2017-12-06 | 7 | 1 | 1 | 1 | | cookie3 | 2017-12-03 | 5 | 2 | 2 | 2 | +------------+--------------+-----+-----------+-----------------+-----------------+--+
LAG 和 LEAD 函数
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
第一个参数为列名,第二个参数为往上第n行(默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT cookie_id,create_time, LAG(create_time,1) OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag1, LAG(create_time,1,'1970-01-01') OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag1_with_default FROM test;
结果:
+------------+--------------+-------------+--------------------+--+ | cookie_id | create_time | lag1 | lag1_with_default | +------------+--------------+-------------+--------------------+--+ | a | 2017-12-01 | NULL | 1970-01-01 | | b | 2017-12-02 | NULL | 1970-01-01 | | cookie1 | 2017-12-02 | NULL | 1970-01-01 | | cookie1 | 2017-12-03 | 2017-12-02 | 2017-12-02 | | cookie1 | 2017-12-04 | 2017-12-03 | 2017-12-03 | | cookie1 | 2017-12-05 | 2017-12-04 | 2017-12-04 | | cookie1 | 2017-12-06 | 2017-12-05 | 2017-12-05 | | cookie1 | 2017-12-07 | 2017-12-06 | 2017-12-06 | | cookie2 | 2017-12-02 | NULL | 1970-01-01 | | cookie2 | 2017-12-04 | 2017-12-02 | 2017-12-02 | | cookie3 | 2017-12-03 | NULL | 1970-01-01 | | cookie3 | 2017-12-06 | 2017-12-03 | 2017-12-03 | +------------+--------------+-------------+--------------------+--+
LEAD 函数则与 LAG 相反: LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT cookie_id,create_time, LEAD(create_time,1) OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag2, LEAD(create_time,1,'1970-01-01') OVER(PARTITION BY cookie_id ORDER BY create_time) AS lag2_with_default FROM test;
结果:
+------------+--------------+-------------+--------------------+--+ | cookie_id | create_time | lag2 | lag2_with_default | +------------+--------------+-------------+--------------------+--+ | a | 2017-12-01 | NULL | 1970-01-01 | | b | 2017-12-02 | NULL | 1970-01-01 | | cookie1 | 2017-12-02 | 2017-12-03 | 2017-12-03 | | cookie1 | 2017-12-03 | 2017-12-04 | 2017-12-04 | | cookie1 | 2017-12-04 | 2017-12-05 | 2017-12-05 | | cookie1 | 2017-12-05 | 2017-12-06 | 2017-12-06 | | cookie1 | 2017-12-06 | 2017-12-07 | 2017-12-07 | | cookie1 | 2017-12-07 | NULL | 1970-01-01 | | cookie2 | 2017-12-02 | 2017-12-04 | 2017-12-04 | | cookie2 | 2017-12-04 | NULL | 1970-01-01 | | cookie3 | 2017-12-03 | 2017-12-06 | 2017-12-06 | | cookie3 | 2017-12-06 | NULL | 1970-01-01 | +------------+--------------+-------------+--------------------+--+
GROUPING SETS,GROUPING__ID,CUBE,ROLLUP
准备数据:
+----------+-------------+-----------+--+ | month | day | cookieid | +----------+-------------+-----------+--+ | 2015-03 | 2015-03-10 | cookie1 | | 2015-03 | 2015-03-10 | cookie5 | | 2015-03 | 2015-03-12 | cookie7 | | 2015-04 | 2015-04-12 | cookie3 | | 2015-04 | 2015-04-13 | cookie2 | | 2015-04 | 2015-04-13 | cookie4 | | 2015-04 | 2015-04-16 | cookie4 | | 2015-03 | 2015-03-10 | cookie2 | | 2015-03 | 2015-03-10 | cookie3 | | 2015-04 | 2015-04-12 | cookie5 | | 2015-04 | 2015-04-13 | cookie6 | | 2015-04 | 2015-04-15 | cookie3 | | 2015-04 | 2015-04-15 | cookie2 | | 2015-04 | 2015-04-16 | cookie1 | +----------+-------------+-----------+--+
表:
create table test2( month STRING, day STRING, cookieid STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
GROUPING SETS
在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL
SELECT month ,day ,COUNT(DISTINCT cookieid) AS uv FROM test2 GROUP BY month,day GROUPING SETS(month, day);
结果:
+----------+-------------+-----+--+ | month | day | uv | +----------+-------------+-----+--+ | NULL | 2015-03-10 | 4 | | NULL | 2015-03-12 | 1 | | NULL | 2015-04-12 | 2 | | NULL | 2015-04-13 | 3 | | NULL | 2015-04-15 | 2 | | NULL | 2015-04-16 | 2 | | 2015-03 | NULL | 5 | | 2015-04 | NULL | 6 | +----------+-------------+-----+--+
等价于
SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv FROM test2 GROUP BY month UNION ALL SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv FROM test2 GROUP BY day;
增强版
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test2 GROUP BY month,day GROUPING SETS (month,day,(month,day)) ORDER BY GROUPING__ID;
其中的 GROUPING__ID,表示结果属于哪一个分组集合。
结果:
+----------+-------------+-----+---------------+--+ | month | day | uv | grouping__id | +----------+-------------+-----+---------------+--+ | 2015-04 | NULL | 6 | 1 | | 2015-03 | NULL | 5 | 1 | | NULL | 2015-03-10 | 4 | 2 | | NULL | 2015-04-16 | 2 | 2 | | NULL | 2015-04-15 | 2 | 2 | | NULL | 2015-04-13 | 3 | 2 | | NULL | 2015-04-12 | 2 | 2 | | NULL | 2015-03-12 | 1 | 2 | | 2015-04 | 2015-04-16 | 2 | 3 | | 2015-04 | 2015-04-12 | 2 | 3 | | 2015-04 | 2015-04-13 | 3 | 3 | | 2015-03 | 2015-03-12 | 1 | 3 | | 2015-03 | 2015-03-10 | 4 | 3 | | 2015-04 | 2015-04-15 | 2 | 3 | +----------+-------------+-----+---------------+--+
等价于
SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test2 GROUP BY month UNION ALL SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test2 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test2 GROUP BY month,day;
CUBE
根据GROUP BY的维度的所有组合进行聚合。
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test2 GROUP BY month,day WITH CUBE ORDER BY GROUPING__ID;
结果
+----------+-------------+-----+---------------+--+ | month | day | uv | grouping__id | +----------+-------------+-----+---------------+--+ | NULL | NULL | 7 | 0 | | 2015-03 | NULL | 5 | 1 | | 2015-04 | NULL | 6 | 1 | | NULL | 2015-04-16 | 2 | 2 | | NULL | 2015-04-15 | 2 | 2 | | NULL | 2015-04-13 | 3 | 2 | | NULL | 2015-04-12 | 2 | 2 | | NULL | 2015-03-12 | 1 | 2 | | NULL | 2015-03-10 | 4 | 2 | | 2015-04 | 2015-04-12 | 2 | 3 | | 2015-04 | 2015-04-16 | 2 | 3 | | 2015-03 | 2015-03-12 | 1 | 3 | | 2015-03 | 2015-03-10 | 4 | 3 | | 2015-04 | 2015-04-15 | 2 | 3 | | 2015-04 | 2015-04-13 | 3 | 3 | +----------+-------------+-----+---------------+--+
等价于
SELECT NULL as month,NULL as day,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM test2 UNION ALL SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test2 GROUP BY month UNION ALL SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test2 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test2 GROUP BY month,day;
ROLLUP
是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。
比如,以month维度进行层级聚合:
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test2 GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID;
结果:
+----------+-------------+-----+---------------+--+ | month | day | uv | grouping__id | +----------+-------------+-----+---------------+--+ | NULL | NULL | 7 | 0 | | 2015-04 | NULL | 6 | 1 | | 2015-03 | NULL | 5 | 1 | | 2015-04 | 2015-04-16 | 2 | 3 | | 2015-04 | 2015-04-15 | 2 | 3 | | 2015-04 | 2015-04-13 | 3 | 3 | | 2015-04 | 2015-04-12 | 2 | 3 | | 2015-03 | 2015-03-12 | 1 | 3 | | 2015-03 | 2015-03-10 | 4 | 3 | +----------+-------------+-----+---------------+--+
实现的上钻过程:月天的UV->月的UV->总UV
如果把month和day调换顺序,则以day维度进行层级聚合:
SELECT day, month, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test2 GROUP BY day,month WITH ROLLUP ORDER BY GROUPING__ID;
结果:
+-------------+----------+-----+---------------+--+ | day | month | uv | grouping__id | +-------------+----------+-----+---------------+--+ | NULL | NULL | 7 | 0 | | 2015-04-12 | NULL | 2 | 1 | | 2015-04-15 | NULL | 2 | 1 | | 2015-03-12 | NULL | 1 | 1 | | 2015-04-16 | NULL | 2 | 1 | | 2015-03-10 | NULL | 4 | 1 | | 2015-04-13 | NULL | 3 | 1 | | 2015-04-16 | 2015-04 | 2 | 3 | | 2015-04-15 | 2015-04 | 2 | 3 | | 2015-04-13 | 2015-04 | 3 | 3 | | 2015-03-12 | 2015-03 | 1 | 3 | | 2015-03-10 | 2015-03 | 4 | 3 | | 2015-04-12 | 2015-04 | 2 | 3 | +-------------+----------+-----+---------------+--+
实现的上钻过程:天月的UV->天的UV->总UV
参考:http://lxw1234.com/archives/category/hive
参考:https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup