【Hadoop离线基础总结】hive的窗口函数
hive的窗口函数
-
概述
hive中一般求取TopN时就需要用到窗口函数
专业窗口函数一般有三个
rank() over
dense rank() over
row_number() over
-
实例
- 创建一个f_test表拿到以下数据
+------------+--------------+-------------+--+ | f_test.id | f_test.name | f_test.sal | +------------+--------------+-------------+--+ | 1 | a | 10 | | 2 | a | 12 | | 3 | b | 13 | | 4 | b | 12 | | 5 | a | 14 | | 6 | a | 15 | | 7 | a | 13 | | 8 | b | 11 | | 9 | a | 16 | | 10 | b | 17 | | 11 | a | 14 | +------------+--------------+-------------+--+
- 现在用窗口函数求取TopN
SELECT id,name,sal, RANK() over(PARTITION by name ORDER BY sal DESC) rp, DENSE_RANK() over(PARTITION by name ORDER BY sal DESC) drp, ROW_NUMBER() over(PARTITION by name ORDER BY sal DESC) rmp FROM f_test;
partition by
就相当于group by
,但是这里不能替换为group by
,而且partition by
后只能跟一个字段,group by
可以跟多个字段 - 结果
可以看出在a组的sal中有两个相同值+-----+-------+------+-----+------+------+--+ | id | name | sal | rp | drp | rmp | +-----+-------+------+-----+------+------+--+ | 9 | a | 16 | 1 | 1 | 1 | | 6 | a | 15 | 2 | 2 | 2 | | 11 | a | 14 | 3 | 3 | 3 | | 5 | a | 14 | 3 | 3 | 4 | | 7 | a | 13 | 5 | 4 | 5 | | 2 | a | 12 | 6 | 5 | 6 | | 1 | a | 10 | 7 | 6 | 7 | | 10 | b | 17 | 1 | 1 | 1 | | 3 | b | 13 | 2 | 2 | 2 | | 4 | b | 12 | 3 | 3 | 3 | | 8 | b | 11 | 4 | 4 | 4 | +-----+-------+------+-----+------+------+--+
rank() over
是给出了相同的序号3
,然后直接跳到5
,所以他最后的序号要比实际数据数量少
dense rank() over
同样给出了相同的序号3
,但是是接着顺序给出了4
,最后序号和实际数据数量相同
row_number() over
则是单纯地按照顺序进行排序,不受相同name影响
- 创建一个f_test表拿到以下数据