【Hive】HiveQL实战之分析函数&窗口函数(补充)
本篇承接【Hive】HiveQL实战之分析函数&窗口函数并对其进行补充。
一 语法结构
分析函数的使用常和Over、Partition By、Order By 和窗口说明一起,具体语法如下:
Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>]
[<window_clause>])
1 Function (arg1,..., argn)
函数可以是以下类型:
- 标准聚合函数:Count()、Sum()、Min()、MAX()或AVG();
- Rank;
- Dense_Rank;
- Row_number;
- Cume_dist;
- Percent_Rank;
- NTile;
- Lead;
- Lag;
- First_Value;
- Last_Value;
2 PARTITION BY <...>
对数据行进行分组,和Group By类似。
3 ORDER BY <....>
对数据进行排序,默认为升序。
4 <window_clause>
窗口说明进一步对结果进行分区并应用分析函数,有两种类型的窗口,分别为Row类型和Range类型。
1)Row类型窗口定义
ROWS BETWEEN <start_expr> AND <end_expr>
2)窗口说明
<start_expr>只能是以下之一:
- Unbounded preceding;
- Current row;
- N Preceding 或 Following;
<end_expr>只能是以下之一:
- Unbounded following;
- Current row;
- N preceding 或Following;
3)图形展示
二 实战示例
1 数据准备
0: jdbc:hive2://localhost:10000/hive> create table ana(name string , depart int,salary int)
. . . . . . . . . . . . . . . . . . > row format delimited
. . . . . . . . . . . . . . . . . . > fields terminated by '|';
No rows affected (3.863 seconds)
0: jdbc:hive2://localhost:10000/hive> load data local inpath '/home/hadoop/tt' overwrite into table ana;
No rows affected (1.044 seconds)
0: jdbc:hive2://localhost:10000/hive> select *from ana;
+-----------+-------------+-------------+
| ana.name | ana.depart | ana.salary |
+-----------+-------------+-------------+
| Lucy | 1000 | 5500 |
| Michael | 1000 | 5000 |
| Steven | 1000 | 6400 |
| Will | 1000 | 4000 |
| Will | 1000 | 4000 |
| Jess | 1001 | 6000 |
| Lily | 1001 | 5000 |
| Mike | 1001 | 6400 |
| Richard | 1002 | 8000 |
| Wei | 1002 | 7000 |
| Yun | 1002 | 5500 |
+-----------+-------------+-------------+
11 rows selected (0.377 seconds)
2 示例一
0: jdbc:hive2://localhost:10000/hive> select row_number()over(order by depart,name) rnk,
. . . . . . . . . . . . . . . . . . > name,depart,salary,count(*)over(partition by depart) as row_cnt,
. . . . . . . . . . . . . . . . . . > sum(salary)over(partition by depart order by depart) as deptTotal,
. . . . . . . . . . . . . . . . . . > sum(salary)over(order by depart) as runningTotal,
. . . . . . . . . . . . . . . . . . > sum(salary)over(order by depart rows unbounded preceding) as runningTotal2,
. . . . . . . . . . . . . . . . . . > sum(salary)over(order by depart rows between unbounded preceding and current row) as runningTotal3
. . . . . . . . . . . . . . . . . . > from ana
. . . . . . . . . . . . . . . . . . > order by depart,name;
+------+----------+---------+---------+----------+------------+---------------+----------------+----------------+
| rnk | name | depart | salary | row_cnt | depttotal | runningtotal | runningtotal2 | runningtotal3 |
+------+----------+---------+---------+----------+------------+---------------+----------------+----------------+
| 1 | Lucy | 1000 | 5500 | 5 | 24900 | 24900 | 5500 | 5500 |
| 2 | Michael | 1000 | 5000 | 5 | 24900 | 24900 | 10500 | 10500 |
| 3 | Steven | 1000 | 6400 | 5 | 24900 | 24900 | 16900 | 16900 |
| 5 | Will | 1000 | 4000 | 5 | 24900 | 24900 | 24900 | 24900 |
| 4 | Will | 1000 | 4000 | 5 | 24900 | 24900 | 20900 | 20900 |
| 6 | Jess | 1001 | 6000 | 3 | 17400 | 42300 | 30900 | 30900 |
| 7 | Lily | 1001 | 5000 | 3 | 17400 | 42300 | 35900 | 35900 |
| 8 | Mike | 1001 | 6400 | 3 | 17400 | 42300 | 42300 | 42300 |
| 9 | Richard | 1002 | 8000 | 3 | 20500 | 62800 | 50300 | 50300 |
| 10 | Wei | 1002 | 7000 | 3 | 20500 | 62800 | 57300 | 57300 |
| 11 | Yun | 1002 | 5500 | 3 | 20500 | 62800 | 62800 | 62800 |
+------+----------+---------+---------+----------+------------+---------------+----------------+----------------+
11 rows selected (241.63 seconds)
3 示例二
0: jdbc:hive2://localhost:10000/hive> select row_number()over(order by depart,name) rnk,
. . . . . . . . . . . . . . . . . . > name,depart,salary,
. . . . . . . . . . . . . . . . . . > rank() over(partition by depart order by salary) as rank,
. . . . . . . . . . . . . . . . . . > dense_rank()over(partition by depart order by salary) as dense_rank,
. . . . . . . . . . . . . . . . . . > row_number()over() as row_num,
. . . . . . . . . . . . . . . . . . > round(cume_dist()over(partition by depart order by salary),1) as cume_dist,
. . . . . . . . . . . . . . . . . . > percent_rank()over(partition by depart order by salary) as percent_rank,
. . . . . . . . . . . . . . . . . . > ntile(4) over(partition by depart order by salary) as ntile
. . . . . . . . . . . . . . . . . . > from ana
. . . . . . . . . . . . . . . . . . > order by depart;
+------+----------+---------+---------+-------+-------------+----------+------------+---------------+--------+
| rnk | name | depart | salary | rank | dense_rank | row_num | cume_dist | percent_rank | ntile |
+------+----------+---------+---------+-------+-------------+----------+------------+---------------+--------+
| 5 | Will | 1000 | 4000 | 1 | 1 | 11 | 0.4 | 0.0 | 1 |
| 4 | Will | 1000 | 4000 | 1 | 1 | 10 | 0.4 | 0.0 | 1 |
| 2 | Michael | 1000 | 5000 | 3 | 2 | 9 | 0.6 | 0.5 | 2 |
| 1 | Lucy | 1000 | 5500 | 4 | 3 | 8 | 0.8 | 0.75 | 3 |
| 3 | Steven | 1000 | 6400 | 5 | 4 | 7 | 1.0 | 1.0 | 4 |
| 7 | Lily | 1001 | 5000 | 1 | 1 | 6 | 0.3 | 0.0 | 1 |
| 6 | Jess | 1001 | 6000 | 2 | 2 | 5 | 0.7 | 0.5 | 2 |
| 8 | Mike | 1001 | 6400 | 3 | 3 | 4 | 1.0 | 1.0 | 3 |
| 11 | Yun | 1002 | 5500 | 1 | 1 | 3 | 0.3 | 0.0 | 1 |
| 10 | Wei | 1002 | 7000 | 2 | 2 | 2 | 0.7 | 0.5 | 2 |
| 9 | Richard | 1002 | 8000 | 3 | 3 | 1 | 1.0 | 1.0 | 3 |
+------+----------+---------+---------+-------+-------------+----------+------------+---------------+--------+
11 rows selected (229.274 seconds)
4 示例三
0: jdbc:hive2://localhost:10000/hive> select name,depart,salary,
. . . . . . . . . . . . . . . . . . > row_number()over(order by depart,name) rnk,
. . . . . . . . . . . . . . . . . . > lead(salary,2)over(partition by depart order by salary) as lead,
. . . . . . . . . . . . . . . . . . > lag(salary,2)over(partition by depart order by salary) as lag,
. . . . . . . . . . . . . . . . . . > first_value(salary)over(partition by depart order by salary) as first_value,
. . . . . . . . . . . . . . . . . . > last_value(salary)over(partition by depart order by salary) as last_value_default,
. . . . . . . . . . . . . . . . . . > last_value(salary)over(partition by depart order by salary rows between unbounded preceding and unbounded following) as last_value
. . . . . . . . . . . . . . . . . . > from ana
. . . . . . . . . . . . . . . . . . > order by depart;
+----------+---------+---------+------+-------+-------+--------------+---------------------+-------------+
| name | depart | salary | rnk | lead | lag | first_value | last_value_default | last_value |
+----------+---------+---------+------+-------+-------+--------------+---------------------+-------------+
| Steven | 1000 | 6400 | 3 | NULL | 5000 | 4000 | 6400 | 6400 |
| Lucy | 1000 | 5500 | 1 | NULL | 4000 | 4000 | 5500 | 6400 |
| Michael | 1000 | 5000 | 2 | 6400 | 4000 | 4000 | 5000 | 6400 |
| Will | 1000 | 4000 | 4 | 5500 | NULL | 4000 | 4000 | 6400 |
| Will | 1000 | 4000 | 5 | 5000 | NULL | 4000 | 4000 | 6400 |
| Mike | 1001 | 6400 | 8 | NULL | 5000 | 5000 | 6400 | 6400 |
| Jess | 1001 | 6000 | 6 | NULL | NULL | 5000 | 6000 | 6400 |
| Lily | 1001 | 5000 | 7 | 6400 | NULL | 5000 | 5000 | 6400 |
| Richard | 1002 | 8000 | 9 | NULL | 5500 | 5500 | 8000 | 8000 |
| Wei | 1002 | 7000 | 10 | NULL | NULL | 5500 | 7000 | 8000 |
| Yun | 1002 | 5500 | 11 | 8000 | NULL | 5500 | 5500 | 8000 |
+----------+---------+---------+------+-------+-------+--------------+---------------------+-------------+
11 rows selected (172.483 seconds)
5 示例四
0: jdbc:hive2://localhost:10000/hive> select name,depart,salary,
. . . . . . . . . . . . . . . . . . > row_number()over(order by depart,name) rnk,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between 2 preceding and current row) win1,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between 2 preceding and unbounded following) win2,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between 1 preceding and 2 following) win3,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between 1 following and 2 following) win5,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between current row and current row) win6,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between current row and 1 following) win7,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between current row and unbounded following) win8,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between unbounded preceding and current row) win9,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between unbounded preceding and 1 following) win10,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between unbounded preceding and unbounded following) win11,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows 2 preceding) win12
. . . . . . . . . . . . . . . . . . > from ana
. . . . . . . . . . . . . . . . . . > order by depart,name;
+----------+---------+---------+------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+
| name | depart | salary | rnk | win1 | win2 | win3 | win5 | win6 | win7 | win8 | win9 | win10 | win11 | win12 |
+----------+---------+---------+------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+
| Lucy | 1000 | 5500 | 1 | 5500 | 6400 | 6400 | 6400 | 5500 | 5500 | 6400 | 5500 | 5500 | 6400 | 5500 |
| Michael | 1000 | 5000 | 2 | 5500 | 6400 | 6400 | 6400 | 5000 | 6400 | 6400 | 5500 | 6400 | 6400 | 5500 |
| Steven | 1000 | 6400 | 3 | 6400 | 6400 | 6400 | 4000 | 6400 | 6400 | 6400 | 6400 | 6400 | 6400 | 6400 |
| Will | 1000 | 4000 | 4 | 6400 | 6400 | 4000 | NULL | 4000 | 4000 | 4000 | 6400 | 6400 | 6400 | 6400 |
| Will | 1000 | 4000 | 5 | 6400 | 6400 | 6400 | 4000 | 4000 | 4000 | 4000 | 6400 | 6400 | 6400 | 6400 |
| Jess | 1001 | 6000 | 6 | 6000 | 6400 | 6400 | 6400 | 6000 | 6000 | 6400 | 6000 | 6000 | 6400 | 6000 |
| Lily | 1001 | 5000 | 7 | 6000 | 6400 | 6400 | 6400 | 5000 | 6400 | 6400 | 6000 | 6400 | 6400 | 6000 |
| Mike | 1001 | 6400 | 8 | 6400 | 6400 | 6400 | NULL | 6400 | 6400 | 6400 | 6400 | 6400 | 6400 | 6400 |
| Richard | 1002 | 8000 | 9 | 8000 | 8000 | 8000 | 7000 | 8000 | 8000 | 8000 | 8000 | 8000 | 8000 | 8000 |
| Wei | 1002 | 7000 | 10 | 8000 | 8000 | 8000 | 5500 | 7000 | 7000 | 7000 | 8000 | 8000 | 8000 | 8000 |
| Yun | 1002 | 5500 | 11 | 8000 | 8000 | 7000 | NULL | 5500 | 5500 | 5500 | 8000 | 8000 | 8000 | 8000 |
+----------+---------+---------+------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+
11 rows selected (190.488 seconds)
6 示例五
0: jdbc:hive2://localhost:10000/hive> select name,depart,salary,
. . . . . . . . . . . . . . . . . . > row_number()over(order by depart,name) rnk,
. . . . . . . . . . . . . . . . . . > max(salary)over w1 as win1,
. . . . . . . . . . . . . . . . . . > max(salary)over w1 as win2,
. . . . . . . . . . . . . . . . . . > max(salary)over w3 as win3
. . . . . . . . . . . . . . . . . . > from ana
. . . . . . . . . . . . . . . . . . > window
. . . . . . . . . . . . . . . . . . > w1 as (partition by depart order by name rows between 2 preceding and current row),
. . . . . . . . . . . . . . . . . . > w3 as (partition by depart order by name rows between 1 preceding and 2 following)
. . . . . . . . . . . . . . . . . . > order by depart,name;
+----------+---------+---------+------+-------+-------+-------+
| name | depart | salary | rnk | win1 | win2 | win3 |
+----------+---------+---------+------+-------+-------+-------+
| Lucy | 1000 | 5500 | 1 | 5500 | 5500 | 6400 |
| Michael | 1000 | 5000 | 2 | 5500 | 5500 | 6400 |
| Steven | 1000 | 6400 | 3 | 6400 | 6400 | 6400 |
| Will | 1000 | 4000 | 4 | 6400 | 6400 | 4000 |
| Will | 1000 | 4000 | 5 | 6400 | 6400 | 6400 |
| Jess | 1001 | 6000 | 6 | 6000 | 6000 | 6400 |
| Lily | 1001 | 5000 | 7 | 6000 | 6000 | 6400 |
| Mike | 1001 | 6400 | 8 | 6400 | 6400 | 6400 |
| Richard | 1002 | 8000 | 9 | 8000 | 8000 | 8000 |
| Wei | 1002 | 7000 | 10 | 8000 | 8000 | 8000 |
| Yun | 1002 | 5500 | 11 | 8000 | 8000 | 7000 |
+----------+---------+---------+------+-------+-------+-------+
11 rows selected (189.893 seconds)