(五)hive之窗口函数
hive窗口函数:
官方文档地址:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
窗口: 函数运行时计算的数据集的范围;
函数: 运行时对数据集所执行的函数
仅仅支持以下函数:Windowing functions、聚合函数、排名分析函数
Windowing functions:
LEAD: LEAD (scalar_expression [,offset] [,default]): 返回当前行以下N行的指定列的列值,如果找不到,就采用默认值
LAG: LAG (scalar_expression [,offset] [,default]): 返回当前行以上N行的指定列的列值,如果找不到,就采用默认值
FIRST_VALUE: FIRST_VALUE(列名,[false(默认)]): 返回当前窗口指定列的第一个值,第二个参数如果为true,代表加入第一个值为null,跳过空值,继续寻找
LAST_VALUE: LAST_VALUE(列名,[false(默认)]): 返回当前窗口指定列的最后一个值, 第二个参数如果为true,代表加入第一个值为null,跳过空值,继续寻找
统计类的函数: min、max、avg、sum、count,一般都需要结合over()使用
排名分析: RANK、ROW_NUMBER、DENSE_RANK、CUME_DIST、PERCENT_RANK、NTILE
使用语法: 函数 over( partition by 字段 ,order by 字段 window_clause )
窗口的大小可以通过windows_clause来指定:
(rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following)
(rows | range) between current row and (current row | (unbounded | [num]) following)
(rows | range) between [num] following and (unbounded | [num]) following
特殊情况:
①在over()中既没有出现windows_clause,也没有出现order by,窗口默认为rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
②在over()中(没有出现windows_clause),指定了order by,窗口默认为rows between UNBOUNDED PRECEDING and CURRENT ROW
窗口函数和分组有什么区别?
①如果是分组操作,select后只能写分组后的字段
②如果是窗口函数,窗口函数是在指定的窗口内,对每条记录都执行一次函数
③如果是分组操作,有去重效果,而partition不去重
实操:
准备表
CREATE TABLE business(
name string,
orderdate string,
cost int
)
ROW format delimited fields terminated by ',';
load DATA LOCAL inpath "/opt/module/datas/business.txt" INTO TABLE business;
#数据
[emo@hadoop02 datas]$ vim /opt/module/datas/business.txt
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
执行日志:
0: jdbc:hive2://hadoop02:10000> CREATE TABLE business(
0: jdbc:hive2://hadoop02:10000> name string,
0: jdbc:hive2://hadoop02:10000> orderdate string,
0: jdbc:hive2://hadoop02:10000> cost int
0: jdbc:hive2://hadoop02:10000> )
0: jdbc:hive2://hadoop02:10000> ROW format delimited fields terminated by ',';
No rows affected (0.073 seconds)
0: jdbc:hive2://hadoop02:10000> load DATA LOCAL inpath "/opt/module/datas/business.txt" INTO TABLE business;
INFO : Loading data to table default.business from file:/opt/module/datas/business.txt
INFO : Table default.business stats: [numFiles=1, totalSize=267]
No rows affected (0.214 seconds)
0: jdbc:hive2://hadoop02:10000> select * from business;
+----------------+---------------------+----------------+--+
| business.name | business.orderdate | business.cost |
+----------------+---------------------+----------------+--+
| jack | 2017-01-01 | 10 |
| tony | 2017-01-02 | 15 |
| jack | 2017-02-03 | 23 |
| tony | 2017-01-04 | 29 |
| jack | 2017-01-05 | 46 |
| jack | 2017-04-06 | 42 |
| tony | 2017-01-07 | 50 |
| jack | 2017-01-08 | 55 |
| mart | 2017-04-08 | 62 |
| mart | 2017-04-09 | 68 |
| neil | 2017-05-10 | 12 |
| mart | 2017-04-11 | 75 |
| neil | 2017-06-12 | 80 |
| mart | 2017-04-13 | 94 |
| | NULL | NULL |
+----------------+---------------------+----------------+--+
15 rows selected (0.071 seconds)
待补充
程序改变世界