(五)hive之窗口函数

hive窗口函数:

官方文档地址:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
窗口: 函数运行时计算的数据集的范围;
函数: 运行时对数据集所执行的函数

仅仅支持以下函数:Windowing functions、聚合函数、排名分析函数

Windowing functions:
	LEAD: LEAD (scalar_expression [,offset] [,default]): 返回当前行以下N行的指定列的列值,如果找不到,就采用默认值
	LAG: LAG (scalar_expression [,offset] [,default]): 返回当前行以上N行的指定列的列值,如果找不到,就采用默认值
	FIRST_VALUE: FIRST_VALUE(列名,[false(默认)]):  返回当前窗口指定列的第一个值,第二个参数如果为true,代表加入第一个值为null,跳过空值,继续寻找
	LAST_VALUE: LAST_VALUE(列名,[false(默认)]):  返回当前窗口指定列的最后一个值, 第二个参数如果为true,代表加入第一个值为null,跳过空值,继续寻找
统计类的函数: min、max、avg、sum、count,一般都需要结合over()使用
排名分析: RANK、ROW_NUMBER、DENSE_RANK、CUME_DIST、PERCENT_RANK、NTILE

使用语法: 函数 over( partition by 字段 ,order by 字段 window_clause )

窗口的大小可以通过windows_clause来指定:

(rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following)
	(rows | range) between current row and (current row | (unbounded | [num]) following)
	(rows | range) between [num] following and (unbounded | [num]) following

特殊情况:

①在over()中既没有出现windows_clause,也没有出现order by,窗口默认为rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
②在over()中(没有出现windows_clause),指定了order by,窗口默认为rows between UNBOUNDED PRECEDING and CURRENT ROW

窗口函数和分组有什么区别?

①如果是分组操作,select后只能写分组后的字段
②如果是窗口函数,窗口函数是在指定的窗口内,对每条记录都执行一次函数
③如果是分组操作,有去重效果,而partition不去重

实操:

准备表

CREATE TABLE business(
 name string,
 orderdate string,
 cost int
)
ROW format delimited fields terminated by ',';
load DATA LOCAL inpath "/opt/module/datas/business.txt" INTO TABLE business;

#数据
[emo@hadoop02 datas]$ vim /opt/module/datas/business.txt
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

执行日志:

0: jdbc:hive2://hadoop02:10000> CREATE TABLE business(
0: jdbc:hive2://hadoop02:10000>  name string,
0: jdbc:hive2://hadoop02:10000>  orderdate string,
0: jdbc:hive2://hadoop02:10000>  cost int
0: jdbc:hive2://hadoop02:10000> )
0: jdbc:hive2://hadoop02:10000> ROW format delimited fields terminated by ',';
No rows affected (0.073 seconds)
0: jdbc:hive2://hadoop02:10000> load DATA LOCAL inpath "/opt/module/datas/business.txt" INTO TABLE business;
INFO  : Loading data to table default.business from file:/opt/module/datas/business.txt
INFO  : Table default.business stats: [numFiles=1, totalSize=267]
No rows affected (0.214 seconds)
0: jdbc:hive2://hadoop02:10000> select * from business;
+----------------+---------------------+----------------+--+
| business.name  | business.orderdate  | business.cost  |
+----------------+---------------------+----------------+--+
| jack           | 2017-01-01          | 10             |
| tony           | 2017-01-02          | 15             |
| jack           | 2017-02-03          | 23             |
| tony           | 2017-01-04          | 29             |
| jack           | 2017-01-05          | 46             |
| jack           | 2017-04-06          | 42             |
| tony           | 2017-01-07          | 50             |
| jack           | 2017-01-08          | 55             |
| mart           | 2017-04-08          | 62             |
| mart           | 2017-04-09          | 68             |
| neil           | 2017-05-10          | 12             |
| mart           | 2017-04-11          | 75             |
| neil           | 2017-06-12          | 80             |
| mart           | 2017-04-13          | 94             |
|                | NULL                | NULL           |
+----------------+---------------------+----------------+--+
15 rows selected (0.071 seconds)

待补充

posted @ 2021-01-19 21:56  Leo-Wong  阅读(653)  评论(0编辑  收藏  举报