Hive分析窗体函数之SUM,AVG,MIN和MAX
Hive中提供了非常多的分析函数,用于完毕负责的统计分析。
本文先介绍SUM、AVG、MIN、MAX这四个函数。
环境信息:
Hive版本号为apache-hive-0.14.0-bin
Hadoop版本号为hadoop-2.6.0
Tez版本号为tez-0.7.0
构造数据:
P088888888888,2016-02-10,1
P088888888888,2016-02-11,3
P088888888888,2016-02-12,1
P088888888888,2016-02-13,9
P088888888888,2016-02-14,3
P088888888888,2016-02-15,12
P088888888888,2016-02-16,3
创建表:
hive (hiveinaction)> create table windows_func
>(
> polno string,
> createtime string,
> pnum int
>)
>ROW FORMAT DELIMITED
>FIELDS TERMINATED BY ','
>stored as textfile;
导入数据到表中:
load data local inpath '/home/hadoop/testhivedata/windows_func.txt' into table windows_func;
測试:
SELECT polno,
createtime,
pnum,
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime) AS pnum1, --默觉得从起点到当前行
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pnum2, --从起点到当前行
SUM(pnum) OVER(PARTITION BY polno) ASpnum3, --分组内全部行
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pnum4, --当前行+往前3行(当前行的值+前面三行的值)
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pnum5, --当前行+往前3行+往后1行
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pnum6 ---当前行+往后全部行
FROM windows_func;
结果:
polno | createtime | pnum | pnum1 | pnum2 | pnum3 | pnum4 | pnum5 | pnum6 |
P088888888888 | 2016/2/10 | 1 | 1 | 1 | 32 | 1 | 4 | 32 |
P088888888888 | 2016/2/11 | 3 | 4 | 4 | 32 | 4 | 5 | 31 |
P088888888888 | 2016/2/12 | 1 | 5 | 5 | 32 | 5 | 14 | 28 |
P088888888888 | 2016/2/13 | 9 | 14 | 14 | 32 | 14 | 17 | 27 |
P088888888888 | 2016/2/14 | 3 | 17 | 17 | 32 | 16 | 28 | 18 |
P088888888888 | 2016/2/15 | 12 | 29 | 29 | 32 | 25 | 28 | 15 |
P088888888888 | 2016/2/16 | 3 | 32 | 32 | 32 | 27 | 27 | 3 |
凝视:
1. 假设不指定ROWS BETWEEN,默觉得从起点到当前行;
2. 假设不指定ORDER BY,则将分组内全部值累加;
理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
其它AVG,MIN。MAX。和SUM使用方法一样。
演示AVG环境:
SELECT polno,
createtime,
pnum,
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime) AS pnum1, --默觉得从起点到当前行
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pnum2, --从起点到当前行
AVG(pnum) OVER(PARTITION BY polno) AS pnum3, --分组内全部行
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pnum4, --当前行+往前3行(当前行的值+前面三行的值)
AVG(pnum) OVER(PARTITION BYpolno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pnum5, --当前行+往前3行+往后1行
AVG(pnum) OVER(PARTITION BYpolno ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pnum6 ---当前行+往后全部行
FROM windows_func;
结果:
polno | createtime | pnum | pnum1 | pnum2 | pnum3 | pnum4 | pnum5 | pnum6 |
P088888888888 | 2016/2/10 | 1 | 1 | 1 | 4.57142857 | 1 | 2 | 4.5714286 |
P088888888888 | 2016/2/11 | 3 | 2 | 2 | 4.57142857 | 2 | 1.666667 | 5.1666667 |
P088888888888 | 2016/2/12 | 1 | 1.66667 | 1.6667 | 4.57142857 | 1.666667 | 3.5 | 5.6 |
P088888888888 | 2016/2/13 | 9 | 3.5 | 3.5 | 4.57142857 | 3.5 | 3.4 | 6.75 |
P088888888888 | 2016/2/14 | 3 | 3.4 | 3.4 | 4.57142857 | 4 | 5.6 | 6 |
P088888888888 | 2016/2/15 | 12 | 4.83333 | 4.8333 | 4.57142857 | 6.25 | 5.6 | 7.5 |
P088888888888 | 2016/2/16 | 3 | 4.57143 | 4.5714 | 4.57142857 | 6.75 | 6.75 | 3 |
其它相似的函数就不举例了。