hive的窗口函数1

Hive中提供了越来越多的分析函数,用于完成负责的统计分析。抽时间将所有的分析窗口函数理一遍,将陆续发布。
今天先看几个基础的,SUM、AVG、MIN、MAX。
用于实现分组内所有和连续累积的统计。

1.数据准备

本地文件上的文件

(my_python_env)[root@hadoop26 data]# cat cookie 
cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4

 吧本地文件系统上的文件上传到hdfs上,并为此数据建立外部表

(my_python_env)[root@hadoop26 data]# hadoop fs -put cookie /cookie


hive->
create external table cookie (
cookieid string,
createtime string,
pv int
)
row format delimited
fields terminated by ','
stored as textfile
location '/cookie';

 

验证表中是否有数据:

hive (default)> select * from cookie;
OK
cookie.cookieid    cookie.createtime    cookie.pv
cookie1    2015-04-10    1
cookie1    2015-04-11    5
cookie1    2015-04-12    7
cookie1    2015-04-13    3
cookie1    2015-04-14    2
cookie1    2015-04-15    4
cookie1    2015-04-16    4
Time taken: 0.098 seconds, Fetched: 7 row(s)

 

 窗口统计:

select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime) as pv1,--默认为起点到当前行
sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2,--current row代表当前行
sum(pv) over(partition by cookieid) as pv3,--如果不指定rows between,统计所有行
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv4,--当前行+往前3行
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5,--当前行+往前3行+往后1行
sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6
from cookie;

 

 统计结果:

OK
cookieid    createtime    pv    pv1    pv2    pv3    pv4    pv5    pv6
cookie1        2015-04-16    4    26    26    26    13    13    4
cookie1        2015-04-15    4    22    22    26    16    20    8
cookie1        2015-04-14    2    18    18    26    17    21    10
cookie1        2015-04-13    3    16    16    26    16    18    13
cookie1        2015-04-12    7    13    13    26    13    16    20
cookie1        2015-04-11    5    6    6    26    6    13    25
cookie1        2015-04-10    1    1    1    26    1    6    26
Time taken: 49.42 seconds, Fetched: 7 row(s)

注意,结果和ORDER BY相关,默认为升序

如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点

–其他AVG,MIN,MAX,和SUM用法一样。

 转自:http://lxw1234.com/archives/2015/04/176.htm

posted @ 2016-08-04 16:25  dongdone  阅读(332)  评论(0编辑  收藏  举报