Hive函数:SUM,AVG,MIN,MAX
转自:http://lxw1234.com/archives/2015/04/176.htm,Hive分析窗口函数(一) SUM,AVG,MIN,MAX
之前看到大数据田地有关于max()over(partition by)的用法,今天恰好工作中用到了它,但是使用中遇到了一个问题:在max(rsrp)over(partition by buildingid,height) as max_rsrp返回的结果不是分组中的最大值。最中找到了问题的原因:max_rsrp数据类型为string而不是double类型,导致的一个bug问题。
再处理的过程中也再次把大数据田地的中关于sum,avg,max,min的函数用法做了demo,因此有了该参考后的文章。
数据准备:
echo ''>data_file.txt vim data_file.txt cookie1,2015-04-10,1 cookie1,2015-04-11,5 cookie1,2015-04-12,7 cookie1,2015-04-13,3 cookie1,2015-04-14,2 cookie1,2015-04-15,4 cookie1,2015-04-16,4 cookie2,2015-04-10,6 cookie2,2015-04-11,5 cookie2,2015-04-12,7 cookie2,2015-04-13,4 cookie2,2015-04-14,3 cookie2,2015-04-15,5 cookie2,2015-04-16,5 hadoop fs -rm -r /user/jrf/test_data hadoop fs -mkdir /user/jrf/test_data hadoop fs -copyFromLocal data_file.txt /user/jrf/test_data/
drop table if exists test_data;
create EXTERNAL TABLE test_data ( cookieid string, createtime string, --day pv INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile location '/user/jrf/test_data/';
select * from test_data; +---------------------+-----------------------+---------------+--+ | test_data.cookieid | test_data.createtime | test_data.pv | +---------------------+-----------------------+---------------+--+ | cookie1 | 2015-04-10 | 1 | | cookie1 | 2015-04-11 | 5 | | cookie1 | 2015-04-12 | 7 | | cookie1 | 2015-04-13 | 3 | | cookie1 | 2015-04-14 | 2 | | cookie1 | 2015-04-15 | 4 | | cookie1 | 2015-04-16 | 4 | | cookie2 | 2015-04-10 | 6 | | cookie2 | 2015-04-11 | 5 | | cookie2 | 2015-04-12 | 7 | | cookie2 | 2015-04-13 | 4 | | cookie2 | 2015-04-14 | 3 | | cookie2 | 2015-04-15 | 5 | | cookie2 | 2015-04-16 | 5 | +---------------------+-----------------------+---------------+--+
SUM — 注意,结果和ORDER BY相关,默认为升序
SELECT cookieid,createtime,pv, SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行 SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 SUM(pv) OVER(PARTITION BY cookieid) AS pv3,--分组内所有行 SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--当前行+往前3行 SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--当前行+往前3行+往后1行 SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 --当前行+往后所有行 FROM test_data order by cookieid,createtime; +-----------+-------------+-----+------+------+------+------+------+------+--+ | cookieid | createtime | pv | pv1 | pv2 | pv3 | pv4 | pv5 | pv6 | +-----------+-------------+-----+------+------+------+------+------+------+--+ | cookie1 | 2015-04-10 | 1 | 1 | 1 | 26 | 1 | 6 | 26 | | cookie1 | 2015-04-11 | 5 | 6 | 6 | 26 | 6 | 13 | 25 | | cookie1 | 2015-04-12 | 7 | 13 | 13 | 26 | 13 | 16 | 20 | | cookie1 | 2015-04-13 | 3 | 16 | 16 | 26 | 16 | 18 | 13 | | cookie1 | 2015-04-14 | 2 | 18 | 18 | 26 | 17 | 21 | 10 | | cookie1 | 2015-04-15 | 4 | 22 | 22 | 26 | 16 | 20 | 8 | | cookie1 | 2015-04-16 | 4 | 26 | 26 | 26 | 13 | 13 | 4 | | cookie2 | 2015-04-10 | 6 | 6 | 6 | 35 | 6 | 11 | 35 | | cookie2 | 2015-04-11 | 5 | 11 | 11 | 35 | 11 | 18 | 29 | | cookie2 | 2015-04-12 | 7 | 18 | 18 | 35 | 18 | 22 | 24 | | cookie2 | 2015-04-13 | 4 | 22 | 22 | 35 | 22 | 25 | 17 | | cookie2 | 2015-04-14 | 3 | 25 | 25 | 35 | 19 | 24 | 13 | | cookie2 | 2015-04-15 | 5 | 30 | 30 | 35 | 19 | 24 | 10 | | cookie2 | 2015-04-16 | 5 | 35 | 35 | 35 | 17 | 17 | 5 | +-----------+-------------+-----+------+------+------+------+------+------+--+ pv1: 分组内从起点到当前行的pv累积,如,11号的pv1=10号的pv+11号的pv, 12号=10号+11号+12号 pv2: 同pv1 pv3: 分组内(cookie1)所有的pv累加 pv4: 分组内当前行+往前3行,如,11号=10号+11号, 12号=10号+11号+12号, 13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号 pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21 pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16号=2+4+4=10
如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
–其他AVG,MIN,MAX,和SUM用法一样。
--AVG SELECT cookieid,createtime,pv, AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行 AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 AVG(pv) OVER(PARTITION BY cookieid) AS pv3,--分组内所有行 AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--当前行+往前3行 AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--当前行+往前3行+往后1行 AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 --当前行+往后所有行 FROM test_data order by cookieid,createtime; +-----------+-------------+-----+---------------------+---------------------+---------------------+--------------------+--------------------+---------------------+--+ | cookieid | createtime | pv | pv1 | pv2 | pv3 | pv4 | pv5 | pv6 | +-----------+-------------+-----+---------------------+---------------------+---------------------+--------------------+--------------------+---------------------+--+ | cookie1 | 2015-04-10 | 1 | 1.0 | 1.0 | 3.7142857142857144 | 1.0 | 3.0 | 3.7142857142857144 | | cookie1 | 2015-04-11 | 5 | 3.0 | 3.0 | 3.7142857142857144 | 3.0 | 4.333333333333333 | 4.166666666666667 | | cookie1 | 2015-04-12 | 7 | 4.333333333333333 | 4.333333333333333 | 3.7142857142857144 | 4.333333333333333 | 4.0 | 4.0 | | cookie1 | 2015-04-13 | 3 | 4.0 | 4.0 | 3.7142857142857144 | 4.0 | 3.6 | 3.25 | | cookie1 | 2015-04-14 | 2 | 3.6 | 3.6 | 3.7142857142857144 | 4.25 | 4.2 | 3.3333333333333335 | | cookie1 | 2015-04-15 | 4 | 3.6666666666666665 | 3.6666666666666665 | 3.7142857142857144 | 4.0 | 4.0 | 4.0 | | cookie1 | 2015-04-16 | 4 | 3.7142857142857144 | 3.7142857142857144 | 3.7142857142857144 | 3.25 | 3.25 | 4.0 | | cookie2 | 2015-04-10 | 6 | 6.0 | 6.0 | 5.0 | 6.0 | 5.5 | 5.0 | | cookie2 | 2015-04-11 | 5 | 5.5 | 5.5 | 5.0 | 5.5 | 6.0 | 4.833333333333333 | | cookie2 | 2015-04-12 | 7 | 6.0 | 6.0 | 5.0 | 6.0 | 5.5 | 4.8 | | cookie2 | 2015-04-13 | 4 | 5.5 | 5.5 | 5.0 | 5.5 | 5.0 | 4.25 | | cookie2 | 2015-04-14 | 3 | 5.0 | 5.0 | 5.0 | 4.75 | 4.8 | 4.333333333333333 | | cookie2 | 2015-04-15 | 5 | 5.0 | 5.0 | 5.0 | 4.75 | 4.8 | 5.0 | | cookie2 | 2015-04-16 | 5 | 5.0 | 5.0 | 5.0 | 4.25 | 4.25 | 5.0 | +-----------+-------------+-----+---------------------+---------------------+---------------------+--------------------+--------------------+---------------------+--+ --MIN SELECT cookieid,createtime,pv, MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行 MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,--从起点到当前行,结果同pv1 MIN(pv) OVER(PARTITION BY cookieid) AS pv3,--分组内所有行 MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--当前行+往前3行 MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--当前行+往前3行+往后1行 MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 --当前行+往后所有行 FROM test_data order by cookieid,createtime; +-----------+-------------+-----+------+------+------+------+------+------+--+ | cookieid | createtime | pv | pv1 | pv2 | pv3 | pv4 | pv5 | pv6 | +-----------+-------------+-----+------+------+------+------+------+------+--+ | cookie1 | 2015-04-10 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | | cookie1 | 2015-04-11 | 5 | 1 | 1 | 1 | 1 | 1 | 2 | | cookie1 | 2015-04-12 | 7 | 1 | 1 | 1 | 1 | 1 | 2 | | cookie1 | 2015-04-13 | 3 | 1 | 1 | 1 | 1 | 1 | 2 | | cookie1 | 2015-04-14 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | | cookie1 | 2015-04-15 | 4 | 1 | 1 | 1 | 2 | 2 | 4 | | cookie1 | 2015-04-16 | 4 | 1 | 1 | 1 | 2 | 2 | 4 | | cookie2 | 2015-04-10 | 6 | 6 | 6 | 3 | 6 | 5 | 3 | | cookie2 | 2015-04-11 | 5 | 5 | 5 | 3 | 5 | 5 | 3 | | cookie2 | 2015-04-12 | 7 | 5 | 5 | 3 | 5 | 4 | 3 | | cookie2 | 2015-04-13 | 4 | 4 | 4 | 3 | 4 | 3 | 3 | | cookie2 | 2015-04-14 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | cookie2 | 2015-04-15 | 5 | 3 | 3 | 3 | 3 | 3 | 5 | | cookie2 | 2015-04-16 | 5 | 3 | 3 | 3 | 3 | 3 | 5 | +-----------+-------------+-----+------+------+------+------+------+------+--+ --MAX SELECT cookieid,createtime,pv, MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行 MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 MAX(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行 MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行 MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行 MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 --当前行+往后所有行 FROM test_data order by cookieid,createtime; +-----------+-------------+-----+------+------+------+------+------+------+--+ | cookieid | createtime | pv | pv1 | pv2 | pv3 | pv4 | pv5 | pv6 | +-----------+-------------+-----+------+------+------+------+------+------+--+ | cookie1 | 2015-04-10 | 1 | 1 | 1 | 7 | 1 | 5 | 7 | | cookie1 | 2015-04-11 | 5 | 5 | 5 | 7 | 5 | 7 | 7 | | cookie1 | 2015-04-12 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | | cookie1 | 2015-04-13 | 3 | 7 | 7 | 7 | 7 | 7 | 4 | | cookie1 | 2015-04-14 | 2 | 7 | 7 | 7 | 7 | 7 | 4 | | cookie1 | 2015-04-15 | 4 | 7 | 7 | 7 | 7 | 7 | 4 | | cookie1 | 2015-04-16 | 4 | 7 | 7 | 7 | 4 | 4 | 4 | | cookie2 | 2015-04-10 | 6 | 6 | 6 | 7 | 6 | 6 | 7 | | cookie2 | 2015-04-11 | 5 | 6 | 6 | 7 | 6 | 7 | 7 | | cookie2 | 2015-04-12 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | | cookie2 | 2015-04-13 | 4 | 7 | 7 | 7 | 7 | 7 | 5 | | cookie2 | 2015-04-14 | 3 | 7 | 7 | 7 | 7 | 7 | 5 | | cookie2 | 2015-04-15 | 5 | 7 | 7 | 7 | 7 | 7 | 5 | | cookie2 | 2015-04-16 | 5 | 7 | 7 | 7 | 5 | 5 | 5 | +-----------+-------------+-----+------+------+------+------+------+------+--+ SELECT cookieid, createtime, pv, min(pv) OVER(PARTITION BY cookieid) AS min_pv, max(pv) OVER(PARTITION BY cookieid) AS max_pv FROM test_data; +-----------+-------------+-----+---------+---------+--+ | cookieid | createtime | pv | min_pv | max_pv | +-----------+-------------+-----+---------+---------+--+ | cookie1 | 2015-04-10 | 1 | 1 | 7 | | cookie1 | 2015-04-16 | 4 | 1 | 7 | | cookie1 | 2015-04-15 | 4 | 1 | 7 | | cookie1 | 2015-04-14 | 2 | 1 | 7 | | cookie1 | 2015-04-13 | 3 | 1 | 7 | | cookie1 | 2015-04-12 | 7 | 1 | 7 | | cookie1 | 2015-04-11 | 5 | 1 | 7 | | cookie2 | 2015-04-16 | 5 | 3 | 7 | | cookie2 | 2015-04-15 | 5 | 3 | 7 | | cookie2 | 2015-04-14 | 3 | 3 | 7 | | cookie2 | 2015-04-13 | 4 | 3 | 7 | | cookie2 | 2015-04-12 | 7 | 3 | 7 | | cookie2 | 2015-04-11 | 5 | 3 | 7 | | cookie2 | 2015-04-10 | 6 | 3 | 7 | +-----------+-------------+-----+---------+---------+--+
基础才是编程人员应该深入研究的问题,比如:
1)List/Set/Map内部组成原理|区别
2)mysql索引存储结构&如何调优/b-tree特点、计算复杂度及影响复杂度的因素。。。
3)JVM运行组成与原理及调优
4)Java类加载器运行原理
5)Java中GC过程原理|使用的回收算法原理
6)Redis中hash一致性实现及与hash其他区别
7)Java多线程、线程池开发、管理Lock与Synchroined区别
8)Spring IOC/AOP 原理;加载过程的。。。
【+加关注】。