Hive练习题1-累计报表
如下测试数据
A,2015-01,5 A,2015-01,15 B,2015-01,5 A,2015-01,8 B,2015-01,25 A,2015-01,5 C,2015-01,10 C,2015-01,20 A,2015-02,4 A,2015-02,6 C,2015-02,30 C,2015-02,10 B,2015-02,10 B,2015-02,5 A,2015-03,14 A,2015-03,6 B,2015-03,20 B,2015-03,25 C,2015-03,10 C,2015-03,20
需求:
求单月访问次数和总访问次数,截至当前最大访问次数
建表
create table t_access(name string, month string, time int) row format delimited fields terminated by ',';
导数
load data local inpath "/home/access_time.txt" into table t_access;
查询
select * from t_access;
传统做法
求每个用户,每月的总访问次数,数据放到一张临时表
create table t_access_tmp as select name, month, sum(time) as month_times from t_access group by name,month;
临时表自连接
select a.*, b.*
from t_access_tmp a join t_access_tmp b on a.name = b.name
where a.month<=b.month
按照b.name,b.month聚合,求最大值和求和。
select t.bname, t.bmonth, sum(t.month_times) as accu_times, max(t.month_times) as max_times from( select b.name as bname, b.month as bmonth, a.month_times as month_times from t_access_tmp a
join t_access_tmp b on a.name = b.name where a.month<=b.month) as t group by t.bname,t.bmonth;
或者简写成
select a.name, a.month, sum(b.times) as accumulate
from (select name,month,sum(time) as times from t_access group by name,month) a join (select name,month,sum(time) as times from t_access group by name,month) b on a.name = b.name where a.month >= b.month group by a.name, a.month;
用开窗函数做法,效率更高。
curent_month_times是当前用户,当前月的访问次数
accu_times是当前用户,截至当前月的累计访问次数
max_times是当前用户,截至当前月的最大的月访问次数
select t.name, t.month, t.month_times as curent_month_times, sum(t.month_times)over(partition by t.name order by t.month rows between unbounded preceding and current row ) as accu_times, max(t.month_times)over(partition by t.name order by t.month rows between unbounded preceding and current row ) as max_times from( select name, month, sum(time) as month_times from t_access group by name,month ) t;
首先对原始数据,根据name,month聚合。聚合后的结果用分析函数求累计
在这个题中,需要加rows between unbounded preceding and current row,意思是从这一组的第一行开始到当前行需要进行聚合计算。