Hive之窗口函数
普通的聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。
因此,普通的聚合函数每组(Group by)只返回一个值,而开窗函数则可为窗口中的每行都返回一个值。
简单理解,就是对查询的结果多出一列,这一列可以是聚合值,也可以是排序值。 开窗函数一般分为两类,聚合开窗函数和排序开窗函数。
OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化
CURRENT ROW:当前行
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
LAG(col,n):往前第n行数据
LEAD(col,n):往后第n行数据
NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。注意:n必须为int类型。
测试数据:
name,orderdate,cost jack,2017-01-01,10 tony,2017-01-02,15 jack,2017-02-03,23 tony,2017-01-04,29 jack,2017-01-05,46 jack,2017-04-06,42 tony,2017-01-07,50 jack,2017-01-08,55 mart,2017-04-08,62 mart,2017-04-09,68 neil,2017-05-10,12 mart,2017-04-11,75 neil,2017-06-12,80 mart,2017-04-13,94
需求:
(1)查询在2017年4月份购买过的顾客及总人数
(2)查询顾客的购买明细及月购买总额
(3)上述的场景,要将cost按照日期进行累加
(4)查询顾客上次的购买时间
(5)查询前20%时间的订单信息
建表、导数:
create table business(name string, orderdate string, cost int) row format delimited fields terminated by ','; load data local inpath "/opt/data/business.txt" into table business;
(1)查询在2017年4月份购买过的顾客及总人数
select name, count(*) as cnt from business where date_format(orderdate,'yyyy-MM')='2017-04' group by name;
(2)查询顾客的购买明细及月购买总额
select name, orderdate, cost, sum(cost) over (partition by month(orderdate)) as month_sum from business order by name,orderdate;
(3)上述的场景,要将cost按照日期进行累加
select name,orderdate,cost, sum(cost) over (partition by name order by orderdate rows between unbounded preceding and current row) as accumulate_cost, --按name分组,按orderdate排序,求从组内首行到当前行之和 sum(cost) over(partition by name order by orderdate) as accumulate_cost2, --效果和上个一样 sum(cost) over () as total_cost, --所有cost求和 sum(cost) over (partition by name) as total_cost_by_person, --每个name组内之和 sum(cost) over (partition by name order by orderdate rows between 1 preceding and current row) as double_cost, --组内按orderdate排序,前一行和当前行之和 sum(cost) over (partition by name order by orderdate rows between 1 preceding and 1 following) as triple_cost, --组内按orderdate排序,前一行、当前行、下一行,三行之和 sum(cost) over (partition by name order by orderdate rows between current row and unbounded following) as total_next_cost --组内按orderdate排序,当前行至组内尾行之和 from business;
(4)查询顾客上次的购买时间
select name, orderdate, lag(orderdate,1) over(partition by name order by orderdate) as time2 from business;
(5)查询前20%时间的订单信息,
ntile(n)用于将分组数据按照顺序切分成n片,返回当前记录所在的切片值,经常用来取前30% 带有百分之多少比例的记录什么的。
select name,orderdate,cost, ntile(5) over (order by orderdate) sorted from business;
查询20%即sorted=1
select t.* from ( select name, orderdate, cost, ntile(5) over (order by orderdate) sorted from business ) t where t.sorted=1;
扩展:
求前40%时间和后60%时间内的平均cost
--<1>按照orderdate升序拆分成5分
select name, orderdate, cost, ntile(5) over (order by orderdate) as prt from business;
--<2>前2份prt标识为1,是前40%的时间;后3份prt标识为2,是后60%的时间
select t.name, t.orderdate, t.cost, case when prt in (1,2) then 1 else 2 end as new_prt from( select name, orderdate, cost, ntile(5) over (order by orderdate) as prt from business ) t;
--<3>--取每个new_prt的平均值
select new_prt, max(case when new_prt=1 then "first_40%_avg_cost" when new_prt=2 then "last_60%_avg_cost" end) as prt_cost_name, cast(avg(cost)as decimal(10,2)) as avg_cost from( select t.name, t.orderdate, t.cost, case when prt in (1,2) then 1 else 2 end as new_prt from( select name, orderdate, cost, ntile(5) over (order by orderdate) as prt from business ) t ) q group by q.new_prt;