Hive 窗口函数使用(1)

一、窗口函数
官方文档地址:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
oracle,sqlserver都提供了窗口函数,但是在mysql5.5和5.6都没有提供窗口函数!

窗口函数: 窗口+函数
窗口: 函数运行时计算的数据集的范围
函数: 运行的函数!
仅仅支持以下函数:
Windowing functions:
LEAD:
LEAD (scalar_expression [,offset] [,default]): 返回当前行以下N行的指定列的列值!
如果找不到,就采用默认值
LAG:
LAG (scalar_expression [,offset] [,default]): 返回当前行以上N行的指定列的列值!
如果找不到,就采用默认值
FIRST_VALUE:
FIRST_VALUE(列名,[false(默认)]): 返回当前窗口指定列的第一个值,
第二个参数如果为true,代表加入第一个值为null,跳过空值,继续寻找!
LAST_VALUE:
LAST_VALUE(列名,[false(默认)]): 返回当前窗口指定列的最后一个值,
第二个参数如果为true,代表加入第一个值为null,跳过空值,继续寻找!
统计类的函数(一般都需要结合over使用): min,max,avg,sum,count
排名分析:
RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE

注意:不是所有的函数在运行都是可以通过改变窗口的大小,来控制计算的数据集的范围!
所有的排名函数和LAG,LEAD,支持使用over(),但是在over()中不能定义 window_clause

格式: 函数 over( partition by 字段 ,order by 字段 window_clause )


窗口的大小可以通过windows_clause来指定:
(rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following)
(rows | range) between current row and (current row | (unbounded | [num]) following)
(rows | range) between [num] following and (unbounded | [num]) following

特殊情况: ①在over()中既没有出现windows_clause,也没有出现order by,窗口默认为rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
②在over()中(没有出现windows_clause),指定了order by,窗口默认为rows between UNBOUNDED PRECEDING and CURRENT ROW

窗口函数和分组有什么区别?
①如果是分组操作,select后只能写分组后的字段
②如果是窗口函数,窗口函数是在指定的窗口内,对每条记录都执行一次函数
③如果是分组操作,有去重效果,而partition不去重!

business.name | business.orderdate | business.cost

(9) 查询前20%时间的订单信息
精确算法:
select *
from
(select name,orderdate,cost,cume_dist() over(order by orderdate ) cdnum
from business) tmp
where cdnum<=0.2

不精确计算:
select *
from
(select name,orderdate,cost,ntile(5) over(order by orderdate ) cdnum
from business) tmp
where cdnum=1



(8)查询顾客的购买明细及顾客最近三次cost花费

最近三次: 当前和之前两次 或 当前+前一次+后一次

当前和之前两次:
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 2 PRECEDING and CURRENT row)
from business

当前+前一次+后一次:
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 1 FOLLOWING)
from business


select name,orderdate,cost,cost+
lag(cost,1,0) over(partition by name order by orderdate )+
lead(cost,1,0) over(partition by name order by orderdate )
from business



(7) 查询顾客的购买明细及顾客本月最后一次购买的时间
select name,orderdate,cost,LAST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate rows between CURRENT row and UNBOUNDED FOLLOWING)
from business


(6) 查询顾客的购买明细及顾客本月第一次购买的时间
select name,orderdate,cost,FIRST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate )
from business


(5) 查询顾客的购买明细及顾客下次的购买时间
select name,orderdate,cost,lead(orderdate,1,'无数据') over(partition by name order by orderdate )
from business

(4)查询顾客的购买明细及顾客上次的购买时间
select name,orderdate,cost,lag(orderdate,1,'无数据') over(partition by name order by orderdate )
from business



(3)查询顾客的购买明细要将cost按照日期进行累加
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate )
from business


(2)查询顾客的购买明细及月购买总额

select name,orderdate,cost,sum(cost) over(partition by name,substring(orderdate,1,7) )
from business






(1)查询在2017年4月份购买过的顾客及总人数

count()在分组后,统计一个组内所有的数据!

传统写法: 效率低
with tmp as (select name
from business
where year(orderdate)=2017 and month(orderdate)=4
group by name)

select tmp.name,tmp1.totalcount
from
(select count(*) totalcount
from tmp ) tmp1 join tmp;

---------------
select name,count(*) over(rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING)
from business
where substring(orderdate,1,7)='2017-04'
group by name

等价于

select name,count(*) over()
from business
where substring(orderdate,1,7)='2017-04'
group by name

 

posted @ 2020-12-02 22:03  [行者]  阅读(137)  评论(0编辑  收藏  举报