Hive 窗口函数之——cume_dist,percent_rank
说明
今天介绍下 hive 中两个常用的分布函数
数据准备
select 1 part_id, 1001 user_id, 1000 fee union all select 1, 1002, 2000 union all select 1, 1003, 3000 union all select 2, 2001, 4000 union all select 2, 2002, 5000;
part_id 部门ID,user_id 用户ID fee 费用
cume_dist :小于等于当前值的行数/分组内总行数;比如,统计小于等于当前薪水的人数,所占总人数的比例
select part_id, user_id, fee, cume_dist() over (order by fee) cd_1, cume_dist() over (partition by part_id order by fee) cd_2 from (select 1 part_id, 1001 user_id, 1000 fee union all select 1, 1002, 2000 union all select 1, 1003, 3000 union all select 2, 2001, 4000 union all select 2, 2002, 5000) t;
输出结果
cd_1: 没有partition,所有数据均为1组,总行数为5, 第1行:小于等于1000的行数为1,因此,1/5=0.2 第2行:小于等于2000的行数为2,因此,2/5=0.4 第2行:小于等于3000的行数为3,因此,3/5=0.6 第4行:小于等于4000的行数为4,因此,4/5=0.8 第5行:小于等于5000的行数为5,因此,5/5=1 cd_2: 按照部门分组,part=1的行数为3,part=2的行数为2 第1行:part=1分组内小于等于1000的行数为1,因此,1/3=0.3333333333333333 第2行:part=1分组内小于等于2000的行数为2,因此,2/3=0.6666666666666666 第3行:part=1分组内小于等于3000的行数为3,因此,3/3=1 第4行:part=2分组内小于等于4000的行数为1,因此,1/2=0.5 第2行:part=2分组内小于等于5000的行数为2,因此,2/2=1
percent_rank
分组内当前行的RANK值-1/分组内总行数-1
select part_id, user_id, fee, percent_rank() over (order by fee) as rn1, --分组内 rank() over (order by fee) as rn11, --分组内的rank值 sum(1) over () as rn12, --分组内总行数 percent_rank() over (partition by part_id order by fee) as rn2, rank() over (partition by part_id order by fee) as rn21, sum(1) over (partition by part_id) as rn22 from (select 1 part_id, 1001 user_id, 1000 fee union all select 1, 1002, 2000 union all select 1, 1003, 3000 union all select 2, 2001, 4000 union all select 2, 2002, 5000) t;
–PERCENT_RANK :分组内当前行的RANK值-1/分组内总行数-1
rn1 == (rn11-1) / (rn12-1)
rn2 == (rn21-1) / (rn22-1)
rn1: rn1 = (rn11-1) / (rn12-1) 第一行,(1-1)/(5-1)=0/4=0 第二行,(2-1)/(5-1)=1/4=0.25 第四行,(4-1)/(5-1)=3/4=0.75 rn2: 按照dept分组, dept=d1的总行数为3 第一行,(1-1)/(3-1)=0 第三行,(3-1)/(3-1)=1