Hive(三)分组、聚合、排序、窗口函数
最近看了一遍hive的文档,本文是为了记录文档中将来会可用东西,并非最全的《文档》,望谅解
一:建表语句
drop table window_test;
create external table if not exists window_test
(
name string,
score string
)
row format delimited
fields terminated by '|'
location '/hive/table/window_test';
二:排序函数
关键字 | 含义 |
sortby | 分区内有序 |
orderby | 全局有序 |
distrbuteby | 分区函数 |
clusterby |
sortby distrbuteby字段相同 只能升序排列 |
三:函数操作
1.sum与over实现定制sum
//查询当前行和下两行score总分的数据
select name, score, sum(score)over(order by name rows between current row and 2 following) from window_test;
//查询当前行和上两行score总分的数据
select name, score, sum(score)over(order by name rows between 2 preceding and current row) from window_test;
//查询当前行和以下行score总分的数据
select name, score, sum(score)over(order by name rows between current row and unbounded following) from window_test;
//查询当前行和以上行score总分的数据
select name, score, sum(score)over(order by name rows between unbounded preceding and current row) from window_test;
2.排序
//排序函数 每一个name分组后从1开始排序,相同的数据列序号相同,后一个加一
select name, score, rank()over(partition by name order by score) from window_test;
//排序函数 每一个name分组后从1开始排序,相同的数据列序号相同,后一个不加一
select name, score, dense_rank()over(partition by name order by score) from window_test;
//排序函数 每一个name分组后从1开始排序,相同的数据列序号不相同
select name, score, row_number()over(partition by name order by score) from window_test;
3.第一个值、最后一个值
//分组内排正序取第一个score
select name, score, first_value(score)over(partition by name order by score) from window_test;
//分组内排正序取最后一个score
select name, score, last_value(score)over(partition by name order by score range between unbounded preceding and unbounded following) from window_test;
//函数的结果是错误的,必须要前面一大段
//select name, score, last_value(score)over(partition by name order by score) from window_test;
4.数据动态等级划分
//根据分数将数据分为三个区域
select name, score, ntile(3)over(order by score) from window_test;
5.上一行、下一行
//下两行的score提过来,没有填充NULL
select name, score, lead(score,2)over(partition by name order by score) from window_test;
//上两行的score提过来,没有填充NULL
select name, score, lag(score,2)over(partition by name order by score) from window_test;
6.占比函数
//小于当前值SCORE的人数占总人数的占比(name分组)
select name, score, cume_dist()over(partition by name order by score) from window_test;
7.行列转
//行转列
select name,concat_ws(',',collect_list(score)) as score_value from window_test group by name;
//列转行
select name,score,score_value from window_test lateral view explode(split(score,','))num as score_value;
四:聚合函数
1.GROUPING SET、grouping__id
select grouping__id,month,day,count(1) from tranods_dev.user_detail group by month,day grouping sets((month,day),month,day,());
grouping__id:grouping sets()里面的值,0代表(month,day),1代表month,2代表day,3代表count(1)
2.with rollup
select grouping__id,month,day,count(1) from tranods_dev.user_detail group by month,day with rollup order by grouping__id ;
结果:all,group by month,group by month,day 以左边的数据为主,依次递减
3.cube
select grouping__id,month,day,count(1) from tranods_dev.user_detail group by cube(month,day) order by grouping__id ;
结果:全量维度数据group by
4.配置
set hive.new.job.grouping.set.cardinality = 30;
这条设置的意义在于告知解释器,group by之前,每条数据复制量在30份以内。2^维度数 大于5的时候就是32