Hive函数:CUME_DIST,PERCENT_RANK
参考自:大数据田地http://lxw1234.com/archives/2015/04/185.htm
数据准备:
d1,user1,1000 d1,user2,2000 d1,user3,3000 d2,user4,4000 d2,user5,5000 CREATE EXTERNAL TABLE test_data ( dept STRING, userid string, sal INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile location '/user/jc_rc_ftp/test_data'; hive> select * from test_data; +-----------------+-------------------+----------------+--+ | test_data.dept | test_data.userid | test_data.sal | +-----------------+-------------------+----------------+--+ | d1 | user1 | 1000 | | d1 | user2 | 2000 | | d1 | user3 | 3000 | | d2 | user4 | 4000 | | d2 | user5 | 5000 | +-----------------+-------------------+----------------+--+
CUME_DIST
–CUME_DIST 小于等于当前值的行数/分组内总行数
–比如,统计小于等于当前薪水的人数,所占总人数的比例
SELECT dept, userid, sal, CUME_DIST() OVER(ORDER BY sal) AS rn1, CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM test_data; +-------+---------+-------+------+---------------------+--+ | dept | userid | sal | rn1 | rn2 | +-------+---------+-------+------+---------------------+--+ | d1 | user1 | 1000 | 0.2 | 0.3333333333333333 | | d1 | user2 | 2000 | 0.4 | 0.6666666666666666 | | d1 | user3 | 3000 | 0.6 | 1.0 | | d2 | user4 | 4000 | 0.8 | 0.5 | | d2 | user5 | 5000 | 1.0 | 1.0 | +-------+---------+-------+------+---------------------+--+ rn1: 没有partition,所有数据均为1组,总行数为5, 第一行:小于等于1000的行数为1,因此,1/5=0.2 第三行:小于等于3000的行数为3,因此,3/5=0.6 rn2: 按照部门分组,dpet=d1的行数为3, 第二行:小于等于2000的行数为2,因此,2/3=0.6666666666666666
PERCENT_RANK
–PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1
应用场景不了解,可能在一些特殊算法的实现中可以用到吧。
SELECT dept, userid, sal, PERCENT_RANK() OVER(ORDER BY sal) AS rn1, --分组内 RANK() OVER(ORDER BY sal) AS rn11, --分组内RANK值 SUM(1) OVER(PARTITION BY NULL) AS rn12, --分组内总行数 PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM test_data; +-------+---------+-------+-------+-------+-------+------+--+ | dept | userid | sal | rn1 | rn11 | rn12 | rn2 | +-------+---------+-------+-------+-------+-------+------+--+ | d1 | user1 | 1000 | 0.0 | 1 | 5 | 0.0 | | d1 | user2 | 2000 | 0.25 | 2 | 5 | 0.5 | | d1 | user3 | 3000 | 0.5 | 3 | 5 | 1.0 | | d2 | user4 | 4000 | 0.75 | 4 | 5 | 0.0 | | d2 | user5 | 5000 | 1.0 | 5 | 5 | 1.0 | +-------+---------+-------+-------+-------+-------+------+--+ 这样只要排序字段为null,就会放在最后,而不会影响排序结果 rn1: rn1 = (rn11-1) / (rn12-1) 第一行,(1-1)/(5-1)=0/4=0 第二行,(2-1)/(5-1)=1/4=0.25 第四行,(4-1)/(5-1)=3/4=0.75 rn2: 按照dept分组, dept=d1的总行数为3 第一行,(1-1)/(3-1)=0 第三行,(3-1)/(3-1)=1
基础才是编程人员应该深入研究的问题,比如:
1)List/Set/Map内部组成原理|区别
2)mysql索引存储结构&如何调优/b-tree特点、计算复杂度及影响复杂度的因素。。。
3)JVM运行组成与原理及调优
4)Java类加载器运行原理
5)Java中GC过程原理|使用的回收算法原理
6)Redis中hash一致性实现及与hash其他区别
7)Java多线程、线程池开发、管理Lock与Synchroined区别
8)Spring IOC/AOP 原理;加载过程的。。。
【+加关注】。