Hive常用函数
Hive常用函数大全一览
hive中split、coalesce及collect_list函数的用法(可举例)
Split将字符串转化为数组。
split('a,b,c,d' , ',') ==> ["a","b","c","d"]
COALESCE(T v1, T v2, …) 返回参数中的第一个非空值;如果所有值都为 NULL,那么返回NULL。
collect_list列出该字段所有的值,不去重 select collect_list(id) from table;
常用时间函数:
将mongodb中时区转换过来: 由UTC时区 转换为 GMT时区,差8个小时
date_format(from_utc_timestamp( CONCAT_WS(' ',substring(updatetime,1,10),substring(updatetime,12,8) ) ,'GMT+8'),'yyyy-MM-dd HH:mm:ss') updatetime select date_format(from_utc_timestamp(create_time,"UTC"),'yyyy-MM-dd HH:mm:ss') as local_time select date_format(from_utc_timestamp(create_time,"GMT+8"),'yyyy-MM-dd HH:mm:ss') as local_time
时间戳 秒S是10位; 毫秒ms是13位;
date_format(from_unixtime(cast(h.updatetime as int)),'yyyy-MM-dd HH:mm:ss')
substring(h.id, 10, 24) 59409d1d2cdcc90b91c62be5 ObjectId(59409d1d2cdcc90b91c62be5)
今天: select date_format(current_timestamp,'yyyy-MM-dd')
前一天: select date_sub(current_date,1);
Hive中字段的合并
contact:简单合并功能;
CONCAT_WS("/",r.province,r.city,a.area) channel_address => 北京/北京市/朝阳区 ,字段必须是string; concat(payid,' ',carlogid) => 01a893092b914703b75941b713767ebf 408693
concat(substr(summary_time,9,2),'',substr(summary_time,6,2),'',substr(summary_time,1,4),'_',concat(market_id),'_' ,concat(mid)) 09022019_108_0 12022019_108_0 21022019_108_0 concat_ws("_" ,substr(summary_time,9,2),substr(summary_time,6,2),substr(summary_time,1,4),concat(market_id),concat(mid)) 09_03_2019_108_0 13_03_2019_108_0 21_03_2019_108_0
concat(sum(total_amount_pur),'&&',sum(total_amount_sig),'&&',sum(total_amount_jump))
0.0&&16665.0&&0.0
order by date desc, market_id asc;
date market_id total_fee 2019-10-08 110 23000 2019-10-08 110 13000 2019-10-08 141 1400 2019-10-08 141 2250 2019-10-08 218 4000 2019-10-08 218 1100 2019-10-08 218 2300 2019-10-08 218 4500 2019-10-08 234 0 2019-10-08 234 0
查询仅出现一次的数据
SELECT name,count(name) AS count_times FROM tb_test GROUP BY name HAVING count_times = 1;
查询语句返回某字段出现超过1次的所有记录
select * from stu where sname in (select sname from stu group by sname having count(sname)>1); 有重复的sname的记录,并计算相同sname的数量 select *,count(sname)as count from stu group by sname having (count(sname)>1);
Hive取非Group by字段数据的方法,但可加聚合函数如count、sum、avg、max等
方法①
输入数据为一个ID对应多个name,要求输出数据为ID是唯一的,name随便取一个就可以。
HIVE有这么一个函数collect_set,类似于mysql的group_concat函数,把每个分组的其他字段,按照逗号进行拼接,得到一个最终字符串:
collect_set(col)
返回类型:array
解释:返回一个去重后的对象集合
select sid,collect_set(class_id) from table1 group by sid; ===>> 1 [11,12,13] 2 [11,14] 3 [12,15] 可以针对第二列做一些计数、求和操作,分别对应到Hive的聚合函数count、sum。 对应到本文的目的,直接从数组获取第一个元素就达到目的了,这样做: select sid,collect_set(class_id)[0] from table1 group by sid;
方法② select后需要几个字段就gruop by几个
select sid,class_id from table1 group by sid, class_id;
相同的user_id,但是每个create_time不一样,现在的需求是根据create_time创建时间选取最早的,那么思路是现根据user_id进行分组,然后根据user_id,create_time进行排序,取row_number 为1的值
SELECT * FROM(SELECT *, row_number() OVER(PARTITION BY dt,id ORDER BY dt,id DESC) rk FROM test_table)t WHERE t.rk = 1
Hive查询中数值累加
1. 需求分析
现有 hive 表 record, 内容如下:
其中字段意义: channel_type(string) dt(string) num(int); 分别代表: 渠道类型 日期 该天个数,原数据模拟如下:
select * from record; channel_type dt num A 2015-01-01 8 A 2015-01-02 4 A 2015-01-02 5 C 2015-02-01 1 A 2015-01-04 5 A 2015-01-05 6 B 2015-01-03 2 B 2015-01-02 3 A 2015-01-03 2 C 2015-01-30 8 C 2015-01-30 7 B 2015-01-02 9 B 2015-01-01 1 C 2015-02-02 3
统计每个渠道截止到当天为止的最大单日人数和累计到该天的总人数:
# 先求出每个渠道每天总访问量:
create table record_nj as
select
channel_type,dt,sum(num) as new_join
from record
group by channel_type,dt;
channel_type dt new_join A 2015-01-01 8 A 2015-01-02 9 A 2015-01-03 2 A 2015-01-04 5 A 2015-01-05 6 B 2015-01-01 1 B 2015-01-02 12 B 2015-01-03 2 C 2015-01-30 15 C 2015-02-01 1 C 2015-02-02 3
方法一: 使用Hive窗口函数over max()、sum() select channel_type, dt, new_join, sum(new_join) over(partition by channel_type order by dt) as sum_count, max(new_join) over(partition by channel_type order by dt) as max_count from record_nj; 方法二:使用group by join自连接 select t1.channel_type,t1.dt,t1.new_join, sum(t2.new_join) sum_count, max(t2.new_join) max_count from record_nj t1 join record_nj t2 on t1.channel_type = t2.channel_type where t1.dt >= t2.dt group by t1.channel_type,t1.dt,t1.new_join order by t1.channel_type,t1.dt; 数据结果如下:
channel_type dt new_join sum_count max_count A 2015-01-01 8 8 8 A 2015-01-02 9 17 9 A 2015-01-03 2 19 9 A 2015-01-04 5 24 9 A 2015-01-05 6 30 9 B 2015-01-01 1 1 1 B 2015-01-02 12 13 12 B 2015-01-03 2 15 12 C 2015-01-30 15 15 15 C 2015-02-01 1 16 15 C 2015-02-02 3 19 15
累加、累乘、最大值:
select channel_type, new_join, sum(new_join) over(partition by channel_type order by dt) as sum_count,--累加 sum(new_join) over(partition by channel_type order by dt rows between unbounded preceding and current row) sum_count, --累加 round(power(10, sum(log(10, new_join))over(partition by channel_type order by dt rows between unbounded preceding and current row))) as tired,--累乘处理- max(new_join) over(partition by channel_type order by dt) as max_count --最大值 from record_nj;
行列转换
原数据如下:
select * from score; name subject score 孙悟空 语文 87 孙悟空 数学 95 孙悟空 英语 68 大海 语文 94 大海 数学 56 大海 英语 84 kris 语文 64 kris 数学 86 kris 英语 84 婷婷 语文 65 婷婷 数学 85 婷婷 英语 78
求语文成绩比数学成绩好的学生:
方法一:join
select s1.name,s1.subject, s1.score from score s1 inner join score s2 on s1.name = s2.name where s1.score > s2.score and s1.subject = '语文' and s2.subject = '数学';
s1.name s1.subject s1.score 大海 语文 94
方法二:行列转换 ① create table t1 AS select name, case subject when '语文' then score else 0 end as chinese_score, case subject when '数学' then score else 0 end as math_score from score;
name chinese_score math_score 孙悟空 87 0 孙悟空 0 95 孙悟空 0 0 大海 94 0 大海 0 56 大海 0 0 kris 64 0 kris 0 86 kris 0 0 婷婷 65 0 婷婷 0 85 婷婷 0 0
② create table t2 AS select name,max(chinese_score) chinese_score,max(math_score) math_score from t1 group by name;
name chinese_score math_score kris 64 86 大海 94 56 婷婷 65 85 孙悟空 87 95
③ select name, chinese_score, math_score from t2 where chinese_score > math_score;
或者三个hql合并为一个如下====> select name,chinese_score,math_score from(
select name,max(chinese_score) chinese_score,max(math_score) math_score from( select name, case subject when '语文' then score else 0 end as chinese_score, case subject when '数学' then score else 0 end as math_score from score)t1
group by t1.name
)t2 where chinese_score >= math_score ;
name chinese_score math_score 大海 94 56
列转行的实现:
数据如下:
id sname math computer english 1 Jed 34 58 58 2 Tony 45 87 45 3 Tom 76 34 89
select id, sname, 'math' as course, math as score from score
union
select id, sname, 'computer' as course, computer as score from score
union
select id, sname, 'english' as course, english as score from score
order by id, sname, course;
结果如下:
id sname course score 1 Jed computer 58 1 Jed english 58 1 Jed math 34 2 Tony computer 87 2 Tony english 45 2 Tony math 45 3 Tom computer 34 3 Tom english 89 3 Tom math 76
窗口函数
一、聚合函数sum、avg、max、min
create table cookie( cookie_id string, create_time string, pv int )row format delimited fields terminated by ","; a_cookie,2019-06-10,1 a_cookie,2019-06-11,9 a_cookie,2019-06-12,7 a_cookie,2019-06-13,3 a_cookie,2019-06-14,2 a_cookie,2019-06-15,4 a_cookie,2019-06-16,4 b_cookie,2019-08-17,6 b_cookie,2019-08-18,9 b_cookie,2019-08-19,5 b_cookie,2019-08-17,2 load data local inpath "/opt/module/datas/cookie.txt" into table cookie;
select
cookie_id,create_time,pv,
sum(pv) over(partition by cookie_id) as sum1, --对组内的pv值进行全部累加
sum(pv) over(partition by cookie_id order by create_time) as accu_sum --从第一行开始累加到当前行
from cookie;
数据如下:
a_cookie 2019-06-10 1 30 1 a_cookie 2019-06-11 9 30 10 a_cookie 2019-06-12 7 30 17 a_cookie 2019-06-13 3 30 20 a_cookie 2019-06-14 2 30 22 a_cookie 2019-06-15 4 30 26 a_cookie 2019-06-16 4 30 30 b_cookie 2019-08-17 2 22 8 b_cookie 2019-08-17 6 22 8 b_cookie 2019-08-18 9 22 17 b_cookie 2019-08-19 5 22 22
select cookie_id,create_time,pv, sum(pv) over(partition by cookie_id) as sum1, --对组内的pv值进行全部累加 sum(pv) over(partition by cookie_id order by create_time) as accu_sum, --从第一行开始累加到当前行 sum(pv) over(partition by cookie_id order by create_time rows between unbounded preceding and current row) as sum3, --默认就是从起点到当前行往前累加,所以between unbounded(起点) preceding(往前) and current row这个条件可以不写。 sum(pv) over(partition by cookie_id order by create_time rows between current row and unbounded following) as sum4, --从当前行累加到最前边,跟前一个是相反的累加,相当于是降序累加 sum(pv) over(partition by cookie_id order by create_time rows between 3 preceding and current row) as sum5, --取当前行-至往前数3行的数进行累加, 一共4行进行累加处理 sum(pv) over(partition by cookie_id order by create_time rows between 3 preceding and 1 following) as sum6 --取当前行,往前数3行,往后数1行的数值进行累加处理 from cookie;
数据如下:
a_cookie 2019-06-10 1 30 1 1 30 1 10 a_cookie 2019-06-11 9 30 10 10 29 10 17 a_cookie 2019-06-12 7 30 17 17 20 17 20 a_cookie 2019-06-13 3 30 20 20 13 20 22 a_cookie 2019-06-14 2 30 22 22 10 21 25 a_cookie 2019-06-15 4 30 26 26 8 16 20 a_cookie 2019-06-16 4 30 30 30 4 13 13 b_cookie 2019-08-17 2 22 8 2 22 2 8 b_cookie 2019-08-17 6 22 8 8 20 8 17 b_cookie 2019-08-18 9 22 17 17 14 17 22 b_cookie 2019-08-19 5 22 22 22 5 22 22
sum、avg、max、min这些窗口函数的语法都是一样的;同上;
二、 ntile分片
rows between; 如果切片不均匀,默认增加第一个切片的分布。
将分组数据按照顺序切分成1份、2份、3份、4份 select cookie_id,create_time,pv, ntile(1) over(partition by cookie_id order by create_time) as nt1, ntile(2) over(partition by cookie_id order by create_time) as nt2, ntile(3) over(partition by cookie_id order by create_time) as nt3, ntile(4) over(partition by cookie_id order by create_time) as nt4 from cookie;
结果: a_cookie组有7条数据, 所以默认第一个分片加1条数据 cookie_id create_time pv nt1 nt2 nt3 nt4 a_cookie 2019-06-10 1 1 1 1 1 a_cookie 2019-06-11 9 1 1 1 1 a_cookie 2019-06-12 7 1 1 1 2 a_cookie 2019-06-13 3 1 1 2 2 a_cookie 2019-06-14 2 1 2 2 3 a_cookie 2019-06-15 4 1 2 3 3 a_cookie 2019-06-16 4 1 2 3 4 b_cookie 2019-08-17 6 1 1 1 1 b_cookie 2019-08-17 2 1 1 1 2 b_cookie 2019-08-18 9 1 2 2 3 b_cookie 2019-08-19 5 1 2 3 4
应用场景:统计一个每个cookie的pv数最多的前1/3的天
select
cookie_id,create_time,pv
from(
select
cookie_id,create_time,pv,
ntile(3) over(partition by cookie_id order by pv desc) as nt1 --按pv降序排,取nt = 1份的数据就是要求的结果
from cookie
)t where nt1 = 1;
a_cookie 2019-06-11 9 a_cookie 2019-06-12 7 a_cookie 2019-06-16 4 b_cookie 2019-08-18 9 b_cookie 2019-08-17 6
三、排序 row_number()等
select
cookie_id,create_time,pv,
row_number() over(partition by cookie_id order by pv desc) as row_index,
rank() over(partition by cookie_id order by pv desc) as rank_index,
dense_rank() over(partition by cookie_id order by pv desc) as dense_index
from cookie;
cookie_id create_time pv row_index rank_index dense_index
a_cookie 2019-06-11 9 1 1 1 a_cookie 2019-06-12 7 2 2 2 a_cookie 2019-06-16 4 3 3 3 a_cookie 2019-06-15 4 4 3 3 a_cookie 2019-06-13 3 5 5 4 a_cookie 2019-06-14 2 6 6 5 a_cookie 2019-06-10 1 7 7 6
常用场景:分组取TopN, 比如求每个cookie排名前三的pv
create table tmp as select cookieid, createtime, pv, row_number() over(partition by cookieid order by pv desc) as index from cookie2; select * from tmp where index <= 3;
四、lag | lead | first_value| last_value
LAG(col,n,DEFAULT)用于统计窗口内往上第n行值
第一个参数为列名
第二个参数为往上第n行(可选,默认为1)
第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
select cookie_id,create_time,pv, lag(pv,1) over (partition by cookie_id order by create_time) as last_1_pv //可设置默认值为0即 lag(pv,1,0) over(...) from cookie;
数据如下:
结果: 没有设置默认值,没有上一行时显示为null cookie_id create_time pv last_1_pv a_cookie 2019-06-10 1 NULL a_cookie 2019-06-11 9 1 a_cookie 2019-06-12 7 9 a_cookie 2019-06-13 3 7 a_cookie 2019-06-14 2 3 a_cookie 2019-06-15 4 2 a_cookie 2019-06-16 4 4 b_cookie 2019-08-17 6 NULL b_cookie 2019-08-17 2 6 b_cookie 2019-08-18 9 2 b_cookie 2019-08-19 5 9
LEAD(col,n,DEFAULT)用于统计窗口内往下第n行值
第一个参数为列名
第二个参数为往下第n行(可选,默认为1)
第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
select cookie_id,create_time,pv, lead(pv,1) over(partition by cookie_id order by create_time) as next_1_pv, lead(create_time,1,'1970-01-01') over(partition by cookie_id order by create_time) as next_1_time from cookie;
数据如下:
cookie_id create_time pv next_1_pv next_1_time a_cookie 2019-06-10 1 9 2019-06-11 a_cookie 2019-06-11 9 7 2019-06-12 a_cookie 2019-06-12 7 3 2019-06-13 a_cookie 2019-06-13 3 2 2019-06-14 a_cookie 2019-06-14 2 4 2019-06-15 a_cookie 2019-06-15 4 4 2019-06-16 a_cookie 2019-06-16 4 NULL 1970-01-01 b_cookie 2019-08-17 6 2 2019-08-17 b_cookie 2019-08-17 2 9 2019-08-18 b_cookie 2019-08-18 9 5 2019-08-19 b_cookie 2019-08-19 5 NULL 1970-01-01
first_value: 取分组内排序后,截止到当前行,第一个值
select cookie_id,create_time,pv, first_value(pv) over (partition by cookie_id order by pv) as first_pv from cookie; 数据如下:
cookie_id create_time pv first_pv a_cookie 2019-06-10 1 1 a_cookie 2019-06-14 2 1 a_cookie 2019-06-13 3 1 a_cookie 2019-06-16 4 1 a_cookie 2019-06-15 4 1 a_cookie 2019-06-12 7 1 a_cookie 2019-06-11 9 1 b_cookie 2019-08-17 2 2 b_cookie 2019-08-19 5 2 b_cookie 2019-08-17 6 2 b_cookie 2019-08-18 9 2
last_value 取分组内排序后,截止到当前行,最后一个值(其实就是它自己)
select cookie_id,create_time,pv, last_value(pv) over (partition by cookie_id order by pv) as last_pv from cookie;
cookie_id create_time pv last_pv a_cookie 2019-06-10 1 1 a_cookie 2019-06-14 2 2 a_cookie 2019-06-13 3 3 a_cookie 2019-06-16 4 4 a_cookie 2019-06-15 4 4 a_cookie 2019-06-12 7 7 a_cookie 2019-06-11 9 9 b_cookie 2019-08-17 2 2 b_cookie 2019-08-19 5 5 b_cookie 2019-08-17 6 6 b_cookie 2019-08-18 9 9
参考:
grouping sets 、 grouping__id 、 cube 、 rollup函数;
cume_dist 、 percent_rank等函数
https://www.jianshu.com/nb/19948302