Hive-day4
HiveSQL书写
1.count(*)、count(1)、count('字段名') 区别
从执行结果来看
-
-
count(1)包括了忽略所有列,用1代表代码行,在统计结果的时候,不会忽略列值为NULL 最快的
-
从执行效率来看
-
-
如果列不为主键,count(1)效率优于count(列名)
-
如果表中存在主键,count(主键列名)效率最优
-
如果表中只有一列,则count(*)效率最优
-
如果表有多列,且不存在主键,则count(1)效率优于count(*)
2.hive语句中的执行顺序
1.from
2.join on 或 lateral view explode(需炸裂的列) tbl as 炸裂后的列名
4.group by
5.聚合函数 如Sum() avg() count(1)等
6.having 在此开始可以使用select中的别名
7.select 若包含over()开窗函数,此时select中的内容作为窗口函数的输入,窗口中所选的数据范围也是在group by,having之后,并不是针对where后的数据进行开窗,这点要注意。需要注意开窗函数的执行顺序及时间点。
8.distinct
9.order by
10.limit
3.where 条件里不支持不等式子查询,实际上是支持 in、not in、exists、not exists
-- 列出与“SCOTT”从事相同工作的所有员工。 select t1.EMPNO ,t1.ENAME ,t1.JOB from emp t1 where t1.ENAME != "SCOTT" and t1.job in( select job from emp where ENAME = "SCOTT"); 7900,JAMES,CLERK,7698,1981-12-03,950,null,30 7902,FORD,ANALYST,7566,1981-12-03,3000,null,20 select t1.EMPNO ,t1.ENAME ,t1.JOB from emp t1 where t1.ENAME != "SCOTT" and exists( select job from emp t2 where ENAME = "SCOTT" and t1.job = t2.job );
5、在hive中,数据中如果有null字符串,加载到表中的时候会变成 null (不是字符串)
如果需要判断 null,使用 某个字段名 is null 这样的方式来判断
或者使用 nvl() 函数,不能 直接 某个字段名 == null
6、使用explain查看SQL执行计划
explain select t1.EMPNO ,t1.ENAME ,t1.JOB from emp t1 where t1.ENAME != "SCOTT" and t1.job in( select job from emp where ENAME = "SCOTT"); # 查看更加详细的执行计划,加上extended(查看抽象语法树) explain extended select t1.EMPNO ,t1.ENAME ,t1.JOB from emp t1 where t1.ENAME != "SCOTT" and t1.job in( select job from emp where ENAME = "SCOTT");
Hive常用函数
1.关系运算
// 等值比较 = == < = > // 不等值比较 != <> // 区间比较: select * from default.students where id between 1500100001 and 1500100010; // 空值/非空值判断:is null、is not null、nvl()、isnull() // like、rlike、regexp用法
2.数值计算
取整函数(四舍五入):round
向上取整:ceil
向下取整:floor
3.条件函数
select if(1>0,1,0); select if(1>0,if(-1>0,-1,1),0); select score,if(score>120,'优秀',if(score>100,'良好',if(score>90,'及格','不及格'))) as pingfen from score limit 20;
COALESCE
select COALESCE(null,'1','2'); // 1 从左往右 依次匹配 直到非空为止 select COALESCE('1',null,'2'); // 1
case when(重点)
select score ,case when score>120 then '优秀' when score>100 then '良好' when score>90 then '及格' else '不及格' end as pingfen from score limit 20; select name ,case name when "张三" then "老张" when "李四" then "老李" when "王二" then “老王" else "哈哈哈" end as nickname from students limit 10;
注意条件的顺序
select from_unixtime(1610611142,'YYYY/MM/dd HH:mm:ss'); select from_unixtime(unix_timestamp(),'YYYY/MM/dd HH:mm:ss'); // '2021年01月14日' -> '2021-01-14' select from_unixtime(unix_timestamp('2022年06月06日','yyyy年MM月dd日'),'yyyy-MM-dd'); // "04牛2021数加16强" -> "2021/04/16" select from_unixtime(unix_timestamp("06牛2022数加06强","MM牛yyyy数加dd强"),"yyyy/MM/dd");
concat('123','456'); // 123456 concat('123','456',null); // NULL select concat_ws('#','a','b','c'); // a#b#c select concat_ws('#','a','b','c',NULL); // a#b#c 可以指定分隔符,并且会自动忽略NULL select concat_ws("|",cast(id as string),name,cast(age as string),gender,clazz) from students limit 10; select substring("abcdefg",1); // abcdefg HQL中涉及到位置的时候 是从1开始计数 // '2021/01/14' -> '2021-01-14' select concat_ws("-",substring('2021/01/14',1,4),substring('2021/01/14',6,2),substring('2021/01/14',9,2)); // 建议使用日期函数去做日期 select from_unixtime(unix_timestamp('2021/01/14','yyyy/MM/dd'),'yyyy-MM-dd'); select split("abcde,fgh",","); // ["abcde","fgh"] select split("a,b,c,d,e,f",",")[2]; // c 数组的下标依旧是从0开始 select explode(split("abcde,fgh",",")); // abcde // fgh // 解析json格式的数据 select get_json_object('{"name":"zhangsan","age":18,"score":[{"course_name":"math","score":100},{"course_name":"english","score":60}]}',"$.score[1].score"); // 60
create table words( words string )row format delimited fields terminated by '|'; // 数据 hello,java,hello,java,scala,python hbase,hadoop,hadoop,hdfs,hive,hive hbase,hadoop,hadoop,hdfs,hive,hive select word,count(*) from (select explode(split(words,',')) word from words) a group by a.word; // 结果 hadoop 4 hbase 2 hdfs 2 hello 2 hive 4 java 2 python 1 scala 1
Hive窗口函数
-- 聚合格式 select sum(字段名) over([partition by 字段名] [ order by 字段名]) as 别名, max(字段名) over() as 别名 from 表名; -- 排序窗口格式 select rank() over([partition by 字段名] [ order by 字段名]) as 别名 from 表名;
-
over()函数中的分区、排序、指定窗口范围可组合使用也可以不指定,根据不同的业务需求结合使用
-
over()函数中如果不指定分区,窗口大小是针对查询产生的所有数据,如果指定了分区,窗口大小是针对每个分区的数据
测试数据
-- 创建表 create table t_fraction( name string, subject string, score int) row format delimited fields terminated by "," lines terminated by '\n'; -- 测试数据 fraction.txt 孙悟空,语文,10 孙悟空,数学,73 孙悟空,英语,15 猪八戒,语文,10 猪八戒,数学,73 猪八戒,英语,11 沙悟净,语文,22 沙悟净,数学,70 沙悟净,英语,31 唐玄奘,语文,21 唐玄奘,数学,81 唐玄奘,英语,23 -- 上传数据 load data local inpath '/usr/local/soft/bigdata19data/fraction.txt' into table t_fraction;
1.聚合开窗函数
min(最小)
max(最大)
avg(平均值)
count(计数)
-- select name,subject,score,sum(score) over() as sumover from t_fraction; +-------+----------+--------+----------+ | name | subject | score | sumover | +-------+----------+--------+----------+ | 唐玄奘 | 英语 | 23 | 321 | | 唐玄奘 | 数学 | 81 | 321 | | 唐玄奘 | 语文 | 21 | 321 | | 沙悟净 | 英语 | 31 | 321 | | 沙悟净 | 数学 | 12 | 321 | | 沙悟净 | 语文 | 22 | 321 | | 猪八戒 | 英语 | 11 | 321 | | 猪八戒 | 数学 | 73 | 321 | | 猪八戒 | 语文 | 10 | 321 | | 孙悟空 | 英语 | 15 | 321 | | 孙悟空 | 数学 | 12 | 321 | | 孙悟空 | 语文 | 10 | 321 | +-------+----------+--------+----------+ select name,subject,score, sum(score) over() as sum1, sum(score) over(partition by subject) as sum2, sum(score) over(partition by subject order by score) as sum3, -- 由起点到当前行的窗口聚合,和sum3一样 sum(score) over(partition by subject order by score rows between unbounded preceding and current row) as sum4, -- 当前行和前面一行的窗口聚合 sum(score) over(partition by subject order by score rows between 1 preceding and current row) as sum5, -- 当前行的前面一行到后面一行的窗口聚合 前一行+当前行+后一行 sum(score) over(partition by subject order by score rows between 1 preceding and 1 following) as sum6, -- 当前行与后一行之和 sum(score) over(partition by subject order by score rows between current row and 1 following) as sum6, -- 当前和后面所有的行 sum(score) over(partition by subject order by score rows between current row and unbounded following) as sum7 from t_fraction; rows:行 unbounded preceding:起点 unbounded following:终点 n preceding:前 n 行 n following:后 n 行 current row:当前行 +-------+----------+--------+-------+-------+-------+-------+-------+-------+-------+ | name | subject | score | sum1 | sum2 | sum3 | sum4 | sum5 | sum6 | sum7 | +-------+----------+--------+-------+-------+-------+-------+-------+-------+-------+ | 孙悟空 | 数学 | 12 | 359 | 185 | 12 | 12 | 12 | 31 | 185 | | 沙悟净 | 数学 | 19 | 359 | 185 | 31 | 31 | 31 | 104 | 173 | | 猪八戒 | 数学 | 73 | 359 | 185 | 104 | 104 | 92 | 173 | 154 | | 唐玄奘 | 数学 | 81 | 359 | 185 | 185 | 185 | 154 | 154 | 81 | | 猪八戒 | 英语 | 11 | 359 | 80 | 11 | 11 | 11 | 26 | 80 | | 孙悟空 | 英语 | 15 | 359 | 80 | 26 | 26 | 26 | 49 | 69 | | 唐玄奘 | 英语 | 23 | 359 | 80 | 49 | 49 | 38 | 69 | 54 | | 沙悟净 | 英语 | 31 | 359 | 80 | 80 | 80 | 54 | 54 | 31 | | 孙悟空 | 语文 | 10 | 359 | 94 | 10 | 10 | 10 | 31 | 94 | | 唐玄奘 | 语文 | 21 | 359 | 94 | 31 | 31 | 31 | 53 | 84 | | 沙悟净 | 语文 | 22 | 359 | 94 | 53 | 53 | 43 | 84 | 63 | | 猪八戒 | 语文 | 41 | 359 | 94 | 94 | 94 | 63 | 63 | 41 | +-------+----------+--------+-------+-------+-------+-------+-------+-------+-------+
rows必须跟在Order by 子句之后,对排序的结果进行限制,使用固定的行数来限制分区中的数据行数量。
CURRENT ROW:当前行
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
LAG(col,n,default_val):往前第n行数据,col是列名,n是往上的行数,当第n行为null的时候取default_val
LEAD(col,n, default_val):往后第n行数据,col是列名,n是往下的行数,当第n行为null的时候取default_val
NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。
cume_dist(),计算某个窗口或分区中某个值的累积分布。假定升序排序,则使用以下公式确定累积分布:
聚合开窗函数实战:
实战1:Hive用户购买明细数据分析
创建表和加载数据
name,orderdate,cost jack,2017-01-01,10 tony,2017-01-02,15 jack,2017-02-03,23 tony,2017-01-04,29 jack,2017-01-05,46 jack,2017-04-06,42 tony,2017-01-07,50 jack,2017-01-08,55 mart,2017-04-08,62 mart,2017-04-09,68 neil,2017-05-10,12 mart,2017-04-11,75 neil,2017-06-12,80 mart,2017-04-13,94 建表加载数据 vim business.txt create table business ( name string, orderdate string, cost int )ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; load data local inpath "/usr/local/soft/bigdata19/business.txt" into table business;
需求1:查询在2017年4月份购买过的顾客及总人数
# 分析:按照日期过滤、分组count求总人数 select name,orderdate,cost,count(*) over() total_people from business where date_format(orderdate,'yyyy-MM')='2017-04';
需求2:查询顾客的购买明细及月购买总额
# 分析:按照顾客分组、sum购买金额 select name,orderdate,cost,sum(cost) over(partition by name) total_amount from business;
需求3:上述的场景,要将cost按照日期进行累加
# 分析:按照顾客分组、日期升序排序、组内每条数据将之前的金额累加 select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between unbounded preceding and current row) cumulative_amountfrom business;
需求4:查询顾客上次的购买时间
·# 分析:查询出明细数据同时获取上一条数据的购买时间(肯定需要按照顾客分组、时间升序排序) select name,orderdate,cost,lag(orderdate,1) over(partition by name order by orderdate) last_date from business;
需求5:查询前20%时间的订单信息
分析:按照日期升序排序、取前20%的数据 select * from (select name,orderdate,cost,ntile(5) over(order by orderdate) sortgroup_num from business) t where t.sortgroup_num=1;
2.排序开窗函数
-
-
DENSE_RANK() 排序相同时会重复,总数会减少
-
ROW_NUMBER() 会根据顺序计算
-
select name,subject, score, rank() over(partition by subject order by score desc) rp, dense_rank() over(partition by subject order by score desc) drp, row_number() over(partition by subject order by score desc) rnp, percent_rank() over(partition by subject order by score) as percent_rank from t_fraction;
select name,subject,score, rank() over(order by score) as row_number, percent_rank() over(partition by subject order by score) as percent_rank from t_fraction;
创建表语加载数据
name subject score 李四 语文 87 李四 数学 95 李四 英语 68 张三 语文 94 张三 数学 56 张三 英语 84 王二 语文 64 王二 数学 86 王二 英语 84 许五 语文 65 许五 数学 85 许五 英语 78 建表加载数据 vim score.txt create table score2 ( name string, subject string, score int ) row format delimited fields terminated by "\t"; load data local inpath '/usr/local/soft/bigdata19/score.txt' into table score;
需求1:每门学科学生成绩排名(是否并列排名、空位排名三种实现)
分析:学科分组、成绩降序排序、按照成绩排名 select name,subject,score, rank() over(partition by subject order by score desc) rp, dense_rank() over(partition by subject order by score desc) drp, row_number() over(partition by subject order by score desc) rmp from score;
需求2:每门学科成绩排名top n的学生
select * from ( select name,subject,score,row_number() over(partition by subject order by score desc) rmp from score2) t where t.rmp<=3;
Hive行转列
lateral view explode
create table testArray2( name string, weight array<string> )row format delimited fields terminated by '\t' COLLECTION ITEMS terminated by ','; 小王 "150","170","180" 火火 "150","180","190" select name,col1 from testarray2 lateral view explode(weight) t1 as col1; 小王 150 小王 170 小王 180 火火 150 火火 180 火火 190 select key from (select explode(map('key1',1,'key2',2,'key3',3)) as (key,value)) t; key1 key2 key3 select name,col1,col2 from testarray2 lateral view explode(map('key1',1,'key2',2,'key3',3)) t1 as col1,col2; 小王 key1 1 小王 key2 2 小王 key3 3 火火 key1 1 火火 key2 2 火火 key3 3 select name,pos,col1 from testarray2 lateral view posexplode(weight) t1 as pos,col1; 小王 0 150 小王 1 170 小王 2 180 火火 0 150 火火 1 180 火火 2 190
Hive列转行
// testLieToLine name col1 小王 150 小王 170 小王 180 火火 150 火火 180 火火 190 create table testLieToLine( name string, col1 int )row format delimited fields terminated by '\t'; select name,collect_list(col1) from testLieToLine group by name; // 结果 小王 ["150","180","190"] 火火 ["150","170","180"] select t1.name ,collect_list(t1.col1) from ( select name ,col1 from testarray2 lateral view explode(weight) t1 as col1 ) t1 group by t1.name;
UDF:一进一出
定义UDF函数要注意下面几点:
继承
org.apache.hadoop.hive.ql.exec.UDF
重写
evaluate
(),这个方法不是由接口定义的,因为它可接受的参数的个数,数据类型都是不确定的。Hive会检查UDF,看能否找到和函数调用相匹配的evaluate()方法
创建maven项目,并加入依赖
<dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>1.2.1</version> </dependency>
解决方案:
在pom文件中修改hive-exec的配置
<dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <exclusions> <!--排除pentaho-aggdesigner-algorithm依赖,不将它引入--> <exclusion> <groupId>org.pentaho</groupId> <artifactId>pentaho-aggdesigner-algorithm</artifactId> </exclusion> </exclusions> </dependency>
编写代码,继承org.apache.hadoop.hive.ql.exec.UDF,实现evaluate方法,在evaluate方法中实现自己的逻辑
import org.apache.hadoop.hive.ql.exec.UDF; public class HiveUDF extends UDF { // hadoop => #hadoop$ public String evaluate(String col1) { // 给传进来的数据 左边加上 # 号 右边加上 $ String result = "#" + col1 + "$"; return result; } }
-
-
在hive shell中,使用
add jar 路径
add jar /usr/local/soft/bigdata19/hive-bigdata19-1.0-SNAPSHOT.jar;
使用jar包资源注册一个临时函数,fxxx1是你的函数名,'MyUDF'是主类名
create temporary function fxxx1 as 'MyUDF';
使用函数名处理数据
select fxx1(name) as fxx_name from students limit 10; #uoi$ #yui$ #huj$ #hdu$ #hul$ #wyy$ #wqy$ #pmq$ #wmd$ #tzq$
案例:转大写
public class FirstUDF extends UDF { public String evaluate(String str){ String upper = null; //1、检查输入参数 if (StringUtils.isEmpty(str)){ } else { upper = str.toUpperCase(); } return upper; } //调试自定义函数 public static void main(String[] args){ System.out.println(new firstUDF().evaluate("jiajingwen")); }
命令加载
这种加载只对本session有效
# 1、将项目打包上传服务器:将打好的jar包传到linux系统中。(不要打依赖) # 进入到hive客户端,执行下面命令 hive> add jar /usr/local/soft/bigdata19/hadoop-mapreduce-1.0-SNAPSHOT.jar # 2、创建一个临时函数名,要跟上面hive在同一个session里面: hive> create temporary function toUP as 'com.shujia.testHiveFun.udf.FirstUDF'; 3、检查函数是否创建成功 show functions; 4. 测试功能 select toUp('abcdef'); 5. 删除函数 drop temporary function if exists toUp;
hadoop fs -put hadoop-mapreduce-1.0-SNAPSHOT.jar /jar/
在hive命令行中创建永久函数:
create function myUp as 'com.shujia.testHiveFun.udf.FirstUDF' using jar 'hdfs:/jar/hadoop-mapreduce-1.0-SNAPSHOT.jar'; create function wqy_fun as 'wqy.udfdemo.HiveTest' using jar 'hdfs:/shujia/bigdata19/jar/hive-udf.jar';
退出hive,再进入,执行测试:
删除永久函数,并检查:
UDTF:一进多出
UDTF是一对多的输入输出,实现UDTF需要完成下面步骤
M1001#xiaohu#S324231212,lkd#M1002#S2543412432,S21312312412#M1003#bfy
1001 xiaohu 324231212
1002 lkd 2543412432
1003 bfy 21312312412
继承org.apache.hadoop.hive.ql.udf.generic.GenericUDTF, 重写initlizer()、process()、close()。 执行流程如下:
UDTF首先会调用initialize方法,此方法返回UDTF的返回行的信息(返回个数,类型)。
初始化完成后,会调用process方法,真正的处理过程在process函数中,在process中,每一次forward()调用产生一行;如果产生多列可以将多个列的值放在一个数组中,然后将该数组传入到forward()函数。
最后close()方法调用,对需要清理的方法进行清理。
"key1:value1,key2:value2,key3:value3"
key1 value1
key2 value2
key3 value3
方法一:使用 explode+split
select split(t.col1,":")[0],split(t.col1,":")[1] from (select explode(split("key1:value1,key2:value2,key3:value3",",")) as col1) t;
-
代码
import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; import java.util.ArrayList; public class HiveUDTF extends GenericUDTF { // 指定输出的列名 及 类型 @Override public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException { ArrayList<String> filedNames = new ArrayList<String>(); ArrayList<ObjectInspector> filedObj = new ArrayList<ObjectInspector>(); filedNames.add("col1"); filedObj.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); filedNames.add("col2"); filedObj.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); return ObjectInspectorFactory.getStandardStructObjectInspector(filedNames, filedObj); } // 处理逻辑 my_udtf(col1,col2,col3) // "key1:value1,key2:value2,key3:value3" // my_udtf("key1:value1,key2:value2,key3:value3") public void process(Object[] objects) throws HiveException { // objects 表示传入的N列 String col = objects[0].toString(); // key1:value1 key2:value2 key3:value3 String[] splits = col.split(","); for (String str : splits) { String[] cols = str.split(":"); // 将数据输出 forward(cols); } } // 在UDTF结束时调用 public void close() thArows HiveException { } }
SQL
create temporary function my_udtf as 'com.shujia.testHiveFun.udtf.HiveUDTF'; select my_udtf("key1:value1,key2:value2,key3:value3");
数据:
a,1,2,3,4,5,6,7,8,9,10,11,12
b,11,12,13,14,15,16,17,18,19,20,21,22
c,21,22,23,24,25,26,27,28,29,30,31,32
转成3列:id,hours,value
例如:
a,1,2,3,4,5,6,7,8,9,10,11,12
a,0时,1
a,2时,2
a,4时,3
a,6时,4
create table udtfData( id string ,col1 string ,col2 string ,col3 string ,col4 string ,col5 string ,col6 string ,col7 string ,col8 string ,col9 string ,col10 string ,col11 string ,col12 string )row format delimited fields terminated by ',';
代码:
package com.shujia.hivefun.udtf; import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; import java.util.ArrayList; /** * 进入的一行: a,1,2,3,4,5,6,7,8,9,10,11,12 * 出来的是: * a 0时 1 * a 2时 2 * a 4时 3 * ... */ public class HiveUDTF2 extends GenericUDTF { @Override public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException { ArrayList<String> filedNames = new ArrayList<String>(); ArrayList<ObjectInspector> fieldObj = new ArrayList<ObjectInspector>(); filedNames.add("id"); fieldObj.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); filedNames.add("hours"); fieldObj.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); filedNames.add("value"); fieldObj.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); return ObjectInspectorFactory.getStandardStructObjectInspector(filedNames, fieldObj); } //a,1,2,3,4,5,6,7,8,9,10,11,12 public void process(Object[] objects) throws HiveException { int hours = 0; Object id = objects[0]; for(int i=1;i<objects.length;i++){ String line = id+","+hours+"时"+","+objects[i].toString(); String[] cols = line.split(","); forward(cols); hours = hours + 2; } } public void close() throws HiveException { } }
添加jar资源:
add jar /usr/local/soft/HiveUDF2-1.0.jar;
注册udtf函数:
create temporary function my_udtf as 'MyUDTF';
SQL:
select id,hours,value from udtfData lateral view my_udtf(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12) t as hours,value ;
Hive Shell
第一种:
hive -e "select * from test1.students limit 10"
hive -f hql文件路径
将HQL写在一个文件里,再使用 -f 参数指定该文件
连续登陆问题
在电商、物流和银行可能经常会遇到这样的需求:统计用户连续交易的总额、连续登陆天数、连续登陆开始和结束时间、间隔天数等
数据:
注意:每个用户每天可能会有多条记录
id datestr amount 1,2019-02-08,6214.23 1,2019-02-08,6247.32 1,2019-02-09,85.63 1,2019-02-09,967.36 1,2019-02-10,85.69 1,2019-02-12,769.85 1,2019-02-13,943.86 1,2019-02-14,538.42 1,2019-02-15,369.76 1,2019-02-16,369.76 1,2019-02-18,795.15 1,2019-02-19,715.65 1,2019-02-21,537.71 2,2019-02-08,6214.23 2,2019-02-08,6247.32 2,2019-02-09,85.63 2,2019-02-09,967.36 2,2019-02-10,85.69 2,2019-02-12,769.85 2,2019-02-13,943.86 2,2019-02-14,943.18 2,2019-02-15,369.76 2,2019-02-18,795.15 2,2019-02-19,715.65 2,2019-02-21,537.71 3,2019-02-08,6214.23 3,2019-02-08,6247.32 3,2019-02-09,85.63 3,2019-02-09,967.36 3,2019-02-10,85.69 3,2019-02-12,769.85 3,2019-02-13,943.86 3,2019-02-14,276.81 3,2019-02-15,369.76 3,2019-02-16,369.76 3,2019-02-18,795.15 3,2019-02-19,715.65 3,2019-02-21,537.71
create table deal_tb( id string ,datestr string ,amount string )row format delimited fields terminated by ',';
-
先按用户和日期分组求和,使每个用户每天只有一条数据
select id,datestr,sum(amount) as sum_amount from deal_tb group by id,datestr
-
-
datediff(string end_date,string start_date); 等于0说明连续登录
-
select ttt1.id, ttt1.grp, sum(ttt1.sum_amount) as sum_over_amount, count(1) as lianxu_days, min(ttt1.datestr) as start_date, max(ttt1.datestr) as end_date, datediff( ttt1.grp,( lag(ttt1.grp, 1) over( partition by ttt1.id order by ttt1.grp ) ) ) as interval_days from ( select tt1.id as id, tt1.datestr as datestr, tt1.sum_amount as sum_amount, date_sub(tt1.datestr, tt1.rn) as grp from ( select t1.id as id, t1.datestr as datestr, t1.sum_amount as sum_amount, row_number() over( partition by t1.id order by t1.datestr ) as rn from ( select id, datestr, sum(amount) as sum_amount from deal_tb group by id, datestr ) t1 ) tt1 ) ttt1 group by ttt1.id, ttt1.grp;
结果:
1 2019-02-07 13600.23 3 2019-02-08 2019-02-10 NULL 1 2019-02-08 2991.650 5 2019-02-12 2019-02-16 1 1 2019-02-09 1510.8 2 2019-02-18 2019-02-19 1 1 2019-02-10 537.71 1 2019-02-21 2019-02-21 1 2 2019-02-07 13600.23 3 2019-02-08 2019-02-10 NULL 2 2019-02-08 3026.649 4 2019-02-12 2019-02-15 1 2 2019-02-10 1510.8 2 2019-02-18 2019-02-19 2 2 2019-02-11 537.71 1 2019-02-21 2019-02-21 1 3 2019-02-07 13600.23 3 2019-02-08 2019-02-10 NULL 3 2019-02-08 2730.04 5 2019-02-12 2019-02-16 1 3 2019-02-09 1510.8 2 2019-02-18 2019-02-19 1 3 2019-02-10 537.71 1 2019-02-21 2019-02-21 1