hive 开窗函数、自定义函数
-- 开窗------------------------分组排序--------------------------------- -- row_number() over(partition by ... order by ...) 组内的排名序号 1 2 3 4 5 6 7 -- rank() over(partition by userid order by pv desc) 组内的排名序号 1 2 3 3 5 6 6 8 -- dense_rank over(partition by userid order by pv desc) 组内的排名序号 1 2 3 3 4 5 6 6 7
如果不添加partition by,就是不分组,rank() over(order by pv desc)就是对全部数据,用pv值进行排序 -- 开窗函数:SUM,MAX,MIN, AVG -- SUM() over(partition by ... order by ... rows between ... and ...) -- MAX() over(partition by ... order by ... rows between ... and ...) -- MIN() over(partition by ... order by ... rows between ... and ...) -- AVG() over(partition by ... order by ... rows between ... and ...) -- 其中rows between ... and ...
从第一行开始累加到当前行 :
默认策略: sum(pv) over(partition by userid order by createtime)
如果没有order by 排序语句,默认把分组内的数据累加:sum(pv) over(partition by userid)
使用between:sum(pv) over(partition by userid order by createtime rows between unbounded preceding and current row)
current row 当前行
n preceding 往前n行数据
n following 往后n行数据
unbounded:
unbounded preceding 从前面开始
unbounded following 直到终点
从向上推3行 累加到当前行 : sum(pv) over(partition by userid order by createtime rows between 3 preceding and current row)
从向上推3行 累加到向后推一行 : sum(pv) over(partition by userid order by createtime rows between 3 preceding and 1 following)
从当前行累加到最后一行 : sum(pv) over(partition by userid order by createtime rows between current row and unbounded following)
从向上推3行 累加到最后一行 : sum(pv) over(partition by userid order by createtime rows between 3 preceding and unbounded following)
从向上推3行,到最后一行 求最大值: max(pv) over(partition by userid order by createtime rows between 3 preceding and unbounded following)
从向上推3行,到最后一行 求最小值 : min(pv) over(partition by userid order by createtime rows between 3 preceding and unbounded following)
从向上推3行,到最后一行 求平均值 : avg(pv) over(partition by userid order by createtime rows between 3 preceding and unbounded following)
-- lag
将上一行的数据和本行数据放在同一行:LAG(createtime,1,'1970-01-01') OVER(PARTITION BY userid ORDER BY createtime)
将上上一行的数据和本行数据放在同一行:LAG(createtime,2,'1970-01-01') OVER(PARTITION BY userid ORDER BY createtime)
-- lead
将下一行的数据和本行数据放在同一行:LEAD(createtime,1,'1970-01-01') OVER(PARTITION BY userid ORDER BY createtime)
将下下一行的数据和本行数据放在同一行:LEAD(createtime,2,'1970-01-01') OVER(PARTITION BY userid ORDER BY createtime)
自定义函数
根据用户自定义函数类别分为以下三种:
1、UDF(User-Defined-Function)
一进一出
类似于:lower/upper/reverse
2、UDAF(User-Defined Aggregation Function)
聚集函数,多进一出
类似于:count/max/min
3、UDTF(User-Defined Table-Generating Functions)
一进多出
如lateral view explode()
1.自定义UDF
编程步骤:
(1)继承org.apache.hadoop.hive.ql.exec.UDF
(2)需要实现evaluate函数;evaluate函数支持重载;
注意事项:
(1)UDF必须要有返回类型,可以返回null,但是返回类型不能为void;
(2)UDF中常用Text/LongWritable等类型,不推荐使用java类型;
代码编写:
第一步:创建maven java 工程,导入jar包
<dependencies>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.1.4</version>
</dependency>
</dependencies>
第二步:开发java类继承UDF,并重载evaluate 方法
public class MyUDF extends UDF {
//17801112345 ---> 178****2345
public String evaluate(String phoneNumStr){
//匹配手机号是否合法
String regex = "1[35789][0-9]{9}";
boolean flag = phoneNumStr.matches(regex);
if (!flag){
return null;
}else{
String str1 = phoneNumStr.substring(0,3);
String str2 = phoneNumStr.substring(7);
return str1 + "****"+str2;
}
}
函数使用方式1-临时函数
1、将我们的项目打包,并上传到hive的lib目录下
2、添加我们的jar包
将jar包上传到 /export/server/hive/lib目录,并重命名我们的jar包名称
cd /export/server/hive/lib
mv day19_udf-1.0-SNAPSHOT.jar my_udf.jar
3、hive的客户端添加我们的jar包
hive> add jar /export/server/hive/lib/my_udf.jar
4、设置函数与我们的自定义函数关联-临时函数
hive> create temporary function my_jiami as 'cn.itcast.udf.MyUDF';
5、使用自定义函数
hive>select my_jiami(phone_num) from test_user;
函数使用方式2-永久函数
1. 把自定义函数的jar上传到hdfs中.
hadoop fs -mkdir /hive_func
hadoop fs -put my_udf.jar /hive_func
2. 创建永久函数
hive> create function my_jiami2 as 'cn.itcast.udf.MyUDF'
using jar 'hdfs://node1:8020/hive_func/my_udf.jar';
3. 验证
hive>select my_jiami2(phone_num) from test_user;
删除函数
-- 删除临时函数
drop temporary function if exists encryptPhoneNumber;
-- 删除永久函数,不会删除HDFS上的jar包
drop function if exists my_lower2;
2.自定义UDTF
自定义一个UDTF,实现将一个任意分隔符的字符串切割成独立的单词,例如:
源数据:"zookeeper,hadoop,hdfs,hive,MapReduce"
目标数据:
zookeeper
hadoop
hdfs
hive
MapReduce
代码实现
public class MyUDTF extends GenericUDTF {
private final transient Object[] forwardListObj = new Object[1];
@Override
public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
//设置列名的类型
List<String> fieldNames = new ArrayList<>();
//设置列名
fieldNames.add("column_01");
List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>() ;//检查器列表
//设置输出的列的值类型
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
}
@Override
public void process(Object[] objects) throws HiveException {
//1:获取原始数据
String args = objects[0].toString();
//2:获取数据传入的第二个参数,此处为分隔符
String splitKey = objects[1].toString();
//3.将原始数据按照传入的分隔符进行切分
String[] fields = args.split(splitKey);
//4:遍历切分后的结果,并写出
for (String field : fields) {
//将每一个单词添加值对象数组
forwardListObj[0] = field;
//将对象数组内容写出
forward(forwardListObj);
}
}
@Override
public void close() throws HiveException {
}
}
将打包的jar包上传到主机/export/data/hive-2.1.0/lib目录,并重命名我们的jar包名称
cd /export/data/hive/lib
mv original-day_10_hive_udtf-1.0-SNAPSHOT.jar my_udtf.jar
hive的客户端添加我们的jar包,将jar包添加到hive的classpath下
hive> add jar /export/server/hive/lib/my_udtf.jar;
创建临时函数与开发后的udtf代码关联
hive>create temporary function my_udtf as 'cn.itcast.udf.MyUDTF';
使用自定义udtf函数
hive>select my_udtf("zookeeper,hadoop,hdfs,hive,MapReduce",",") word;