hive 开窗函数、自定义函数

-- 开窗------------------------分组排序---------------------------------
-- row_number() over(partition by ... order by ...)   组内的排名序号 1 2 3 4 5 6 7
-- rank() over(partition by userid order by pv desc) 组内的排名序号 1 2 3 3 5 6 6 8
-- dense_rank over(partition by userid order by pv desc) 组内的排名序号 1 2 3 3 4 5 6 6 7
如果不添加partition by,就是不分组,rank() over(order by pv desc)就是对全部数据,用pv值进行排序 -- 开窗函数:SUM,MAX,MIN, AVG -- SUM() over(partition by ... order by ... rows between ... and ...) -- MAX() over(partition by ... order by ... rows between ... and ...) -- MIN() over(partition by ... order by ... rows between ... and ...) -- AVG() over(partition by ... order by ... rows between ... and ...) -- 其中rows between ... and ...
从第一行开始累加到当前行 :
默认策略: sum(pv) over(partition by userid order by createtime)
如果没有order by 排序语句,默认把分组内的数据累加:sum(pv) over(partition by userid)
使用between:sum(pv) over(partition by userid order by createtime rows between unbounded preceding and current row)
  current row 当前行
  n preceding 往前n行数据
  n following 往后n行数据
  unbounded:
    unbounded preceding 从前面开始
    unbounded following 直到终点
从向上推3行 累加到当前行 : sum(pv) over(partition by userid order by createtime rows between 3 preceding and current row)
从向上推3行 累加到向后推一行 : sum(pv) over(partition by userid order by createtime rows between 3 preceding and 1 following)
从当前行累加到最后一行 : sum(pv) over(partition by userid order by createtime rows between current row and unbounded following)
从向上推3行 累加到最后一行 : sum(pv) over(partition by userid order by createtime rows between 3 preceding and unbounded following)
从向上推3行,到最后一行 求最大值: max(pv) over(partition by userid order by createtime rows between 3 preceding and unbounded following)
从向上推3行,到最后一行 求最小值 : min(pv) over(partition by userid order by createtime rows between 3 preceding and unbounded following)
从向上推3行,到最后一行 求平均值 : avg(pv) over(partition by userid order by createtime rows between 3 preceding and unbounded following)

-- lag 
将上一行的数据和本行数据放在同一行:LAG(createtime,1,'1970-01-01') OVER(PARTITION BY userid ORDER BY createtime)
将上上一行的数据和本行数据放在同一行:LAG(createtime,2,'1970-01-01') OVER(PARTITION BY userid ORDER BY createtime)

-- lead
将下一行的数据和本行数据放在同一行:LEAD(createtime,1,'1970-01-01') OVER(PARTITION BY userid ORDER BY createtime)
将下下一行的数据和本行数据放在同一行:LEAD(createtime,2,'1970-01-01') OVER(PARTITION BY userid ORDER BY createtime)

 

自定义函数
根据用户自定义函数类别分为以下三种:
    1、UDF(User-Defined-Function)
        一进一出  
类似于:lower/upper/reverse
    2、UDAF(User-Defined Aggregation Function)
        聚集函数,多进一出
              类似于:count/max/min
    3、UDTF(User-Defined Table-Generating Functions)
        一进多出
        如lateral view   explode()

1.自定义UDF
    编程步骤:
    (1)继承org.apache.hadoop.hive.ql.exec.UDF
    (2)需要实现evaluate函数;evaluate函数支持重载;
    注意事项:
    (1)UDF必须要有返回类型,可以返回null,但是返回类型不能为void;
    (2)UDF中常用Text/LongWritable等类型,不推荐使用java类型;

代码编写:
第一步:创建maven  java 工程,导入jar包
<dependencies>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>3.1.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.1.4</version>
    </dependency>
</dependencies>
第二步:开发java类继承UDF,并重载evaluate 方法
    public class MyUDF  extends UDF {
        //17801112345 ---> 178****2345
        public String  evaluate(String phoneNumStr){
            //匹配手机号是否合法
        String regex = "1[35789][0-9]{9}";
        boolean flag = phoneNumStr.matches(regex);
        if (!flag){
            return null;
        }else{
                String str1 = phoneNumStr.substring(0,3);
                String str2 = phoneNumStr.substring(7);

                return str1 + "****"+str2;
    }
}
函数使用方式1-临时函数
1、将我们的项目打包,并上传到hive的lib目录下
2、添加我们的jar包
将jar包上传到 /export/server/hive/lib目录,并重命名我们的jar包名称
cd /export/server/hive/lib
mv day19_udf-1.0-SNAPSHOT.jar my_udf.jar

3、hive的客户端添加我们的jar包
hive> add jar /export/server/hive/lib/my_udf.jar

4、设置函数与我们的自定义函数关联-临时函数
hive> create temporary function my_jiami as 'cn.itcast.udf.MyUDF';

5、使用自定义函数
hive>select my_jiami(phone_num) from test_user;


函数使用方式2-永久函数
1. 把自定义函数的jar上传到hdfs中.
hadoop fs -mkdir /hive_func
hadoop fs -put my_udf.jar /hive_func  
2. 创建永久函数
hive> create function my_jiami2 as 'cn.itcast.udf.MyUDF'
    using jar 'hdfs://node1:8020/hive_func/my_udf.jar';
3. 验证
hive>select my_jiami2(phone_num) from test_user;

删除函数
-- 删除临时函数 
drop temporary function if exists encryptPhoneNumber; 
-- 删除永久函数,不会删除HDFS上的jar包 
drop function if exists my_lower2;


2.自定义UDTF
自定义一个UDTF,实现将一个任意分隔符的字符串切割成独立的单词,例如:
源数据:"zookeeper,hadoop,hdfs,hive,MapReduce"
目标数据:
zookeeper
hadoop
hdfs
hive
MapReduce

代码实现
public class MyUDTF extends GenericUDTF {
    private final transient Object[] forwardListObj = new Object[1];

    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        //设置列名的类型
        List<String> fieldNames = new ArrayList<>();
        //设置列名
        fieldNames.add("column_01");
        List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>()  ;//检查器列表

        //设置输出的列的值类型
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
         
        return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);

    }

    @Override
    public void process(Object[] objects) throws HiveException {
        //1:获取原始数据
        String args = objects[0].toString();
        //2:获取数据传入的第二个参数,此处为分隔符
        String splitKey = objects[1].toString();
        //3.将原始数据按照传入的分隔符进行切分
        String[] fields = args.split(splitKey);
        //4:遍历切分后的结果,并写出
        for (String field : fields) {
            //将每一个单词添加值对象数组
            forwardListObj[0] = field;
            //将对象数组内容写出
            forward(forwardListObj);
        }

    }

    @Override
    public void close() throws HiveException {

    }
}

将打包的jar包上传到主机/export/data/hive-2.1.0/lib目录,并重命名我们的jar包名称
cd /export/data/hive/lib
mv original-day_10_hive_udtf-1.0-SNAPSHOT.jar my_udtf.jar

hive的客户端添加我们的jar包,将jar包添加到hive的classpath下
hive> add jar /export/server/hive/lib/my_udtf.jar;

创建临时函数与开发后的udtf代码关联
hive>create temporary function my_udtf as 'cn.itcast.udf.MyUDTF';

使用自定义udtf函数
hive>select my_udtf("zookeeper,hadoop,hdfs,hive,MapReduce",",") word;

 

posted @ 2022-06-12 15:54  景、  阅读(189)  评论(0编辑  收藏  举报