RICH-ATONE

Hive UDF处理特殊字符[\x22、urlencode等编码问题

如果你的函数读和返回都是基础数据类型(Hadoop&Hive 基本writable类型,如Text,IntWritable,LongWriable,DoubleWritable等等),那么简单的API(org.apache.hadoop.hive.ql.exec.UDF)可以胜任

但是,如果你想写一个UDF用来操作内嵌数据结构,如Map,List和Set,那么你要去熟悉org.apache.hadoop.hive.ql.udf.generic.GenericUDF这个API
简单API: org.apache.hadoop.hive.ql.exec.UDF
复杂API:  org.apache.hadoop.hive.ql.udf.generic.GenericUDF
接下来我将通过一个示例为上述两个API建立UDF,我将为接下来的示例提供代码与测试 。
注://事实上UDF有一个bug,不会去检查null参数,null在大数据集当中是非常常见的,所以要严谨点。作为回应,这边加了一个null的检查
pom文件参考:
    <dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>2.1.1</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.7.1</version>
    </dependency>

<!--        <dependency>-->
<!--            <groupId>com.aliyun.odps</groupId>-->
<!--            <artifactId>odps-sdk-udf</artifactId>-->
<!--            <version>0.29.10-public</version>-->
<!--        </dependency>-->


    </dependencies>
<build>
        <pluginManagement>
                <plugins>
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>3.0.0</version>
        <executions>
            <execution>
                <phase>package</phase>
                <goals>
                    <goal>shade</goal>
                </goals>
                <configuration>
                    <artifactSet>
                    </artifactSet>
                </configuration>
            </execution>
        </executions>
    </plugin>
                </plugins>
        </pluginManagement>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <configuration>
                <source>6</source>
                <target>6</target>
            </configuration>
        </plugin>
    </plugins>
</build>

  

DEMO:

package udf;

import jodd.util.URLDecoder;
import org.apache.hadoop.hive.ql.exec.UDF;

import java.io.UnsupportedEncodingException;

public class TestDecodeX extends UDF {

    public static void decodeX (String s) throws UnsupportedEncodingException {

        String s1 = s.replaceAll("\\\\x", "%");
        String decode = URLDecoder.decode(s1, "utf-8");
        System.out.println(decode);

    }

    public String evaluate(String input) throws Exception {
//事实上UDF有一个bug,不会去检查null参数,null在大数据集当中是非常常见的,所以要严谨点。作为回应,这边加了一个null的检查
        if (input == null) return null ;
        String decode = null ;
        try {

            String s1 = input.replaceAll("\\\\x", "%");
             decode = URLDecoder.decode(s1, "utf-8");
//            System.out.println(decode);

        } catch (Exception e) {
            //  e.printStackTrace();
        }
        System.out.println(decode);
        return decode ;
    }



    public static void main(String[] args) throws Exception {

        String s1 = "G977N%7C7.1.2%7Cwifi%7C%7Cgamepubgoogle%7CGetHashed%7Ccom.gamepub.ft2.g%7Candroid%7C%7C%7C1.0.2%7Csamsung%7C1547548%7C1%7CAsia%2FSeoul%7CARM%7C%7C19d1b5cdf01341e99c670f254765148d%22%5D" ;
        String s = "172.31.35.210|21/04/2021:10:59:01|[\\x22TakeSample|0bb9f14b1041a8d9|32550283-4DF6-4CC5-9922-E4F9CFAFD7FD|iPhone13,1|14.2.1|wifi||gamepubappstore|GetHashed|com.gamepub.fr2|ios|BAB3A467-A4D0-4900-80F7-BCB9D53757B1||0.26.87|\\xE8\\x8B\\xB9\\xE6\\x9E\\x9C|3.63|0|Asia/Seoul|ARM64||\\x22]\n" ;
        TestDecodeX t = new TestDecodeX() ;
        t.evaluate(s1) ;

    }


}  

 

result结果示例:

G977N|7.1.2|wifi||gamepubgoogle|GetHashed|com.gamepub.ft2.g|android|||1.0.2|samsung|1547548|1|Asia/Seoul|ARM||19d1b5cdf01341e99c670f254765148d"]

Process finished with exit code 0

 

在hive客户端:

hive> ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;  
hive> CREATE TEMPORARY FUNCTION decodeX as 'udf.TestDecodeX';  

 

参考:

Hive UDF开发指南

posted on 2021-04-28 16:39  RICH-ATONE  阅读(819)  评论(0编辑  收藏  举报

导航