Hive UDF处理特殊字符[\x22、urlencode等编码问题
如果你的函数读和返回都是基础数据类型(Hadoop&Hive 基本writable类型,如Text,IntWritable,LongWriable,DoubleWritable等等),那么简单的API(org.apache.hadoop.hive.ql.exec.UDF)可以胜任
但是,如果你想写一个UDF用来操作内嵌数据结构,如Map,List和Set,那么你要去熟悉org.apache.hadoop.hive.ql.udf.generic.GenericUDF这个API
简单API: org.apache.hadoop.hive.ql.exec.UDF
复杂API: org.apache.hadoop.hive.ql.udf.generic.GenericUDF
复杂API: org.apache.hadoop.hive.ql.udf.generic.GenericUDF
接下来我将通过一个示例为上述两个API建立UDF,我将为接下来的示例提供代码与测试 。
注://事实上UDF有一个bug,不会去检查null参数,null在大数据集当中是非常常见的,所以要严谨点。作为回应,这边加了一个null的检查
pom文件参考:
<dependencies> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>2.1.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.1</version> </dependency> <!-- <dependency>--> <!-- <groupId>com.aliyun.odps</groupId>--> <!-- <artifactId>odps-sdk-udf</artifactId>--> <!-- <version>0.29.10-public</version>--> <!-- </dependency>--> </dependencies> <build> <pluginManagement> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.0.0</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <artifactSet> </artifactSet> </configuration> </execution> </executions> </plugin> </plugins> </pluginManagement> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>6</source> <target>6</target> </configuration> </plugin> </plugins> </build>
DEMO:
package udf; import jodd.util.URLDecoder; import org.apache.hadoop.hive.ql.exec.UDF; import java.io.UnsupportedEncodingException; public class TestDecodeX extends UDF { public static void decodeX (String s) throws UnsupportedEncodingException { String s1 = s.replaceAll("\\\\x", "%"); String decode = URLDecoder.decode(s1, "utf-8"); System.out.println(decode); } public String evaluate(String input) throws Exception { //事实上UDF有一个bug,不会去检查null参数,null在大数据集当中是非常常见的,所以要严谨点。作为回应,这边加了一个null的检查 if (input == null) return null ; String decode = null ; try { String s1 = input.replaceAll("\\\\x", "%"); decode = URLDecoder.decode(s1, "utf-8"); // System.out.println(decode); } catch (Exception e) { // e.printStackTrace(); } System.out.println(decode); return decode ; } public static void main(String[] args) throws Exception { String s1 = "G977N%7C7.1.2%7Cwifi%7C%7Cgamepubgoogle%7CGetHashed%7Ccom.gamepub.ft2.g%7Candroid%7C%7C%7C1.0.2%7Csamsung%7C1547548%7C1%7CAsia%2FSeoul%7CARM%7C%7C19d1b5cdf01341e99c670f254765148d%22%5D" ; String s = "172.31.35.210|21/04/2021:10:59:01|[\\x22TakeSample|0bb9f14b1041a8d9|32550283-4DF6-4CC5-9922-E4F9CFAFD7FD|iPhone13,1|14.2.1|wifi||gamepubappstore|GetHashed|com.gamepub.fr2|ios|BAB3A467-A4D0-4900-80F7-BCB9D53757B1||0.26.87|\\xE8\\x8B\\xB9\\xE6\\x9E\\x9C|3.63|0|Asia/Seoul|ARM64||\\x22]\n" ; TestDecodeX t = new TestDecodeX() ; t.evaluate(s1) ; } }
result结果示例:
G977N|7.1.2|wifi||gamepubgoogle|GetHashed|com.gamepub.ft2.g|android|||1.0.2|samsung|1547548|1|Asia/Seoul|ARM||19d1b5cdf01341e99c670f254765148d"]
Process finished with exit code 0
在hive客户端:
hive> ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar; hive> CREATE TEMPORARY FUNCTION decodeX as 'udf.TestDecodeX';
参考:
Hive UDF开发指南posted on 2021-04-28 16:39 RICH-ATONE 阅读(819) 评论(0) 编辑 收藏 举报