1.问题描述
如何将多个字符作为字段分割符的数据文件加载到Hive表中,事例数据如下:
字段分隔符为“@#$”
test1@#$test1name@#$test2value
test2@#$test2name@#$test2value
test3@#$test3name@#$test4value
如何将上述事例数据加载到Hive表(multi_delimiter_test)中,表结构如下:
字段名 |
字段类型 |
s1 | string |
s2 | string |
s3 | string |
2.Hive多分隔符支持
Hive在0.14及以后版本支持字段的多分隔符,参考https://cwiki.apache.org/confluence/display/Hive/MultiDelimitSerDe
3.实现步骤
1.准备多分隔符文件并装载到HDFS对应目录
[ec2-user@ip-172-31-8-141 ~]$ cat multi_delimiter_test.dat test1@#$test1name@#$test2value test2@#$test2name@#$test2value test3@#$test3name@#$test4value [ec2-user@ip-172-31-8-141 ~]$ hadoop dfs -put multi_delimiter_test.dat /fayson/multi_delimiter_test [ec2-user@ip-172-31-8-141 ~]$ hadoop dfs -ls /fayson/multi_delimiter_test DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Found 1 items -rw-r--r-- 3 user_r supergroup 93 2017-08-23 03:24 /fayson/multi_delimiter_test/multi_delimiter_test.dat [ec2-user@ip-172-31-8-141 ~]$
2.基于准备好的多分隔符文件建表
create external table multi_delimiter_test( s1 string, s2 string, s3 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES ("field.delim"="@#$") stored as textfile location '/fayson/multi_delimiter_test';
3.测试
> select * from multi_delimiter_test; +--------------------------+--------------------------+--------------------------+--+ | multi_delimiter_test.s1 | multi_delimiter_test.s2 | multi_delimiter_test.s3 | +--------------------------+--------------------------+--------------------------+--+ | test1 | test1name | test2value | | test2 | test2name | test2value | | test3 | test3name | test4value | +--------------------------+--------------------------+--------------------------+--+
字段名 |
字段类型 |
---|---|
s1 |
String |
s2 |
String
|