HIVE: 自定义TextInputFormat (旧版MapReduceAPI ok, 新版MapReduceAPI实现有BUG?)
我们的输入文件 hello0, 内容如下:
xiaowang 28 shanghai@_@zhangsan 38 beijing@_@someone 100 unknown
逻辑上有3条记录, 它们以@_@分隔. 我们将分别用旧版MapReduce API 和新版MapReduce API实现自定义TextInputFormat,然后在hive配置使用, 加载数据.
首先用旧版API
1, 自定义Format6继承自TextInputFormat
package MyTestPackage; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileSplit; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.JobConfigurable; import org.apache.hadoop.mapred.LineRecordReader; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.TaskAttemptContext; import org.apache.hadoop.mapred.TextInputFormat; public class Format6 extends TextInputFormat { @Override public RecordReader getRecordReader (InputSplit split, JobConf job, Reporter reporter) throws IOException { byte[] recordDelimiterBytes = "@_@".getBytes(); return new LineRecordReader(job, (FileSplit)split, recordDelimiterBytes); } }
2.导出为MyInputFormat.jar, 放到 hive/lib中
3.在HIVE DDL中配置使用
create table hive_again2(name varchar(50), age int, city varchar(30)) stored as INPUTFORMAT 'MyTestPackage.Format6' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
成功load数据-
新API
1, 自定义Format5继承自TextInputFormat
package MyTestPackage; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.input.LineRecordReader; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; public class Format5 extends TextInputFormat { @Override public RecordReader createRecordReader (InputSplit split, TaskAttemptContext tac) { byte[] recordDelimiterBytes = "@_@".getBytes(); return new LineRecordReader(recordDelimiterBytes); } }
2.导出为MyInputFormat.jar, 放到 hive/lib中
3.在HIVE DDL中配置使用
create table hive_again1(name varchar(50), age int, city varchar(30)) stored as INPUTFORMAT 'MyTestPackage.Format5' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
出错了!
把Format5放到MapReduce中debug一切正常(http://www.cnblogs.com/silva/p/4490532.html), 为什么给hive用就不行了呢? 没明白.. 有知道的同学请指点.谢了!