MapReduce中的OutputFormat
OutputFormat在hadoop源码中是一个抽象类 public abstract class OutputFormat<K, V>,其定义了reduce任务的输出格式
1 | https: //github .com /apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/OutputFormat .java |
可以参考文章
MapReduce快速入门系列(12) | MapReduce之OutputFormat
常用的OutputFormat可以查看源码
1 | https: //github .com /apache/hadoop/tree/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output |
1.文本输出TextOutputFormat,是hadoop的默认实现输出功能的类
1 | https: //github .com /apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat .java |
TextOutputFormat实现了FileOutputFormat,FileOutputFormat也一个抽象类,是OutputFormat的子类,源码在
1 | https: //github .com /apache/hadoop/blob/release-2 .6.0 /hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat .java |
<i>其中一个比较重要的是 RecordWriter接口,其中有两个方法,一个是write方法,另一个是close方法
1 | https: //github .com /apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/RecordWriter .java |
源码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | public interface RecordWriter<K, V> { /** * Writes a key /value pair. * * @param key the key to write. * @param value the value to write. * @throws IOException */ void write(K key, V value) throws IOException; /** * Close this <code>RecordWriter< /code > to future operations. * * @param reporter facility to report progress. * @throws IOException */ void close(Reporter reporter) throws IOException; } |
在 TextOutputFormat 实现类中如下实现了 LineRecordWriter<K, V>,源码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | protected static class LineRecordWriter<K, V> extends RecordWriter<K, V> { private static final String utf8 = "UTF-8" ; private static final byte[] newline; static { try { newline = "\n" .getBytes(utf8); } catch (UnsupportedEncodingException uee) { throw new IllegalArgumentException( "can't find " + utf8 + " encoding" ); } } protected DataOutputStream out; private final byte[] keyValueSeparator; public LineRecordWriter(DataOutputStream out, String keyValueSeparator) { this.out = out; try { this.keyValueSeparator = keyValueSeparator.getBytes(utf8); } catch (UnsupportedEncodingException uee) { throw new IllegalArgumentException( "can't find " + utf8 + " encoding" ); } } public LineRecordWriter(DataOutputStream out) { this(out, "\t" ); } /** * Write the object to the byte stream, handling Text as a special * case . * @param o the object to print * @throws IOException if the write throws, we pass it on */ private void writeObject(Object o) throws IOException { if (o instanceof Text) { Text to = (Text) o; out.write(to.getBytes(), 0, to.getLength()); } else { out.write(o.toString().getBytes(utf8)); } } public synchronized void write(K key, V value) throws IOException { boolean nullKey = key == null || key instanceof NullWritable; boolean nullValue = value == null || value instanceof NullWritable; if (nullKey && nullValue) { return ; } if (!nullKey) { writeObject(key); } if (!(nullKey || nullValue)) { out.write(keyValueSeparator); } if (!nullValue) { writeObject(value); } out.write(newline); } public synchronized void close(TaskAttemptContext context) throws IOException { out.close(); } } |
<ii> 另外一个比较重要是 getRecordWriter 抽象方法,当实现 FileOutputFormat抽象类 的时候需要实现这个方法,从job当中获取 RecordWriter<K, V>
1 2 3 | public abstract RecordWriter<K, V> getRecordWriter(TaskAttemptContext job ) throws IOException, InterruptedException; |
在 TextOutputFormat 实现类中如下实现了 getRecordWriter,
其中使用了LineRecordWriter,源码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job ) throws IOException, InterruptedException { Configuration conf = job.getConfiguration(); boolean isCompressed = getCompressOutput(job); String keyValueSeparator= conf.get(SEPERATOR, "\t" ); CompressionCodec codec = null; String extension = "" ; if (isCompressed) { Class<? extends CompressionCodec> codecClass = getOutputCompressorClass(job, GzipCodec.class); codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf); extension = codec.getDefaultExtension(); } Path file = getDefaultWorkFile(job, extension); FileSystem fs = file .getFileSystem(conf); if (!isCompressed) { FSDataOutputStream fileOut = fs.create( file , false ); return new LineRecordWriter<K, V>(fileOut, keyValueSeparator); } else { FSDataOutputStream fileOut = fs.create( file , false ); return new LineRecordWriter<K, V>(new DataOutputStream (codec.createOutputStream(fileOut)), keyValueSeparator); } } |
2.二进制输出SequenceFileOutputFormat
1 | https: //github .com /apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/SequenceFileOutputFormat .java |
SequenceFileOutputFormat和TextOutputFormat一样,同样实现了FileOutputFormat
<i>其中一个比较重要的是 getSequenceWriter方法,返回二进制文件的Writer
1 | https: //github .com /apache/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/SequenceFile .java |
源码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | protected SequenceFile.Writer getSequenceWriter(TaskAttemptContext context, Class<?> keyClass, Class<?> valueClass) throws IOException { Configuration conf = context.getConfiguration(); CompressionCodec codec = null; CompressionType compressionType = CompressionType.NONE; if (getCompressOutput(context)) { // find the kind of compression to do compressionType = getOutputCompressionType(context); // find the right codec Class<?> codecClass = getOutputCompressorClass(context, DefaultCodec.class); codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf); } // get the path of the temporary output file Path file = getDefaultWorkFile(context, "" ); FileSystem fs = file .getFileSystem(conf); return SequenceFile.createWriter(fs, conf, file , keyClass, valueClass, compressionType, codec, context); } |
<ii> 另外一个比较重要是 getRecordWriter 抽象方法,源码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | public RecordWriter<K, V> getRecordWriter(TaskAttemptContext context ) throws IOException, InterruptedException { final SequenceFile.Writer out = getSequenceWriter(context, context.getOutputKeyClass(), context.getOutputValueClass()); return new RecordWriter<K, V>() { public void write(K key, V value) throws IOException { out.append(key, value); } public void close(TaskAttemptContext context) throws IOException { out.close(); } }; } |
3.用于输出thrift对象的Parquet文件的ParquetThriftOutputFormat,参考项目:
1 | https: //github .com /adobe-research/spark-parquet-thrift-example/blob/master/src/main/scala/SparkParquetThriftApp .scala |
代码
1 2 3 4 5 6 7 8 9 10 11 | ParquetThriftOutputFormat.setThriftClass(job, classOf[SampleThriftObject]) ParquetOutputFormat.setWriteSupportClass(job, classOf[SampleThriftObject]) sc.parallelize(sampleData) .map(obj => (null, obj)) .saveAsNewAPIHadoopFile( parquetStore, classOf[Void], classOf[SampleThriftObject], classOf[ParquetThriftOutputFormat[SampleThriftObject]], job.getConfiguration ) |
4.用于文本格式hive表的HiveIgnoreKeyTextOutputFormat
1 | https: //github .com /apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveIgnoreKeyTextOutputFormat .java |
其实现了 HiveOutputFormat
1 | https: //github .com /apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveOutputFormat .java |
HiveIgnoreKeyTextOutputFormat的key为null,源码
1 2 3 4 | @Override public synchronized void write(K key, V value) throws IOException { this.mWriter.write(null, value); } |
本文只发表于博客园和tonglin0325的博客,作者:tonglin0325,转载请注明原文链接:https://www.cnblogs.com/tonglin0325/p/13970805.html
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 提示词工程——AI应用必不可少的技术
· Open-Sora 2.0 重磅开源!
· 字符编码:从基础到乱码解决
2016-11-16 广告系统——广告归因