Hadoop中OutputFormat解析

一、OutputFormat

OutputFormat描述的是MapReduce的输出格式，它主要的任务是：

1.验证job输出格式的有效性，如：检查输出的目录是否存在。

2.通过实现RecordWriter，将输出的结果写到文件系统的文件中。

OutputFormat的主要是由三个抽象方法组成，下面根据源代码介绍每个方法的功能，源代码详解如下：

 1 public abstract class OutputFormat<K, V> {
 2 
 3   /** 
 4    * Get the {@link RecordWriter} for the given task. 
 5    *  得到给定任务的K-V对，即RecordWriter。
 6    * @param context the information about the current task.
 7    * @return a {@link RecordWriter} to write the output for the job.
 8    * @throws IOException
 9    */
10   public abstract RecordWriter<K, V> getRecordWriter(TaskAttemptContext context) 
11           throws IOException, InterruptedException;
12 
13   /** 
14    * Check for validity of the output-specification for the job.
15    * 为job检查输出格式的有效性。
16    * <p>This is to validate the output specification for the job when it is
17    * a job is submitted.  Typically checks that it does not already exist,
18    * throwing an exception when it already exists, so that output is not
19    * overwritten.</p>
20    * 这里，当job被提交时验证输出格式。实际上检查输出目录是否已经存在，当存在时抛出exception。
21    * 以至于原来的输出不会被覆盖。
22    * @param context information about the job
23    * @throws IOException when output should not be attempted
24    */
25   public abstract void checkOutputSpecs(JobContext context) throws IOException, InterruptedException;
26 
27   /**
28    * Get the output committer for this output format. This is responsible
29    * for ensuring the output is committed correctly.
30    * 获得一个OutPutCommitter对象。这是用来确保输出被正确的提交。
31    * @param context the task context
32    * @return an output committer
33    * @throws IOException
34    * @throws InterruptedException
35    */
36   public abstract OutputCommitter getOutputCommitter(TaskAttemptContext context)
37           throws IOException, InterruptedException;
38 }

posted on 2014-05-02 14:59 月下美妞1314 阅读(391) 评论(0) 编辑收藏举报