MapReduce的输入输出

　　数据以文件的形式存储在HDFS中，在MapReduce程序中，数据是怎么从HDFS传给Mapper的？Reducer处理完数据之后，又是怎么把数据存储到HDFS中的？1、将数据从HDFS传到Mapper是由InputFormat类实现的，2、将数据从Reducer存储到HDFS是由OutputFormat类实现的。

一、输入流

　　InputFormat类是一个抽象类，InputFormat类定义了两个抽象函数。这两个抽象函数是：　　　　

abstract List<InputSplit> getSplits(JobContext context);
abstract RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext context);

　　函数getSplits的功能是将输入的HDFS文件切分成若干个split。在Hadoop集群里的每个节点做MapRed处理的时候，每次只处理一个split，所以split是MapReduce处理的最小单元。

　　函数createRecordReader是创建RecordReader对象，这个对象根据split的内容，将split解析成若干个键值对。在做MapReduce的时候，Mapper会不断地调用RecordReader的功能，从RecordReader里读取键值对，然后用map函数处理。

　　InputFormat是一个抽象类。她有3个继承类，DBInputFormat类，DelegatingInputFormat类和FileInputFormat类。其中，DBInputFormat类是处理从数据库输入，DelegationInputFormat类是用在多个输入处理，FileInputFormat类是处理基于文件的输入。

　　以 fileInputFormat 类为例。FileInputFormat 类是一个抽象类，它在 InputFormat 类的基础上，增加一些跟文件操作相关的函数。它实现了 getSplits 函数，但没实现 createRecordReader 函数，它把 createRecordReader 的实现留给继承类去做。

　　在 getSplits 函数里，最重要的是这段：

 1 // generate splits
 2 //这是返回值
 3 List<InputSplit> splits = new ArrayList<InputSplit>();
 4 //获取 HDFS 文件的信息，FileStatus 在前面的章节使用过。
 5 List<FileStatus>files = listStatus(job);
 6 //对作业的每个文件都进行处理
 7 for (FileStatus file: files) {
 8     //获取文件路径
 9     Path path = file.getPath();
10     FileSystem fs = path.getFileSystem(job.getConfiguration());
11     //获取文件长度
12     long length = file.getLen();
13     //获取文件在 HDFS 上存储的文件块的位置信息
14     BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
15     if ((length != 0) && isSplitable(job, path)) { 
16     //获取文件块的大小
17     long blockSize = file.getBlockSize();
18     //根据文件块大小，最小尺寸，最大尺寸，计算出 split 的大小
19     long splitSize = computeSplitSize(blockSize, minSize, maxSize);
20     //这段代码是根据 splitSize，每次计算一个 split 的块位置和所在主机的位置。
21     //然后生成 split 对象存储。
22     long bytesRemaining = length;
23     while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
24     int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
25     splits.add(new FileSplit(path, length-bytesRemaining, splitSize, 
26 blkLocations[blkIndex].getHosts()));
27     bytesRemaining -= splitSize;
28     }
29 //最后剩下的不够一个 splitSize 的数据单独做一个 split。
30     if (bytesRemaining != 0) {
31     splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining, 
32 blkLocations[blkLocations.length-1].getHosts()));
33     }
34 } else if (length != 0) {
35     //如果文件很小，就直接做成一个 split
36     splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
37     } else { 
38     //如果文件尺寸是 0，空文件，就创建一个空主机，主要是为了形式上一致。
39     splits.add(new FileSplit(path, 0, length, new String[0]));
40     }
41 }

　　　FileInputFormat 有 5 个继承类，包括 CombineFileInputFormat 类， KeyValueTextInputFormat 类，NLineInputFormat 类，SequenceFileInputFormat 类和 TextInputFormat 类。这几个有抽象类，也有具体类。以 TextInputFormat 类为例，它实现了 createRecordReader 函数，非常简单，函数体只有一个语句，返回一个LineRecordReader。LineRecordReader 类继承了抽象类 RecordReader。抽象类 RecordReader 定义的全是抽象函数。 LineRecordReader 每次从一个 InputSplit 里读取一行文本，以这行文本在文件中的偏移量为键，以这行文本为值，组成一个键值对，返回给 Mapper 处理。

二、输出流

OutputFormat 类将键值对写入存储结构。一般来说，Mapper 类和 Reducer 类都会用到 OutputFormat 类。 Mapper 类用它存储中间结果，Reducer 类用它存储最终结果。OutputFormat 是个抽象类，这个类声明了 3 个抽象函数：

public abstract RecordWriter<K, V> getRecordWriter(TaskAttemptContext context);
public abstract void checkOutputSpecs(JobContext context);
public abstract OutputCommitter getOutputCommitter(TaskAttemptContext context);

　　其中，最主要的函数是 getRecordWriter 返回 RecordWriter ，它负责将键值对写入存储部件。函数checkOutputSpecs 检查输出参数是否合理，一般是检查输出目录是否存在，如果已经存在就报错。函数
getOutputCommitter 获取 OutputCommitter ， OutputCommitter 类是负责做杂活的，诸如初始化临时文件，作业完成后清理临时目录临时文件，处理作业的临时目录临时文件等等。OutputFormat 类 4 个继承类，有 DBOutputFormat，FileOutputFormat，FilterOutputFormat，NullOutputFormat。顾名思义，DBOutputFormat 是将键值对写入到数据库， FileOutputFormat 将键值对写到文件系统， FilterOutputFormat将其实是提供一种将 OutputFormat 进行再次封装，类似 Java 的流的 Filter 方式， NullOutputFormat 将键值对写入/dev/null，相当于舍弃这些值。以 FileOutputFormat 为例。FileOutputFormat 是一个抽象类。它有两个继承类，SequenceFileOutputFormat 和TextOutputFormat。 SequenceFileOutputFormat 将键值对写入 HDFS 的顺序文件。 TextOutputFormat 将数据写入 HDFS 的文本文件。

posted @ 2013-12-26 17:27 hadoop在云端阅读(915) 评论(0) 收藏举报

刷新页面返回顶部

zzzhy

MapReduce的输入输出

公告