《Hadoop管理五》MapReduce类型常用的InputFormat

MapReduce过程Mapper的输出参数和Reducer的输入参数是一样的，都是中间需要处理的结果，而Reducer的输出结果便是我们想要的输出结果。所以根据需要对InputFormat进行较合理的设置，Job才能正常运行。Job过程中间的Key和Value的对应关系可以简单阐述如下：

map: <k1,v1> -> list(k2,v2)
combile: <k2,list(v2)> -> list(k2,v2)
reduce: <k2,list(v2)> -> list(k3,v3)

至于为什么需要显示指定中间、最终的数据类型，貌似看上去很奇怪，原因是Java的泛型机制有很多限制，类型擦出导致运行过程中类型信息并非一直可见，所以Hadoop不得不明确指定。

InputFormat的结构图如下：

还有想说明的是，单个reducer的默认配置对于新手而言很容易上手，但是在真实的应用中，reducer被设置成一个较大的数字，否则作业效率极低。reducer的最大个数与集群中reducer的最大个数有关，集群中reducer的最大个数由节点数与每个节点的reducer数相乘得到。该值在mapred.tasktracker.reduce.tasks.maximum决定

下面介绍一些常用的InputFormat和用法。

FileInpuFormat

FileInputFormat是所有使用文件作为数据源的InputFormat的积累。它提供两个功能：一个是定义哪些文件包含在一个作业的输入中；一个为输入文件生成分片的实现。自动将作业分块作业分块大小与mapred-site.xml中的mapred.min.split.size和mapred.min.split.size和blocksize有关系。分片大小由如下公式来决定：

分片大小 = max(minimumSize, min(maximumSize, blockSize))

如果想避免文件被切分，可以采用如下两种之一，不过推荐第二种。

1)设置minimum size 大于文件大小即可

2)使用FileInputFormat子类并重载isSplitable方法返回false

import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
　　@Override
　　protected boolean isSplitable(FileSystem fs, Path file) {
　　　　return false;
　　}
}

CombileFileInputFormat

CombileFileInputFormat是为了解决大批量的小文件作业。

TextInputFormat（LongWritable，Text：字节偏移量，每行的内容）

默认的InputFormat。键是改行文件在源文件中的偏移量，值是该行内容（不包括终止符，如换行符或者回车符）。如

On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.

被表示成键值对如下：

<0, On the top of the Crumpetty Tree>
<33, The Quangle Wangle sat,>
<57, But his face you could not see,>
<89, On account of his Beaver Hat.>

KeyValueTextInputFormat

如果文件中的每一行就是一个键值对，使用某个分界符进行分隔，比如Tab分隔符。例如Hadoop默认的OutputFormat产生的输出，即是每行用Tab分隔符分隔开的键值对。

可以通过key.value.separator.in.input.line属性来指定分隔符，默认的值是一个Tab分隔符。（注： → 代表一个Tab分隔符）

line1→On the top of the Crumpetty Tree
line2→The Quangle Wangle sat,
line3→But his face you could not see,
line4→On account of his Beaver Hat.

被表示成键值对如下：

<line1, On the top of the Crumpetty Tree>
<line2, The Quangle Wangle sat,>
<line3, But his face you could not see,>
<line4, On account of his Beaver Hat.>

NLineInputFormat

以行号来分割数据源文件。N作为输入的行数，可以有mapred.line.input.format.linespermap来指定。

On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.

如果N是2，则一个mapper会收到前两行键值对：

<0, On the top of the Crumpetty Tree>
<33, The Quangle Wangle sat,>

另一个mapper会收到后两行：

<57, But his face you could not see,>
<89, On account of his Beaver Hat.>

posted @ 2012-07-25 20:15 hanyuanbo 阅读(817) 评论(0) 收藏举报

刷新页面返回顶部

hanyuanbo

《Hadoop管理五》MapReduce类型常用的InputFormat

公告