hadoop之MapReduce输入输出类

默认的输入TextInputFormat：

1）TextInputformat是默认的inputformat，对于输入文件。

2）文件中每一行作为一个记录，他将每一行在文件中的起始偏移量作为key，每一行的内容作为value。

3）默认以\n或回车键作为一行记录。

4）TextInputFormat继承了FileInputFormat。

Hadoop自带的输入类：

1）CombinarFileInputFormat：

相对于大量的小文件来说，hadoop更合适处理少量的大文件。

CombinarFileInputFormat可以缓解这个问题，它是针对小文件而设计的。

2）KeyValueTextInputFormat：

当输入数据的每一行是两列，并用tab分离的形式的时候，

KeyValueTextInputformat处理这种格式的文件非常适合。

3）NLineInputformat：

NLineInputformat可以控制在每个split中数据的行数。

4）SequenceFileInputformat：

当输入文件格式是sequencefile的时候，要使用SequenceFileInputformat作为输入。

自定义输入类格式：

1）继承FileInputFormat基类。

2）重写里面的isSplitable(FileSyatem fs,Path fileName)方法。

3）重写getRecordReader()方法。

public interface InputFormat<K, V>

{

InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;

RecordReader<K, V> getRecordReader(InputSplit split,JobConf job,Reporter reporter) throws IOException;

}

Hadoop的输出类

1）TextOutputformat:

默认的输出格式，key和value中间值用tab隔开的。

2）SequenceFileOutputformat:

将key和value以sequencefile格式输出。

3）sequencefileAsOutputFormat:

将key和value以原始二进制的格式输出。

4）MapFileOutputFormat:

将key和value写入MapFile中。由于MapFile中的key是有序的，所以写入的时候必须保证记录

是按key值顺序写入的。

5）MultipleOutputFormat:

默认情况下一个reducer会产生一个输出，但是有些时候我们想一个reducer产生多个输出，

MultipleOutputFormat和MultipleOutputs可以实现这个功能。

posted on 2013-04-27 09:09 北京_飞狐阅读(418) 评论(0) 编辑收藏举报