hadoop 自定义Combiner

Map输出后进行combine操作；这样可以减少网络传输带来的开销，同时减轻了reduce任务的负担。

在MapReduce中，当map生成的数据过大时，带宽就成了瓶颈，怎样精简压缩传给Reduce的数据，有不影响最终的结果呢。有一种方法就是使用Combiner，Combiner号称本地的Reduce，Reduce最终的输入，是Combiner的输出。

Combine操作是运行在每个节点上的，只会影响本地Map的输出结果；Combine的输入为本地map的输出结果(一般是数据在溢出到磁盘之前，可以减少IO开销)，其输出则作为reduce的输入。

很多时候combine的逻辑和reduce的逻辑是相同的，因此两者可以共用Reducer体；这个时候只需要在客户端中设置Map类之后，Reduce类之前加入一行代码: job.setCombinerClass(MyReducer.class);

staticclassMyCombinerextendsReducer<Text,LongWritable,Text,LongWritable>{
protectedvoid reduce(Text k2, java.lang.Iterable<LongWritable> v2s,Context ctx)throws java.io.IOException,InterruptedException{
//显示次数表示redcue函数被调用了多少次，表示k2有多少个分组
System.out.println("Combiner输入分组<"+k2.toString()+",...>");
long times =0L;
for(LongWritable count : v2s){
times += count.get();
//显示次数表示输入的k2,v2的键值对数量
System.out.println("Combiner输入键值对<"+k2.toString()+","+count.get()+">");
}
ctx.write(k2,newLongWritable(times));
//显示次数表示输出的k2,v2的键值对数量
System.out.println("Combiner输出键值对<"+k2.toString()+","+times+">");
};
}

posted on 2014-11-12 22:24 转折点人生阅读(795) 评论(0) 编辑收藏举报