MapReduce的Combiner组件

每一个map都可能会产生大量的本地输出，Combiner的作用就是对map端的输出先做一次合并，以减少在map和reduce节点之间的数据传输量，以提高网络IO性能，是MapReduce的一种优化手段之一。

combiner是MR程序中Mapper和Reducer之外的一种组件
combiner组件的父类就是Reducer
combiner和reducer的区别在于运行的位置：
Combiner是在每一个maptask所在的节点运行
Reducer是接收全局所有Mapper的输出结果；
combiner的意义就是对每一个maptask的输出进行局部汇总，以减小网络传输量
具体实现步骤：

自定义一个combiner继承Reducer，重写reduce方法
在job中设置： job.setCombinerClass(CustomCombiner.class)

combiner能够应用的前提是不能影响最终的业务逻辑，而且，combiner的输出kv应该跟reducer的输入kv类型要对应起来

总结：

combiner：

局部聚合组件，针对每个maptask进行局部聚合运算从而进行优化
局部聚合之后，减少了map与reduce之间跨网络传递的数据量，减少网络IO
如果涉及业务，不可以使用combiner,因为局部合并会对最终结果发送改变
combiner本身就是reduce只不过应用的范围只限于maptask输出，不是全局
开启combiner需要在job中进行设置

job.setCombinerClass(WordCountReducer.class)

每一个map可能会产生大量的输出，combiner的作用就是在map端对输出先做一次合并，以减少传输到reducer的数据量。
combiner最基本是实现本地key的归并，combiner具有类似本地的reduce功能。
如果不用combiner，那么，所有的结果都是reduce完成，效率会相对低下。使用combiner，先完成的map会在本地聚合，提升速度。
注意：Combiner的输出是Reducer的输入，如果Combiner是可插拔的，添加Combiner绝不能改变最终的计算结果。所以Combiner只应该用于那种Reduce的输入key/value与输出key/value类型完全一致，且不影响最终结果的场景。比如累加，最大值等。

代码举例

准备（数据）：

  hello tom
  hello kittty
  hello jerry
  hello cat
  hello tom

 
  hello tom
  hello kittty
  hello jerry
  hello cat
  hello tom

WordCountMapper类

 import java.io.IOException;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Mapper;

  public class WordCountMapper extends Mapper<LongWritable, Text, Text,LongWritable >{
    protected void map(LongWritable key, Text value, 
    Mapper<LongWritable, Text, Text,LongWritable >.Context context)
        throws IOException, InterruptedException {
      // TODO Auto-generated method stub
      String hang = value.toString();
      String[] strings = hang.split(" ");
      for(String string : strings) {
        context.write(new Text(string),new LongWritable(1));
      }		
    }
  }

 
 
 
 
 
 import java.io.IOException;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Mapper;
  public class WordCountMapper extends Mapper<LongWritable, Text, Text,LongWritable >{
    protected void map(LongWritable key, Text value, 
    Mapper<LongWritable, Text, Text,LongWritable >.Context context)
        throws IOException, InterruptedException {
      // TODO Auto-generated method stub
      String hang = value.toString();
      String[] strings = hang.split(" ");
      for(String string : strings) {
        context.write(new Text(string),new LongWritable(1));
      }
    }
  }
 
 

WordCountReducer类

import java.io.IOException;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Reducer;

  public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable>{

    @Override
    protected void reduce(Text key2, Iterable<LongWritable> value2,
        Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
      long sum=0;

      for(LongWritable i :value2){
        sum += i.get();
      }

      context.write(key2,new LongWritable(sum));
    }
  }

 
 
 
 
 
import java.io.IOException;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Reducer;
  public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
    @Override
    protected void reduce(Text key2, Iterable<LongWritable> value2,
        Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
      long sum=0;
      for(LongWritable i :value2){
        sum += i.get();
      }
      context.write(key2,new LongWritable(sum));
    }
  }
 
 

MRClient类

import org.apache.hadoop.conf.Configuration;
  import org.apache.hadoop.fs.Path;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Job;
  import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

  public class MRClient {

    public static void main(String[] args) throws Exception {
      // TODO Auto-generated method stub
      Configuration configuration=new Configuration();
      Job job=Job.getInstance(configuration);
      //设置当前作业主函数所在类
      job.setJarByClass(MRClient.class);

      job.setMapperClass(WordCountMapper.class);
      job.setMapOutputKeyClass(Text.class);
      job.setMapOutputValueClass(LongWritable.class);
      FileInputFormat.setInputPaths(job, "c:/data.txt");

      job.setReducerClass(WordCountReducer.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(LongWritable.class);
      FileOutputFormat.setOutputPath(job,new Path("c:/out"));

      job.setCombinerClass(WordCountReducer.class);

      //提交作业，参数：true为显示计算过程，false不显示计算过程
      job.waitForCompletion(true);

    }
  }

 
 
 
 
 
import org.apache.hadoop.conf.Configuration;
  import org.apache.hadoop.fs.Path;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapreduce.Job;
  import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  public class MRClient {
    public static void main(String[] args) throws Exception {
      // TODO Auto-generated method stub
      Configuration configuration=new Configuration();
      Job job=Job.getInstance(configuration);
      //设置当前作业主函数所在类
      job.setJarByClass(MRClient.class);
      job.setMapperClass(WordCountMapper.class);
      job.setMapOutputKeyClass(Text.class);
      job.setMapOutputValueClass(LongWritable.class);
      FileInputFormat.setInputPaths(job, "c:/data.txt");
      job.setReducerClass(WordCountReducer.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(LongWritable.class);
      FileOutputFormat.setOutputPath(job,new Path("c:/out"));
      job.setCombinerClass(WordCountReducer.class);
      //提交作业，参数：true为显示计算过程，false不显示计算过程
      job.waitForCompletion(true);
    }
  }
 
 

想要使用 `combiner`功能只要在组装作业时，添加下面一行代码即可：

// 设置 Combiner
job.setCombinerClass(WordCountReducer.class);

 
// 设置 Combiner
job.setCombinerClass(WordCountReducer.class);

加入 `combiner`后统计结果是不会有变化的，但是可以从打印的日志看出 `combiner`的效果：

没有加入 `combiner`的打印日志：

加入 combiner后的打印日志如下：

这里我们只有一个输入文件并且小于 128M，所以只有一个 Map 进行处理。可以看到经过 combiner 后，records 由 3519 降低为 6(样本中单词种类就只有 6 种)，在这个用例中 combiner 就能极大地降低需要传输的数据量。

posted @ 2019-09-14 20:49 DaBai的黑屋阅读(768) 评论(0) 编辑收藏举报

刷新页面返回顶部

DaBai的黑屋

无法浪费时间，浪费的只是自己！

MapReduce的Combiner组件

总结：

combiner：

代码举例

准备（数据）：

WordCountReducer类

MRClient类

想要使用 `combiner`功能只要在组装作业时，添加下面一行代码即可：

加入 `combiner`后统计结果是不会有变化的，但是可以从打印的日志看出 `combiner`的效果：

没有加入 `combiner`的打印日志：

公告

DaBai的黑屋

无法浪费时间，浪费的只是自己！

MapReduce的Combiner组件

总结：

combiner：

代码举例

准备（数据）：

WordCountReducer类

MRClient类

想要使用 combiner功能只要在组装作业时，添加下面一行代码即可：

加入 combiner后统计结果是不会有变化的，但是可以从打印的日志看出 combiner的效果：

没有加入 combiner的打印日志：

公告

想要使用 `combiner`功能只要在组装作业时，添加下面一行代码即可：

加入 `combiner`后统计结果是不会有变化的，但是可以从打印的日志看出 `combiner`的效果：

没有加入 `combiner`的打印日志：