MapReduce之单词计数

最近在看google那篇经典的MapReduce论文，中文版可以参考孟岩推荐的 mapreduce 中文版中文翻译

论文中提到，MapReduce的编程模型就是：

计算利用一个输入key/value对集,来产生一个输出key/value对集.MapReduce库的用户用两个函数表达这个计算:map和reduce.

用户自定义的map函数,接受一个输入对,然后产生一个中间key/value对集.MapReduce库把所有具有相同中间key I的中间value聚合在一起,然后把它们传递给reduce函数.

用户自定义的reduce函数,接受一个中间key I和相关的一个value集.它合并这些value,形成一个比较小的value集.一般的,每次reduce调用只产生0或1个输出value.通过一个迭代器把中间value提供给用户自定义的reduce函数.这样可以使我们根据内存来控制value列表的大小.

那么研究MapReduce，一般是从hadoop开始，研究编程语言，一般从helloworld开始，那么我们研究hadoop，就先从官方实例wordcount开始。

按照上面提到的编程模型：

用户自定义的map函数,接受一个输入对,然后产生一个中间key/value对集.MapReduce库把所有具有相同中间key I的中间value聚合在一起,然后把它们传递给reduce函数.

那么对于单词计数这个程序来说：

map函数对输入的文本进行分词处理，然后输出（单词， 1）这样的结果，例如“You are a young man”，输出的就是（you， 1），（are， 1）之类的结果

代码如下：

class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer tokenizer = new StringTokenizer(value.toString());
        while (tokenizer.hasMoreTokens())
        {
            word.set(tokenizer.nextToken());
            context.write(word, one);
        }
    }
}

上面提到map函数的输入也是k-v堆，从模板参数中可以看出。这个map函数的输入K-V类型为 <Object, Text>

而map函数的输出类型为<Text, IntWritable>，而这恰好就是reduce函数的输入类型

reduce函数：

用户自定义的reduce函数,接受一个中间key I和相关的一个value集.它合并这些value,形成一个比较小的value集.一般的,每次reduce调用只产生0或1个输出value.通过一个迭代器把中间value提供给用户自定义的reduce函数.这样可以使我们根据内存来控制value列表的大小.

在单词计数中，我们把具有相同key的结果聚合起来：

class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values){
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

reduce函数的第二个参数类型为Iterable<IntWritable>，这是一堆value的集合，他们具有相同的key，reduce函数的意义就是将这些结果聚合起来。

例如（”hello“， 1）和（”hello“， 1）聚合为（”hello“， 2），后者可能再次和（”hello“， 3）（”hello“， 1），聚合为（”hello“， 7）

可以通过控制values的大小，防止内存溢出，合理使用内存。

reduce函数的结果存储到磁盘上，就是我们最终的结果。

完整的代码为：

package com.zhihu;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.StringTokenizer;

/**
 * Created by guochunyang on 15/9/22.
 */
public class WordCount {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "wordcount");
        job.setJarByClass(WordCount.class);

        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path("in"));
        FileOutputFormat.setOutputPath(job, new Path("out"));

        job.waitForCompletion(true);
    }
}

class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer tokenizer = new StringTokenizer(value.toString());
        while (tokenizer.hasMoreTokens())
        {
            word.set(tokenizer.nextToken());
            context.write(word, one);
        }
    }
}

class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values){
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

posted on 2016-03-01 21:13 inevermore 阅读(4164) 评论(0) 收藏举报