倒排索引的一些扩展探讨

这篇文章将会从倒排索引的例子逐渐深入地介绍hadoop mapreduce的一些操作。

倒排索引

对于倒排索引的介绍，网上有很多的介绍和相关的程序，在本文中就不会详细介绍，但是会给出相关代码，这些代码将会展示hadoop mapreduce的基本操作，在接下来的内容中，将会基于这些代码进行扩展，以熟悉mapreduce的一些程序设计。

Map

map操作主要是读取文本文件的每一行，并将单词和文本文件名合成一个key。代码如下：

static class InvertedIndexMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
        private static final IntWritable one = new IntWritable(1);
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            FileSplit fileSplit = (FileSplit) context.getInputSplit();
            Path path = fileSplit.getPath();
            String fileName = path.getName(); // docId
            fileName = fileName.substring(0, fileName.indexOf('.'));
            String[] strs = value.toString().split(" ");
            for (String s : strs) {
                String newKey = s + "#" + fileName;
                context.write(new Text(newKey), one);
            }
        }
    }

Combiner

combiner操作主要是将map过程中产生的键值对进行合并，以减少网络传输，代码如下：

static class InvertedIndexCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable v : values) {
                sum += v.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

Partitioner

因为我们的key是由单词和文件名组成的，如果使用默认的partitioner，那么会导致相同的单词被分入不同的Reducer中，所以需要自己定义一个Partitioner，代码如下：

static class InvertedIndexPartitioner extends HashPartitioner<Text, IntWritable> {
        public int getPartition(Text key, IntWritable value, int numReduceTasks) {
            String word = key.toString().split("#")[0];
            return super.getPartition(new Text(word), value, numReduceTasks);
        }
    }

Reducer

在本文的例子中，我们的倒排索引增加了统计每个单词的平均出现次数，平均出现次数=单词在各个文档出现的总数 / 文档总数，最后生成的记录的形式是：“单词　　平均出现次数，文档1：词频；文档2：词频....”，代码如下：

static class InvertedIndexReducer extends Reducer<Text, IntWritable, Text, Text> {
        private Text word = new Text();
        private Text docId = new Text();
        private String currentWord = " ";
        private double total = 0;
        private int docNum = 0;
        private StringBuilder posting = new StringBuilder();

        private String help() {
            double avg = total / docNum;
            posting.deleteCharAt(posting.length() - 1);
            String post = avg + "," + posting.toString();
            total = docNum = 0;
            posting = new StringBuilder();
            return post;
        }

        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            String word = (key.toString().split("#")[0]);
            String docId = (key.toString().split("#")[1]);
            int sum = 0;
            for (IntWritable v : values)
                sum += v.get();
            if (!currentWord.equals(word) && !currentWord.equals(" ")) {
                context.write(new Text(word), new Text(help()));
            }
            total += sum;
            ++docNum;
            posting.append(docId + ":" + sum + ";");
            currentWord = word;
        }

        public void cleanup(Context context) throws IOException, InterruptedException {
            context.write(new Text(word), new Text(help()));
        }
    }

使用辅助类

以上一些代码是基本的mapreduce过程的操作，下面将逐渐地深入mapreduce高级程序设计，首先将是使用辅助类来设计我们的程序。

为了简化命令行方式运行作业，hadoop自带了一些辅助类。GenericOptionsParser是一个类，用来解释常用的hadoop命令行选项，并根据需要，为Configuration对象设置相应的取值。通常不直接使用GenericOptionsParser，更方便的方式是：实现Tool接口，通过ToolRunner来运行应用程序，ToolRunner内部调用GenericOptionsParser。那么以上代码可修改为：

public class InvertedIndex extends Configured implements Tool {
    static class InvertedIndexMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    ....
    } 
    static class InvertedIndexCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
    ....
    }    
    static class InvertedIndexPartitioner extends HashPartitioner<Text, IntWritable> {
    ....
    }
    static class InvertedIndexReducer extends Reducer<Text, IntWritable, Text, Text> {
    ....   
    }
    public int run(String[] var1) throws Exception {
    ....
    }
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        ToolRunner.run(conf, new InvertedIndex(), args);
    }
}

与之前的代码相比，改变的是InvertedIndex继承了Configured并实现了Tool接口，在类中，又实现了run方法，在main()函数中，调用ToolRunner的静态run方法，这个静态run方法会调用InvertedIndex的run方法。ToolRunner的run()源码如下：

public static int run(Configuration conf, Tool tool, String[] args) 
    throws Exception{
    if(conf == null) {
      conf = new Configuration();
    }
    GenericOptionsParser parser = new GenericOptionsParser(conf, args);
    //set the configuration back, so that Tool can configure itself
    tool.setConf(conf);
    
    //get the args w/o generic hadoop args
    String[] toolArgs = parser.getRemainingArgs();
    return tool.run(toolArgs);
  }

从源码可知，ToolRunner.run()将会调用GenericOptionsParser解析命令行参数，并传递给Too.run()。GenericOptionsParser类中将根据传递的参数给Configuration对象设置值。

那么有些同学就会问了，为什么这么麻烦传递命令行参数呢，我直接使用main方法里面的参数不行吗？在这个简单的例子里当然是可以，但是在复杂的项目里，使用ToolRunner将会带来一系列的好处。因为在mapreduce运行和参数配置中，经常会出现如下的烦恼：

将mapreduce job配置参数写到java代码里，一旦变更意味着修改java文件源码，编译，打包，部署一连串的事情。
当mapreduce 依赖配置文件的时候，你需要手工编写java代码使用DistributedCache将其上传到HDFS中，以便map和reduce函数可以读取。
当你的map或reduce函数依赖第三方jar文件时，你在命令行中使用“-libjars”参数指定依赖jar包时，但根本没生效。

以上的烦恼都可以使用ToolRunner解决，通过简单的几步，就可以实现代码与配置隔离，上传文件到DistributedCache等功能。修改mapreduce参数不需要修改java代码，打包，部署，提高工作效率。

多个mapreduce程序的组合

如果对于倒排索引的结果再设置一个任务，即根据每个词语的平均出现次数进行排序，该怎么修改程序呢？首先需要编写一个mapreduce任务来处理排序，这里我们只编写一个map类，利用map类自己的排序功能将平均出现次数进行排序，这样做的的前提就是使用平均出现次数作为键值。代码如下：

static class SortMapper extends Mapper<LongWritable, Text, DoubleWritable, Text> {
        @Override
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String record = value.toString();
            String newRecord = record.substring(0, record.indexOf(','));
            Text word = new Text(newRecord.split("\t")[0]);
            DoubleWritable avg = new DoubleWritable(Double.parseDouble(newRecord.split("\t")[1]));
            context.write(avg, word);
        }
    }

在SortMapper类中，将平均出现次数作为键值。有些同学可能会有一些疑问，如果我们reducer有好多个，最后排序能成功吗？我们知道reducer有几个，那么最后输出的文件就有相对应的数量，在每一份文件中都会进行排序，但在所有文件合起来时不一定是有序的，因为默认是根据哈希shuffle到reducer的。如果reducer不止一个，需要自定义一个Partitioner，这样生成的多分文件在总体上是有序的，在这里我们假定只有一个reducer。还有一个知识点，如果在一个mapreduce任务中，只有map任务，没有reduce任务，那么执行map任务的数量和reduce的数量是一致的，这是为什么呢？（TODO）

顺序组合

即将前一个mapreduce任务的输出作为后一个任务的输入，多个mapreduce任务进行手工执行。

Job job1 = Job.getInstance(conf, "invertedindex");
job1.setJarByClass(InvertedIndex.class);
job1.setMapperClass(InvertedIndexMapper.class);
job1.setCombinerClass(InvertedIndexCombiner.class);
job1.setPartitionerClass(InvertedIndexPartitioner.class);
job1.setReducerClass(InvertedIndexReducer.class);
job1.setMapOutputKeyClass(Text.class);
job1.setMapOutputValueClass(IntWritable.class);
job1.setNumReduceTasks(5);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job1, new Path(var1[0]));
FileOutputFormat.setOutputPath(job1, new Path(var1[1]));
job1.waitForCompletion(true);

Job job2 = Job.getInstance(conf, "sort");
job2.setJarByClass(InvertedIndex.class);
job2.setMapperClass(SortMapper.class);
job2.setOutputKeyClass(DoubleWritable.class);
job2.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job2, new Path(var1[1]));
FileOutputFormat.setOutputPath(job2, new Path(var1[2]));

上面代码中的job1执行的是倒排索引任务，job2执行的排序任务，两者的联系通过将job1的输出当做job2的输入，job2等job1执行完之后再执行。

链式执行

Hadoop提供的ChainMapper允许在一个单一Map任务中添加和使用多个Map子任务；而ChainReducer则允许在一个单一Reduce任务执行了Reduce处理后，继续使用多个Map子任务完成一些后续处理。这个方式对于我们第二个排序任务也是适用的，因为我们的排序任务没有使用reduce任务。但是这里需要注意几个问题：

因为是链式的，所以前一个输出和后一个输入需要相对应。
同时只会有一个输出了。
因为在reduce之后只增加了一个map任务，不是完整的mapreduce任务，所以不会将map的输出键值对进行排序了。（这里只是介绍一下这种设计模式，不在意有没有排序）

所以我们将上面的例子改动一下，将不会输出倒排索引的结果，直接输出排序map的结果，而且需要将倒排索引的输出键值对与排序map的输入键值对对应起来，这有两种方法，主动修改倒排索引的输出键值对，或者主动修改排序的输入键值对。这里我们将主动修改排序的输入键值对。修改后的代码如下：

static class SortMapper extends Mapper<Text, Text, DoubleWritable, Text> {
        @Override
        public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
            String record = value.toString();
            String newRecord = record.substring(0, record.indexOf(','));
            DoubleWritable avg = new DoubleWritable(Double.parseDouble(newRecord));
            context.write(avg, key);
        }
    }

设置job的代码更改如下：

Job job = Job.getInstance(conf, "invertedindex");
job.setJarByClass(InvertedIndex.class);
Configuration mapAConf = new Configuration(false);
Configuration mapBConf = new Configuration(false);
Configuration reduceConf = new Configuration(false);
ChainMapper.addMapper(job, InvertedIndexMapper.class, LongWritable.class, Text.class, Text.class, IntWritable.class, mapAConf);
ChainReducer.setReducer(job, InvertedIndexReducer.class, Text.class, IntWritable.class, Text.class, Text.class, reduceConf);
ChainReducer.addMapper(job, SortMapper.class, Text.class, Text.class, DoubleWritable.class, Text.class, mapBConf);
job.setCombinerClass(InvertedIndexCombiner.class);
job.setPartitionerClass(InvertedIndexPartitioner.class);
FileInputFormat.addInputPath(job, new Path(var1[0]));
FileOutputFormat.setOutputPath(job, new Path(var1[1]));
return job.waitForCompletion(true) ? 0 : 1;

有一个问题，这里最后一个map任务是有很多个map执行的，为什么最后只会输出一个文件呢？（TODO）

更新1：上面黄色高亮部分有错误，不应该是有多个map执行，因为reduce的数量为1，所以最后一个map任务的输入也是1个。那么如果指定reduce数量呢？会输出多个文件吗？

更新2：实验了一下，如果设置reducer的数量，那么reducer后面的mapper输出也将会是5个。

更换输入文件格式

我们倒排索引的输出格式为“单词　　平均出现次数，文档1：词频；文档2：词频....”，这是KeyValueTextInputFormat的格式，所以我们可以将第二个排序的mapreduce任务的输入格式改成KeyValueTextInputFormat，这样我们就可以直接将单词作为key读入map中。修改的代码也很简单，SortMapper还是和链式执行的代码一样，只需在job2中设置输入格式：

job2.setInputFormatClass(KeyValueTextInputFormat.class);

对于本文中出现的一些疑问，将会在以后的文章中讲述，现在我也还没有搞清楚。

参考资料：

http://www.infoq.com/cn/articles/MapReduce-Best-Practice-1

posted @ 2016-11-06 12:59 传奇魔法师阅读(250) 评论(0) 编辑收藏举报

刷新页面返回顶部