MapReduce 实例 简单的倒排索引建立

倒排索引被广泛应用于全文搜索殷勤,像Google 百度 雅虎这样的搜索殷勤都在使用倒排索引。具体倒排索引的介绍,参照 维基百科

这个实例要做的是将几个文件中的内容进行倒排索引,文件的内容如下:

我们要实现的结果是:

这样就简单做了一个倒排索引操作,通过单词可以查询到该单词出现在了哪个文件中,出现了几次。

分三个步骤:

1. 将file1、file2和file3文件内容按照 <word:filename,1> (即<单词:文件名,1>,这是为了将文件名信息添加到map中以便后面的reduce使用)

2. 将<word:filename,1>转化为 <word,filename,N>,N为在一个文件中出现的次数

3. reduce输出到文件中

代码如下:

  1 package com.eric;
  2 
  3 import java.io.IOException;
  4 import java.util.StringTokenizer;
  5 
  6 import org.apache.hadoop.conf.Configuration;
  7 import org.apache.hadoop.fs.Path;
  8 import org.apache.hadoop.io.Text;
  9 import org.apache.hadoop.mapreduce.Job;
 10 import org.apache.hadoop.mapreduce.Mapper;
 11 import org.apache.hadoop.mapreduce.Reducer;
 12 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 13 import org.apache.hadoop.mapreduce.lib.input.FileSplit;
 14 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 15 import org.apache.hadoop.util.GenericOptionsParser;
 16 
 17 public class InvertedIndex {
 18 
 19     public static class InvertedIndexMapper extends Mapper<Object, Text, Text, Text>{
 20         private Text keyInfo = new Text();
 21         private Text one = new Text("1");
 22         private FileSplit split;
 23         
 24         @Override
 25         protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
 26             
 27             split = (FileSplit)context.getInputSplit();
 28             String filename = split.getPath().getName().toString();
 29             StringTokenizer iter = new StringTokenizer(value.toString());
 30             while(iter.hasMoreTokens()){
 31                 keyInfo.set(iter.nextToken() + ":" + filename);
 32                 
 33                 context.write(keyInfo, one);
 34             }
 35         }
 36     }
 37     
 38     public static class InvertedIndexCombiner extends Reducer<Text, Text, Text, Text>{
 39 
 40         private Text valueInfo = new Text();
 41         
 42         @Override
 43         protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
 44             
 45             int sum = 0;
 46             for(Text value : values){
 47                 sum += Integer.parseInt(value.toString());
 48             }
 49             
 50             int splitIndex = key.toString().indexOf(":");
 51             valueInfo.set(key.toString().substring(splitIndex+1) + ":" + sum);
 52             key.set(key.toString().substring(0, splitIndex));
 53             context.write(key, valueInfo);
 54         }
 55     }
 56     
 57     public static class InvertedIndexReducer extends Reducer<Text, Text, Text, Text>{
 58 
 59         private Text result = new Text();
 60         @Override
 61         protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
 62 
 63             StringBuffer sb = new StringBuffer();
 64             for(Text value : values){
 65                 sb.append(value.toString() + ";");
 66             }
 67             
 68             result.set(sb.toString());
 69             context.write(key, result);
 70         }
 71     }
 72     
 73     
 74     /**
 75      * @param args
 76      * @throws IOException 
 77      */
 78     public static void main(String[] args) throws Exception {
 79         Configuration conf = new Configuration();
 80         
 81         String[] _args = new String[]{"eric","eric_out"};
 82         
 83         String[] otherArgs = new GenericOptionsParser(conf, _args).getRemainingArgs();
 84         if(otherArgs.length != 2){
 85             System.err.println("Useage: invertedindex input output");
 86             System.exit(2);
 87         }
 88         
 89         Job job = new Job(conf, "inverted index");
 90         job.setJarByClass(InvertedIndex.class);
 91         
 92         job.setMapperClass(InvertedIndexMapper.class);
 93         job.setCombinerClass(InvertedIndexCombiner.class);
 94         job.setReducerClass(InvertedIndexReducer.class);
 95         
 96         job.setOutputKeyClass(Text.class);
 97         job.setOutputValueClass(Text.class);
 98         
 99         FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
100         FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
101         
102         System.exit(job.waitForCompletion(true) ? 0 : 1);
103     }
104 }

 

在Centos5.5中部署的Hadoop集群,操作结果如下:

 

在自己Win7下搭建的Eclipse Hadoop开发环境,在运行时有错误,大致信息为 Failed to set permissions of path: \tmp\***\.staging to 0700,具体路径跟自己的集群配置有关,应该是context在得到FileSplit的过程中产生的,因为同wordcount比较只有这里有不同的操作,所以我选择了将jar包放到了Hadoop集群中,运行jar命令来执行,也可以如下操作:

 

解决方法是,修改/hadoop-1.0.3/src/core/org/apache/hadoop/fs/FileUtil.java里面的checkReturnValue,注释掉即可(有些粗暴,在Window下,可以不用检查)

......
  private static void checkReturnValue(boolean rv, File p,
                                       FsPermission permission
                                       ) throws IOException {
    /**
if (!rv) {
throw new IOException("Failed to set permissions of path: " + p +
" to " +
String.format("%04o", permission.toShort()));
}
**/
  }
......

重新编译打包hadoop-core-1.0.3.jar,替换掉hadoop-1.0.3根目录下的hadoop-core-1.0.3.jar即可。

 

posted @ 2012-07-26 12:56  hanyuanbo  阅读(1754)  评论(0编辑  收藏  举报