MapReduce 实例 简单的倒排索引建立
倒排索引被广泛应用于全文搜索殷勤,像Google 百度 雅虎这样的搜索殷勤都在使用倒排索引。具体倒排索引的介绍,参照 维基百科。
这个实例要做的是将几个文件中的内容进行倒排索引,文件的内容如下:
我们要实现的结果是:
这样就简单做了一个倒排索引操作,通过单词可以查询到该单词出现在了哪个文件中,出现了几次。
分三个步骤:
1. 将file1、file2和file3文件内容按照 <word:filename,1> (即<单词:文件名,1>,这是为了将文件名信息添加到map中以便后面的reduce使用)
2. 将<word:filename,1>转化为 <word,filename,N>,N为在一个文件中出现的次数
3. reduce输出到文件中
代码如下:
1 package com.eric; 2 3 import java.io.IOException; 4 import java.util.StringTokenizer; 5 6 import org.apache.hadoop.conf.Configuration; 7 import org.apache.hadoop.fs.Path; 8 import org.apache.hadoop.io.Text; 9 import org.apache.hadoop.mapreduce.Job; 10 import org.apache.hadoop.mapreduce.Mapper; 11 import org.apache.hadoop.mapreduce.Reducer; 12 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 13 import org.apache.hadoop.mapreduce.lib.input.FileSplit; 14 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 15 import org.apache.hadoop.util.GenericOptionsParser; 16 17 public class InvertedIndex { 18 19 public static class InvertedIndexMapper extends Mapper<Object, Text, Text, Text>{ 20 private Text keyInfo = new Text(); 21 private Text one = new Text("1"); 22 private FileSplit split; 23 24 @Override 25 protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { 26 27 split = (FileSplit)context.getInputSplit(); 28 String filename = split.getPath().getName().toString(); 29 StringTokenizer iter = new StringTokenizer(value.toString()); 30 while(iter.hasMoreTokens()){ 31 keyInfo.set(iter.nextToken() + ":" + filename); 32 33 context.write(keyInfo, one); 34 } 35 } 36 } 37 38 public static class InvertedIndexCombiner extends Reducer<Text, Text, Text, Text>{ 39 40 private Text valueInfo = new Text(); 41 42 @Override 43 protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { 44 45 int sum = 0; 46 for(Text value : values){ 47 sum += Integer.parseInt(value.toString()); 48 } 49 50 int splitIndex = key.toString().indexOf(":"); 51 valueInfo.set(key.toString().substring(splitIndex+1) + ":" + sum); 52 key.set(key.toString().substring(0, splitIndex)); 53 context.write(key, valueInfo); 54 } 55 } 56 57 public static class InvertedIndexReducer extends Reducer<Text, Text, Text, Text>{ 58 59 private Text result = new Text(); 60 @Override 61 protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { 62 63 StringBuffer sb = new StringBuffer(); 64 for(Text value : values){ 65 sb.append(value.toString() + ";"); 66 } 67 68 result.set(sb.toString()); 69 context.write(key, result); 70 } 71 } 72 73 74 /** 75 * @param args 76 * @throws IOException 77 */ 78 public static void main(String[] args) throws Exception { 79 Configuration conf = new Configuration(); 80 81 String[] _args = new String[]{"eric","eric_out"}; 82 83 String[] otherArgs = new GenericOptionsParser(conf, _args).getRemainingArgs(); 84 if(otherArgs.length != 2){ 85 System.err.println("Useage: invertedindex input output"); 86 System.exit(2); 87 } 88 89 Job job = new Job(conf, "inverted index"); 90 job.setJarByClass(InvertedIndex.class); 91 92 job.setMapperClass(InvertedIndexMapper.class); 93 job.setCombinerClass(InvertedIndexCombiner.class); 94 job.setReducerClass(InvertedIndexReducer.class); 95 96 job.setOutputKeyClass(Text.class); 97 job.setOutputValueClass(Text.class); 98 99 FileInputFormat.addInputPath(job, new Path(otherArgs[0])); 100 FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); 101 102 System.exit(job.waitForCompletion(true) ? 0 : 1); 103 } 104 }
在Centos5.5中部署的Hadoop集群,操作结果如下:
在自己Win7下搭建的Eclipse Hadoop开发环境,在运行时有错误,大致信息为 Failed to set permissions of path: \tmp\***\.staging to 0700,具体路径跟自己的集群配置有关,应该是context在得到FileSplit的过程中产生的,因为同wordcount比较只有这里有不同的操作,所以我选择了将jar包放到了Hadoop集群中,运行jar命令来执行,也可以如下操作:
解决方法是,修改/hadoop-1.0.3/src/core/org/apache/hadoop/fs/FileUtil.java里面的checkReturnValue,注释掉即可(有些粗暴,在Window下,可以不用检查)
...... private static void checkReturnValue(boolean rv, File p, FsPermission permission ) throws IOException { /** if (!rv) { throw new IOException("Failed to set permissions of path: " + p + " to " + String.format("%04o", permission.toShort())); } **/ } ......
重新编译打包hadoop-core-1.0.3.jar,替换掉hadoop-1.0.3根目录下的hadoop-core-1.0.3.jar即可。