使用MultipleInputs和MultipleOutputs

还是计算矩阵的乘积，待计算的表达式如下：

S=F*[B+mu(u+s+b+d)]

其中，矩阵B、u、s、d分别存放在名称对应的SequenceFile文件中。

1)我们想分别读取这些文件（放在不同的文件夹中）然后计算与矩阵F的乘积，这就需要使用MultipleInputs类，那么就需要修改main()函数中对作业的配置，首先我们回顾一下之前未使用MultipleInputs的时候，对job中的map()阶段都需要进行哪些配置，示例如下：

job.setInputFormatClass(SequenceFileInputFormat.class);
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(DoubleArrayWritable.class);
FileInputFormat.setInputPaths(job, new Path(uri));

在配置job的时候对map任务的设置有五点，分别是：输入格式、所使用的mapper类、map输出key的类型、map输出value的类型以及输入路径。

然后，使用MultipleInputs时，对map的配置内容如下：

1 MultipleInputs.addInputPath(job, new Path(uri + "/b100"), SequenceFileInputFormat.class, MyMapper.class);
2 MultipleInputs.addInputPath(job, new Path(uri + "/u100"), SequenceFileInputFormat.class, MyMapper.class);
3 MultipleInputs.addInputPath(job, new Path(uri + "/s100"), SequenceFileInputFormat.class, MyMapper.class);
4 MultipleInputs.addInputPath(job, new Path(uri + "/d100"), SequenceFileInputFormat.class, MyMapper.class);
5 job.setMapOutputKeyClass(Text.class);
6 job.setMapOutputValueClass(DoubleArrayWritable.class);

首先使用addInputpath()将文件的输入路径、文件输入格式、所使用的mapper类添加到job中，所以接下来我们只需要再配置map的输出key和value的类型就可以了。

2)以上就是完成了使用MultipleInputs对map任务的配置，但是当我们使用MultipleInputs时，获得文件名的方法与以前的方法不同，所以需要将map()方法中获得文件名的代码修改为如下代码(http://blog.csdn.net/cuilanbo/article/details/25722489)：

 1 InputSplit split=context.getInputSplit();
 2             //String fileName=((FileSplit)inputSplit).getPath().getName();
 3             Class<? extends InputSplit> splitClass = split.getClass();
 4 
 5             FileSplit fileSplit = null;
 6             if (splitClass.equals(FileSplit.class)) {
 7                 fileSplit = (FileSplit) split;
 8             } else if (splitClass.getName().equals(
 9                     "org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit")) {
10                 // begin reflection hackery...
11                 try {
12                     Method getInputSplitMethod = splitClass
13                             .getDeclaredMethod("getInputSplit");
14                     getInputSplitMethod.setAccessible(true);
15                     fileSplit = (FileSplit) getInputSplitMethod.invoke(split);
16                 } catch (Exception e) {
17                     // wrap and re-throw error
18                     throw new IOException(e);
19                 }
20                 // end reflection hackery
21             }
22                String fileName=fileSplit.getPath().getName();

以上就完成了map的多路径输入，不过如果我们想要将这多个文件的计算结果也输出到不同的文件这怎么办？那就使用MultipleOutputs。

3)使用MultipleOutputs之前，我们首先考虑之前我们是怎么配置reduce任务的，示例如下：

1 job.setOutputFormatClass(SequenceFileOutputFormat.class);
2 job.setReducerClass(MyReducer.class);
3 job.setOutputKeyClass(IntWritable.class);          
4 job.setOutputValueClass(DoubleArrayWritable.class);
5 FileOutputFormat.setOutputPath(job,new Path(outUri));

同样的，在reduce任务的时候设置也是五点，分别是：输入格式、所使用的reduce类、renduce输出key的类型、reduce输出value的类型以及输出路径。然后，使用MultipleInputs是对reducer的配置如下：

1 MultipleOutputs.addNamedOutput(job, "Sb100", SequenceFileOutputFormat.class, IntWritable.class, DoubleArrayWritable.class);
2 MultipleOutputs.addNamedOutput(job,"Su100",SequenceFileOutputFormat.class,IntWritable.class,DoubleArrayWritable.class);
3 MultipleOutputs.addNamedOutput(job,"Ss100",SequenceFileOutputFormat.class,IntWritable.class,DoubleArrayWritable.class);
4 MultipleOutputs.addNamedOutput(job, "Sd100", SequenceFileOutputFormat.class, IntWritable.class, DoubleArrayWritable.class);
5 job.setReducerClass(MyReducer.class);        
6 FileOutputFormat.setOutputPath(job,new Path(outUri));

使用MultipleOutputs的addNamedOutput()方法将输出文件名、输出文件格式、输出key类型、输出value类型配置到job中。然后我们再配置所使用的reduce类、输出路径。

4）使用MultipleOutputs时，在reduce()方法中写入文件使用的不再是context.write(),而是使用MultipleOutputs类的write()方法。所以需要修改redcue()的实现以及setup()的实现，修改后如下：

a.setup()方法

 1 public void setup(Context context){
 2 mos=new MultipleOutputs(context);
 3 int leftMatrixColumnNum=context.getConfiguration().getInt("leftMatrixColumnNum",100);
 4 sum=new DoubleWritable[leftMatrixColumnNum];
 5 for (int i=0;i<leftMatrixColumnNum;++i){
 6                 sum[i]=new DoubleWritable(0.0);
 7             }
 8         }
 9        
10

b.reduce()方法

 1 public void reduce(Text key,Iterable<DoubleArrayWritable>value,Context context) throws IOException, InterruptedException {
 2             int valueLength=0;
 3             for(DoubleArrayWritable doubleValue:value){
 4                 obValue=doubleValue.toArray();
 5                 valueLength=Array.getLength(obValue);
 6                 for (int i=0;i<valueLength;++i){
 7                     sum[i]=new DoubleWritable(Double.parseDouble(Array.get(obValue,i).toString())+sum[i].get());
 8                 }
 9             }
10             valueArrayWritable=new DoubleArrayWritable();
11             valueArrayWritable.set(sum);
12             String[] xx=key.toString(). split(",");
13             IntWritable intKey=new IntWritable(Integer.parseInt(xx[0]));
14             if (key.toString().endsWith("b100")){
15                 mos.write("Sb100",intKey,valueArrayWritable);
16             }
17             else if (key.toString().endsWith("u100")) {
18                 mos.write("Su100",intKey,valueArrayWritable);
19             }
20             else if (key.toString().endsWith("s100")) {
21                 mos.write("Ss100",intKey,valueArrayWritable);
22             }
23             else if (key.toString().endsWith("d100")) {
24                 mos.write("Sd100",intKey,valueArrayWritable);
25             }
26             for (int i=0;i<sum.length;++i){
27                 sum[i].set(0.0);
28             }
29 
30         }
31     }

mos.write("Sb100",key,value)中的文件名必须与使用addNamedOutput()方法配置job时使用的文件名相同，另外文件名中不能包含"-"、“_”字符。

5）在一个Job中同时使用MultipleInputs和MultipleOutputs的完整代码如下：

  1 /**
  2  * Created with IntelliJ IDEA.
  3  * User: hadoop
  4  * Date: 16-3-9
  5  * Time: 下午12:47
  6  * To change this template use File | Settings | File Templates.
  7  */
  8 import org.apache.hadoop.conf.Configuration;
  9 import org.apache.hadoop.fs.FileSystem;
 10 import java.io.IOException;
 11 import java.lang.reflect.Array;
 12 import java.lang.reflect.Method;
 13 import java.net.URI;
 14 
 15 import org.apache.hadoop.fs.Path;
 16 import org.apache.hadoop.io.*;
 17 import org.apache.hadoop.mapreduce.InputSplit;
 18 import org.apache.hadoop.mapreduce.Job;
 19 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 20 import org.apache.hadoop.mapreduce.lib.input.FileSplit;
 21 import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
 22 import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
 23 import org.apache.hadoop.mapreduce.lib.output.NullOutputFormat;
 24 import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
 25 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 26 import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
 27 import org.apache.hadoop.mapreduce.Reducer;
 28 import org.apache.hadoop.mapreduce.Mapper;
 29 import org.apache.hadoop.filecache.DistributedCache;
 30 import org.apache.hadoop.util.ReflectionUtils;
 31 
 32 public class MutiDoubleInputMatrixProduct {
 33     public static  class MyMapper extends Mapper<IntWritable,DoubleArrayWritable,Text,DoubleArrayWritable>{
 34         public DoubleArrayWritable map_value=new DoubleArrayWritable();
 35         public  double[][] leftMatrix=null;/******************************************/
 36         public Object obValue=null;
 37         public DoubleWritable[] arraySum=null;
 38         public double sum=0;
 39         public void setup(Context context) throws IOException {
 40             Configuration conf=context.getConfiguration();
 41             leftMatrix=new double[conf.getInt("leftMatrixRowNum",10)][conf.getInt("leftMatrixColumnNum",10)];
 42             System.out.println("map setup() start!");
 43             //URI[] cacheFiles=DistributedCache.getCacheFiles(context.getConfiguration());
 44             Path[] cacheFiles=DistributedCache.getLocalCacheFiles(conf);
 45             String localCacheFile="file://"+cacheFiles[0].toString();
 46             //URI[] cacheFiles=DistributedCache.getCacheFiles(conf);
 47             //DistributedCache.
 48             System.out.println("local path is:"+cacheFiles[0].toString());
 49             // URI[] cacheFiles=DistributedCache.getCacheFiles(context.getConfiguration());
 50             FileSystem fs =FileSystem.get(URI.create(localCacheFile), conf);
 51             SequenceFile.Reader reader=null;
 52             reader=new SequenceFile.Reader(fs,new Path(localCacheFile),conf);
 53             IntWritable key= (IntWritable)ReflectionUtils.newInstance(reader.getKeyClass(),conf);
 54             DoubleArrayWritable value= (DoubleArrayWritable)ReflectionUtils.newInstance(reader.getValueClass(),conf);
 55             int valueLength=0;
 56             int rowIndex=0;
 57             while (reader.next(key,value)){
 58                 obValue=value.toArray();
 59                 rowIndex=key.get();
 60                 if(rowIndex<1){
 61                     valueLength=Array.getLength(obValue);
 62                 }
 63                 leftMatrix[rowIndex]=new double[conf.getInt("leftMatrixColumnNum",10)];
 64                 //this.leftMatrix=new double[valueLength][Integer.parseInt(context.getConfiguration().get("leftMatrixColumnNum"))];
 65                 for (int i=0;i<valueLength;++i){
 66                     leftMatrix[rowIndex][i]=Double.parseDouble(Array.get(obValue, i).toString());
 67                 }
 68 
 69             }
 70         }
 71         public void map(IntWritable key,DoubleArrayWritable value,Context context) throws IOException, InterruptedException {
 72             obValue=value.toArray();
 73             InputSplit split=context.getInputSplit();
 74             //String fileName=((FileSplit)inputSplit).getPath().getName();
 75             Class<? extends InputSplit> splitClass = split.getClass();
 76 
 77             FileSplit fileSplit = null;
 78             if (splitClass.equals(FileSplit.class)) {
 79                 fileSplit = (FileSplit) split;
 80             } else if (splitClass.getName().equals(
 81                     "org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit")) {
 82                 // begin reflection hackery...
 83                 try {
 84                     Method getInputSplitMethod = splitClass
 85                             .getDeclaredMethod("getInputSplit");
 86                     getInputSplitMethod.setAccessible(true);
 87                     fileSplit = (FileSplit) getInputSplitMethod.invoke(split);
 88                 } catch (Exception e) {
 89                     // wrap and re-throw error
 90                     throw new IOException(e);
 91                 }
 92                 // end reflection hackery
 93             }
 94                String fileName=fileSplit.getPath().getName();
 95 
 96 
 97 
 98 
 99 
100             if (fileName.startsWith("FB")) {
101                 context.write(new Text(String.valueOf(key.get())+","+fileName),value);
102             }
103             else{
104                 arraySum=new DoubleWritable[this.leftMatrix.length];
105                 for (int i=0;i<this.leftMatrix.length;++i){
106                     sum=0;
107                     for (int j=0;j<this.leftMatrix[0].length;++j){
108                         sum+= this.leftMatrix[i][j]*Double.parseDouble(Array.get(obValue,j).toString())*(double)(context.getConfiguration().getFloat("u",1f));
109                     }
110                     arraySum[i]=new DoubleWritable(sum);
111                     //arraySum[i].set(sum);
112                 }
113                 map_value.set(arraySum);
114                 context.write(new Text(String.valueOf(key.get())+","+fileName),map_value);
115             }
116         }
117     }
118     public static class MyReducer extends Reducer<Text,DoubleArrayWritable,IntWritable,DoubleArrayWritable>{
119         public DoubleWritable[] sum=null;
120         public Object obValue=null;
121         public DoubleArrayWritable valueArrayWritable=null;
122         private MultipleOutputs mos=null;
123 
124         public void setup(Context context){
125             mos=new MultipleOutputs(context);
126             int leftMatrixColumnNum=context.getConfiguration().getInt("leftMatrixColumnNum",100);
127             sum=new DoubleWritable[leftMatrixColumnNum];
128             for (int i=0;i<leftMatrixColumnNum;++i){
129                 sum[i]=new DoubleWritable(0.0);
130             }
131         }
132 
133         public void reduce(Text key,Iterable<DoubleArrayWritable>value,Context context) throws IOException, InterruptedException {
134             int valueLength=0;
135             for(DoubleArrayWritable doubleValue:value){
136                 obValue=doubleValue.toArray();
137                 valueLength=Array.getLength(obValue);
138                 for (int i=0;i<valueLength;++i){
139                     sum[i]=new DoubleWritable(Double.parseDouble(Array.get(obValue,i).toString())+sum[i].get());
140                 }
141             }
142             valueArrayWritable=new DoubleArrayWritable();
143             valueArrayWritable.set(sum);
144             String[] xx=key.toString(). split(",");
145             IntWritable intKey=new IntWritable(Integer.parseInt(xx[0]));
146             if (key.toString().endsWith("b100")){
147                 mos.write("Sb100",intKey,valueArrayWritable);
148             }
149             else if (key.toString().endsWith("u100")) {
150                 mos.write("Su100",intKey,valueArrayWritable);
151             }
152             else if (key.toString().endsWith("s100")) {
153                 mos.write("Ss100",intKey,valueArrayWritable);
154             }
155             else if (key.toString().endsWith("d100")) {
156                 mos.write("Sd100",intKey,valueArrayWritable);
157             }
158             for (int i=0;i<sum.length;++i){
159                 sum[i].set(0.0);
160             }
161 
162         }
163     }
164 
165     public static void main(String[]args) throws IOException, ClassNotFoundException, InterruptedException {
166         String uri="data/input";
167         String outUri="sOutput";
168         String cachePath="data/F100";
169         HDFSOperator.deleteDir(outUri);
170         Configuration conf=new Configuration();
171         DistributedCache.addCacheFile(URI.create(cachePath),conf);//添加分布式缓存
172         /**************************************************/
173         //FileSystem fs=FileSystem.get(URI.create(uri),conf);
174         //fs.delete(new Path(outUri),true);
175         /*********************************************************/
176         conf.setInt("leftMatrixColumnNum",100);
177         conf.setInt("leftMatrixRowNum",100);
178         conf.setFloat("u",0.5f);
179        // conf.set("mapred.jar","MutiDoubleInputMatrixProduct.jar");
180         Job job=new Job(conf,"MultiMatrix2");
181         job.setJarByClass(MutiDoubleInputMatrixProduct.class);
182         //job.setOutputFormatClass(NullOutputFormat.class);
183         job.setReducerClass(MyReducer.class);
184         job.setMapOutputKeyClass(Text.class);
185         job.setMapOutputValueClass(DoubleArrayWritable.class);
186         MultipleInputs.addInputPath(job, new Path(uri + "/b100"), SequenceFileInputFormat.class, MyMapper.class);
187         MultipleInputs.addInputPath(job, new Path(uri + "/u100"), SequenceFileInputFormat.class, MyMapper.class);
188         MultipleInputs.addInputPath(job, new Path(uri + "/s100"), SequenceFileInputFormat.class, MyMapper.class);
189         MultipleInputs.addInputPath(job, new Path(uri + "/d100"), SequenceFileInputFormat.class, MyMapper.class);
190         MultipleOutputs.addNamedOutput(job, "Sb100", SequenceFileOutputFormat.class, IntWritable.class, DoubleArrayWritable.class);
191         MultipleOutputs.addNamedOutput(job,"Su100",SequenceFileOutputFormat.class,IntWritable.class,DoubleArrayWritable.class);
192         MultipleOutputs.addNamedOutput(job,"Ss100",SequenceFileOutputFormat.class,IntWritable.class,DoubleArrayWritable.class);
193         MultipleOutputs.addNamedOutput(job, "Sd100", SequenceFileOutputFormat.class, IntWritable.class, DoubleArrayWritable.class);
194         FileOutputFormat.setOutputPath(job,new Path(outUri));
195         System.exit(job.waitForCompletion(true)?0:1);
196     }
197 
198 
199 }
200 class DoubleArrayWritable extends ArrayWritable {
201     public DoubleArrayWritable(){
202         super(DoubleWritable.class);
203     }
204     public String toString(){
205         StringBuilder sb=new StringBuilder();
206         for (Writable val:get()){
207             DoubleWritable doubleWritable=(DoubleWritable)val;
208             sb.append(doubleWritable.get());
209             sb.append(",");
210         }
211         sb.deleteCharAt(sb.length()-1);
212         return sb.toString();
213     }
214 }
215 
216 class HDFSOperator{
217     public static boolean deleteDir(String dir)throws IOException{
218         Configuration conf=new Configuration();
219         FileSystem fs =FileSystem.get(conf);
220         boolean result=fs.delete(new Path(dir),true);
221         System.out.println("sOutput delete");
222         fs.close();
223         return result;
224     }
225 }

6）运行结果如下：

posted @ 2016-03-09 14:24 lz3018 阅读(1162) 评论(0) 编辑收藏举报

刷新页面返回顶部

lz3018

使用MultipleInputs和MultipleOutputs

公告