hadoop map(分片)数量确定
之前学习hadoop的时候,一直希望可以调试hadoop源码,可是一直没找到有效的方法,今天在调试矩阵乘法的时候发现了调试的方法,所以在这里记录下来。
1)事情的起因是想在一个Job里设置map的数量(虽然最终的map数量是由分片决定的),在hadoop1.2.1之前,设置方法是:
job.setNumMapTasks()
不过,hadoop1.2.1没有了这个方法,只保留了设置reduce数量的方法。继续搜索资料,发现有同学提供了另外一种方法,就是使用configuration设置,设置方式如下:
conf.set("mapred.map.tasks",5);//设置5个map
按照上述方法设置之后,还是没有什么效果,控制分片数量的代码如下():
goalSize=totalSize/(numSplits==0?1:numSplits) //totalSize是输入数据文件的大小,numSplits是用户设置的map数量,就是按照用户自己 //的意愿,每个分片的大小应该是goalSize minSize=Math.max(job.getLong("mapred.min.split.size",1),minSplitSize) //hadoop1.2.1中mapred-default.xml文件中mapred.min.split.size=0,所以job.getLong("mapred.min.split.size",1)=0,而minSplitSize是InputSplit中的一个数据成员,在File//Split中值为1.所以minSize=1,其目的就是得到配置中的最小值。 splitSize=Math.max(minSize,Math.min(goalSize,blockSize)) //真正的分片大小就是取按照用户设置的map数量计算出的goalSize和块大小blockSize中最小值(这是为了是分片不会大于一个块大小,有利于本地化计算),并且又比minSize大的值。
其实,这是hadoop1.2.1之前的生成分片的方式,所以即使设置了map数量也不会有什么实际效果。
2)新版API(hadoop1.2.1)中计算分片的代码如下所示:
1 public List<InputSplit> getSplits(JobContext job) throws IOException { 2 long minSize = Math.max(this.getFormatMinSplitSize(), getMinSplitSize(job)); 3 long maxSize = getMaxSplitSize(job); 4 ArrayList splits = new ArrayList(); 5 List files = this.listStatus(job); 6 Iterator i$ = files.iterator(); 7 8 while(true) { 9 while(i$.hasNext()) { 10 FileStatus file = (FileStatus)i$.next(); 11 Path path = file.getPath(); 12 FileSystem fs = path.getFileSystem(job.getConfiguration()); 13 long length = file.getLen(); 14 BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0L, length); 15 if(length != 0L && this.isSplitable(job, path)) { 16 long blockSize = file.getBlockSize(); 17 long splitSize = this.computeSplitSize(blockSize, minSize, maxSize); 18 19 long bytesRemaining; 20 for(bytesRemaining = length; (double)bytesRemaining / (double)splitSize > 1.1D; bytesRemaining -= splitSize) { 21 int blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining); 22 splits.add(new FileSplit(path, length - bytesRemaining, splitSize, blkLocations[blkIndex].getHosts())); 23 } 24 25 if(bytesRemaining != 0L) { 26 splits.add(new FileSplit(path, length - bytesRemaining, bytesRemaining, blkLocations[blkLocations.length - 1].getHosts())); 27 } 28 } else if(length != 0L) { 29 splits.add(new FileSplit(path, 0L, length, blkLocations[0].getHosts())); 30 } else { 31 splits.add(new FileSplit(path, 0L, length, new String[0])); 32 } 33 } 34 35 job.getConfiguration().setLong("mapreduce.input.num.files", (long)files.size()); 36 LOG.debug("Total # of splits: " + splits.size()); 37 return splits; 38 } 39 }
第17行使用computeSplitSize(blockSize,minSize,maxsize)计算分片大小。
a.minSize通过以下方式计算:
long minSize = Math.max(this.getFormatMinSplitSize(), getMinSplitSize(job))
而getFormatMinSplitSize():
protected long getFormatMinSplitSize() { return 1L; }
而getMinSplitSize(job):
public static long getMinSplitSize(JobContext job) { return job.getConfiguration().getLong("mapred.min.split.size", 1L); }
没有设置“mapred.min.split.size”的默认值是0。
所以,不设置“mapred.min.split.size”的话,就使用方法的默认值1代替,而“mapred.min.split.size”的默认值是0,所以minSize的值就是1
b.再看maxSize的计算方式:
long maxSize = getMaxSplitSize(job);
而getMaxSplitSize():
public static long getMaxSplitSize(JobContext context) { return context.getConfiguration().getLong("mapred.max.split.size", 9223372036854775807L); }
没有设置"mapred.max.split.size"的话,就使用方法的默认值 9223372036854775807,而"mapred.max.split.size"并没有默认值,所以maxSize= 9223372036854775807;
c.我们已经能够计算出minSize=1,maxSize= 9223372036854775807,接下来计算分片大小:
long splitSize = this.computeSplitSize(blockSize, minSize, maxSize);
protected long computeSplitSize(long blockSize, long minSize, long maxSize) { return Math.max(minSize, Math.min(maxSize, blockSize)); }
显然,分片大小是就是maxSize和blockSize的较小值(minSize=1),那么我们就可以通过设置"mapred.max.split.size"来控制map的数量,只要设置值比物理块小就可以了。使用configuration对象的设置方法如下:
conf.set("mapred.max.split.size",2000000)//单位是字节,物理块是16M
3)可以设置map数量的矩阵乘法代码如下所示:
1 /** 2 * Created with IntelliJ IDEA. 3 * User: hadoop 4 * Date: 16-3-14 5 * Time: 下午3:13 6 * To change this template use File | Settings | File Templates. 7 */ 8 import org.apache.hadoop.conf.Configuration; 9 import org.apache.hadoop.fs.FileSystem; 10 import java.io.IOException; 11 import java.net.URI; 12 import org.apache.hadoop.fs.Path; 13 import org.apache.hadoop.io.*; 14 import org.apache.hadoop.io.DoubleWritable; 15 import org.apache.hadoop.io.Writable; 16 import org.apache.hadoop.mapreduce.InputSplit; 17 import org.apache.hadoop.mapreduce.Job; 18 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 19 import org.apache.hadoop.mapreduce.lib.input.FileSplit; 20 import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; 21 import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; 22 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 23 import org.apache.hadoop.mapreduce.Reducer; 24 import org.apache.hadoop.mapreduce.Mapper; 25 import org.apache.hadoop.filecache.DistributedCache; 26 import org.apache.hadoop.util.ReflectionUtils; 27 28 public class MutiDoubleInputMatrixProduct { 29 30 public static void initDoubleArrayWritable(int length,DoubleWritable[] doubleArrayWritable){ 31 for (int i=0;i<length;++i){ 32 doubleArrayWritable[i]=new DoubleWritable(0.0); 33 } 34 } 35 36 public static class MyMapper extends Mapper<IntWritable,DoubleArrayWritable,IntWritable,DoubleArrayWritable>{ 37 public DoubleArrayWritable map_value=new DoubleArrayWritable(); 38 public double[][] leftMatrix=null;/******************************************/ 39 //public Object obValue=null; 40 public DoubleWritable[] arraySum=null; 41 public DoubleWritable[] tempColumnArrayDoubleWritable=null; 42 public DoubleWritable[] tempRowArrayDoubleWritable=null; 43 public double sum=0; 44 public double uValue; 45 public int leftMatrixRowNum; 46 public int leftMatrixColumnNum; 47 public void setup(Context context) throws IOException { 48 Configuration conf=context.getConfiguration(); 49 leftMatrixRowNum=conf.getInt("leftMatrixRowNum",10); 50 leftMatrixColumnNum=conf.getInt("leftMatrixColumnNum",10); 51 leftMatrix=new double[leftMatrixRowNum][leftMatrixColumnNum]; 52 uValue=(double)(context.getConfiguration().getFloat("u",1.0f)); 53 tempRowArrayDoubleWritable=new DoubleWritable[leftMatrixColumnNum]; 54 initDoubleArrayWritable(leftMatrixColumnNum,tempRowArrayDoubleWritable); 55 tempColumnArrayDoubleWritable=new DoubleWritable[leftMatrixRowNum]; 56 initDoubleArrayWritable(leftMatrixRowNum,tempColumnArrayDoubleWritable); 57 System.out.println("map setup() start!"); 58 //URI[] cacheFiles=DistributedCache.getCacheFiles(context.getConfiguration()); 59 Path[] cacheFiles=DistributedCache.getLocalCacheFiles(conf); 60 String localCacheFile="file://"+cacheFiles[0].toString(); 61 //URI[] cacheFiles=DistributedCache.getCacheFiles(conf); 62 //DistributedCache. 63 System.out.println("local path is:"+cacheFiles[0].toString()); 64 // URI[] cacheFiles=DistributedCache.getCacheFiles(context.getConfiguration()); 65 FileSystem fs =FileSystem.get(URI.create(localCacheFile), conf); 66 SequenceFile.Reader reader=null; 67 reader=new SequenceFile.Reader(fs,new Path(localCacheFile),conf); 68 IntWritable key= (IntWritable)ReflectionUtils.newInstance(reader.getKeyClass(),conf); 69 DoubleArrayWritable value= (DoubleArrayWritable)ReflectionUtils.newInstance(reader.getValueClass(),conf); 70 //int valueLength=0; 71 int rowIndex=0; 72 int index; 73 while (reader.next(key,value)){ 74 index=-1; 75 for (Writable val:value.get()){ //ArrayWritable类的get方法返回Writable[]数组 76 tempRowArrayDoubleWritable[++index].set(((DoubleWritable)val).get()); 77 } 78 //obValue=value.toArray(); 79 rowIndex=key.get(); 80 leftMatrix[rowIndex]=new double[leftMatrixColumnNum]; 81 //this.leftMatrix=new double[valueLength][Integer.parseInt(context.getConfiguration().get("leftMatrixColumnNum"))]; 82 for (int i=0;i<leftMatrixColumnNum;++i){ 83 //leftMatrix[rowIndex][i]=Double.parseDouble(Array.get(obValue, i).toString()); 84 //leftMatrix[rowIndex][i]=Array.getDouble(obValue, i); 85 leftMatrix[rowIndex][i]= tempRowArrayDoubleWritable[i].get(); 86 } 87 88 } 89 arraySum=new DoubleWritable[leftMatrix.length]; 90 initDoubleArrayWritable(leftMatrix.length,arraySum); 91 } 92 public void map(IntWritable key,DoubleArrayWritable value,Context context) throws IOException, InterruptedException { 93 //obValue=value.toArray(); 94 InputSplit inputSplit=context.getInputSplit(); 95 String fileName=((FileSplit)inputSplit).getPath().getName(); 96 if (fileName.startsWith("FB")) { 97 context.write(key,value); 98 } 99 else{ 100 int ii=-1; 101 for(Writable val:value.get()){ 102 tempColumnArrayDoubleWritable[++ii].set(((DoubleWritable)val).get()); 103 } 104 //arraySum=new DoubleWritable[this.leftMatrix.length]; 105 for (int i=0;i<this.leftMatrix.length;++i){ 106 sum=0; 107 for (int j=0;j<this.leftMatrix[0].length;++j){ 108 //sum+= this.leftMatrix[i][j]*Double.parseDouble(Array.get(obValue,j).toString())*(double)(context.getConfiguration().getFloat("u",1f)); 109 //sum+= this.leftMatrix[i][j]*Array.getDouble(obValue,j)*uValue; 110 sum+= this.leftMatrix[i][j]*tempColumnArrayDoubleWritable[j].get()*uValue; 111 } 112 arraySum[i].set(sum); 113 //arraySum[i].set(sum); 114 } 115 map_value.set(arraySum); 116 context.write(key,map_value); 117 } 118 } 119 } 120 public static class MyReducer extends Reducer<IntWritable,DoubleArrayWritable,IntWritable,DoubleArrayWritable>{ 121 public DoubleWritable[] sum=null; 122 // public Object obValue=null; 123 public DoubleArrayWritable valueArrayWritable=new DoubleArrayWritable(); 124 public DoubleWritable[] tempColumnArrayDoubleWritable=null; 125 private int leftMatrixRowNum; 126 127 public void setup(Context context){ 128 //leftMatrixColumnNum=context.getConfiguration().getInt("leftMatrixColumnNum",100); 129 leftMatrixRowNum=context.getConfiguration().getInt("leftMatrixRowNum",100); 130 sum=new DoubleWritable[leftMatrixRowNum]; 131 initDoubleArrayWritable(leftMatrixRowNum,sum); 132 //tempRowArrayDoubleWritable=new DoubleWritable[leftMatrixColumnNum]; 133 tempColumnArrayDoubleWritable=new DoubleWritable[leftMatrixRowNum]; 134 initDoubleArrayWritable(leftMatrixRowNum,tempColumnArrayDoubleWritable); 135 } 136 //如果矩阵的计算已经在map中完成了,貌似可以不使用reduce,如果不创建reduce类,MR框架仍然会调用一个默认的reduce,只是这个reduce什么也不做 137 //但是,不使用reduce的话,map直接写文件,有多少个map就会产生多少个结果文件。这里使用reduce是为了将结果矩阵存储在一个文件中。 138 public void reduce(IntWritable key,Iterable<DoubleArrayWritable>value,Context context) throws IOException, InterruptedException { 139 //int valueLength=0; 140 for(DoubleArrayWritable doubleValue:value){ 141 int index=-1; 142 for (Writable val:doubleValue.get()){ 143 tempColumnArrayDoubleWritable[++index].set(((DoubleWritable)val).get()); 144 } 145 //valueLength=Array.getLength(obValue); 146 /* 147 for (int i=0;i<leftMatrixRowNum;++i){ 148 //sum[i]=new DoubleWritable(Double.parseDouble(Array.get(obValue,i).toString())+sum[i].get()); 149 //sum[i]=new DoubleWritable(Array.getDouble(obValue,i)+sum[i].get()); 150 sum[i].set(tempColumnArrayDoubleWritable[i].get()+sum[i].get()); 151 } 152 */ 153 } 154 //valueArrayWritable.set(sum); 155 valueArrayWritable.set(tempColumnArrayDoubleWritable); 156 context.write(key,valueArrayWritable); 157 /* 158 for (int i=0;i<sum.length;++i){ 159 sum[i].set(0.0); 160 } 161 */ 162 163 } 164 } 165 166 public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { 167 String uri=args[3]; 168 String outUri=args[4]; 169 String cachePath=args[2]; 170 HDFSOperator.deleteDir(outUri); 171 Configuration conf=new Configuration(); 172 DistributedCache.addCacheFile(URI.create(cachePath),conf);//添加分布式缓存 173 /**************************************************/ 174 //FileSystem fs=FileSystem.get(URI.create(uri),conf); 175 //fs.delete(new Path(outUri),true); 176 /*********************************************************/ 177 conf.setInt("leftMatrixColumnNum",Integer.parseInt(args[0])); 178 conf.setInt("leftMatrixRowNum",Integer.parseInt(args[1])); 179 conf.setFloat("u",1.0f); 180 //conf.set("mapred.map.tasks",args[5]); 181 //int mxSplitSize=Integer.valueOf(args[5]) 182 conf.set("mapred.max.split.size",args[5]);//hadoop1.2.1中并没有setNumMapTasks方法,只能通过这种方式控制计算分片的大小来控制map数量 183 conf.set("mapred.jar","MutiDoubleInputMatrixProduct.jar"); 184 Job job=new Job(conf,"MatrixProdcut"); 185 job.setJarByClass(MutiDoubleInputMatrixProduct.class); 186 job.setInputFormatClass(SequenceFileInputFormat.class); 187 job.setOutputFormatClass(SequenceFileOutputFormat.class); 188 job.setMapperClass(MyMapper.class); 189 job.setReducerClass(MyReducer.class); 190 job.setMapOutputKeyClass(IntWritable.class); 191 job.setMapOutputValueClass(DoubleArrayWritable.class); 192 job.setOutputKeyClass(IntWritable.class); 193 job.setOutputValueClass(DoubleArrayWritable.class); 194 FileInputFormat.setInputPaths(job, new Path(uri)); 195 FileOutputFormat.setOutputPath(job,new Path(outUri)); 196 System.exit(job.waitForCompletion(true)?0:1); 197 } 198 199 200 } 201 class DoubleArrayWritable extends ArrayWritable { 202 public DoubleArrayWritable(){ 203 super(DoubleWritable.class); 204 } 205 /* 206 public String toString(){ 207 StringBuilder sb=new StringBuilder(); 208 for (Writable val:get()){ 209 DoubleWritable doubleWritable=(DoubleWritable)val; 210 sb.append(doubleWritable.get()); 211 sb.append(","); 212 } 213 sb.deleteCharAt(sb.length()-1); 214 return sb.toString(); 215 } 216 */ 217 } 218 219 class HDFSOperator{ 220 public static boolean deleteDir(String dir)throws IOException{ 221 Configuration conf=new Configuration(); 222 FileSystem fs =FileSystem.get(conf); 223 boolean result=fs.delete(new Path(dir),true); 224 System.out.println("sOutput delete"); 225 fs.close(); 226 return result; 227 } 228 }
4)接下来说说如何断点调试hadoop源码,这里以计算文件分片的源码为例来说明。
a.首先找到FileInputFormat类,这个类就在hadoop-core-1.2.1.jar中,我们需要将这个jar包添加到工程中,如下所示:
虽然这是编译之后的类文件,也就是字节码,但是仍然可以像java源码一样,断点调试,这里我们分别在getSplits()方法和computeSplitSize()方法中添加两个断点,然后使用IDEA在本地直接以Debug方式运行我们的MapReduce程序,结果如下所示:
命中断点,并且我们可以查看相关的变量值。