Hadoop学习之路(二十二)MapReduce的输入和输出
MapReduce的输入
作为一个会编写MR程序的人来说,知道map方法的参数是默认的数据读取组件读取到的一行数据
1、是谁在读取? 是谁在调用这个map方法?
查看源码Mapper.java知道是run方法在调用map方法。
1 /** 2 * 3 * 找出谁在调用Run方法 4 * 5 * 6 * 有一个组件叫做:MapTask 7 * 8 * 就会有对应的方法在调用mapper.run(context); 9 * 10 * 11 * context.nextKeyValue() ====== lineRecordReader.nextKeyValue(); 12 */ 13 public void run(Context context) throws IOException, InterruptedException { 14 15 /** 16 * 在每一个mapTask被初始化出来的时候,就会被调用一次 17 */ 18 setup(context); 19 try { 20 21 /** 22 * 数据读取组件每次读取到一行,都交给map方法执行一次 23 * 24 * 25 * context.nextKeyValue()的意义有连点: 26 * 27 * 1、读取一个key-value到该context对象中的两个属性中:key-value 28 * 2、方法的返回值并不是读取到的key-value,是标志有没有读取到key_value的布尔值 29 * 30 * 31 * context.getCurrentKey() ==== key 32 * context.getCurrentValue() ==== value 33 * 34 * 35 * 36 * 依赖于最底层的 LineRecordReader的实现 37 * 38 * 你的nextKeyValue方法的返回结果中,一定要包含 false 39 */ 40 while (context.nextKeyValue()) { 41 map(context.getCurrentKey(), context.getCurrentValue(), context); 42 } 43 44 } finally { 45 46 /** 47 * 当前这个mapTask在执行完毕所有的该切片数据之后,会执行 48 */ 49 cleanup(context); 50 } 51 }
此处map方法中有四个重要的方法:
1、context.nextKeyValue(); //负责读取数据,但是方法的返回值却不是读取到的key-value,而是返回了一个标识有没有读取到数据的布尔值
2、context.getCurrentKey(); //负责获取context.nextKeyValue() 读取到的key
3、context.getCurrentValue(); //负责获取context.nextKeyValue() 读取到的value
4、context.write(key,value); //负责输出mapper阶段输出的数据
2、谁在调用run方法?context参数怎么来的,是什么?
共同答案:找到了谁在调用run方法,那么就能知道这个谁就会给run方法传入一个参数叫做:context
最开始,mapper.run(context)是由mapTask实例对象进行调用
查看源码MapTask.java
1 @Override 2 public void run(final JobConf job, final TaskUmbilicalProtocol umbilical) 3 throws IOException, ClassNotFoundException, InterruptedException { 4 this.umbilical = umbilical; 5 6 if (isMapTask()) { 7 // If there are no reducers then there won't be any sort. Hence the 8 // map 9 // phase will govern the entire attempt's progress. 10 if (conf.getNumReduceTasks() == 0) { 11 mapPhase = getProgress().addPhase("map", 1.0f); 12 } else { 13 // If there are reducers then the entire attempt's progress will 14 // be 15 // split between the map phase (67%) and the sort phase (33%). 16 mapPhase = getProgress().addPhase("map", 0.667f); 17 sortPhase = getProgress().addPhase("sort", 0.333f); 18 } 19 } 20 TaskReporter reporter = startReporter(umbilical); 21 22 boolean useNewApi = job.getUseNewMapper(); 23 initialize(job, getJobID(), reporter, useNewApi); 24 25 // check if it is a cleanupJobTask 26 if (jobCleanup) { 27 runJobCleanupTask(umbilical, reporter); 28 return; 29 } 30 if (jobSetup) { 31 runJobSetupTask(umbilical, reporter); 32 return; 33 } 34 if (taskCleanup) { 35 runTaskCleanupTask(umbilical, reporter); 36 return; 37 } 38 39 /** 40 * run方法的核心: 41 * 42 * 新的API 43 */ 44 45 if (useNewApi) { 46 /** 47 * jobConf对象, splitMetaInfo 切片信息 umbilical 通信协议 48 * reporter就是包含了各种计数器的一个对象 49 */ 50 runNewMapper(job, splitMetaInfo, umbilical, reporter); 51 } else { 52 runOldMapper(job, splitMetaInfo, umbilical, reporter); 53 } 54 55 done(umbilical, reporter); 56 }
得出伪代码调动新的API
1 mapTask.run(){ 2 runNewMapper(){ 3 mapper.run(mapperContext); 4 } 5 }
3、查看runNewMapper方法
发现此方法还是在MapTask.java中
1 /** 2 * 这就是具体的调用逻辑的核心; 3 * 4 * 5 * mapper.run(context); 6 * 7 * 8 * 9 * @param job 10 * @param splitIndex 11 * @param umbilical 12 * @param reporter 13 * @throws IOException 14 * @throws ClassNotFoundException 15 * @throws InterruptedException 16 */ 17 @SuppressWarnings("unchecked") 18 private <INKEY, INVALUE, OUTKEY, OUTVALUE> void runNewMapper(final JobConf job, final TaskSplitIndex splitIndex, 19 final TaskUmbilicalProtocol umbilical, TaskReporter reporter) 20 throws IOException, ClassNotFoundException, InterruptedException { 21 // make a task context so we can get the classes 22 org.apache.hadoop.mapreduce.TaskAttemptContext taskContext = new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl( 23 job, getTaskID(), reporter); 24 // make a mapper 25 org.apache.hadoop.mapreduce.Mapper<INKEY, INVALUE, OUTKEY, OUTVALUE> mapper = (org.apache.hadoop.mapreduce.Mapper<INKEY, INVALUE, OUTKEY, OUTVALUE>) ReflectionUtils 26 .newInstance(taskContext.getMapperClass(), job); 27 28 29 30 31 /** 32 * inputFormat.createRecordReader() === RecordReader real 33 * 34 * 35 * inputFormat就是TextInputFormat类的实例对象 36 * 37 * TextInputFormat组件中的createRecordReader方法的返回值就是 LineRecordReader的实例对象 38 */ 39 // make the input format 40 org.apache.hadoop.mapreduce.InputFormat<INKEY, INVALUE> inputFormat = 41 (org.apache.hadoop.mapreduce.InputFormat<INKEY, INVALUE>) 42 ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job); 43 44 45 46 47 48 // rebuild the input split 49 org.apache.hadoop.mapreduce.InputSplit split = null; 50 split = getSplitDetails(new Path(splitIndex.getSplitLocation()), splitIndex.getStartOffset()); 51 LOG.info("Processing split: " + split); 52 53 /** 54 * NewTrackingRecordReader这个类中一定有三个方法: 55 * 56 * nextKeyValue 57 * getCurrentKey 58 * getCurrentValue 59 * 60 * NewTrackingRecordReader的里面的三个方法的实现 61 * 其实是依赖于于inputFormat对象的createRecordReader方法的返回值的 三个方法的实现 62 * 63 * 默认的InputFormat: TextInputFormat 64 * 默认的RecordReader:LineRecordReader 65 * 66 * 67 * 最终:NewTrackingRecordReader的三个方法的实现是依赖于:LineRecordReader这个类中的三个同名方法的实现 68 */ 69 org.apache.hadoop.mapreduce.RecordReader<INKEY, INVALUE> input = 70 new NewTrackingRecordReader<INKEY, INVALUE>( 71 split, inputFormat, reporter, taskContext); 72 73 job.setBoolean(JobContext.SKIP_RECORDS, isSkipping()); 74 75 76 77 78 79 /** 80 * 声明一个Output对象用来给mapper的key-value进行输出 81 */ 82 org.apache.hadoop.mapreduce.RecordWriter output = null; 83 // get an output object 84 if (job.getNumReduceTasks() == 0) { 85 86 /** 87 * NewDirectOutputCollector 直接输出的一个收集器, 这个类中一定有一个方法 叫做 write 88 */ 89 output = new NewDirectOutputCollector(taskContext, job, umbilical, reporter); 90 } else { 91 92 93 /** 94 * 有reducer阶段了。 95 * 96 * 1、能确定,一定会排序 97 * 98 * 2、能否确定一定会使用Parititioner, 不一定。 在逻辑上可以任务没有起作用。 99 * 100 * NewOutputCollector 这个类当中,一定有一个方法:write方法 101 */ 102 output = new NewOutputCollector(taskContext, job, umbilical, reporter); 103 } 104 105 106 107 108 109 /** 110 * mapContext对象中一定包含三个方法 111 * 112 * 找到了之前第一查看源码实现的方法的问题的答案: 113 * 114 * 问题:找到谁调用MapContextImpl这个类的构造方法 115 * 116 * mapContext就是MapContextImpl的实例对象 117 * 118 * MapContextImpl类中一定有三个方法: 119 * 120 * input === NewTrackingRecordReader 121 * 122 * 123 * 124 * 确定的知识: 125 * 126 * 1、mapContext对象中,一定有write方法 127 * 128 * 2、通过观看MapContextImpl的组成,发现其实没有write方法 129 * 130 * 解决: 131 * 132 * 其实mapContext.write方法的调用是来自于MapContextImpl这个类的父类 133 * 134 * 135 * 136 * 最底层的write方法: output.write(); 137 */ 138 org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> mapContext = 139 new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>( 140 job, getTaskID(), input, output, committer, reporter, split); 141 142 /** 143 * mapperContext的内部一定包含是三个犯法: 144 * 145 * nextKeyValue 146 * getCurrentKey 147 * getCurrentValue 148 * 149 * mapperContext的具体实现是依赖于new Context(context); 150 * context = mapContext 151 * 152 * 结论: 153 * 154 * mapContext对象的内部一定包含以下三个方法: 155 * 156 * nextKeyValue 157 * getCurrentKey 158 * getCurrentValue 159 * 160 * 161 * mapContext 中 也有一个方法叫做:write(key,value) 162 */ 163 org.apache.hadoop.mapreduce.Mapper<INKEY, INVALUE, OUTKEY, OUTVALUE>.Context mapperContext = 164 new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>() 165 .getMapContext(mapContext); 166 167 try { 168 169 170 171 172 input.initialize(split, mapperContext); 173 174 175 176 /** 177 * 复杂调用整个mapTask执行的入口 178 * 179 * 方法的逻辑构成: 180 * 181 * 1、重点方法在最后,或者在try中 182 * 2、其他的代码,几乎只有两个任务:一个用来记录记日志或者完善流程。。 一个准备核心方法的参数 183 */ 184 mapper.run(mapperContext); 185 186 187 188 mapPhase.complete(); 189 setPhase(TaskStatus.Phase.SORT); 190 statusUpdate(umbilical); 191 input.close(); 192 input = null; 193 output.close(mapperContext); 194 output = null; 195 196 197 198 } finally { 199 closeQuietly(input); 200 closeQuietly(output, mapperContext); 201 } 202 }
能确定的是:mapperContext一定有上面说的那四个重要的方法,往上继续查找mapperContext
/** 143 * mapperContext的内部一定包含是三个犯法: 144 * 145 * nextKeyValue 146 * getCurrentKey 147 * getCurrentValue 148 * 149 * mapperContext的具体实现是依赖于new Context(context); 150 * context = mapContext 151 * 152 * 结论: 153 * 154 * mapContext对象的内部一定包含以下三个方法: 155 * 156 * nextKeyValue 157 * getCurrentKey 158 * getCurrentValue 159 * 160 * 161 * mapContext 中 也有一个方法叫做:write(key,value) 162 */ 163 org.apache.hadoop.mapreduce.Mapper<INKEY, INVALUE, OUTKEY, OUTVALUE>.Context mapperContext = 164 new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>() 165 .getMapContext(mapContext);
查看WrappedMapper.java
1 /** 2 * Licensed to the Apache Software Foundation (ASF) under one 3 * or more contributor license agreements. See the NOTICE file 4 * distributed with this work for additional information 5 * regarding copyright ownership. The ASF licenses this file 6 * to you under the Apache License, Version 2.0 (the 7 * "License"); you may not use this file except in compliance 8 * with the License. You may obtain a copy of the License at 9 * 10 * http://www.apache.org/licenses/LICENSE-2.0 11 * 12 * Unless required by applicable law or agreed to in writing, software 13 * distributed under the License is distributed on an "AS IS" BASIS, 14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 15 * See the License for the specific language governing permissions and 16 * limitations under the License. 17 */ 18 19 package org.apache.hadoop.mapreduce.lib.map; 20 21 import java.io.IOException; 22 import java.net.URI; 23 24 import org.apache.hadoop.classification.InterfaceAudience; 25 import org.apache.hadoop.classification.InterfaceStability; 26 import org.apache.hadoop.conf.Configuration; 27 import org.apache.hadoop.conf.Configuration.IntegerRanges; 28 import org.apache.hadoop.fs.Path; 29 import org.apache.hadoop.io.RawComparator; 30 import org.apache.hadoop.mapreduce.Counter; 31 import org.apache.hadoop.mapreduce.InputFormat; 32 import org.apache.hadoop.mapreduce.InputSplit; 33 import org.apache.hadoop.mapreduce.JobID; 34 import org.apache.hadoop.mapreduce.MapContext; 35 import org.apache.hadoop.mapreduce.Mapper; 36 import org.apache.hadoop.mapreduce.OutputCommitter; 37 import org.apache.hadoop.mapreduce.OutputFormat; 38 import org.apache.hadoop.mapreduce.Partitioner; 39 import org.apache.hadoop.mapreduce.Reducer; 40 import org.apache.hadoop.mapreduce.TaskAttemptID; 41 import org.apache.hadoop.security.Credentials; 42 43 /** 44 * A {@link Mapper} which wraps a given one to allow custom 45 * {@link Mapper.Context} implementations. 46 */ 47 @InterfaceAudience.Public 48 @InterfaceStability.Evolving 49 public class WrappedMapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> extends Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { 50 51 /** 52 * Get a wrapped {@link Mapper.Context} for custom implementations. 53 * 54 * @param mapContext 55 * <code>MapContext</code> to be wrapped 56 * @return a wrapped <code>Mapper.Context</code> for custom implementations 57 */ 58 public Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context getMapContext( 59 MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext) { 60 return new Context(mapContext); 61 } 62 63 @InterfaceStability.Evolving 64 public class Context extends Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context { 65 66 protected MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext; 67 68 public Context(MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext) { 69 this.mapContext = mapContext; 70 } 71 72 /** 73 * Get the input split for this map. 74 */ 75 public InputSplit getInputSplit() { 76 return mapContext.getInputSplit(); 77 } 78 79 @Override 80 public KEYIN getCurrentKey() throws IOException, InterruptedException { 81 return mapContext.getCurrentKey(); 82 } 83 84 @Override 85 public VALUEIN getCurrentValue() throws IOException, InterruptedException { 86 return mapContext.getCurrentValue(); 87 } 88 89 @Override 90 public boolean nextKeyValue() throws IOException, InterruptedException { 91 return mapContext.nextKeyValue(); 92 } 93 94 @Override 95 public Counter getCounter(Enum<?> counterName) { 96 return mapContext.getCounter(counterName); 97 } 98 99 @Override 100 public Counter getCounter(String groupName, String counterName) { 101 return mapContext.getCounter(groupName, counterName); 102 } 103 104 @Override 105 public OutputCommitter getOutputCommitter() { 106 return mapContext.getOutputCommitter(); 107 } 108 109 @Override 110 public void write(KEYOUT key, VALUEOUT value) throws IOException, InterruptedException { 111 mapContext.write(key, value); 112 } 113 114 @Override 115 public String getStatus() { 116 return mapContext.getStatus(); 117 } 118 119 @Override 120 public TaskAttemptID getTaskAttemptID() { 121 return mapContext.getTaskAttemptID(); 122 } 123 124 @Override 125 public void setStatus(String msg) { 126 mapContext.setStatus(msg); 127 } 128 129 @Override 130 public Path[] getArchiveClassPaths() { 131 return mapContext.getArchiveClassPaths(); 132 } 133 134 @Override 135 public String[] getArchiveTimestamps() { 136 return mapContext.getArchiveTimestamps(); 137 } 138 139 @Override 140 public URI[] getCacheArchives() throws IOException { 141 return mapContext.getCacheArchives(); 142 } 143 144 @Override 145 public URI[] getCacheFiles() throws IOException { 146 return mapContext.getCacheFiles(); 147 } 148 149 @Override 150 public Class<? extends Reducer<?, ?, ?, ?>> getCombinerClass() throws ClassNotFoundException { 151 return mapContext.getCombinerClass(); 152 } 153 154 @Override 155 public Configuration getConfiguration() { 156 return mapContext.getConfiguration(); 157 } 158 159 @Override 160 public Path[] getFileClassPaths() { 161 return mapContext.getFileClassPaths(); 162 } 163 164 @Override 165 public String[] getFileTimestamps() { 166 return mapContext.getFileTimestamps(); 167 } 168 169 @Override 170 public RawComparator<?> getCombinerKeyGroupingComparator() { 171 return mapContext.getCombinerKeyGroupingComparator(); 172 } 173 174 @Override 175 public RawComparator<?> getGroupingComparator() { 176 return mapContext.getGroupingComparator(); 177 } 178 179 @Override 180 public Class<? extends InputFormat<?, ?>> getInputFormatClass() throws ClassNotFoundException { 181 return mapContext.getInputFormatClass(); 182 } 183 184 @Override 185 public String getJar() { 186 return mapContext.getJar(); 187 } 188 189 @Override 190 public JobID getJobID() { 191 return mapContext.getJobID(); 192 } 193 194 @Override 195 public String getJobName() { 196 return mapContext.getJobName(); 197 } 198 199 @Override 200 public boolean getJobSetupCleanupNeeded() { 201 return mapContext.getJobSetupCleanupNeeded(); 202 } 203 204 @Override 205 public boolean getTaskCleanupNeeded() { 206 return mapContext.getTaskCleanupNeeded(); 207 } 208 209 @Override 210 public Path[] getLocalCacheArchives() throws IOException { 211 return mapContext.getLocalCacheArchives(); 212 } 213 214 @Override 215 public Path[] getLocalCacheFiles() throws IOException { 216 return mapContext.getLocalCacheFiles(); 217 } 218 219 @Override 220 public Class<?> getMapOutputKeyClass() { 221 return mapContext.getMapOutputKeyClass(); 222 } 223 224 @Override 225 public Class<?> getMapOutputValueClass() { 226 return mapContext.getMapOutputValueClass(); 227 } 228 229 @Override 230 public Class<? extends Mapper<?, ?, ?, ?>> getMapperClass() throws ClassNotFoundException { 231 return mapContext.getMapperClass(); 232 } 233 234 @Override 235 public int getMaxMapAttempts() { 236 return mapContext.getMaxMapAttempts(); 237 } 238 239 @Override 240 public int getMaxReduceAttempts() { 241 return mapContext.getMaxReduceAttempts(); 242 } 243 244 @Override 245 public int getNumReduceTasks() { 246 return mapContext.getNumReduceTasks(); 247 } 248 249 @Override 250 public Class<? extends OutputFormat<?, ?>> getOutputFormatClass() throws ClassNotFoundException { 251 return mapContext.getOutputFormatClass(); 252 } 253 254 @Override 255 public Class<?> getOutputKeyClass() { 256 return mapContext.getOutputKeyClass(); 257 } 258 259 @Override 260 public Class<?> getOutputValueClass() { 261 return mapContext.getOutputValueClass(); 262 } 263 264 @Override 265 public Class<? extends Partitioner<?, ?>> getPartitionerClass() throws ClassNotFoundException { 266 return mapContext.getPartitionerClass(); 267 } 268 269 @Override 270 public Class<? extends Reducer<?, ?, ?, ?>> getReducerClass() throws ClassNotFoundException { 271 return mapContext.getReducerClass(); 272 } 273 274 @Override 275 public RawComparator<?> getSortComparator() { 276 return mapContext.getSortComparator(); 277 } 278 279 @Override 280 public boolean getSymlink() { 281 return mapContext.getSymlink(); 282 } 283 284 @Override 285 public Path getWorkingDirectory() throws IOException { 286 return mapContext.getWorkingDirectory(); 287 } 288 289 @Override 290 public void progress() { 291 mapContext.progress(); 292 } 293 294 @Override 295 public boolean getProfileEnabled() { 296 return mapContext.getProfileEnabled(); 297 } 298 299 @Override 300 public String getProfileParams() { 301 return mapContext.getProfileParams(); 302 } 303 304 @Override 305 public IntegerRanges getProfileTaskRange(boolean isMap) { 306 return mapContext.getProfileTaskRange(isMap); 307 } 308 309 @Override 310 public String getUser() { 311 return mapContext.getUser(); 312 } 313 314 @Override 315 public Credentials getCredentials() { 316 return mapContext.getCredentials(); 317 } 318 319 @Override 320 public float getProgress() { 321 return mapContext.getProgress(); 322 } 323 } 324 }
此类里面一定有那4个重要的方法,发现调用了mapContext,继续往上找
/** 110 * mapContext对象中一定包含三个方法 111 * 112 * 找到了之前第一查看源码实现的方法的问题的答案: 113 * 114 * 问题:找到谁调用MapContextImpl这个类的构造方法 115 * 116 * mapContext就是MapContextImpl的实例对象 117 * 118 * MapContextImpl类中一定有三个方法: 119 * 120 * input === NewTrackingRecordReader 121 * 122 * 123 * 124 * 确定的知识: 125 * 126 * 1、mapContext对象中,一定有write方法 127 * 128 * 2、通过观看MapContextImpl的组成,发现其实没有write方法 129 * 130 * 解决: 131 * 132 * 其实mapContext.write方法的调用是来自于MapContextImpl这个类的父类 133 * 134 * 135 * 136 * 最底层的write方法: output.write(); 137 */ 138 org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> mapContext = 139 new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>( 140 job, getTaskID(), input, output, committer, reporter, split);
mapConext就是这个类MapContextImpl的实例对象
继续确定:
mapConext = new MapContextImpl(input) mapConext.nextKeyVlaue(){ LineRecordReader real = input.createRecordReader(); real.nextKeyValue(); }
查看MapContextImpl.java源码
1 public class MapContextImpl<KEYIN, VALUEIN, KEYOUT, VALUEOUT> 2 extends TaskInputOutputContextImpl<KEYIN, VALUEIN, KEYOUT, VALUEOUT> 3 implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { 4 5 6 private RecordReader<KEYIN, VALUEIN> reader; 7 private InputSplit split; 8 9 public MapContextImpl(Configuration conf, 10 TaskAttemptID taskid, 11 RecordReader<KEYIN, VALUEIN> reader, 12 RecordWriter<KEYOUT, VALUEOUT> writer, 13 OutputCommitter committer, 14 StatusReporter reporter, 15 InputSplit split) { 16 17 18 19 // 通过super调用父类的构造方法 20 super(conf, taskid, writer, committer, reporter); 21 22 23 24 this.reader = reader; 25 this.split = split; 26 } 27 28 /** 29 * Get the input split for this map. 30 */ 31 public InputSplit getInputSplit() { 32 return split; 33 } 40 @Override 41 public KEYIN getCurrentKey() throws IOException, InterruptedException { 42 return reader.getCurrentKey(); 43 } 44 45 @Override 46 public VALUEIN getCurrentValue() throws IOException, InterruptedException { 47 return reader.getCurrentValue(); 48 } 49 50 @Override 51 public boolean nextKeyValue() throws IOException, InterruptedException { 52 return reader.nextKeyValue(); 53 } 54 55 56 57 58 }