[Hadoop源码解读](六)MapReduce篇之MapTask类
MapTask类继承于Task类,它最主要的方法就是run(),用来执行这个Map任务。
run()首先设置一个TaskReporter并启动,然后调用JobConf的getUseNewAPI()判断是否使用New API,使用New API的设置在前面[Hadoop源码解读](三)MapReduce篇之Job类 讲到过,再调用Task继承来的initialize()方法初始化这个task,接着根据需要执行runJobCleanupTask()、runJobSetupTask()、runTaskCleanupTask()或相应的Mapper,执行Mapper时根据情况使用不同版本的MapReduce,这个版本是设置参数决定的。
- @Override
- public void run(final JobConf job, final TaskUmbilicalProtocol umbilical)
- throws IOException, ClassNotFoundException, InterruptedException {
- this.umbilical = umbilical;
- // start thread that will handle communication with parent
- TaskReporter reporter = new TaskReporter(getProgress(), umbilical,
- jvmContext);
- reporter.startCommunicationThread();
- boolean useNewApi = job.getUseNewMapper(); //是由JobConf来的,而New API 的JobContext包含一个JobConf,Job类有
- //setUseNewAPI()方法,当Job.submit()时使用它,这样,waitForCompletion()就用submit()设置了使用New API,而此时就使用它。
- initialize(job, getJobID(), reporter, useNewApi);//一个Task的初始化工作,包括jobContext,taskContext,输出路径等,
- //使用的是Task.initialize()方法
- // check if it is a cleanupJobTask
- if (jobCleanup) {
- runJobCleanupTask(umbilical, reporter);
- return;
- }
- if (jobSetup) {
- runJobSetupTask(umbilical, reporter);
- return;
- }
- if (taskCleanup) {
- runTaskCleanupTask(umbilical, reporter);
- return;
- }
- if (useNewApi) {//根据情况使用不同的MapReduce版本执行Mapper
- runNewMapper(job, splitMetaInfo, umbilical, reporter);
- } else {
- runOldMapper(job, splitMetaInfo, umbilical, reporter);
- }
- done(umbilical, reporter);
- }
runNewMapper对应new API的MapReduce,而runOldMapper对应旧API。
runNewMapper首先创建TaskAttemptContext对象,Mapper对象,InputFormat对象,InputSplit,RecordReader;然后根据是否有Reduce task来创建不同的输出收集器NewDirectOutputCollector[没有reducer]或NewOutputCollector[有reducer],接下来调用input.initialize()初始化RecordReader,主要是为输入做准备,设置RecordReader,输入路径等等。然后到最主要的部分:mapper.run()。这个方法就是调用前面[Hadoop源码解读](二)MapReduce篇之Mapper类讲到的Mapper.class的run()方法。然后就是一条一条的读取K/V对,这样就衔接起来了。
- @SuppressWarnings("unchecked")
- private <INKEY,INVALUE,OUTKEY,OUTVALUE>
- void runNewMapper(final JobConf job,
- final TaskSplitIndex splitIndex,
- final TaskUmbilicalProtocol umbilical,
- TaskReporter reporter
- ) throws IOException, ClassNotFoundException,
- InterruptedException {
- // make a task context so we can get the classes
- org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
- new org.apache.hadoop.mapreduce.TaskAttemptContext(job, getTaskID());
- // make a mapper
- org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
- (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
- ReflectionUtils.newInstance(taskContext.getMapperClass(), job);
- // make the input format
- org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
- (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
- ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
- // rebuild the input split
- org.apache.hadoop.mapreduce.InputSplit split = null;
- split = getSplitDetails(new Path(splitIndex.getSplitLocation()),
- splitIndex.getStartOffset());
- org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
- new NewTrackingRecordReader<INKEY,INVALUE>
- (split, inputFormat, reporter, job, taskContext);
- job.setBoolean("mapred.skip.on", isSkipping());
- org.apache.hadoop.mapreduce.RecordWriter output = null;
- org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context
- mapperContext = null;
- try {
- Constructor<org.apache.hadoop.mapreduce.Mapper.Context> contextConstructor =
- org.apache.hadoop.mapreduce.Mapper.Context.class.getConstructor
- (new Class[]{org.apache.hadoop.mapreduce.Mapper.class,
- Configuration.class,
- org.apache.hadoop.mapreduce.TaskAttemptID.class,
- org.apache.hadoop.mapreduce.RecordReader.class,
- org.apache.hadoop.mapreduce.RecordWriter.class,
- org.apache.hadoop.mapreduce.OutputCommitter.class, //
- org.apache.hadoop.mapreduce.StatusReporter.class,
- org.apache.hadoop.mapreduce.InputSplit.class});
- // get an output object
- if (job.getNumReduceTasks() == 0) {
- output =
- new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
- } else {
- output = new NewOutputCollector(taskContext, job, umbilical, reporter);
- }
- mapperContext = contextConstructor.newInstance(mapper, job, getTaskID(),
- input, output, committer,
- reporter, split);
- input.initialize(split, mapperContext);
- mapper.run(mapperContext);
- input.close();
- output.close(mapperContext);
- } catch (NoSuchMethodException e) {
- throw new IOException("Can't find Context constructor", e);
- } catch (InstantiationException e) {
- throw new IOException("Can't create Context", e);
- } catch (InvocationTargetException e) {
- throw new IOException("Can't invoke Context constructor", e);
- } catch (IllegalAccessException e) {
- throw new IOException("Can't invoke Context constructor", e);
- }
- }
至于运行哪个Mapper类,一般是我们用job.setMapperClass(SelectGradeMapper.class)设置的,那设置后是怎样获取的,或者默认值是什么,且看下面的追溯。
MapTask.runNewMapper()
=> (TaskAttemptContext)taskContext.getMapperClass(); //runNewMapper生成mapper时用到。
=> JobContext.getMapperClass()
=> JobConf.getClass(MAP_CLASS_ATTR,Mapper.class)
=> Configuration.getClass(name,default)
根据上面一层的调用关系,找到了默认值是Mapper.class,它的获取过程也一目了然。
再仔细看看Configuration.getClass()
- public Class<?> getClass(String name, Class<?> defaultValue) {
- String valueString = get(name);
- if (valueString == null)
- return defaultValue;
- try {
- return getClassByName(valueString);
- } catch (ClassNotFoundException e) {
- throw new RuntimeException(e);
- }
- }
Mapper执行完后,关闭RecordReader和OutputCollector等资源就完事了。
另外我们把关注点放在上面的runNewMapper()中的mapper.run(mapperContext);前面对Mapper.class提到,这个mapperContext会被用于读取输入分片的K/V对和写出输出结果的K/V对。而由
- mapperContext = contextConstructor.newInstance(mapper, job, getTaskID(),
- input, output, committer,
- reporter, split);
Mapper中的map方法不断使用context.write(K,V)进行输出,我们看这个函数是怎么进行的,先看Context类的层次关系:
write()方法是由TaskInputOutputContext来的:
- public void write(KEYOUT key, VALUEOUT value
- ) throws IOException, InterruptedException {
- output.write(key, value);
- }
- public abstract class RecordWriter<K, V> {
- public abstract void write(K key, V value
- ) throws IOException, InterruptedException;
- public abstract void close(TaskAttemptContext context
- ) throws IOException, InterruptedException;
- }
- private class NewOutputCollector<K,V>
- extends org.apache.hadoop.mapreduce.RecordWriter<K,V> {
- private final MapOutputCollector<K,V> collector;
- private final org.apache.hadoop.mapreduce.Partitioner<K,V> partitioner;
- private final int partitions;
- @SuppressWarnings("unchecked")
- NewOutputCollector(org.apache.hadoop.mapreduce.JobContext jobContext,
- JobConf job,
- TaskUmbilicalProtocol umbilical,
- TaskReporter reporter
- ) throws IOException, ClassNotFoundException {
- collector = new MapOutputBuffer<K,V>(umbilical, job, reporter);
- partitions = jobContext.getNumReduceTasks();
- if (partitions > 0) {
- partitioner = (org.apache.hadoop.mapreduce.Partitioner<K,V>)
- ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);
- } else {
- partitioner = new org.apache.hadoop.mapreduce.Partitioner<K,V>() {
- @Override
- public int getPartition(K key, V value, int numPartitions) {
- return -1;
- }
- };
- }
- }
- @Override
- public void write(K key, V value) throws IOException, InterruptedException {
- collector.collect(key, value,
- partitioner.getPartition(key, value, partitions));
- }
- @Override
- public void close(TaskAttemptContext context
- ) throws IOException,InterruptedException {
- try {
- collector.flush();
- } catch (ClassNotFoundException cnf) {
- throw new IOException("can't find class ", cnf);
- }
- collector.close();
- }
- }
我们可以从NewOutputCollector看出NewOutputCollector就是MapOutputBuffer的封装。MapoutputBuffer是旧API中就存在了的,它很复杂,但很关键,暂且放着先,反正就是收集输出K/V对的。它实现了MapperOutputCollector接口:
- interface MapOutputCollector<K, V> {
- public void collect(K key, V value, int partition
- ) throws IOException, InterruptedException;
- public void close() throws IOException, InterruptedException;
- public void flush() throws IOException, InterruptedException,
- ClassNotFoundException;
- }
这个接口告诉我们,收集器必须实现collect,close,flush方法。
看一个简单的:NewDirectOutputCollector,它在没有reduce task的时候使用,主要是从InputFormat中获取OutputFormat的RecordWriter,然后就可以用这个RecordWriter的write()方法来写出,这就与我们设置的输出格式对应起来了。
- private class NewDirectOutputCollector<K,V>
- extends org.apache.hadoop.mapreduce.RecordWriter<K,V> {
- private final org.apache.hadoop.mapreduce.RecordWriter out;
- private final TaskReporter reporter;
- private final Counters.Counter mapOutputRecordCounter;
- private final Counters.Counter fileOutputByteCounter;
- private final Statistics fsStats;
- @SuppressWarnings("unchecked")
- NewDirectOutputCollector(org.apache.hadoop.mapreduce.JobContext jobContext,
- JobConf job, TaskUmbilicalProtocol umbilical, TaskReporter reporter)
- throws IOException, ClassNotFoundException, InterruptedException {
- this.reporter = reporter;
- Statistics matchedStats = null;
- if (outputFormat instanceof org.apache.hadoop.mapreduce.lib.output.FileOutputFormat) {
- //outputFormat是Task来的,内部类访问外部类成员变量
- matchedStats = getFsStatistics(org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
- .getOutputPath(jobContext), job);
- }
- fsStats = matchedStats;
- mapOutputRecordCounter =
- reporter.getCounter(MAP_OUTPUT_RECORDS);
- fileOutputByteCounter = reporter
- .getCounter(org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.Counter.BYTES_WRITTEN);
- long bytesOutPrev = getOutputBytes(fsStats);
- out = outputFormat.getRecordWriter(taskContext); //主要是这句,获取设置的OutputputFormat里的RecordWriter
- long bytesOutCurr = getOutputBytes(fsStats);
- fileOutputByteCounter.increment(bytesOutCurr - bytesOutPrev);
- }
- @Override
- @SuppressWarnings("unchecked")
- public void write(K key, V value)
- throws IOException, InterruptedException {
- reporter.progress(); //报告一下进度
- long bytesOutPrev = getOutputBytes(fsStats);
- out.write(key, value);//使用out收集一条记录,out是设置的OutputFormat来的。
- long bytesOutCurr = getOutputBytes(fsStats);
- fileOutputByteCounter.increment(bytesOutCurr - bytesOutPrev); //更新输出字节数
- mapOutputRecordCounter.increment(1); //更新输出K/V对数量
- }
- @Override
- public void close(TaskAttemptContext context)
- throws IOException,InterruptedException {
- reporter.progress();
- if (out != null) {
- long bytesOutPrev = getOutputBytes(fsStats);
- out.close(context);
- long bytesOutCurr = getOutputBytes(fsStats);
- fileOutputByteCounter.increment(bytesOutCurr - bytesOutPrev);
- }
- }
- private long getOutputBytes(Statistics stats) {
- return stats == null ? 0 : stats.getBytesWritten();
- }
- }
另外还有一些以runOldMapper()为主导的旧MapReduce API那套,就不进行讨论了。