MapReduce调度与执行原理系列文章
转自:http://blog.csdn.net/jaytalent?viewmode=contents
MapReduce调度与执行原理系列文章
前言:本文旨在理清在Hadoop中一个MapReduce作业(Job)在提交到框架后的整个生命周期过程,权作总结和日后参考,如有问题,请不吝赐教。本文不涉及Hadoop的架构设计,如有兴趣请参考相关书籍和文献。在梳理过程中,我对一些感兴趣的源码也会逐行研究学习,以期强化基础。
作者:Jaytalent
开始日期:2013年9月9日参考资料:【1】《Hadoop技术内幕--深入解析MapReduce架构设计与实现原理》董西成【2】Hadoop 1.0.0 源码
【3】《Hadoop技术内幕--深入解析Hadoop Common和HDFS架构设计与实现原理》蔡斌 陈湘萍
一个MapReduce作业的生命周期大体分为5个阶段【1】:
1. 作业提交与初始化
2. 任务调度与监控
3. 任务运行环境准备
4. 任务执行
5. 作业完成
现逐一学习。
由于作业提交是在客户端完成,而初始化在JobTracker完成,本文只关注前者,后者留待下一篇文章学习研究。
一、作业提交与初始化
以WordCount作业为例,先看作业提交的代码片段:
- Job job = new Job(conf, "word count");
- job.setJarByClass(WordCount.class);
- job.setMapperClass(TokenizerMapper.class);
- job.setCombinerClass(IntSumReducer.class);
- job.setReducerClass(IntSumReducer.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
- FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
- FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
- System.exit(job.waitForCompletion(true) ? 0 : 1);
这里使用的新的MapReduce API。job.waitForCompletion(true)函数调用开始作业提交过程。接下来,依次调用:job.submit --> JobClient.submitJobInternal方法,真正实现作业提交。在JobClient.submitJobInternal方法中,主要有以下准备工作:
1. 获取作业ID
- JobID jobId = jobSubmitClient.getNewJobId();
- private JobSubmissionProtocol jobSubmitClient;
Hadoop的RPC机制是基于动态代理实现的。客户端代码使用RPC类提供的代理对象调用服务器的方法。MapReduce中定义了一系列协议接口用于RPC通信。这些协议包括:
a. JobSubmissionProtocol
b. RefreshUserMappingsProtocol
c. RefreshAuthorizationPolicyProtocol
d. AdminOperationsProtocol
e. InterTrackerProtocol
f. TaskUmbilicalProtocol
前面四个协议用于客户端,最后两个协议位于MapReduce内部。这里使用的getNewJobId方法即协议JobSubmissionProtocol所定义:
- /**
- * Allocate a name for the job.
- * @return a unique job name for submitting jobs.
- * @throws IOException
- */
- public JobID getNewJobId() throws IOException;
2. 作业文件上传
JobClient会根据作业配置信息将作业所需文件上传到JobTracker的文件系统,通常是HDFS。配置信息由JobConf类对象维护。在新的API中,JobConf对象作为JobContext对象的组成部分,作业类Job即继承于JobContext类。
在上传文件前,需要在HDFS上创建必要的目录。上传文件的具体过程从JobClient.submitJobInternal方法中这句调用开始:
- copyAndConfigureFiles(jobCopy, submitJobDir);
- // Retrieve command line arguments placed into the JobConf
- // by GenericOptionsParser.
- String files = job.get("tmpfiles");
- String libjars = job.get("tmpjars");
- String archives = job.get("tmparchives");
- // Create a number of filenames in the JobTracker's fs namespace
- FileSystem fs = submitJobDir.getFileSystem(job);
- submitJobDir = fs.makeQualified(submitJobDir);
- FsPermission mapredSysPerms = new FsPermission(JobSubmissionFiles.JOB_DIR_PERMISSION);
- FileSystem.mkdirs(fs, submitJobDir, mapredSysPerms);
- Path filesDir = JobSubmissionFiles.getJobDistCacheFiles(submitJobDir);
- Path archivesDir = JobSubmissionFiles.getJobDistCacheArchives(submitJobDir);
- Path libjarsDir = JobSubmissionFiles.getJobDistCacheLibjars(submitJobDir);
有了路径名后,在HDFS上创建路径并将这些文件拷贝到对应的目录中,代码如下:
- // add all the command line files/ jars and archive
- // first copy them to jobtrackers filesystem
- if (files != null) {
- FileSystem.mkdirs(fs, filesDir, mapredSysPerms);
- String[] fileArr = files.split(",");
- for (String tmpFile: fileArr) {
- URI tmpURI;
- tmpURI = new URI(tmpFile);
- Path tmp = new Path(tmpURI);
- Path newPath = copyRemoteFiles(fs,filesDir, tmp, job, replication);
- URI pathURI = getPathURI(newPath, tmpURI.getFragment());
- DistributedCache.addCacheFile(pathURI, job);
- DistributedCache.createSymlink(job);
- }
- }
- if (libjars != null) {
- FileSystem.mkdirs(fs, libjarsDir, mapredSysPerms);
- String[] libjarsArr = libjars.split(",");
- for (String tmpjars: libjarsArr) {
- Path tmp = new Path(tmpjars);
- Path newPath = copyRemoteFiles(fs, libjarsDir, tmp, job, replication);
- DistributedCache.addArchiveToClassPath
- (new Path(newPath.toUri().getPath()), job, fs);
- }
- }
- if (archives != null) {
- FileSystem.mkdirs(fs, archivesDir, mapredSysPerms);
- String[] archivesArr = archives.split(",");
- for (String tmpArchives: archivesArr) {
- URI tmpURI;
- tmpURI = new URI(tmpArchives);
- Path tmp = new Path(tmpURI);
- Path newPath = copyRemoteFiles(fs, archivesDir, tmp, job, replication);
- URI pathURI = getPathURI(newPath, tmpURI.getFragment());
- DistributedCache.addCacheArchive(pathURI, job);
- DistributedCache.createSymlink(job);
- }
最后,将作业对应的jar文件拷贝到HDFS中:
- String originalJarPath = job.getJar();
- if (originalJarPath != null) { // copy jar to JobTracker's fs
- // use jar name if job is not named.
- if ("".equals(job.getJobName())){
- job.setJobName(new Path(originalJarPath).getName());
- }
- Path submitJarFile = JobSubmissionFiles.getJobJar(submitJobDir);
- job.setJar(submitJarFile.toString());
- fs.copyFromLocalFile(new Path(originalJarPath), submitJarFile);
- fs.setReplication(submitJarFile, replication);
- fs.setPermission(submitJarFile,
- new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION));
- }
- DistributedCache.addCacheFile(pathURI, job);
- DistributedCache.addArchiveToClassPath(new Path(newPath.toUri().getPath()), job, fs);
- DistributedCache.addCacheArchive(pathURI, job);
- job.setJar(submitJarFile.toString());
3. 生成InputSplit文件
JobClient调用InputFormat的getSplits方法将用户提交的输入文件生成InputSplit相关信息。
- // Create the splits for the job
- FileSystem fs = submitJobDir.getFileSystem(jobCopy);
- int maps = writeSplits(context, submitJobDir);
- jobCopy.setNumMapTasks(maps);
- // Write job file to JobTracker's fs
- FSDataOutputStream out =
- FileSystem.create(fs, submitJobFile,
- new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION));
- try {
- jobCopy.writeXml(out);
- } finally {
- out.close();
- }
接下来,作业将被提交到JobTracker,请关注下篇文章: