FairScheduler job初始化过程源码浅析

上一篇文章说到了jobTracker中的submitJob()方法，这个方法最终会调用listener.jobAdded(job)，将Job注册到TaskScheduler中，由其进行调度。今天接着研究。hadoop中默认的TaskScheduler是JobQueueTaskScheduler，采用的是FIFO(先进先出)原则进行调度，还有FiarScheduler和CapacityTaskScheduler两种调度类（非hadoop自带，不过hadoop也把他们加入到类库中），这两个类可以在hadoop目录下的lib包下找到，源码在src/contrib下可以找到。主要对FairScheduler进行解读。

上文提到jobTracker最终将job注册到jobListener中，下面就来看看FairScheduler的JobListener。

1.FairScheduler.JobListener.addJob()：这个方法比较简单，JobSchedulable mapSched = ReflectionUtils.newInstance(conf.getClass("mapred.jobtracker.jobSchedulable", JobSchedulable.class, JobSchedulable.class), conf)这里通过反射获得两个JobSchedulable对象，也就是默认的FairScheduler.JobSchedulable对象，一个是mapSched，一个是redSched，然后进行JobSchedulable的初始化，比较简单。infos.put(job, info)将job添加到infos（存放所有的jobInPorgress对象）中，同时将job添加到PoolScheduable中，主要是根据配置的poolName获取对应的pool。下面的是重点，update()方法，下面看看这个方法。

public void jobAdded(JobInProgress job) {
      synchronized (FairScheduler.this) {
        eventLog.log("JOB_ADDED", job.getJobID());
        JobSchedulable mapSched = ReflectionUtils.newInstance(
            conf.getClass("mapred.jobtracker.jobSchedulable", JobSchedulable.class,
                JobSchedulable.class), conf);
        mapSched.init(FairScheduler.this, job, TaskType.MAP);

        JobSchedulable redSched = ReflectionUtils.newInstance(
            conf.getClass("mapred.jobtracker.jobSchedulable", JobSchedulable.class,
                JobSchedulable.class), conf);
        redSched.init(FairScheduler.this, job, TaskType.REDUCE);

        JobInfo info = new JobInfo(mapSched, redSched);
        infos.put(job, info);
        poolMgr.addJob(job); // Also adds job into the right PoolScheduable
        update();
      }
    }

2.FairScheduler.update()：跳过看不懂的，直接看poolMgr.reloadAllocsIfNecessary()，这个方法主要是读取FairScheduler的配置文件（fair-scheduler.xml），由mapred.fairscheduler.allocation.file参数设置，这里是根据配置文件的最后修改时间+ALLOC_RELOAD_INTERVAL决定是否重新加载配置文件，加载文件的时候就是简单地读取xml文件。接着看update方法，加载完配置文件之后会遍历infos（保存了FairScheduler所有的jobInProgress），遍历的时候去除成功了的job和失败了的job以及被kill掉的job，同时也会从pool中去掉该job。接下来就是updateRunnability()，这个方法会根据userMaxJob以及poolMaxJob数量进行判断是否启动job。

 List<JobInProgress> toRemove = new ArrayList<JobInProgress>();
      for (JobInProgress job: infos.keySet()) { 
        int runState = job.getStatus().getRunState();
        if (runState == JobStatus.SUCCEEDED || runState == JobStatus.FAILED
          || runState == JobStatus.KILLED) {
            toRemove.add(job);
        }
      }
      for (JobInProgress job: toRemove) {
        jobNoLongerRunning(job);
      }

3.FairScheduler.updateRunnability()：第一步将所有infos中剩余的job（成功以及失败的任务会在update时清除）状态全部设为notrunning。接着对infos中的job进行排序，Collections.sort(jobs, new FifoJobComparator())，排序规则是FIFO原则（奇怪，不懂）。然后接着对jobs进行遍历，同时根据该job的提交用户和提交的pool的最大提交job数量决定是否将其添加到任务队列中（就是两个list），如果该job状态=RUNNING，则jobinfo.running=true，如果job状态=PREP（准备中），则对其进行初始化（注意这里只对job状态=RUNNING和PREP的job进行操作）。jobInitializer.initJob(jobInfo, job)进行job初始化，这里使用到jdk的threadPool（其实就是将thread加入到线程池中，由线程池绝对什么时候对其进行执行，总之都会调用thread的run方法），看看thread的run方法。run方法中调用ttm.initJob(job)，此处的ttm就是jobTracker，现在回到jobTracker去。

 if (userCount < poolMgr.getUserMaxJobs(user) &&
          poolCount < poolMgr.getPoolMaxJobs(pool)) {
        if (job.getStatus().getRunState() == JobStatus.RUNNING ||
            job.getStatus().getRunState() == JobStatus.PREP) {
          userJobs.put(user, userCount + 1);
          poolJobs.put(pool, poolCount + 1);
          JobInfo jobInfo = infos.get(job);
          if (job.getStatus().getRunState() == JobStatus.RUNNING) {
            jobInfo.runnable = true;
          } else {
            // The job is in the PREP state. Give it to the job initializer
            // for initialization if we have not already done it.
            if (jobInfo.needsInitializing) {
              jobInfo.needsInitializing = false;
              jobInitializer.initJob(jobInfo, job);
            }
          }
        }
      }

4.JobTracker.initJob()：主要调用job.initTasks()，下面进入到JobInProgress.initTasks()。

5.JobInProgress.initTasks()：为job对象设置优先级setPriority(this.priority)，接着读取分片信息文件获取分片信息，SplitMetaInfoReader.readSplitMetaInfo()这个方就是jobInPorgress用来读取分分片信息的，读取过程与写入过程相对应，具体还是较简单的。读取了分片信息之后，根据分片数量创建相应数量的mapTask（TaskInProgress对象），接下来会执行nonRunningMapCache = createCache(splits, maxLevel)，这个方法是根据每个分片的location信息，然后根据location的host判断每个host上所有的job，并放入cache中。接着根据设置的reduce数量新建对应的reduceTask（TaskInProgress对象），并加入到nonRunningReduces队列中，并根据mapred.reduce.slowstart.completed.maps（百分比，默认是5%）参数的值计算completedMapsForReduceSlowstart（多少map任务完成的时候启动reduce任务）。之后就是分别新建两个setUp任务和cheanUp任务，分别对应map和reduce task。到此initTask完成，initTask完成JobTracker的initJob也就差不多完成了，接着FairScheduler的updateRunnability()也就完成了。回到FairScheduler.update()。

6.FairScheduler.update()：

for (Pool pool: poolMgr.getPools()) {
        pool.getMapSchedulable().updateDemand();
        pool.getReduceSchedulable().updateDemand();
      }
      
      // Compute fair shares based on updated demands
      List<PoolSchedulable> mapScheds = getPoolSchedulables(TaskType.MAP);
      List<PoolSchedulable> reduceScheds = getPoolSchedulables(TaskType.REDUCE);
      SchedulingAlgorithms.computeFairShares(
          mapScheds, clusterStatus.getMaxMapTasks());
      SchedulingAlgorithms.computeFairShares(
          reduceScheds, clusterStatus.getMaxReduceTasks());
      
      // Use the computed shares to assign shares within each pool
      for (Pool pool: poolMgr.getPools()) {
        pool.getMapSchedulable().redistributeShare();
        pool.getReduceSchedulable().redistributeShare();
      }
      
      if (preemptionEnabled)
        updatePreemptionVariables();
    }

看不懂，先到这吧，等下次慢慢研究吧，今天就到这了，好累。

---------------------------------------------------------------------------------

今天把上次遗留的问题继续研究一下。

for (Pool pool: poolMgr.getPools()) {
        pool.getMapSchedulable().updateDemand();
        pool.getReduceSchedulable().updateDemand();
      }

这里是更新每个pool的slot需求情况，下面来看看，pool.getMapSchedulable().updateDemand()，pool.getReduceSchedulable().updateDemand()两个基本相同。

7.PoolSchedulable.updateDemand()：第一句poolMgr.getMaxSlots(pool.getName(), taskType)是获取pool的最大slot数量，从配置文件获取，配置文件是之前加载过的，前面有说到。每个PoolSchedulable中都会存在多个JobSchedulable对象，在JobListener.addJob()时添加。一个JobSchedulable对应一个jobInProgress对象。然后调用JobSchedulable.updateDemand()更新每个JobSchedulable的slot的需求。

public void updateDemand() {
    // limit the demand to maxTasks
    int maxTasks = poolMgr.getMaxSlots(pool.getName(), taskType);
    demand = 0;
    for (JobSchedulable sched: jobScheds) {
      sched.updateDemand();
      demand += sched.getDemand();
      if (demand >= maxTasks) {
        demand = maxTasks;
        break;
      }
    }
    if (LOG.isDebugEnabled()) {
      LOG.debug("The pool " + pool.getName() + " demand is " + demand
          + "; maxTasks is " + maxTasks);
    }
  }

8.JobSchedulable.updateDemand()：首先第一步就是判断该JobSchedulable的job是否已运行(RUNNING)，没有运行则不分配slot。然后判断该JobSchedulable是Map还是Reduce，如果是Reduce则需先判断完成的Map数量(finishedMapTasks)数量+失败的Map(failedMapTIPs)数量>=completedMapsForReduceSlowstart（由"mapred.reduce.slowstart.completed.maps参数值*numMapTasks），满足则表示Reduce任务可以启动，否则不可启动。而对于Map任务直接计算其slot需求。TaskInProgress[] tips = (taskType == TaskType.MAP ? job.getTasks(TaskType.MAP) : job.getTasks(TaskType.REDUCE))，获取对应的taskInPorgress数量(tip)，boolean speculationEnabled = (taskType == TaskType.MAP ?job.getMapSpeculativeExecution() : job.getReduceSpeculativeExecution())判断是否启用推测执行，double avgProgress = (taskType == TaskType.MAP ?job.getStatus().mapProgress() : job.getStatus().reduceProgress())获取map/reduce任务的进度，即map/reduce已完成多少，之后计算每个taskInProgress的slot需求。如果taskInProgress未完成则正在运行中，则demand += tip.getActiveTasks().size()计算出所需的slot数量，而tip的ActiveTasks则是任务调用的时候，即调用tip.addRunningTask()方法时添加的，而该方法的调用者则是FairScheduler的assignTasks()方法，即方法调度。获取到tip的activeTasks数量，则就是该tip所需要的slot数量，同时如果启用了推测执行，则还需多加一个slot用于推测执行任务，这样就获得了一个JobSchedulable所需的总slot数量，求和即为这个pool所需的总slot数量，当所需数量大于maxTasks（该pool所拥有的最大slot数），则返回。继续回到FairScheduler.update()方法。

9.FairScheduler.update()：

List<PoolSchedulable> mapScheds = getPoolSchedulables(TaskType.MAP);
      List<PoolSchedulable> reduceScheds = getPoolSchedulables(TaskType.REDUCE);
      SchedulingAlgorithms.computeFairShares(
          mapScheds, clusterStatus.getMaxMapTasks());
      SchedulingAlgorithms.computeFairShares(
          reduceScheds, clusterStatus.getMaxReduceTasks());
      
      // Use the computed shares to assign shares within each pool
      for (Pool pool: poolMgr.getPools()) {
        pool.getMapSchedulable().redistributeShare();
        pool.getReduceSchedulable().redistributeShare();
      }
      
      if (preemptionEnabled)
        updatePreemptionVariables();

这里涉及的就是FairScheduler的核心之处——资源分配算法。先看看前两句，前两句就是获取所有的MapPoolSchedulable和ReducePoolSchedulable，一个pool中分别包含一个MapPoolSchedulable和ReducePoolSchedulable。下面两句就是具体的资源分配，调用的是SchedulingAlgorithms类进行资源分配的。

10.SchedulingAlgorithms.computeFairShares()：

private static double slotsUsedWithWeightToSlotRatio(double w2sRatio,
      Collection<? extends Schedulable> schedulables) {
    double slotsTaken = 0;
    for (Schedulable sched: schedulables) {
      double share = computeShare(sched, w2sRatio);
      slotsTaken += share;
    }
    return slotsTaken;
  }

调用computeShare()方法根据job的weight和w2sRatio（相当于总权重，1.0）计算每个Schedulable根据权重应该获得slot数量。

11.SchedulingAlgorithms.computeShare()：第一句double share = sched.getWeight() * w2sRatio，获取Pool的权重，该权重是在fair-scheduler.xml中设置pool时为pool设置了weigth，默认是1.0。获得job权重之后，根据weigth*w2sRatio获得一个share值，然后share=Math.max(share, sched.getMinShare())（minShare默认是0），share = Math.min(share, sched.getDemand())，即获得share值。

public double getJobWeight(JobInProgress job, TaskType taskType) {
    if (!isRunnable(job)) {
      // Job won't launch tasks, but don't return 0 to avoid division errors
      return 1.0;
    } else {
      double weight = 1.0;
      if (sizeBasedWeight) {
        // Set weight based on runnable tasks
        JobInfo info = infos.get(job);
        int runnableTasks = (taskType == TaskType.MAP) ?
            info.mapSchedulable.getDemand() : 
            info.reduceSchedulable.getDemand();
        weight = Math.log1p(runnableTasks) / Math.log(2);
      }
      weight *= getPriorityFactor(job.getPriority());
      if (weightAdjuster != null) {
        // Run weight through the user-supplied weightAdjuster
        weight = weightAdjuster.adjustWeight(job, taskType, weight);
      }
      return weight;
    }
  }
private static double computeShare(Schedulable sched, double w2sRatio) {
    double share = sched.getWeight() * w2sRatio;
    share = Math.max(share, sched.getMinShare());
    share = Math.min(share, sched.getDemand());
    return share;
  }

12.SchedulingAlgorithms.computeFairShares：返回到该方法，全总体来看其实这里是一个算法（好吧，这个类本身就是一个算法类），这个算法旨在找出一个合适的fairShare值，使得所有job的权重*fairShare之和最接近cap值（Math.min(totalDemand, totalSlots)），这是一个二分查找算法，至于这样做的原因可以参见SchedulingAlgorithms.computeFairShares的注释，大致的意思好像是这个值叫做weighted fair sharing，We call R the weight-to-slots ratio because it converts a Schedulable's weight to the number of slots it is assigned。也就是这个可以根据这个值*pool的权重得到该pool所分配到的slot数量。就到这，英语不好看不太懂，等别人解释吧。最后sched.setFairShare(computeShare(sched, right))将这个值设置到PoolSchedulable中。

for (Schedulable sched: schedulables) {
      totalDemand += sched.getDemand();
    }
    double cap = Math.min(totalDemand, totalSlots);
    double rMax = 1.0;
    while (slotsUsedWithWeightToSlotRatio(rMax, schedulables) < cap) {
      rMax *= 2.0;
    }
    // Perform the binary search for up to COMPUTE_FAIR_SHARES_ITERATIONS steps
    double left = 0;
    double right = rMax;
    for (int i = 0; i < COMPUTE_FAIR_SHARES_ITERATIONS; i++) {
      double mid = (left + right) / 2.0;
      if (slotsUsedWithWeightToSlotRatio(mid, schedulables) < cap) {
        left = mid;
      } else {
        right = mid;
      }
    }

13.FairScheduler.update()：后面是对每个JobSchedulable使用同上方法计算一个fairShare值，意义是为pool中的每个job的可分配的slot数量。这里同样会计算job的weigth,job的权重是由FairScheduler计算得到的，在计算权重时，可以选择是否开启根据job长度调整权重（由mapred.fairscheduler.sizebasedweight参数控制，默认false），然后根据job的优先级判断相应的权重，其对应关系：优先级：VERY_HIGH-权重：4.0/HIGH：2.0/NORMAL：1.0/LOW：0.5，最后根据weightAdjuster进行调整job的权重，需要手动实现，由mapred.fairscheduler.weightadjuster参数设置，如果你自定义了一个weightAdjuster类，则可以通过重写adjustWeight()方法控制job的权重。总之默认情况下一个job的权重只是取决于该Job优先级。后面的跳过，不是太懂

14.FairScheduler.update()：最后是判断是否支持抢占机制，即当一个资源池资源有剩余是否允许将剩余资源共享给其他资源池。具体是判断每个资源池中正在运行的任务是否小于资源池本身最小资源量或者需求量，同时还判断该资源池是否急于将资源共享给其他资源，即资源使用量低于共享量的一半。

 private void updatePreemptionVariables() {
    long now = clock.getTime();
    lastPreemptionUpdateTime = now;
    for (TaskType type: MAP_AND_REDUCE) {
      for (PoolSchedulable sched: getPoolSchedulables(type)) {
        if (!isStarvedForMinShare(sched)) {
          sched.setLastTimeAtMinShare(now);
        }
        if (!isStarvedForFairShare(sched)) {
          sched.setLastTimeAtHalfFairShare(now);
        }
        eventLog.log("PREEMPT_VARS", sched.getName(), type,
            now - sched.getLastTimeAtMinShare(),
            now - sched.getLastTimeAtHalfFairShare());
      }
    }
  }

到此，整个FairScheduler的任务初始化操作或者说JobListener的jobAdded()方法完成了，分析的有遗漏，也许还有错误，如您发现请不吝赐教，谢谢

posted @ 2013-11-21 22:31 Vicky01200059 阅读(198) 评论(0) 收藏举报

刷新页面返回顶部

Vicky01200059

学习是不断成长的过程，我希望能够一直成长

FairScheduler job初始化过程源码浅析

公告