【Yarn源码分析】ApplicationMaster源码分析

本文主要介绍 ApplicationMaster 的运行流程,并从 ApplicationMaster 的启动、注册/心跳、Container 资源申请与分配三个角度分析相关源码。其中花了大量篇幅介绍 ApplicationMaster 的启动过程,包括任务提交流程、App/Attempt 转换过程,到 ApplicationMaster 的启动,这部分主要是方便读者了解从应用程序提交到启动 ApplicationMaster 启动整个过程,对 Yarn 的提交流程有更深入的理解。

一、ApplicationMaster 整体运行流程

ApplicationMaster 生命周期

ApplicationMaster 管理主要由三个服务构成,分别是 ApplicationMasterLauncher、AMLivelinessMonitor 和 ApplicationMasterService,它们共同管理应用程序的 ApplicationMaster 的生命周期。ApplicationMaster 服务从创建到销毁的流程如下:

  1. 用户向 ResourceManager 提交应用程序,ResourceManager 收到提交请求后,先向资源调度器申请用以启动 ApplicationMaster 的资源,待申请到资源后,再由 ApplicationMasterLauncher 与对应的 NodeManager 通信,从而启动应用程序的 ApplicationMaster。
  2. ApplicationMaster 启动完成后,ApplicationMasterLauncher 会通过事件的形式,将刚刚启动的 ApplicationMaster 注册到 AMLivelinessMonitor,以启动心跳监控。
  3. ApplicationMaster 启动后,先向 ApplicationMasterService 注册,将自己所在 host、端口号等信息汇报给它。
  4. ApplicationMaster 运行过程中,周期性地向 ApplicationMasterService 汇报“心跳”信息(“心跳”信息中包含想要申请的资源描述)。
  5. ApplicationMasterService 每次收到 ApplicationMaster 的心跳信息后,将通知 AMLivelinessMonitor 更新该应用程序的最近汇报心跳的时间。
  6. 当应用程序运行完成后,ApplicationMaster 向 ApplicationMasterService 发送请求,注销自己。
  7. ApplicationMasterService 收到注销请求后,标注应用程序运行状态为完成,同时通知 AMLivelinessMonitor 移除对它的心跳监控。

结合 ApplicationMaster 的整体生命周期,我们从 ApplicatioMaster 启动、注册/心跳及资源申请三个角度来剖析相关源码。

二、ApplicationMaster 启动流程

这部分主要介绍 ApplicationMaster 生命周期的第一步,即 ApplicationMaster 的启动。为了方便理解整个任务执行流程,我们不直接分析 ApplicationMaster 的启动类,而是从应用程序提交,到 APP/Attempt 状态转换(ApplicationMaster 启动前应用程序的一些状态转换过程),再到具体的 ApplicationMaster 启动,以对 Yarn 的整个任务提交流程有更深的了解。

2.1 应用程序提交

不管是什么类型的应用程序,提交到 Yarn 上的入口,都是通过 YarnClient 这个接口 api 提交的,具体提交方法为 submitApplication()。

//位置:org/apache/hadoop/yarn/client/api/YarnClient.java
  public abstract ApplicationId submitApplication(
      ApplicationSubmissionContext appContext) throws YarnException,
      IOException;

 

看看其实现类的提交入口:

//位置:org/apache/hadoop/yarn/client/api/impl/YarnClientImpl.java
  @Override
  public ApplicationId
      submitApplication(ApplicationSubmissionContext appContext)
          throws YarnException, IOException {
    ApplicationId applicationId = appContext.getApplicationId();
    if (applicationId == null) {
      throw new ApplicationIdNotProvidedException(
          "ApplicationId is not provided in ApplicationSubmissionContext");
    }

    // 构建应用程序请求的上文文信息
    SubmitApplicationRequest request =
        Records.newRecord(SubmitApplicationRequest.class);
    request.setApplicationSubmissionContext(appContext);

    // Automatically add the timeline DT into the CLC
    // Only when the security and the timeline service are both enabled
    if (isSecurityEnabled() && timelineServiceEnabled) {
      addTimelineDelegationToken(appContext.getAMContainerSpec());
    }

    // Client 真正提交应用程序
    rmClient.submitApplication(request);

    while (true) {  
      // 对未能及时提交的应用程序不断重试
    }

    return applicationId;
  }

 

Yarn Client 与 RM 进行 RPC 通信是通过 ClientRMService 服务实现的,应用程序提交到服务端,会调用 RMAppManager 类的对应方法来处理应用程序。

//位置:org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java
  @Override
  public SubmitApplicationResponse submitApplication(
      SubmitApplicationRequest request) throws YarnException {
    ApplicationSubmissionContext submissionContext = request
        .getApplicationSubmissionContext();
    ApplicationId applicationId = submissionContext.getApplicationId();

    // 跳过神圣的检查工作

    try {
      // 重点:调用 RMAppManager 来提交应用程序
      rmAppManager.submitApplication(submissionContext,
          System.currentTimeMillis(), user);

      LOG.info("Application with id " + applicationId.getId() + 
          " submitted by user " + user);
      RMAuditLogger.logSuccess(user, AuditConstants.SUBMIT_APP_REQUEST,
          "ClientRMService", applicationId);
    } catch (YarnException e) {
      LOG.info("Exception in submitting application with id " +
          applicationId.getId(), e);
      RMAuditLogger.logFailure(user, AuditConstants.SUBMIT_APP_REQUEST,
          e.getMessage(), "ClientRMService",
          "Exception in submitting application", applicationId);
      throw e;
    }

    SubmitApplicationResponse response = recordFactory
        .newRecordInstance(SubmitApplicationResponse.class);
    return response;
  }

 

2.2 APP/AppAttempt 状态转换过程

从 RMAppManager 类的 rmAppManager.submitApplication() 方法,可以看到它向调度器发送 RMAppEventType.START 事件。

//位置:src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
  protected void submitApplication(
      ApplicationSubmissionContext submissionContext, long submitTime,
      String user) throws YarnException {
    ApplicationId applicationId = submissionContext.getApplicationId();

    RMAppImpl application =
        createAndPopulateNewRMApp(submissionContext, submitTime, user, false);
    ApplicationId appId = submissionContext.getApplicationId();
    Credentials credentials = null;
    try {
      credentials = parseCredentials(submissionContext);
      if (UserGroupInformation.isSecurityEnabled()) {
        this.rmContext.getDelegationTokenRenewer().addApplicationAsync(appId,
            credentials, submissionContext.getCancelTokensWhenComplete(),
            application.getUser());
      } else {
        // 重点:向调度器发送 RMAppEventType.START 事件
        this.rmContext.getDispatcher().getEventHandler()
            .handle(new RMAppEvent(applicationId, RMAppEventType.START));
      }
    } catch (Exception e) {
      LOG.warn("Unable to parse credentials.", e);
      // Sending APP_REJECTED is fine, since we assume that the
      // RMApp is in NEW state and thus we haven't yet informed the
      // scheduler about the existence of the application
      assert application.getState() == RMAppState.NEW;
      this.rmContext.getDispatcher().getEventHandler()
          .handle(new RMAppRejectedEvent(applicationId, e.getMessage()));
      throw RPCUtil.getRemoteException(e);
    }
  }

 

RMAppEventType.START 事件在 RMAppImpl 类中有对应的状态转换,即 APP 状态从 NEW 转换为 NEW_SAVING。

//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
    // Transitions from NEW state
    .addTransition(RMAppState.NEW, RMAppState.NEW_SAVING,
        RMAppEventType.START, new RMAppNewlySavingTransition())

 

注册的 RMAppNewlySavingTransition 状态机做了什么呢?

//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
  private static final class RMAppNewlySavingTransition extends RMAppTransition {
    @Override
    public void transition(RMAppImpl app, RMAppEvent event) {

      // 保存 APP 的状态信息
      LOG.info("Storing application with id " + app.applicationId);
      app.rmContext.getStateStore().storeNewApplication(app);
    }
  }

 

状态机会对 APP 的状态进行保存,将其元数据存储到 ZK 中。

//位置:org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java
  public void storeNewApplication(RMApp app) {
    ApplicationSubmissionContext context = app
                                            .getApplicationSubmissionContext();
    assert context instanceof ApplicationSubmissionContextPBImpl;
    ApplicationStateData appState =
        ApplicationStateData.newInstance(
            app.getSubmitTime(), app.getStartTime(), context, app.getUser());
    // 向调度器发送 RMStateStoreEventType.STORE_APP 事件
    dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
  }

 

这里向调度器发送 RMStateStoreEventType.STORE_APP 事件,并注册了 StoreAppTransition 状态机。

//位置:org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java
    .addTransition(RMStateStoreState.ACTIVE,
          EnumSet.of(RMStateStoreState.ACTIVE, RMStateStoreState.FENCED),
          RMStateStoreEventType.STORE_APP, new StoreAppTransition())

 

StoreAppTransition 状态机会向调度器发送 RMAppEventType.APP_NEW_SAVED 事件,触发 APP 状态从 NEW_SAVING 到 SUBMITED 的转换。

//位置:org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java
  private static class StoreAppTransition
      implements MultipleArcTransition<RMStateStore, RMStateStoreEvent,
          RMStateStoreState> {
    @Override
    public RMStateStoreState transition(RMStateStore store,
        RMStateStoreEvent event) {
      if (!(event instanceof RMStateStoreAppEvent)) {
        // should never happen
        LOG.error("Illegal event type: " + event.getClass());
        return RMStateStoreState.ACTIVE;
      }
      boolean isFenced = false;
      ApplicationStateData appState =
          ((RMStateStoreAppEvent) event).getAppState();
      ApplicationId appId =
          appState.getApplicationSubmissionContext().getApplicationId();
      LOG.info("Storing info for app: " + appId);
      try {
        store.storeApplicationStateInternal(appId, appState);
        // 重点:向调度器发送 RMAppEventType.APP_NEW_SAVED 事件
        store.notifyApplication(new RMAppEvent(appId,
               RMAppEventType.APP_NEW_SAVED));
      } catch (Exception e) {
        LOG.error("Error storing app: " + appId, e);
        isFenced = store.notifyStoreOperationFailedInternal(e);
      }
      return finalState(isFenced);
    };
  }

 

这里会向调度器发送 RMAppEventType.APP_NEW_SAVED 事件,该事件会触发 APP 状态从 NEW_SAVING 到 SUBMITED 的转换,并调用 AddApplicationToSchedulerTransition 状态机。

//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
    .addTransition(RMAppState.NEW_SAVING, RMAppState.SUBMITTED,
        RMAppEventType.APP_NEW_SAVED, new AddApplicationToSchedulerTransition())

 

AddApplicationToSchedulerTransition 状态机会触发 SchedulerEventType.APP_ADDED 事件。

//位置:org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/AppAddedSchedulerEvent.java
  private static final class AddApplicationToSchedulerTransition extends
      RMAppTransition {
    @Override
    public void transition(RMAppImpl app, RMAppEvent event) {
      // 向调度器发送 SchedulerEventType.APP_ADDED 事件
      app.handler.handle(new AppAddedSchedulerEvent(app.applicationId,
        app.submissionContext.getQueue(), app.user,
        app.submissionContext.getReservationID()));
    }
  }

 

其中AppAddedSchedulerEvent 类继承自 SchedulerEvent 类,事件的处理会进入到 FairScheduler 类,来看看对应的 handle() 方法。

//位置:org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
  @Override
  public void handle(SchedulerEvent event) {
    switch (event.getType()) {
    case NODE_ADDED: // 省略
    case NODE_REMOVED: // 省略
    case NODE_UPDATE: // 省略
    case APP_ADDED:
      if (!(event instanceof AppAddedSchedulerEvent)) {
        throw new RuntimeException("Unexpected event type: " + event);
      }
      AppAddedSchedulerEvent appAddedEvent = (AppAddedSchedulerEvent) event;
      // APP_ADDED 事件处理逻辑
      addApplication(appAddedEvent.getApplicationId(),
        appAddedEvent.getQueue(), appAddedEvent.getUser(),
        appAddedEvent.getIsAppRecovering());
      break;
    case APP_REMOVED: // 省略
    case NODE_RESOURCE_UPDATE: // 省略
    case APP_ATTEMPT_ADDED: // 省略
    case APP_ATTEMPT_REMOVED: // 省略
    case CONTAINER_EXPIRED: // 省略
    case CONTAINER_RESCHEDULED: // 省略
    default:
      LOG.error("Unknown event arrived at FairScheduler: " + event.toString());
    }
  }

 

addApplication() 方法会对应用程序的提交进行一些前期检查工作,比如队列名是否正确、用户是否有队列访问权限等,检查通过后,会向调度器发送 RMAppEventType.APP_ACCEPTED 事件。

//位置:org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
  protected synchronized void addApplication(ApplicationId applicationId,
      String queueName, String user, boolean isAppRecovering) {
    // 提交队列信息判断
    if (queueName == null || queueName.isEmpty()) {
      String message = "Reject application " + applicationId +
              " submitted by user " + user + " with an empty queue name.";
      LOG.info(message);
      rmContext.getDispatcher().getEventHandler()
          .handle(new RMAppRejectedEvent(applicationId, message));
      return;
    }

    if (queueName.startsWith(".") || queueName.endsWith(".")) {
      String message = "Reject application " + applicationId
          + " submitted by user " + user + " with an illegal queue name "
          + queueName + ". "
          + "The queue name cannot start/end with period.";
      LOG.info(message);
      rmContext.getDispatcher().getEventHandler()
          .handle(new RMAppRejectedEvent(applicationId, message));
      return;
    }

    RMApp rmApp = rmContext.getRMApps().get(applicationId);
    FSLeafQueue queue = assignToQueue(rmApp, queueName, user);
    if (queue == null) {
      return;
    }

    // 队列的 ACL 访问权限判断
    UserGroupInformation userUgi = UserGroupInformation.createRemoteUser(user);

    if (!queue.hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi)
        && !queue.hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) {
      String msg = "User " + userUgi.getUserName() +
              " cannot submit applications to queue " + queue.getName();
      LOG.info(msg);
      rmContext.getDispatcher().getEventHandler()
          .handle(new RMAppRejectedEvent(applicationId, msg));
      return;
    }

    SchedulerApplication<FSAppAttempt> application =
        new SchedulerApplication<FSAppAttempt>(queue, user);
    applications.put(applicationId, application);
    queue.getMetrics().submitApp(user);

    LOG.info("Accepted application " + applicationId + " from user: " + user
        + ", in queue: " + queue.getName()
        + ", currently num of applications: " + applications.size());
    if (isAppRecovering) {
      // 判断 APP 是否事 Recover 状态(暂时不考虑 Recover 情况)
      if (LOG.isDebugEnabled()) {
        LOG.debug(applicationId
            + " is recovering. Skip notifying APP_ACCEPTED");
      }
    } else {
      // 重点:向调度器发送 RMAppEventType.APP_ACCEPTED 事件
      rmContext.getDispatcher().getEventHandler()
        .handle(new RMAppEvent(applicationId, RMAppEventType.APP_ACCEPTED));
    }
  }

 

RMAppEventType.APP_ACCEPTED 事件的注册,会触发 StartAppAttemptTransition 状态机,并将 APP 的状态从 SUBMITED 转换为 ACCEPTED。

//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
    // Transitions from SUBMITTED state
    .addTransition(RMAppState.SUBMITTED, RMAppState.ACCEPTED,
        RMAppEventType.APP_ACCEPTED, new StartAppAttemptTransition())

 

StartAppAttemptTransition 状态机会发送 RMAppAttemptEventType.START 事件,以开始启动 AppAttempt。

//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
  private static final class StartAppAttemptTransition extends RMAppTransition {
    @Override
    public void transition(RMAppImpl app, RMAppEvent event) {
      app.createAndStartNewAttempt(false);
    };
  }

  // 开始启动 AppAttempt
  private void createAndStartNewAttempt(boolean transferStateFromPreviousAttempt) {
    createNewAttempt();
    // 向调度器发送 RMAppAttemptEventType.START 事件
    handler.handle(new RMAppStartAttemptEvent(currentAttempt.getAppAttemptId(),
      transferStateFromPreviousAttempt));
  }

 

RMAppAttemptEventType.START 事件的注册,会调用 AttemptStartedTransition 状态机,触发 AppAttempt 状态从 NEW 转变为 SUBMITED。

//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
      // Transitions from NEW State
      .addTransition(RMAppAttemptState.NEW, RMAppAttemptState.SUBMITTED,
          RMAppAttemptEventType.START, new AttemptStartedTransition())

 

AttemptStartedTransition 状态机会触发 AppAttemptAddedSchedulerEvent 事件,发送 SchedulerEventType.APP_ATTEMPT_ADDED 请求。

//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
  private static final class AttemptStartedTransition extends BaseTransition {
  @Override
    public void transition(RMAppAttemptImpl appAttempt,
        RMAppAttemptEvent event) {

      // 跳过一些神圣的检查工作

      // 向调度器发送 SchedulerEventType.APP_ATTEMPT_ADDED 事件
      appAttempt.eventHandler.handle(new AppAttemptAddedSchedulerEvent(
        appAttempt.applicationAttemptId, transferStateFromPreviousAttempt));
    }
  }

 

AppAttemptAddedSchedulerEvent 类继承自 SchedulerEvent 类,进入具体代码看看 SchedulerEventType.APP_ATTEMPT_ADDED 事件的处理逻辑,还是在 handle() 方法。

//位置:org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
  public void handle(SchedulerEvent event) {
    switch (event.getType()) {
    case NODE_ADDED: // 省略
    case NODE_REMOVED: // 省略
    case NODE_UPDATE: // 省略
    case APP_ADDED: // 省略
    case APP_REMOVED: // 省略
    case NODE_RESOURCE_UPDATE: // 省略
    case APP_ATTEMPT_ADDED:
      if (!(event instanceof AppAttemptAddedSchedulerEvent)) {
        throw new RuntimeException("Unexpected event type: " + event);
      }
      AppAttemptAddedSchedulerEvent appAttemptAddedEvent =
          (AppAttemptAddedSchedulerEvent) event;
      addApplicationAttempt(appAttemptAddedEvent.getApplicationAttemptId(),
        appAttemptAddedEvent.getTransferStateFromPreviousAttempt(),
        appAttemptAddedEvent.getIsAttemptRecovering());
      break;
    case APP_ATTEMPT_REMOVED: // 省略
    case CONTAINER_EXPIRED: // 省略
    case CONTAINER_RESCHEDULED: // 省略
    default:
      LOG.error("Unknown event arrived at FairScheduler: " + event.toString());
    }
  }

 

addApplicationAttempt() 方法会调度器发送 RMAppAttemptEventType.ATTEMPT_ADDED 事件。

//位置:org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
  protected synchronized void addApplicationAttempt(
      ApplicationAttemptId applicationAttemptId,
      boolean transferStateFromPreviousAttempt,
      boolean isAttemptRecovering) {
    // 跳过前期的检查和初始化工作

    if (isAttemptRecovering) {
      if (LOG.isDebugEnabled()) {
        LOG.debug(applicationAttemptId
            + " is recovering. Skipping notifying ATTEMPT_ADDED");
      }
    } else {
      // 向调度器发送 RMAppAttemptEventType.ATTEMPT_ADDED 事件
      rmContext.getDispatcher().getEventHandler().handle(
        new RMAppAttemptEvent(applicationAttemptId,
            RMAppAttemptEventType.ATTEMPT_ADDED));
    }
  }

 

RMAppAttemptEventType.ATTEMPT_ADDED 注册,并触发 ScheduleTransition 状态机,将 AppAttempt 状态从 SUBMITED 转变为 LAUNCHED_UNMANAGED_SAVING 或者 SCHEDULED。

//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
      // Transitions from SUBMITTED state
      .addTransition(RMAppAttemptState.SUBMITTED, 
          EnumSet.of(RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING,
                     RMAppAttemptState.SCHEDULED),
          RMAppAttemptEventType.ATTEMPT_ADDED,
          new ScheduleTransition())

 

看看 ScheduleTransition 状态机,if 语句开关判断是否应该获取管理 AM 的执行,如果为 true,则 RM 不会为 AM 分配一个容器并启动,默认是 false,所以这里返回的状态是 SCHEDULED。

//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
public static final class ScheduleTransition
      implements
      MultipleArcTransition<RMAppAttemptImpl, RMAppAttemptEvent, RMAppAttemptState> {
    @Override
    public RMAppAttemptState transition(RMAppAttemptImpl appAttempt,
        RMAppAttemptEvent event) {
      ApplicationSubmissionContext subCtx = appAttempt.submissionContext;
      if (!subCtx.getUnmanagedAM()) {
        // 跳过一部分操作

        // 分配 Container,这里暂不做解释
        Allocation amContainerAllocation =
            appAttempt.scheduler.allocate(
                appAttempt.applicationAttemptId,
                appAttempt.amReqs,
                EMPTY_CONTAINER_RELEASE_LIST,
                amBlacklist.getAdditions(),
                amBlacklist.getRemovals());
        if (amContainerAllocation != null
            && amContainerAllocation.getContainers() != null) {
          assert (amContainerAllocation.getContainers().size() == 0);
        }
        // 返回的状态也会进行状态机转换
        return RMAppAttemptState.SCHEDULED;
      } else {
        // save state and then go to LAUNCHED state
        appAttempt.storeAttempt();
        return RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING;
      }
    }
  }

 

RMAppAttemptState.SCHEDULED 状态,会触发 RMAppAttemptEventType.CONTAINER_ALLOCATED 事件,使得 AppAttempt 状态从 SCHEDULED 转换到 ALLOCATED_SAVING,对应的处理状态机为 AMContainerAllocatedTransition。

//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
       // Transitions from SCHEDULED State
      .addTransition(RMAppAttemptState.SCHEDULED,
          EnumSet.of(RMAppAttemptState.ALLOCATED_SAVING,
            RMAppAttemptState.SCHEDULED),
          RMAppAttemptEventType.CONTAINER_ALLOCATED,
          new AMContainerAllocatedTransition())

 

AMContainerAllocatedTransition 状态机主要是 AM 获取分配的资源,并发送 RMAppAttemptState.ALLOCATED_SAVING 事件。

//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
  private static final class AMContainerAllocatedTransition
      implements
      MultipleArcTransition<RMAppAttemptImpl, RMAppAttemptEvent, RMAppAttemptState> {
    @Override
    public RMAppAttemptState transition(RMAppAttemptImpl appAttempt,
        RMAppAttemptEvent event) {
      // 从调度器获取启动 AM 的 Container,这里的 allocate 并没有传入 AM 请求信息,表示先尝试直接获取 Container
      Allocation amContainerAllocation =
          appAttempt.scheduler.allocate(appAttempt.applicationAttemptId,
            EMPTY_CONTAINER_REQUEST_LIST, EMPTY_CONTAINER_RELEASE_LIST, null,
            null);
      // 对 AM 资源进行判空处理,如果没有获取到之前分配的资源,在这里重新进行分配
      if (amContainerAllocation.getContainers().size() == 0) {
        appAttempt.retryFetchingAMContainer(appAttempt);
        return RMAppAttemptState.SCHEDULED;
      }

      // Set the masterContainer
      appAttempt.setMasterContainer(amContainerAllocation.getContainers()
          .get(0));
      RMContainerImpl rmMasterContainer = (RMContainerImpl)appAttempt.scheduler
          .getRMContainer(appAttempt.getMasterContainer().getId());
      rmMasterContainer.setAMContainer(true);

      appAttempt.rmContext.getNMTokenSecretManager()
        .clearNodeSetForAttempt(appAttempt.applicationAttemptId);
      appAttempt.getSubmissionContext().setResource(
        appAttempt.getMasterContainer().getResource());
      appAttempt.storeAttempt();

      // 向调度器发送 RMAppAttemptState.ALLOCATED_SAVING 事件
      return RMAppAttemptState.ALLOCATED_SAVING;
    }
  } 

 

RMAppAttemptState.ALLOCATED_SAVING 事件的注册状态机为 AttemptStoredTransition,此时 AppAttempt 状态已从 ALLOCATED_SAVING 转换为 ALLOCATED。

//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
       // Transitions from ALLOCATED_SAVING State
      .addTransition(RMAppAttemptState.ALLOCATED_SAVING, 
          RMAppAttemptState.ALLOCATED,
          RMAppAttemptEventType.ATTEMPT_NEW_SAVED, new AttemptStoredTransition())

 

我们接着看 AttemptStoredTransition 状态机做了什么。

//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
  private static final class AttemptStoredTransition extends BaseTransition {
    @Override
    public void transition(RMAppAttemptImpl appAttempt, RMAppAttemptEvent event) {
      // 运行 AppAttempt
      appAttempt.launchAttempt();
    }
  }

  private void launchAttempt(){
    launchAMStartTime = System.currentTimeMillis();
    // 重点:发送 AMLauncherEventType.LAUNCH 事件启动 AM Container
    eventHandler.handle(new AMLauncherEvent(AMLauncherEventType.LAUNCH, this));
  }

至此,终于看到了 AM Container 启动的曙光了,可具体是怎么启动的呢?我们接着分析。

2.3 启动 AM

上面的发送的 AMLauncherEventType.LAUNCH 事件是启动 AM 的关键入口,可由谁来处理这个事件呢?这就需要进入到 ApplicationMasterLauncher 类来分析了,我们先来看看这个类的基本属性。

//位置:org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
public class ApplicationMasterLauncher extends AbstractService implements
    EventHandler<AMLauncherEvent> {
  // 创建线程池实例,针对处理的每一个 AM 事件都启动一个线程
  private ThreadPoolExecutor launcherPool;
  // 独立线程处理 AM 的 LAUNCH 和 CLEANUP 事件
  private LauncherThread launcherHandlingThread;
  
  // 事件接收和处理的队列
  private final BlockingQueue<Runnable> masterEvents
    = new LinkedBlockingQueue<Runnable>();
  // 资源管理器上下文
  protected final RMContext context;
  
  public ApplicationMasterLauncher(RMContext context) {
    super(ApplicationMasterLauncher.class.getName());
    this.context = context;
    // 新建事件处理的线程
    this.launcherHandlingThread = new LauncherThread();
  }
  
  @Override
  protected void serviceInit(Configuration conf) throws Exception {
    int threadCount = conf.getInt(
        YarnConfiguration.RM_AMLAUNCHER_THREAD_COUNT,
        YarnConfiguration.DEFAULT_RM_AMLAUNCHER_THREAD_COUNT);
    ThreadFactory tf = new ThreadFactoryBuilder()
        .setNameFormat("ApplicationMasterLauncher #%d")
        .build();
    // 初始化线程池
    launcherPool = new ThreadPoolExecutor(threadCount, threadCount, 1,
        TimeUnit.HOURS, new LinkedBlockingQueue<Runnable>());
    launcherPool.setThreadFactory(tf);

    // 跳过一些配置初始化操作
  }
}

 

这里主要是创建一些执行环境,包括事件处理的独立线程 launcherHandlingThread、所需的线程池 launcherPool 及一个负责接收和处理 AM 事件的 masterEvents 事件队列。而 ApplicationMasterLauncher 类中主要处理 AM 的两种事件:LAUNCH 和 CLEANUP。

//位置:org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
  @Override
  public synchronized void  handle(AMLauncherEvent appEvent) {
    // 获取 AMLauncherEvent 的事件类型
    AMLauncherEventType event = appEvent.getType();
    RMAppAttempt application = appEvent.getAppAttempt();
    switch (event) {
    // 处理 AM LAUNCH 事件
    case LAUNCH:
      launch(application);
      break;
    // 处理 AM CLEANUP 事件
    case CLEANUP:
      cleanup(application);
    default:
      break;
    }
  }

 

上面 2.2 小节中最后发送的 AMLauncherEventType.LAUNCH 事件正是在这里处理的,我们就以 LAUNCH 事件为例来看看具体的处理逻辑。

//位置:org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
  private void launch(RMAppAttempt application) {
    // 创建一个 AMLauncher 实例,AMLauncher 继承自 Runnable
    Runnable launcher = createRunnableLauncher(application, 
        AMLauncherEventType.LAUNCH);
    // 将事件添加到 masterEvents 队列中
    masterEvents.add(launcher);
  }

  protected Runnable createRunnableLauncher(RMAppAttempt application, 
      AMLauncherEventType event) {
    Runnable launcher =
        new AMLauncher(context, application, event, getConfig());
    return launcher;
  }

 

事件被加入到事件队列之后,是如何被处理的呢?这里就是独立线程 launcherHandlingThread 所做的事了,通过消息队列的形式,在线程中逐一被消费处理。

//位置:org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java
  private class LauncherThread extends Thread {
    
    public LauncherThread() {
      super("ApplicationMaster Launcher");
    }

    @Override
    public void run() {
      while (!this.isInterrupted()) {   // 死循环不停地处理事件请求
        Runnable toLaunch;
        try {
          // 从事件队列中取出事件
          toLaunch = masterEvents.take();
          // 从线程池中取出一个线程执行事件请求
          launcherPool.execute(toLaunch);
        } catch (InterruptedException e) {
          LOG.warn(this.getClass().getName() + " interrupted. Returning.");
          return;
        }
      }
    }
  }   

 

取出事件后具体的执行逻辑就交给 AMLaunch 类了,这里的 AMLaunch 类本身就是一个 Runnable 实例,我们直接看其 run() 方法。

//位置:org/apache/hadoop/yarn/server/resourcemanager/amlauncher/AMLauncher.java
  public void run() {
    switch (eventType) {
    case LAUNCH:
      try {
        LOG.info("Launching master" + application.getAppAttemptId());
        // 启动 launch() 方法
        launch();
        // 发送 RMAppAttemptEventType.LAUNCHED 事件
        handler.handle(new RMAppAttemptEvent(application.getAppAttemptId(),
            RMAppAttemptEventType.LAUNCHED));
      } catch(Exception ie) {
        String message = "Error launching " + application.getAppAttemptId()
            + ". Got exception: " + StringUtils.stringifyException(ie);
        LOG.info(message);
        handler.handle(new RMAppAttemptLaunchFailedEvent(application
            .getAppAttemptId(), message));
      }
      break;
    case CLEANUP: // 省略
    default:
      LOG.warn("Received unknown event-type " + eventType + ". Ignoring.");
      break;
    }
  }

 

AMLaunch 类的 launch() 方法操作会调用 RPC 函数与 NodeManager 通信,来启动 AM Container,这里 AM 与 NM 交互是通过 ContainerManagementProtocol 协议来实现 RPC 调用的。launch() 方法运行完成后会向调度器发送 RMAppAttemptEventType.LAUNCHED 事件,并将 AppAttempt 的状态从 ALLOCATED 转换为 LAUNCHED。

//位置:org/apache/hadoop/yarn/server/resourcemanager/amlauncher/AMLauncher.java
  private void launch() throws IOException, YarnException {
    connect();
    ContainerId masterContainerID = masterContainer.getId();
    ApplicationSubmissionContext applicationContext =
      application.getSubmissionContext();
    LOG.info("Setting up container " + masterContainer
        + " for AM " + application.getAppAttemptId());  
    ContainerLaunchContext launchContext =
        createAMContainerLaunchContext(applicationContext, masterContainerID);

    // 构建 Container 请求信息
    StartContainerRequest scRequest =
        StartContainerRequest.newInstance(launchContext,
          masterContainer.getContainerToken());
    List<StartContainerRequest> list = new ArrayList<StartContainerRequest>();
    list.add(scRequest);
    StartContainersRequest allRequests =
        StartContainersRequest.newInstance(list);
 
    // 重点:调用 RPC 函数启动 Container
    StartContainersResponse response =
        containerMgrProxy.startContainers(allRequests);
    if (response.getFailedRequests() != null
        && response.getFailedRequests().containsKey(masterContainerID)) {
      Throwable t =
          response.getFailedRequests().get(masterContainerID).deSerialize();
      parseAndThrowException(t);
    } else {
      LOG.info("Done launching container " + masterContainer + " for AM "
          + application.getAppAttemptId());
    }
  }

至此,用于运行 ApplicationMaster 的 Container 已经启动,具体的 Container 启动逻辑在这里不做分析,AM Container 在具体的 NodeManager 上启动后,Container 会根据上下文信息启动 ApplicationMaster 进程,ApplicationMaster 生命周期的第一步 ApplicationMaster 启动在这里已经完成了。

三、ApplicationMaster 注册/心跳及资源申请流程

这部分主要介绍 ApplicationMaster 启动是做了哪些工作,如何向 ResourceManager 进行注册和心跳,以及如何申请 Container 资源。

3.1 ApplicationMaster 注册/心跳流程

Container 的启动会触发 ApplicationMaster 进程的启动,于是我们从 ApplicationMaster 类的 main() 方法作为入口,来看看 ApplicationMaster 启动时做了哪些工作。

//位置:org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
  public static void main(String[] args) {
    boolean result = false;
    try {
      ApplicationMaster appMaster = new ApplicationMaster();
      LOG.info("Initializing ApplicationMaster");
      boolean doRun = appMaster.init(args);
      if (!doRun) {
        System.exit(0);
      }
      // AM 启动的核心 run() 方法
      appMaster.run();
      result = appMaster.finish();
    } catch (Throwable t) {
      LOG.fatal("Error running ApplicationMaster", t);
      LogManager.shutdown();
      ExitUtil.terminate(1, t);
    }
    if (result) {
      LOG.info("Application Master completed successfully. exiting");
      System.exit(0);
    } else {
      LOG.info("Application Master failed. exiting");
      System.exit(2);
    }
  }

 

run() 方法是 AM 启动的核心入口方法。这里主要是初始化相关 RPC 客户端实例,并开始向 RM 进行注册和心跳。

//位置:org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
  public void run() throws YarnException, IOException {
    LOG.info("Starting ApplicationMaster");

    // 跳过 tokens 的检查工作

    // 初始化 AMRMClientAsync 实例,用于 AM 与 RM 之间进行交互
    AMRMClientAsync.CallbackHandler allocListener = new RMCallbackHandler();
    amRMClient = AMRMClientAsync.createAMRMClientAsync(1000, allocListener);
    amRMClient.init(conf);
    amRMClient.start();

    // 初始化 NMClientAsync 实例,用于 AM 与 NM 之间进行交互
    containerListener = createNMCallbackHandler();
    nmClientAsync = new NMClientAsyncImpl(containerListener);
    nmClientAsync.init(conf);
    nmClientAsync.start();

    // 重点:AM 向 RM 进行注册,这里也会向 RM 发送心跳请求
    appMasterHostname = NetUtils.getHostname();
    RegisterApplicationMasterResponse response = amRMClient
        .registerApplicationMaster(appMasterHostname, appMasterRpcPort,
            appMasterTrackingUrl);
    
    // 跳过资源限制检查及 Container 状态的记录过程
  }

 

AM 与 RM 进行 RPC 通信是通过 ApplicationMasterService 服务实现的,在看服务端  registerApplicationMaster 注册函数前,先来看看客户端的注册函数。

//位置:org/apache/hadoop/yarn/client/api/async/impl/AMRMClientAsyncImpl.java
  public RegisterApplicationMasterResponse registerApplicationMaster(
      String appHostName, int appHostPort, String appTrackingUrl)
      throws YarnException, IOException {
    // AM 注册
    RegisterApplicationMasterResponse response = client
        .registerApplicationMaster(appHostName, appHostPort, appTrackingUrl);
    // 启动 AM 心跳上报线程
    heartbeatThread.start();
    return response;
  }

 

先来看看 AM 注册过程。注册时做了两件事,一个是更新 AM 在 AMLivelinessMonitor 中的最新事件,另一个是发送 RMAppAttemptEventType.REGISTERED 事件,触发 AMRegisteredTransition 状态机,并将 AppAttempt 状态从 LAUNCHED 转换为 RUNNING。

//位置:org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
  @Override
  public RegisterApplicationMasterResponse registerApplicationMaster(
      RegisterApplicationMasterRequest request) throws YarnException,
      IOException {

    //省略

    // Allow only one thread in AM to do registerApp at a time.
    synchronized (lock) {
      // 省略
      
      // 更新 AM 在 AMLivelinessMonitor 中最近汇报心跳的事件
      this.amLivelinessMonitor.receivedPing(applicationAttemptId);
      RMApp app = this.rmContext.getRMApps().get(appID);
      
      // Setting the response id to 0 to identify if the
      // application master is register for the respective attemptid
      lastResponse.setResponseId(0);
      lock.setAllocateResponse(lastResponse);

      // AM 注册关键逻辑,发送 RMAppAttemptEventType.REGISTERED 事件
      LOG.info("AM registration " + applicationAttemptId);
      this.rmContext
        .getDispatcher()
        .getEventHandler()
        .handle(
          new RMAppAttemptRegistrationEvent(applicationAttemptId, request
            .getHost(), request.getRpcPort(), request.getTrackingUrl()));
      RMAuditLogger.logSuccess(app.getUser(), AuditConstants.REGISTER_AM,
        "ApplicationMasterService", appID, applicationAttemptId);

      // 省略
    }
  }

 

接着来看看 AM 心跳上报流程。heartbeatThread 线程是处理 AM 的独立线程,其初始化过程如下。

//位置:org/apache/hadoop/yarn/client/api/async/impl/AMRMClientAsyncImpl.java
public class AMRMClientAsyncImpl<T extends ContainerRequest> 
extends AMRMClientAsync<T> {
  
  // AM 心跳线程对象
  private final HeartbeatThread heartbeatThread;
  
  @Private
  @VisibleForTesting
  public AMRMClientAsyncImpl(AMRMClient<T> client, int intervalMs,
      CallbackHandler callbackHandler) {
    super(client, intervalMs, callbackHandler);
    // 初始化 AM 心跳线程实例
    heartbeatThread = new HeartbeatThread();
  }

 

AM 向 RM 注册后,周期性地通过 RPC 函数 ApplicationMasterProtocol#allocate() 方法与 RM 通信,该方法主要有以下是三个作用:

  • 请求申请;
  • 获取新分配地资源;
  • 形成周期性心跳,告诉 RM 自己还活着。
//位置:org/apache/hadoop/yarn/client/api/async/impl/AMRMClientAsyncImpl.java
  private class HeartbeatThread extends Thread {
    public HeartbeatThread() {
      super("AMRM Heartbeater thread");
    }
    
    public void run() {
      while (true) {   // 心跳线程死循环的跑
        AllocateResponse response = null;
        // synchronization ensures we don't send heartbeats after unregistering
        synchronized (unregisterHeartbeatLock) {
          if (!keepRunning) {
            return;
          }

          try {
            // 重点:心跳线程其实就是周期性的调用 allocate() 方法
            response = client.allocate(progress);
          } catch (ApplicationAttemptNotFoundException e) {
            handler.onShutdownRequest();
            LOG.info("Shutdown requested. Stopping callback.");
            return;
          } catch (Throwable ex) {
            LOG.error("Exception on heartbeat", ex);
            savedException = ex;
            // interrupt handler thread in case it waiting on the queue
            handlerThread.interrupt();
            return;
          }
          if (response != null) {
            while (true) {
              try {
                responseQueue.put(response);
                break;
              } catch (InterruptedException ex) {
                LOG.debug("Interrupted while waiting to put on response queue", ex);
              }
            }
          }
        }
        try {
          Thread.sleep(heartbeatIntervalMs.get());
        } catch (InterruptedException ex) {
          LOG.debug("Heartbeater interrupted", ex);
        }
      }
    }
  }

至此,AM 已经完成向 RM 的注册及周期性心跳上报的过程,其中心跳上报是通过周期性地调用 ApplicationMasterProtocol#allocate() 方法来实现的。AM 心跳开始后,便会定期的向 RM 申请资源,以在对应的 NodeManager 上启动 Container 进程,在下一部分中会详细介绍。

3.2 ApplicationMaster 资源申请与分配流程

ApplicationMaster 资源申请与分配的对象都是针对 Container,下面也是以 Container 的申请与分配作为介绍内容。

(1)Container 分配与申请流程

 

Container 分配与申请流程

如上图,应用程序的 ApplicationMaster 在 NM 上成功启动并向 RM 注册后,向 RM 请求资源(Container)到获取到资源的整个过程,分为两个阶段:

阶段一:ApplicationMaster 汇报资源资源并领取已经分配到的资源;

  1. ApplicationMaster 通过 RPC 函数 ApplicationMasterProtocol#allocate 向 RM 汇报资源需求(由于是周期性调用,也叫“心跳”),包括包括新的资源需求描述、待释放的 Container 列表、请求加入黑名单的节点列表、请求移除黑名单的节点列表等;
  2. RM 中的 ApplicationMasterService 负责处理来自 ApplicationMaster 的请求,一旦受到请求,会向 RMAppAttemptImpl 发送一个 RMAppAttemptEventType.STATUS_UPDATE 类型事件,而 RMAppAttempImpl 收到该事件后,将更新应用程序执行进度和 AMLivelinessMonitor 中记录的应用程序最近更新事件。
  3. ApplicationMasterService 调用 ResourceScheduler#allocate 函数,将 ApplicationMaster 资源需求汇报给 ResourceScheduler。
  4. ResourceScheduler 首先读取待释放 Contianer 列表,依次向对应的 RMContainerImpl 发送 RMContainerEventType.RELEASED 类型事件,以杀死正在运行的 Container,然后将新的资源需求更新到对应数据中,并返回已经为该应用程序分配的资源。

阶段二:NM 向 RM 汇报各个 Container 运行状态,如果 RM 发现它上面又空闲的资源,则进行一次分配,并将分配的资源保存到 RM 数据结构中,等待下次 ApplicationMaster 发送心跳时获取。

  1. NM 通过 RPC 函数 ResourceTracker#nodeHeartbeat 向 RM 汇报各个 Container 运行状态。
  2. RM 中的 ResourceTrackerService 负责处理来自 NM 的请求,一旦收到请求,会向 RMNodeImpl 发送一个 RMNodeEventType.STATUS_UPDATE 事件,而 RMNodeImpl 收到事件后,将更新各个 Container 运行状态,并进一步向 ResourceScheduler 发送一个 SchedulerEventType.NODE_UPDATE 事件。
  3. ResourceScheduler 收到事件后,如果该节点又可分配的空闲资源,则会将这些资源分配给各个应用程序,而分配后的资源仅是记录到对应数据结构中,等待 ApplicationMaster 下次通过心跳机制来领取。

 

(2)源码分析

客户端调用 AMRMClientAsyncImpl#allocate() 方法会通过 RPC 函数向 RM 汇报资源需求,其通信接口是由 ApplicationMasterProtocol 协议来实现,来看看该协议是如何为客户端申请资源。

//位置:rg/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
  @Override
  public AllocateResponse allocate(AllocateRequest request)
      throws YarnException, IOException {

    synchronized (lock) {
      AllocateResponse lastResponse = lock.getAllocateResponse();

      // 发送 STATUS_UPDATE 更新 AppAttempt 状态
      this.rmContext.getDispatcher().getEventHandler().handle(
          new RMAppAttemptStatusupdateEvent(appAttemptId, request
              .getProgress()));             
      // 检查队列中的 memory 和 vcore 是否足够
      try {
        RMServerUtils.normalizeAndValidateRequests(ask,
            rScheduler.getMaximumResourceCapability(), app.getQueue(),
            rScheduler);
      } catch (InvalidResourceRequestException e) {
        LOG.warn("Invalid resource ask by application " + appAttemptId, e);
        throw e;
      }
      
      // 重点:调用调度器的 allocate() 方法向 RM 上报资源需求
      Allocation allocation =
          this.rScheduler.allocate(appAttemptId, ask, release, 
              blacklistAdditions, blacklistRemovals);

      // 更新请求的 response 和 AMRMToken 的状态,省略具体流程

      return allocateResponse;
    }    
  }

 

ApplicationMasterService#allocate() 方法会调用 YarnScheduler 的 allocate() 分配方法,由于采用是 FairScheduler 调度器,我们来分析下 FairScheduler#allocate() 方法。分配过程的核心在 pullNewlyAllocatedContainersAndNMTokens() 方法,该方法的核心是从 newlyAllocatedContainers 这个 List 数据结构中取 Container,那取到的 Container 是从哪儿来的呢?其实就是 NoddeManager 心跳发生时进行资源分配逻辑分配出来的 Container,是保存在 RM 的内存数据结构 newlyAllocatedContainers 中,AM 则直接从该数据结构中取对应的 Container。

//位置:org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
  @Override
  public Allocation allocate(ApplicationAttemptId appAttemptId,
      List<ResourceRequest> ask, List<ContainerId> release,
      List<String> blacklistAdditions, List<String> blacklistRemovals) {

    // 规整化资源请求
    SchedulerUtils.normalizeRequests(ask, DOMINANT_RESOURCE_CALCULATOR,
        getClusterResource(), minimumAllocation, getMaximumResourceCapability(),
        incrAllocation);

    // 记录 Container 分配的开始时间
    application.recordContainerRequestTime(getClock().getTime());

    // Release containers
    releaseContainers(release, application);

    synchronized (application) {
      if (!ask.isEmpty()) {
        if (LOG.isDebugEnabled()) {
          LOG.debug("allocate: pre-update" +
              " applicationAttemptId=" + appAttemptId +
              " application=" + application.getApplicationId());
        }
        application.showRequests();

        // Update application requests
        application.updateResourceRequests(ask);

        application.showRequests();
      }
      ... // 省略

      // 重点:对 Container 进行鉴权,并拿到之前为 AppAttempt 分配的 Container 资源
      // 该资源保存在 RM 内存数据结构中,由 assignContainer() 方法分配出来的,具体分配逻辑可以看 Yarn 的资源分配逻辑
      ContainersAndNMTokensAllocation allocation =
          application.pullNewlyAllocatedContainersAndNMTokens();

      // Record container allocation time
      if (!(allocation.getContainerList().isEmpty())) {
        application.recordContainerAllocationTime(getClock().getTime());
      }

      // 将分配的 Container 资源返回给客户端(AM)
      return new Allocation(allocation.getContainerList(),
        application.getHeadroom(), preemptionContainerIds, null, null,
        allocation.getNMTokenList());
    }
  }

至此,AM 周期性心跳进行资源申请的逻辑在这里已经拿到了 Container,那拿到 Container 后又怎样启动呢,不同任务类型的启动方式不太一样,这里就不做详细介绍。

 

【参考资料】

posted @ 2020-08-26 17:29  笨小康u  阅读(2626)  评论(0编辑  收藏  举报