【Yarn源码分析】ApplicationMaster源码分析
本文主要介绍 ApplicationMaster 的运行流程,并从 ApplicationMaster 的启动、注册/心跳、Container 资源申请与分配三个角度分析相关源码。其中花了大量篇幅介绍 ApplicationMaster 的启动过程,包括任务提交流程、App/Attempt 转换过程,到 ApplicationMaster 的启动,这部分主要是方便读者了解从应用程序提交到启动 ApplicationMaster 启动整个过程,对 Yarn 的提交流程有更深入的理解。
一、ApplicationMaster 整体运行流程
ApplicationMaster 生命周期
ApplicationMaster 管理主要由三个服务构成,分别是 ApplicationMasterLauncher、AMLivelinessMonitor 和 ApplicationMasterService,它们共同管理应用程序的 ApplicationMaster 的生命周期。ApplicationMaster 服务从创建到销毁的流程如下:
- 用户向 ResourceManager 提交应用程序,ResourceManager 收到提交请求后,先向资源调度器申请用以启动 ApplicationMaster 的资源,待申请到资源后,再由 ApplicationMasterLauncher 与对应的 NodeManager 通信,从而启动应用程序的 ApplicationMaster。
- ApplicationMaster 启动完成后,ApplicationMasterLauncher 会通过事件的形式,将刚刚启动的 ApplicationMaster 注册到 AMLivelinessMonitor,以启动心跳监控。
- ApplicationMaster 启动后,先向 ApplicationMasterService 注册,将自己所在 host、端口号等信息汇报给它。
- ApplicationMaster 运行过程中,周期性地向 ApplicationMasterService 汇报“心跳”信息(“心跳”信息中包含想要申请的资源描述)。
- ApplicationMasterService 每次收到 ApplicationMaster 的心跳信息后,将通知 AMLivelinessMonitor 更新该应用程序的最近汇报心跳的时间。
- 当应用程序运行完成后,ApplicationMaster 向 ApplicationMasterService 发送请求,注销自己。
- ApplicationMasterService 收到注销请求后,标注应用程序运行状态为完成,同时通知 AMLivelinessMonitor 移除对它的心跳监控。
结合 ApplicationMaster 的整体生命周期,我们从 ApplicatioMaster 启动、注册/心跳及资源申请三个角度来剖析相关源码。
二、ApplicationMaster 启动流程
这部分主要介绍 ApplicationMaster 生命周期的第一步,即 ApplicationMaster 的启动。为了方便理解整个任务执行流程,我们不直接分析 ApplicationMaster 的启动类,而是从应用程序提交,到 APP/Attempt 状态转换(ApplicationMaster 启动前应用程序的一些状态转换过程),再到具体的 ApplicationMaster 启动,以对 Yarn 的整个任务提交流程有更深的了解。
2.1 应用程序提交
不管是什么类型的应用程序,提交到 Yarn 上的入口,都是通过 YarnClient 这个接口 api 提交的,具体提交方法为 submitApplication()。
//位置:org/apache/hadoop/yarn/client/api/YarnClient.java public abstract ApplicationId submitApplication( ApplicationSubmissionContext appContext) throws YarnException, IOException;
看看其实现类的提交入口:
//位置:org/apache/hadoop/yarn/client/api/impl/YarnClientImpl.java @Override public ApplicationId submitApplication(ApplicationSubmissionContext appContext) throws YarnException, IOException { ApplicationId applicationId = appContext.getApplicationId(); if (applicationId == null) { throw new ApplicationIdNotProvidedException( "ApplicationId is not provided in ApplicationSubmissionContext"); } // 构建应用程序请求的上文文信息 SubmitApplicationRequest request = Records.newRecord(SubmitApplicationRequest.class); request.setApplicationSubmissionContext(appContext); // Automatically add the timeline DT into the CLC // Only when the security and the timeline service are both enabled if (isSecurityEnabled() && timelineServiceEnabled) { addTimelineDelegationToken(appContext.getAMContainerSpec()); } // Client 真正提交应用程序 rmClient.submitApplication(request); while (true) { // 对未能及时提交的应用程序不断重试 } return applicationId; }
Yarn Client 与 RM 进行 RPC 通信是通过 ClientRMService 服务实现的,应用程序提交到服务端,会调用 RMAppManager 类的对应方法来处理应用程序。
//位置:org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java @Override public SubmitApplicationResponse submitApplication( SubmitApplicationRequest request) throws YarnException { ApplicationSubmissionContext submissionContext = request .getApplicationSubmissionContext(); ApplicationId applicationId = submissionContext.getApplicationId(); // 跳过神圣的检查工作 try { // 重点:调用 RMAppManager 来提交应用程序 rmAppManager.submitApplication(submissionContext, System.currentTimeMillis(), user); LOG.info("Application with id " + applicationId.getId() + " submitted by user " + user); RMAuditLogger.logSuccess(user, AuditConstants.SUBMIT_APP_REQUEST, "ClientRMService", applicationId); } catch (YarnException e) { LOG.info("Exception in submitting application with id " + applicationId.getId(), e); RMAuditLogger.logFailure(user, AuditConstants.SUBMIT_APP_REQUEST, e.getMessage(), "ClientRMService", "Exception in submitting application", applicationId); throw e; } SubmitApplicationResponse response = recordFactory .newRecordInstance(SubmitApplicationResponse.class); return response; }
2.2 APP/AppAttempt 状态转换过程
从 RMAppManager 类的 rmAppManager.submitApplication() 方法,可以看到它向调度器发送 RMAppEventType.START 事件。
//位置:src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java protected void submitApplication( ApplicationSubmissionContext submissionContext, long submitTime, String user) throws YarnException { ApplicationId applicationId = submissionContext.getApplicationId(); RMAppImpl application = createAndPopulateNewRMApp(submissionContext, submitTime, user, false); ApplicationId appId = submissionContext.getApplicationId(); Credentials credentials = null; try { credentials = parseCredentials(submissionContext); if (UserGroupInformation.isSecurityEnabled()) { this.rmContext.getDelegationTokenRenewer().addApplicationAsync(appId, credentials, submissionContext.getCancelTokensWhenComplete(), application.getUser()); } else { // 重点:向调度器发送 RMAppEventType.START 事件 this.rmContext.getDispatcher().getEventHandler() .handle(new RMAppEvent(applicationId, RMAppEventType.START)); } } catch (Exception e) { LOG.warn("Unable to parse credentials.", e); // Sending APP_REJECTED is fine, since we assume that the // RMApp is in NEW state and thus we haven't yet informed the // scheduler about the existence of the application assert application.getState() == RMAppState.NEW; this.rmContext.getDispatcher().getEventHandler() .handle(new RMAppRejectedEvent(applicationId, e.getMessage())); throw RPCUtil.getRemoteException(e); } }
RMAppEventType.START 事件在 RMAppImpl 类中有对应的状态转换,即 APP 状态从 NEW 转换为 NEW_SAVING。
//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java // Transitions from NEW state .addTransition(RMAppState.NEW, RMAppState.NEW_SAVING, RMAppEventType.START, new RMAppNewlySavingTransition())
注册的 RMAppNewlySavingTransition 状态机做了什么呢?
//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java private static final class RMAppNewlySavingTransition extends RMAppTransition { @Override public void transition(RMAppImpl app, RMAppEvent event) { // 保存 APP 的状态信息 LOG.info("Storing application with id " + app.applicationId); app.rmContext.getStateStore().storeNewApplication(app); } }
状态机会对 APP 的状态进行保存,将其元数据存储到 ZK 中。
//位置:org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java public void storeNewApplication(RMApp app) { ApplicationSubmissionContext context = app .getApplicationSubmissionContext(); assert context instanceof ApplicationSubmissionContextPBImpl; ApplicationStateData appState = ApplicationStateData.newInstance( app.getSubmitTime(), app.getStartTime(), context, app.getUser()); // 向调度器发送 RMStateStoreEventType.STORE_APP 事件 dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState)); }
这里向调度器发送 RMStateStoreEventType.STORE_APP 事件,并注册了 StoreAppTransition 状态机。
//位置:org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java .addTransition(RMStateStoreState.ACTIVE, EnumSet.of(RMStateStoreState.ACTIVE, RMStateStoreState.FENCED), RMStateStoreEventType.STORE_APP, new StoreAppTransition())
StoreAppTransition 状态机会向调度器发送 RMAppEventType.APP_NEW_SAVED 事件,触发 APP 状态从 NEW_SAVING 到 SUBMITED 的转换。
//位置:org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java private static class StoreAppTransition implements MultipleArcTransition<RMStateStore, RMStateStoreEvent, RMStateStoreState> { @Override public RMStateStoreState transition(RMStateStore store, RMStateStoreEvent event) { if (!(event instanceof RMStateStoreAppEvent)) { // should never happen LOG.error("Illegal event type: " + event.getClass()); return RMStateStoreState.ACTIVE; } boolean isFenced = false; ApplicationStateData appState = ((RMStateStoreAppEvent) event).getAppState(); ApplicationId appId = appState.getApplicationSubmissionContext().getApplicationId(); LOG.info("Storing info for app: " + appId); try { store.storeApplicationStateInternal(appId, appState); // 重点:向调度器发送 RMAppEventType.APP_NEW_SAVED 事件 store.notifyApplication(new RMAppEvent(appId, RMAppEventType.APP_NEW_SAVED)); } catch (Exception e) { LOG.error("Error storing app: " + appId, e); isFenced = store.notifyStoreOperationFailedInternal(e); } return finalState(isFenced); }; }
这里会向调度器发送 RMAppEventType.APP_NEW_SAVED 事件,该事件会触发 APP 状态从 NEW_SAVING 到 SUBMITED 的转换,并调用 AddApplicationToSchedulerTransition 状态机。
//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java .addTransition(RMAppState.NEW_SAVING, RMAppState.SUBMITTED, RMAppEventType.APP_NEW_SAVED, new AddApplicationToSchedulerTransition())
AddApplicationToSchedulerTransition 状态机会触发 SchedulerEventType.APP_ADDED 事件。
//位置:org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/AppAddedSchedulerEvent.java private static final class AddApplicationToSchedulerTransition extends RMAppTransition { @Override public void transition(RMAppImpl app, RMAppEvent event) { // 向调度器发送 SchedulerEventType.APP_ADDED 事件 app.handler.handle(new AppAddedSchedulerEvent(app.applicationId, app.submissionContext.getQueue(), app.user, app.submissionContext.getReservationID())); } }
其中AppAddedSchedulerEvent 类继承自 SchedulerEvent 类,事件的处理会进入到 FairScheduler 类,来看看对应的 handle() 方法。
//位置:org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java @Override public void handle(SchedulerEvent event) { switch (event.getType()) { case NODE_ADDED: // 省略 case NODE_REMOVED: // 省略 case NODE_UPDATE: // 省略 case APP_ADDED: if (!(event instanceof AppAddedSchedulerEvent)) { throw new RuntimeException("Unexpected event type: " + event); } AppAddedSchedulerEvent appAddedEvent = (AppAddedSchedulerEvent) event; // APP_ADDED 事件处理逻辑 addApplication(appAddedEvent.getApplicationId(), appAddedEvent.getQueue(), appAddedEvent.getUser(), appAddedEvent.getIsAppRecovering()); break; case APP_REMOVED: // 省略 case NODE_RESOURCE_UPDATE: // 省略 case APP_ATTEMPT_ADDED: // 省略 case APP_ATTEMPT_REMOVED: // 省略 case CONTAINER_EXPIRED: // 省略 case CONTAINER_RESCHEDULED: // 省略 default: LOG.error("Unknown event arrived at FairScheduler: " + event.toString()); } }
addApplication() 方法会对应用程序的提交进行一些前期检查工作,比如队列名是否正确、用户是否有队列访问权限等,检查通过后,会向调度器发送 RMAppEventType.APP_ACCEPTED 事件。
//位置:org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java protected synchronized void addApplication(ApplicationId applicationId, String queueName, String user, boolean isAppRecovering) { // 提交队列信息判断 if (queueName == null || queueName.isEmpty()) { String message = "Reject application " + applicationId + " submitted by user " + user + " with an empty queue name."; LOG.info(message); rmContext.getDispatcher().getEventHandler() .handle(new RMAppRejectedEvent(applicationId, message)); return; } if (queueName.startsWith(".") || queueName.endsWith(".")) { String message = "Reject application " + applicationId + " submitted by user " + user + " with an illegal queue name " + queueName + ". " + "The queue name cannot start/end with period."; LOG.info(message); rmContext.getDispatcher().getEventHandler() .handle(new RMAppRejectedEvent(applicationId, message)); return; } RMApp rmApp = rmContext.getRMApps().get(applicationId); FSLeafQueue queue = assignToQueue(rmApp, queueName, user); if (queue == null) { return; } // 队列的 ACL 访问权限判断 UserGroupInformation userUgi = UserGroupInformation.createRemoteUser(user); if (!queue.hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi) && !queue.hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) { String msg = "User " + userUgi.getUserName() + " cannot submit applications to queue " + queue.getName(); LOG.info(msg); rmContext.getDispatcher().getEventHandler() .handle(new RMAppRejectedEvent(applicationId, msg)); return; } SchedulerApplication<FSAppAttempt> application = new SchedulerApplication<FSAppAttempt>(queue, user); applications.put(applicationId, application); queue.getMetrics().submitApp(user); LOG.info("Accepted application " + applicationId + " from user: " + user + ", in queue: " + queue.getName() + ", currently num of applications: " + applications.size()); if (isAppRecovering) { // 判断 APP 是否事 Recover 状态(暂时不考虑 Recover 情况) if (LOG.isDebugEnabled()) { LOG.debug(applicationId + " is recovering. Skip notifying APP_ACCEPTED"); } } else { // 重点:向调度器发送 RMAppEventType.APP_ACCEPTED 事件 rmContext.getDispatcher().getEventHandler() .handle(new RMAppEvent(applicationId, RMAppEventType.APP_ACCEPTED)); } }
RMAppEventType.APP_ACCEPTED 事件的注册,会触发 StartAppAttemptTransition 状态机,并将 APP 的状态从 SUBMITED 转换为 ACCEPTED。
//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java // Transitions from SUBMITTED state .addTransition(RMAppState.SUBMITTED, RMAppState.ACCEPTED, RMAppEventType.APP_ACCEPTED, new StartAppAttemptTransition())
StartAppAttemptTransition 状态机会发送 RMAppAttemptEventType.START 事件,以开始启动 AppAttempt。
//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java private static final class StartAppAttemptTransition extends RMAppTransition { @Override public void transition(RMAppImpl app, RMAppEvent event) { app.createAndStartNewAttempt(false); }; } // 开始启动 AppAttempt private void createAndStartNewAttempt(boolean transferStateFromPreviousAttempt) { createNewAttempt(); // 向调度器发送 RMAppAttemptEventType.START 事件 handler.handle(new RMAppStartAttemptEvent(currentAttempt.getAppAttemptId(), transferStateFromPreviousAttempt)); }
RMAppAttemptEventType.START 事件的注册,会调用 AttemptStartedTransition 状态机,触发 AppAttempt 状态从 NEW 转变为 SUBMITED。
//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java // Transitions from NEW State .addTransition(RMAppAttemptState.NEW, RMAppAttemptState.SUBMITTED, RMAppAttemptEventType.START, new AttemptStartedTransition())
AttemptStartedTransition 状态机会触发 AppAttemptAddedSchedulerEvent 事件,发送 SchedulerEventType.APP_ATTEMPT_ADDED 请求。
//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java private static final class AttemptStartedTransition extends BaseTransition { @Override public void transition(RMAppAttemptImpl appAttempt, RMAppAttemptEvent event) { // 跳过一些神圣的检查工作 // 向调度器发送 SchedulerEventType.APP_ATTEMPT_ADDED 事件 appAttempt.eventHandler.handle(new AppAttemptAddedSchedulerEvent( appAttempt.applicationAttemptId, transferStateFromPreviousAttempt)); } }
AppAttemptAddedSchedulerEvent 类继承自 SchedulerEvent 类,进入具体代码看看 SchedulerEventType.APP_ATTEMPT_ADDED 事件的处理逻辑,还是在 handle() 方法。
//位置:org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java public void handle(SchedulerEvent event) { switch (event.getType()) { case NODE_ADDED: // 省略 case NODE_REMOVED: // 省略 case NODE_UPDATE: // 省略 case APP_ADDED: // 省略 case APP_REMOVED: // 省略 case NODE_RESOURCE_UPDATE: // 省略 case APP_ATTEMPT_ADDED: if (!(event instanceof AppAttemptAddedSchedulerEvent)) { throw new RuntimeException("Unexpected event type: " + event); } AppAttemptAddedSchedulerEvent appAttemptAddedEvent = (AppAttemptAddedSchedulerEvent) event; addApplicationAttempt(appAttemptAddedEvent.getApplicationAttemptId(), appAttemptAddedEvent.getTransferStateFromPreviousAttempt(), appAttemptAddedEvent.getIsAttemptRecovering()); break; case APP_ATTEMPT_REMOVED: // 省略 case CONTAINER_EXPIRED: // 省略 case CONTAINER_RESCHEDULED: // 省略 default: LOG.error("Unknown event arrived at FairScheduler: " + event.toString()); } }
addApplicationAttempt() 方法会调度器发送 RMAppAttemptEventType.ATTEMPT_ADDED 事件。
//位置:org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java protected synchronized void addApplicationAttempt( ApplicationAttemptId applicationAttemptId, boolean transferStateFromPreviousAttempt, boolean isAttemptRecovering) { // 跳过前期的检查和初始化工作 if (isAttemptRecovering) { if (LOG.isDebugEnabled()) { LOG.debug(applicationAttemptId + " is recovering. Skipping notifying ATTEMPT_ADDED"); } } else { // 向调度器发送 RMAppAttemptEventType.ATTEMPT_ADDED 事件 rmContext.getDispatcher().getEventHandler().handle( new RMAppAttemptEvent(applicationAttemptId, RMAppAttemptEventType.ATTEMPT_ADDED)); } }
RMAppAttemptEventType.ATTEMPT_ADDED 注册,并触发 ScheduleTransition 状态机,将 AppAttempt 状态从 SUBMITED 转变为 LAUNCHED_UNMANAGED_SAVING 或者 SCHEDULED。
//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java // Transitions from SUBMITTED state .addTransition(RMAppAttemptState.SUBMITTED, EnumSet.of(RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING, RMAppAttemptState.SCHEDULED), RMAppAttemptEventType.ATTEMPT_ADDED, new ScheduleTransition())
看看 ScheduleTransition 状态机,if 语句开关判断是否应该获取管理 AM 的执行,如果为 true,则 RM 不会为 AM 分配一个容器并启动,默认是 false,所以这里返回的状态是 SCHEDULED。
//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java public static final class ScheduleTransition implements MultipleArcTransition<RMAppAttemptImpl, RMAppAttemptEvent, RMAppAttemptState> { @Override public RMAppAttemptState transition(RMAppAttemptImpl appAttempt, RMAppAttemptEvent event) { ApplicationSubmissionContext subCtx = appAttempt.submissionContext; if (!subCtx.getUnmanagedAM()) { // 跳过一部分操作 // 分配 Container,这里暂不做解释 Allocation amContainerAllocation = appAttempt.scheduler.allocate( appAttempt.applicationAttemptId, appAttempt.amReqs, EMPTY_CONTAINER_RELEASE_LIST, amBlacklist.getAdditions(), amBlacklist.getRemovals()); if (amContainerAllocation != null && amContainerAllocation.getContainers() != null) { assert (amContainerAllocation.getContainers().size() == 0); } // 返回的状态也会进行状态机转换 return RMAppAttemptState.SCHEDULED; } else { // save state and then go to LAUNCHED state appAttempt.storeAttempt(); return RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING; } } }
RMAppAttemptState.SCHEDULED 状态,会触发 RMAppAttemptEventType.CONTAINER_ALLOCATED 事件,使得 AppAttempt 状态从 SCHEDULED 转换到 ALLOCATED_SAVING,对应的处理状态机为 AMContainerAllocatedTransition。
//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java // Transitions from SCHEDULED State .addTransition(RMAppAttemptState.SCHEDULED, EnumSet.of(RMAppAttemptState.ALLOCATED_SAVING, RMAppAttemptState.SCHEDULED), RMAppAttemptEventType.CONTAINER_ALLOCATED, new AMContainerAllocatedTransition())
AMContainerAllocatedTransition 状态机主要是 AM 获取分配的资源,并发送 RMAppAttemptState.ALLOCATED_SAVING 事件。
//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java private static final class AMContainerAllocatedTransition implements MultipleArcTransition<RMAppAttemptImpl, RMAppAttemptEvent, RMAppAttemptState> { @Override public RMAppAttemptState transition(RMAppAttemptImpl appAttempt, RMAppAttemptEvent event) { // 从调度器获取启动 AM 的 Container,这里的 allocate 并没有传入 AM 请求信息,表示先尝试直接获取 Container Allocation amContainerAllocation = appAttempt.scheduler.allocate(appAttempt.applicationAttemptId, EMPTY_CONTAINER_REQUEST_LIST, EMPTY_CONTAINER_RELEASE_LIST, null, null); // 对 AM 资源进行判空处理,如果没有获取到之前分配的资源,在这里重新进行分配 if (amContainerAllocation.getContainers().size() == 0) { appAttempt.retryFetchingAMContainer(appAttempt); return RMAppAttemptState.SCHEDULED; } // Set the masterContainer appAttempt.setMasterContainer(amContainerAllocation.getContainers() .get(0)); RMContainerImpl rmMasterContainer = (RMContainerImpl)appAttempt.scheduler .getRMContainer(appAttempt.getMasterContainer().getId()); rmMasterContainer.setAMContainer(true); appAttempt.rmContext.getNMTokenSecretManager() .clearNodeSetForAttempt(appAttempt.applicationAttemptId); appAttempt.getSubmissionContext().setResource( appAttempt.getMasterContainer().getResource()); appAttempt.storeAttempt(); // 向调度器发送 RMAppAttemptState.ALLOCATED_SAVING 事件 return RMAppAttemptState.ALLOCATED_SAVING; } }
RMAppAttemptState.ALLOCATED_SAVING 事件的注册状态机为 AttemptStoredTransition,此时 AppAttempt 状态已从 ALLOCATED_SAVING 转换为 ALLOCATED。
//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java // Transitions from ALLOCATED_SAVING State .addTransition(RMAppAttemptState.ALLOCATED_SAVING, RMAppAttemptState.ALLOCATED, RMAppAttemptEventType.ATTEMPT_NEW_SAVED, new AttemptStoredTransition())
我们接着看 AttemptStoredTransition 状态机做了什么。
//位置:org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java private static final class AttemptStoredTransition extends BaseTransition { @Override public void transition(RMAppAttemptImpl appAttempt, RMAppAttemptEvent event) { // 运行 AppAttempt appAttempt.launchAttempt(); } } private void launchAttempt(){ launchAMStartTime = System.currentTimeMillis(); // 重点:发送 AMLauncherEventType.LAUNCH 事件启动 AM Container eventHandler.handle(new AMLauncherEvent(AMLauncherEventType.LAUNCH, this)); }
至此,终于看到了 AM Container 启动的曙光了,可具体是怎么启动的呢?我们接着分析。
2.3 启动 AM
上面的发送的 AMLauncherEventType.LAUNCH 事件是启动 AM 的关键入口,可由谁来处理这个事件呢?这就需要进入到 ApplicationMasterLauncher 类来分析了,我们先来看看这个类的基本属性。
//位置:org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java public class ApplicationMasterLauncher extends AbstractService implements EventHandler<AMLauncherEvent> { // 创建线程池实例,针对处理的每一个 AM 事件都启动一个线程 private ThreadPoolExecutor launcherPool; // 独立线程处理 AM 的 LAUNCH 和 CLEANUP 事件 private LauncherThread launcherHandlingThread; // 事件接收和处理的队列 private final BlockingQueue<Runnable> masterEvents = new LinkedBlockingQueue<Runnable>(); // 资源管理器上下文 protected final RMContext context; public ApplicationMasterLauncher(RMContext context) { super(ApplicationMasterLauncher.class.getName()); this.context = context; // 新建事件处理的线程 this.launcherHandlingThread = new LauncherThread(); } @Override protected void serviceInit(Configuration conf) throws Exception { int threadCount = conf.getInt( YarnConfiguration.RM_AMLAUNCHER_THREAD_COUNT, YarnConfiguration.DEFAULT_RM_AMLAUNCHER_THREAD_COUNT); ThreadFactory tf = new ThreadFactoryBuilder() .setNameFormat("ApplicationMasterLauncher #%d") .build(); // 初始化线程池 launcherPool = new ThreadPoolExecutor(threadCount, threadCount, 1, TimeUnit.HOURS, new LinkedBlockingQueue<Runnable>()); launcherPool.setThreadFactory(tf); // 跳过一些配置初始化操作 } }
这里主要是创建一些执行环境,包括事件处理的独立线程 launcherHandlingThread、所需的线程池 launcherPool 及一个负责接收和处理 AM 事件的 masterEvents 事件队列。而 ApplicationMasterLauncher 类中主要处理 AM 的两种事件:LAUNCH 和 CLEANUP。
//位置:org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java @Override public synchronized void handle(AMLauncherEvent appEvent) { // 获取 AMLauncherEvent 的事件类型 AMLauncherEventType event = appEvent.getType(); RMAppAttempt application = appEvent.getAppAttempt(); switch (event) { // 处理 AM LAUNCH 事件 case LAUNCH: launch(application); break; // 处理 AM CLEANUP 事件 case CLEANUP: cleanup(application); default: break; } }
上面 2.2 小节中最后发送的 AMLauncherEventType.LAUNCH 事件正是在这里处理的,我们就以 LAUNCH 事件为例来看看具体的处理逻辑。
//位置:org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java private void launch(RMAppAttempt application) { // 创建一个 AMLauncher 实例,AMLauncher 继承自 Runnable Runnable launcher = createRunnableLauncher(application, AMLauncherEventType.LAUNCH); // 将事件添加到 masterEvents 队列中 masterEvents.add(launcher); } protected Runnable createRunnableLauncher(RMAppAttempt application, AMLauncherEventType event) { Runnable launcher = new AMLauncher(context, application, event, getConfig()); return launcher; }
事件被加入到事件队列之后,是如何被处理的呢?这里就是独立线程 launcherHandlingThread 所做的事了,通过消息队列的形式,在线程中逐一被消费处理。
//位置:org/apache/hadoop/yarn/server/resourcemanager/amlauncher/ApplicationMasterLauncher.java private class LauncherThread extends Thread { public LauncherThread() { super("ApplicationMaster Launcher"); } @Override public void run() { while (!this.isInterrupted()) { // 死循环不停地处理事件请求 Runnable toLaunch; try { // 从事件队列中取出事件 toLaunch = masterEvents.take(); // 从线程池中取出一个线程执行事件请求 launcherPool.execute(toLaunch); } catch (InterruptedException e) { LOG.warn(this.getClass().getName() + " interrupted. Returning."); return; } } } }
取出事件后具体的执行逻辑就交给 AMLaunch 类了,这里的 AMLaunch 类本身就是一个 Runnable 实例,我们直接看其 run() 方法。
//位置:org/apache/hadoop/yarn/server/resourcemanager/amlauncher/AMLauncher.java public void run() { switch (eventType) { case LAUNCH: try { LOG.info("Launching master" + application.getAppAttemptId()); // 启动 launch() 方法 launch(); // 发送 RMAppAttemptEventType.LAUNCHED 事件 handler.handle(new RMAppAttemptEvent(application.getAppAttemptId(), RMAppAttemptEventType.LAUNCHED)); } catch(Exception ie) { String message = "Error launching " + application.getAppAttemptId() + ". Got exception: " + StringUtils.stringifyException(ie); LOG.info(message); handler.handle(new RMAppAttemptLaunchFailedEvent(application .getAppAttemptId(), message)); } break; case CLEANUP: // 省略 default: LOG.warn("Received unknown event-type " + eventType + ". Ignoring."); break; } }
AMLaunch 类的 launch() 方法操作会调用 RPC 函数与 NodeManager 通信,来启动 AM Container,这里 AM 与 NM 交互是通过 ContainerManagementProtocol 协议来实现 RPC 调用的。launch() 方法运行完成后会向调度器发送 RMAppAttemptEventType.LAUNCHED 事件,并将 AppAttempt 的状态从 ALLOCATED 转换为 LAUNCHED。
//位置:org/apache/hadoop/yarn/server/resourcemanager/amlauncher/AMLauncher.java private void launch() throws IOException, YarnException { connect(); ContainerId masterContainerID = masterContainer.getId(); ApplicationSubmissionContext applicationContext = application.getSubmissionContext(); LOG.info("Setting up container " + masterContainer + " for AM " + application.getAppAttemptId()); ContainerLaunchContext launchContext = createAMContainerLaunchContext(applicationContext, masterContainerID); // 构建 Container 请求信息 StartContainerRequest scRequest = StartContainerRequest.newInstance(launchContext, masterContainer.getContainerToken()); List<StartContainerRequest> list = new ArrayList<StartContainerRequest>(); list.add(scRequest); StartContainersRequest allRequests = StartContainersRequest.newInstance(list); // 重点:调用 RPC 函数启动 Container StartContainersResponse response = containerMgrProxy.startContainers(allRequests); if (response.getFailedRequests() != null && response.getFailedRequests().containsKey(masterContainerID)) { Throwable t = response.getFailedRequests().get(masterContainerID).deSerialize(); parseAndThrowException(t); } else { LOG.info("Done launching container " + masterContainer + " for AM " + application.getAppAttemptId()); } }
至此,用于运行 ApplicationMaster 的 Container 已经启动,具体的 Container 启动逻辑在这里不做分析,AM Container 在具体的 NodeManager 上启动后,Container 会根据上下文信息启动 ApplicationMaster 进程,ApplicationMaster 生命周期的第一步 ApplicationMaster 启动在这里已经完成了。
三、ApplicationMaster 注册/心跳及资源申请流程
这部分主要介绍 ApplicationMaster 启动是做了哪些工作,如何向 ResourceManager 进行注册和心跳,以及如何申请 Container 资源。
3.1 ApplicationMaster 注册/心跳流程
Container 的启动会触发 ApplicationMaster 进程的启动,于是我们从 ApplicationMaster 类的 main() 方法作为入口,来看看 ApplicationMaster 启动时做了哪些工作。
//位置:org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java public static void main(String[] args) { boolean result = false; try { ApplicationMaster appMaster = new ApplicationMaster(); LOG.info("Initializing ApplicationMaster"); boolean doRun = appMaster.init(args); if (!doRun) { System.exit(0); } // AM 启动的核心 run() 方法 appMaster.run(); result = appMaster.finish(); } catch (Throwable t) { LOG.fatal("Error running ApplicationMaster", t); LogManager.shutdown(); ExitUtil.terminate(1, t); } if (result) { LOG.info("Application Master completed successfully. exiting"); System.exit(0); } else { LOG.info("Application Master failed. exiting"); System.exit(2); } }
run() 方法是 AM 启动的核心入口方法。这里主要是初始化相关 RPC 客户端实例,并开始向 RM 进行注册和心跳。
//位置:org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java public void run() throws YarnException, IOException { LOG.info("Starting ApplicationMaster"); // 跳过 tokens 的检查工作 // 初始化 AMRMClientAsync 实例,用于 AM 与 RM 之间进行交互 AMRMClientAsync.CallbackHandler allocListener = new RMCallbackHandler(); amRMClient = AMRMClientAsync.createAMRMClientAsync(1000, allocListener); amRMClient.init(conf); amRMClient.start(); // 初始化 NMClientAsync 实例,用于 AM 与 NM 之间进行交互 containerListener = createNMCallbackHandler(); nmClientAsync = new NMClientAsyncImpl(containerListener); nmClientAsync.init(conf); nmClientAsync.start(); // 重点:AM 向 RM 进行注册,这里也会向 RM 发送心跳请求 appMasterHostname = NetUtils.getHostname(); RegisterApplicationMasterResponse response = amRMClient .registerApplicationMaster(appMasterHostname, appMasterRpcPort, appMasterTrackingUrl); // 跳过资源限制检查及 Container 状态的记录过程 }
AM 与 RM 进行 RPC 通信是通过 ApplicationMasterService 服务实现的,在看服务端 registerApplicationMaster 注册函数前,先来看看客户端的注册函数。
//位置:org/apache/hadoop/yarn/client/api/async/impl/AMRMClientAsyncImpl.java public RegisterApplicationMasterResponse registerApplicationMaster( String appHostName, int appHostPort, String appTrackingUrl) throws YarnException, IOException { // AM 注册 RegisterApplicationMasterResponse response = client .registerApplicationMaster(appHostName, appHostPort, appTrackingUrl); // 启动 AM 心跳上报线程 heartbeatThread.start(); return response; }
先来看看 AM 注册过程。注册时做了两件事,一个是更新 AM 在 AMLivelinessMonitor 中的最新事件,另一个是发送 RMAppAttemptEventType.REGISTERED 事件,触发 AMRegisteredTransition 状态机,并将 AppAttempt 状态从 LAUNCHED 转换为 RUNNING。
//位置:org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java @Override public RegisterApplicationMasterResponse registerApplicationMaster( RegisterApplicationMasterRequest request) throws YarnException, IOException { //省略 // Allow only one thread in AM to do registerApp at a time. synchronized (lock) { // 省略 // 更新 AM 在 AMLivelinessMonitor 中最近汇报心跳的事件 this.amLivelinessMonitor.receivedPing(applicationAttemptId); RMApp app = this.rmContext.getRMApps().get(appID); // Setting the response id to 0 to identify if the // application master is register for the respective attemptid lastResponse.setResponseId(0); lock.setAllocateResponse(lastResponse); // AM 注册关键逻辑,发送 RMAppAttemptEventType.REGISTERED 事件 LOG.info("AM registration " + applicationAttemptId); this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppAttemptRegistrationEvent(applicationAttemptId, request .getHost(), request.getRpcPort(), request.getTrackingUrl())); RMAuditLogger.logSuccess(app.getUser(), AuditConstants.REGISTER_AM, "ApplicationMasterService", appID, applicationAttemptId); // 省略 } }
接着来看看 AM 心跳上报流程。heartbeatThread 线程是处理 AM 的独立线程,其初始化过程如下。
//位置:org/apache/hadoop/yarn/client/api/async/impl/AMRMClientAsyncImpl.java public class AMRMClientAsyncImpl<T extends ContainerRequest> extends AMRMClientAsync<T> { // AM 心跳线程对象 private final HeartbeatThread heartbeatThread; @Private @VisibleForTesting public AMRMClientAsyncImpl(AMRMClient<T> client, int intervalMs, CallbackHandler callbackHandler) { super(client, intervalMs, callbackHandler); // 初始化 AM 心跳线程实例 heartbeatThread = new HeartbeatThread(); }
AM 向 RM 注册后,周期性地通过 RPC 函数 ApplicationMasterProtocol#allocate() 方法与 RM 通信,该方法主要有以下是三个作用:
- 请求申请;
- 获取新分配地资源;
- 形成周期性心跳,告诉 RM 自己还活着。
//位置:org/apache/hadoop/yarn/client/api/async/impl/AMRMClientAsyncImpl.java private class HeartbeatThread extends Thread { public HeartbeatThread() { super("AMRM Heartbeater thread"); } public void run() { while (true) { // 心跳线程死循环的跑 AllocateResponse response = null; // synchronization ensures we don't send heartbeats after unregistering synchronized (unregisterHeartbeatLock) { if (!keepRunning) { return; } try { // 重点:心跳线程其实就是周期性的调用 allocate() 方法 response = client.allocate(progress); } catch (ApplicationAttemptNotFoundException e) { handler.onShutdownRequest(); LOG.info("Shutdown requested. Stopping callback."); return; } catch (Throwable ex) { LOG.error("Exception on heartbeat", ex); savedException = ex; // interrupt handler thread in case it waiting on the queue handlerThread.interrupt(); return; } if (response != null) { while (true) { try { responseQueue.put(response); break; } catch (InterruptedException ex) { LOG.debug("Interrupted while waiting to put on response queue", ex); } } } } try { Thread.sleep(heartbeatIntervalMs.get()); } catch (InterruptedException ex) { LOG.debug("Heartbeater interrupted", ex); } } } }
至此,AM 已经完成向 RM 的注册及周期性心跳上报的过程,其中心跳上报是通过周期性地调用 ApplicationMasterProtocol#allocate() 方法来实现的。AM 心跳开始后,便会定期的向 RM 申请资源,以在对应的 NodeManager 上启动 Container 进程,在下一部分中会详细介绍。
3.2 ApplicationMaster 资源申请与分配流程
ApplicationMaster 资源申请与分配的对象都是针对 Container,下面也是以 Container 的申请与分配作为介绍内容。
(1)Container 分配与申请流程
Container 分配与申请流程
如上图,应用程序的 ApplicationMaster 在 NM 上成功启动并向 RM 注册后,向 RM 请求资源(Container)到获取到资源的整个过程,分为两个阶段:
阶段一:ApplicationMaster 汇报资源资源并领取已经分配到的资源;
- ApplicationMaster 通过 RPC 函数 ApplicationMasterProtocol#allocate 向 RM 汇报资源需求(由于是周期性调用,也叫“心跳”),包括包括新的资源需求描述、待释放的 Container 列表、请求加入黑名单的节点列表、请求移除黑名单的节点列表等;
- RM 中的 ApplicationMasterService 负责处理来自 ApplicationMaster 的请求,一旦受到请求,会向 RMAppAttemptImpl 发送一个 RMAppAttemptEventType.STATUS_UPDATE 类型事件,而 RMAppAttempImpl 收到该事件后,将更新应用程序执行进度和 AMLivelinessMonitor 中记录的应用程序最近更新事件。
- ApplicationMasterService 调用 ResourceScheduler#allocate 函数,将 ApplicationMaster 资源需求汇报给 ResourceScheduler。
- ResourceScheduler 首先读取待释放 Contianer 列表,依次向对应的 RMContainerImpl 发送 RMContainerEventType.RELEASED 类型事件,以杀死正在运行的 Container,然后将新的资源需求更新到对应数据中,并返回已经为该应用程序分配的资源。
阶段二:NM 向 RM 汇报各个 Container 运行状态,如果 RM 发现它上面又空闲的资源,则进行一次分配,并将分配的资源保存到 RM 数据结构中,等待下次 ApplicationMaster 发送心跳时获取。
- NM 通过 RPC 函数 ResourceTracker#nodeHeartbeat 向 RM 汇报各个 Container 运行状态。
- RM 中的 ResourceTrackerService 负责处理来自 NM 的请求,一旦收到请求,会向 RMNodeImpl 发送一个 RMNodeEventType.STATUS_UPDATE 事件,而 RMNodeImpl 收到事件后,将更新各个 Container 运行状态,并进一步向 ResourceScheduler 发送一个 SchedulerEventType.NODE_UPDATE 事件。
- ResourceScheduler 收到事件后,如果该节点又可分配的空闲资源,则会将这些资源分配给各个应用程序,而分配后的资源仅是记录到对应数据结构中,等待 ApplicationMaster 下次通过心跳机制来领取。
(2)源码分析
客户端调用 AMRMClientAsyncImpl#allocate() 方法会通过 RPC 函数向 RM 汇报资源需求,其通信接口是由 ApplicationMasterProtocol 协议来实现,来看看该协议是如何为客户端申请资源。
//位置:rg/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java @Override public AllocateResponse allocate(AllocateRequest request) throws YarnException, IOException { synchronized (lock) { AllocateResponse lastResponse = lock.getAllocateResponse(); // 发送 STATUS_UPDATE 更新 AppAttempt 状态 this.rmContext.getDispatcher().getEventHandler().handle( new RMAppAttemptStatusupdateEvent(appAttemptId, request .getProgress())); // 检查队列中的 memory 和 vcore 是否足够 try { RMServerUtils.normalizeAndValidateRequests(ask, rScheduler.getMaximumResourceCapability(), app.getQueue(), rScheduler); } catch (InvalidResourceRequestException e) { LOG.warn("Invalid resource ask by application " + appAttemptId, e); throw e; } // 重点:调用调度器的 allocate() 方法向 RM 上报资源需求 Allocation allocation = this.rScheduler.allocate(appAttemptId, ask, release, blacklistAdditions, blacklistRemovals); // 更新请求的 response 和 AMRMToken 的状态,省略具体流程 return allocateResponse; } }
ApplicationMasterService#allocate() 方法会调用 YarnScheduler 的 allocate() 分配方法,由于采用是 FairScheduler 调度器,我们来分析下 FairScheduler#allocate() 方法。分配过程的核心在 pullNewlyAllocatedContainersAndNMTokens() 方法,该方法的核心是从 newlyAllocatedContainers 这个 List 数据结构中取 Container,那取到的 Container 是从哪儿来的呢?其实就是 NoddeManager 心跳发生时进行资源分配逻辑分配出来的 Container,是保存在 RM 的内存数据结构 newlyAllocatedContainers 中,AM 则直接从该数据结构中取对应的 Container。
//位置:org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java @Override public Allocation allocate(ApplicationAttemptId appAttemptId, List<ResourceRequest> ask, List<ContainerId> release, List<String> blacklistAdditions, List<String> blacklistRemovals) { // 规整化资源请求 SchedulerUtils.normalizeRequests(ask, DOMINANT_RESOURCE_CALCULATOR, getClusterResource(), minimumAllocation, getMaximumResourceCapability(), incrAllocation); // 记录 Container 分配的开始时间 application.recordContainerRequestTime(getClock().getTime()); // Release containers releaseContainers(release, application); synchronized (application) { if (!ask.isEmpty()) { if (LOG.isDebugEnabled()) { LOG.debug("allocate: pre-update" + " applicationAttemptId=" + appAttemptId + " application=" + application.getApplicationId()); } application.showRequests(); // Update application requests application.updateResourceRequests(ask); application.showRequests(); } ... // 省略 // 重点:对 Container 进行鉴权,并拿到之前为 AppAttempt 分配的 Container 资源 // 该资源保存在 RM 内存数据结构中,由 assignContainer() 方法分配出来的,具体分配逻辑可以看 Yarn 的资源分配逻辑 ContainersAndNMTokensAllocation allocation = application.pullNewlyAllocatedContainersAndNMTokens(); // Record container allocation time if (!(allocation.getContainerList().isEmpty())) { application.recordContainerAllocationTime(getClock().getTime()); } // 将分配的 Container 资源返回给客户端(AM) return new Allocation(allocation.getContainerList(), application.getHeadroom(), preemptionContainerIds, null, null, allocation.getNMTokenList()); } }
至此,AM 周期性心跳进行资源申请的逻辑在这里已经拿到了 Container,那拿到 Container 后又怎样启动呢,不同任务类型的启动方式不太一样,这里就不做详细介绍。
【参考资料】