retry policy 作用是规定当任务失败的时候要不要重试。适用于framework 和 task两类任务。
1、两种retry policy:FancyRetry Policy + NormalRetry Policy
结合退出原因分类处理:
FancyRetry:
暂时失败(transient failure)-- retry
非暂时失败(non - transient failure)-- not retry
成功/ 未知失败(unknow failure)-- 交由NormalRetry Policy判断
NormalRetry Policy:
满足下面三个条件之一时retry且retry count++:
1)maxRetryCount 设置值为 -2;意味着在这个设置下,就算是成功退出也会继续retry。
2)maxRetryCount 设置值为 -1 且 状态为failure;意味着在这个设置下,如果失败退出则会无限制次数地retry直至成功。
3)retry count < max retry count 且状态为failure;意味着在这个设置下,失败退出时会retry有限次。
下面链接给出了不同的framework中设置值,以及他们对应的retry策略。
2、ApplicationCompletionPolicy
决定一个任务(taskRole)retry的参数:minFailedTaskCount + minSuccessTaskCount,
具体分为以下几种情况:
1、如果minFailedTask != null,且实际失败task数量 > minFailedTask设置值,结束AM,标记为失败。ExitStatus为failed,由最后一个失败的task决定。
2、如果minSuccessTask != null,且实际成功数量 >= minSuccessTask设置值,结束AM,标记为成功。ExitStatus为succeed,由最后一个成功的task决定。
3、如果满足1、2中任一种情况,那么根据实际情况决定,可能会是其中的任意一种。
4、如果1、2两种情况都需要在所有taskRole的task都完成时才能满足,结束AM,标记为成功。ExitStatus为succeed,不由任何一个task生成。
一些具体的例子:
3、framework的retry
framework的retry有三种情况,一个是launchApplication()提交AM之后不成功,这时候会抛出异常,在异常部分判断是否需要retry。第二是在检查标记为COMPLETE状态的framework,看它们的ExitType,结合retry policy来判断是否需要retry。第三是在重启的时候有recover,在recover检查是否需要retry一些framework。下面主要讲前两种情况。
3.1 launchApplication() 提交了AM之后,在异常处理部分调用了retrieveApplicationExitDiagnostics()。
YarnException --> non-transient error
IOException --> transient error
其他Exception --> unknown error
处理异常的步骤如下:
1、调用killApplication(),结束掉这个AM。
2、由statusManager将这个AM负责的Framework状态改为APPLICATION_RETRIEVING_DIAGNOSTICS,
3、交由diagnosticsRetrieveHandler另起一个线程处理。
4、上面的线程把retrieveApplicationExitCode(applicationId, finalDiagnostics)这个任务扔进线程池中。这个任务一是获取AM中对应的ApplicationId,二是调用completeApplication()尝试结束framework。
5、completeApplication()中将framework的状态标记为APPLICATION_COMPLETED,然后调用attemptToRetry()。
6、attemptToRetry()中,根据不同的状态framework有两种结局:
1)completeFramework --> 将framework的状态标记为FRAMEWORK_COMPLETED
2)retryFramework --> 将framework的状态标记为FRAMEWORK_WAITING,然后调用createApplication()重新启动一个AM。
3.2 检查COMPLETE状态的framework
检查framework状态这一步是异步进行的,会另起一个线程执行。
入口在方法 resyncFrameworksWithLiveApplications(Map<String, ApplicationReport> liveApplicationReports)中。因为RM会将AM的状态保存在zk中,这里检查的是从zk上pull下来的live Application。
private void resyncFrameworksWithLiveApplications(Map<String, ApplicationReport> liveApplicationReports) throws Exception { // Since Application is persistent in ZK by RM, so liveApplicationReports will never incomplete. String logScope = "resyncFrameworksWithLiveApplications"; CHANGE_AWARE_LOGGER.initializeScope(logScope, Level.INFO, Level.DEBUG); CHANGE_AWARE_LOGGER.log(logScope, "Got %s live Applications from RM, start to resync them.", liveApplicationReports.size()); for (ApplicationReport applicationReport : liveApplicationReports.values()) { String applicationId = applicationReport.getApplicationId().toString(); YarnApplicationState applicationState = applicationReport.getYarnApplicationState(); FinalApplicationStatus applicationFinalStatus = applicationReport.getFinalApplicationStatus(); String diagnostics = applicationReport.getDiagnostics(); if (statusManager.isApplicationIdLiveAssociated(applicationId)) { FrameworkStatus frameworkStatus = statusManager.getFrameworkStatusWithLiveAssociatedApplicationId(applicationId); String frameworkName = frameworkStatus.getFrameworkName(); FrameworkState frameworkState = frameworkStatus.getFrameworkState(); if (frameworkState == FrameworkState.APPLICATION_CREATED) { continue; } // updateApplicationStatus statusManager.updateApplicationStatus(frameworkName, applicationReport); // transitionFrameworkState if (applicationFinalStatus == FinalApplicationStatus.UNDEFINED) { if (applicationState == YarnApplicationState.NEW || applicationState == YarnApplicationState.NEW_SAVING || applicationState == YarnApplicationState.SUBMITTED || applicationState == YarnApplicationState.ACCEPTED) { statusManager.transitionFrameworkState(frameworkName, FrameworkState.APPLICATION_WAITING); } else if (applicationState == YarnApplicationState.RUNNING) { statusManager.transitionFrameworkState(frameworkName, FrameworkState.APPLICATION_RUNNING); } } else if (applicationFinalStatus == FinalApplicationStatus.SUCCEEDED) { retrieveApplicationExitDiagnostics( applicationId, ExitStatusKey.SUCCEEDED.toInt(), diagnostics, false); } else if (applicationFinalStatus == FinalApplicationStatus.KILLED) { retrieveApplicationExitDiagnostics( applicationId, ExitStatusKey.AM_KILLED_BY_USER.toInt(), diagnostics, false); } else if (applicationFinalStatus == FinalApplicationStatus.FAILED) { retrieveApplicationExitDiagnostics( applicationId, null, diagnostics, false); } } else { // Do not kill Application due to AM_RM_RESYNC_EXCEED, since Exceed AM will kill itself. // In this way, we can support multiple LauncherServices to share a single RM, // like the sharing of HDFS and ZK. } } List<String> liveAssociatedApplicationIds = statusManager.getLiveAssociatedApplicationIds(); for (String applicationId : liveAssociatedApplicationIds) { if (!liveApplicationReports.containsKey(applicationId)) { FrameworkStatus frameworkStatus = statusManager.getFrameworkStatusWithLiveAssociatedApplicationId(applicationId); String frameworkName = frameworkStatus.getFrameworkName(); FrameworkState frameworkState = frameworkStatus.getFrameworkState(); // APPLICATION_CREATED Application is not in the liveApplicationReports, but it is indeed live in RM. if (frameworkState == FrameworkState.APPLICATION_CREATED) { continue; } LOGGER.logWarning( "[%s]: Cannot find live associated Application %s in resynced live Applications. " + "Will complete it with RMResyncLost ExitStatus", frameworkName, applicationId); retrieveApplicationExitDiagnostics( applicationId, ExitStatusKey.AM_RM_RESYNC_LOST.toInt(), "AM lost after RMResynced", false); } } }
代码中可以看到,它会对final status为SUCCEEDED, KILLED, FAILED的application调用retrieveApplicationExitDiagnostics()检查。
private void retrieveApplicationExitDiagnostics(String applicationId, Integer exitCode, String diagnostics, boolean needToKill) throws Exception { if (needToKill) { HadoopUtils.killApplication(applicationId); } String logSuffix = String.format( "[%s]: retrieveApplicationExitDiagnostics: ExitCode: %s, ExitDiagnostics: %s, NeedToKill: %s", applicationId, exitCode, diagnostics, needToKill); if (!statusManager.isApplicationIdAssociated(applicationId)) { LOGGER.logWarning("[NotAssociated]%s", logSuffix); return; } FrameworkStatus frameworkStatus = statusManager.getFrameworkStatusWithAssociatedApplicationId(applicationId); String frameworkName = frameworkStatus.getFrameworkName(); // Schedule to retrieveDiagnostics LOGGER.logDebug("[%s]%s", frameworkName, logSuffix); statusManager.transitionFrameworkState(frameworkName, FrameworkState.APPLICATION_RETRIEVING_DIAGNOSTICS, new FrameworkEvent().setApplicationExitCode(exitCode).setApplicationExitDiagnostics(diagnostics)); diagnosticsRetrieveHandler.retrieveDiagnosticsAsync(applicationId, diagnostics); }
然后用retriveDiagnosticsAsync() 另起一个线程执行检查。
从这里以后与3.1的第三步往后是一样的执行路径。
4、task/taskRole的retry
有两种情况:
一个container运行结束的时候,对它进行“尸检”时是调用attemptToRetry()进行的。与framework类似,一个task会有两种结局:
1)completeTask --> 将task的状态标记为TASK_COMPLETED,attemptToStop()尝试结束整个framework。
2)retryTask --> 将task的状态标记为TASK_WAITING,addContainerRequest()继续请求container。
二个是在recover()中,会调用attemptToRetry()来帮助恢复transitionTaskStateQueue,里面对所有状态是CONTAINER_COMPLETED的task进行了检查。
不过目前这个recover()也只在AM初次启动的时候才会调用到。
接下来需要了解一般的任务retry policy是怎么设置的?