retry policy 作用是规定当任务失败的时候要不要重试。适用于framework 和 task两类任务。
 

1、两种retry policy:FancyRetry Policy + NormalRetry Policy

结合退出原因分类处理:
FancyRetry:
暂时失败(transient failure)-- retry
非暂时失败(non - transient failure)-- not retry
成功/ 未知失败(unknow failure)-- 交由NormalRetry Policy判断
NormalRetry Policy:
满足下面三个条件之一时retry且retry count++:
1)maxRetryCount 设置值为 -2;意味着在这个设置下,就算是成功退出也会继续retry。
2)maxRetryCount 设置值为 -1 且 状态为failure;意味着在这个设置下,如果失败退出则会无限制次数地retry直至成功。
3)retry count < max retry count 且状态为failure;意味着在这个设置下,失败退出时会retry有限次。
下面链接给出了不同的framework中设置值,以及他们对应的retry策略。
https://github.com/Microsoft/pai/blob/master/frameworklauncher/doc/USERMANUAL.md#RetryPolicy

 

2、ApplicationCompletionPolicy

决定一个任务(taskRole)retry的参数:minFailedTaskCount + minSuccessTaskCount,
具体分为以下几种情况:
1、如果minFailedTask != null,且实际失败task数量 > minFailedTask设置值,结束AM,标记为失败。ExitStatus为failed,由最后一个失败的task决定。
2、如果minSuccessTask != null,且实际成功数量 >= minSuccessTask设置值,结束AM,标记为成功。ExitStatus为succeed,由最后一个成功的task决定。
3、如果满足1、2中任一种情况,那么根据实际情况决定,可能会是其中的任意一种。
4、如果1、2两种情况都需要在所有taskRole的task都完成时才能满足,结束AM,标记为成功。ExitStatus为succeed,不由任何一个task生成。
一些具体的例子:
https://github.com/Microsoft/pai/blob/master/frameworklauncher/doc/USERMANUAL.md#ApplicationCompletionPolicy
 

3、framework的retry

framework的retry有三种情况,一个是launchApplication()提交AM之后不成功,这时候会抛出异常,在异常部分判断是否需要retry。第二是在检查标记为COMPLETE状态的framework,看它们的ExitType,结合retry policy来判断是否需要retry。第三是在重启的时候有recover,在recover检查是否需要retry一些framework。下面主要讲前两种情况。
 
3.1 launchApplication() 提交了AM之后,在异常处理部分调用了retrieveApplicationExitDiagnostics()。
YarnException --> non-transient error
IOException --> transient error
其他Exception --> unknown error
处理异常的步骤如下:
1、调用killApplication(),结束掉这个AM。
2、由statusManager将这个AM负责的Framework状态改为APPLICATION_RETRIEVING_DIAGNOSTICS,
3、交由diagnosticsRetrieveHandler另起一个线程处理。
4、上面的线程把retrieveApplicationExitCode(applicationId, finalDiagnostics)这个任务扔进线程池中。这个任务一是获取AM中对应的ApplicationId,二是调用completeApplication()尝试结束framework。
5、completeApplication()中将framework的状态标记为APPLICATION_COMPLETED,然后调用attemptToRetry()。
6、attemptToRetry()中,根据不同的状态framework有两种结局:
    1)completeFramework --> 将framework的状态标记为FRAMEWORK_COMPLETED
    2)retryFramework --> 将framework的状态标记为FRAMEWORK_WAITING,然后调用createApplication()重新启动一个AM。
 
3.2 检查COMPLETE状态的framework
检查framework状态这一步是异步进行的,会另起一个线程执行。
入口在方法 resyncFrameworksWithLiveApplications(Map<String, ApplicationReport> liveApplicationReports)中。因为RM会将AM的状态保存在zk中,这里检查的是从zk上pull下来的live Application。
private void resyncFrameworksWithLiveApplications(Map<String, ApplicationReport> liveApplicationReports) throws Exception {
    // Since Application is persistent in ZK by RM, so liveApplicationReports will never incomplete.
    String logScope = "resyncFrameworksWithLiveApplications";
    CHANGE_AWARE_LOGGER.initializeScope(logScope, Level.INFO, Level.DEBUG);
    CHANGE_AWARE_LOGGER.log(logScope,
        "Got %s live Applications from RM, start to resync them.",
        liveApplicationReports.size());

    for (ApplicationReport applicationReport : liveApplicationReports.values()) {
      String applicationId = applicationReport.getApplicationId().toString();
      YarnApplicationState applicationState = applicationReport.getYarnApplicationState();
      FinalApplicationStatus applicationFinalStatus = applicationReport.getFinalApplicationStatus();
      String diagnostics = applicationReport.getDiagnostics();

      if (statusManager.isApplicationIdLiveAssociated(applicationId)) {
        FrameworkStatus frameworkStatus = statusManager.getFrameworkStatusWithLiveAssociatedApplicationId(applicationId);
        String frameworkName = frameworkStatus.getFrameworkName();
        FrameworkState frameworkState = frameworkStatus.getFrameworkState();
        if (frameworkState == FrameworkState.APPLICATION_CREATED) {
          continue;
        }

        // updateApplicationStatus
        statusManager.updateApplicationStatus(frameworkName, applicationReport);

        // transitionFrameworkState
        if (applicationFinalStatus == FinalApplicationStatus.UNDEFINED) {
          if (applicationState == YarnApplicationState.NEW ||
              applicationState == YarnApplicationState.NEW_SAVING ||
              applicationState == YarnApplicationState.SUBMITTED ||
              applicationState == YarnApplicationState.ACCEPTED) {
            statusManager.transitionFrameworkState(frameworkName, FrameworkState.APPLICATION_WAITING);
          } else if (applicationState == YarnApplicationState.RUNNING) {
            statusManager.transitionFrameworkState(frameworkName, FrameworkState.APPLICATION_RUNNING);
          }
        } else if (applicationFinalStatus == FinalApplicationStatus.SUCCEEDED) {
          retrieveApplicationExitDiagnostics(
              applicationId,
              ExitStatusKey.SUCCEEDED.toInt(),
              diagnostics,
              false);
        } else if (applicationFinalStatus == FinalApplicationStatus.KILLED) {
          retrieveApplicationExitDiagnostics(
              applicationId,
              ExitStatusKey.AM_KILLED_BY_USER.toInt(),
              diagnostics,
              false);
        } else if (applicationFinalStatus == FinalApplicationStatus.FAILED) {
          retrieveApplicationExitDiagnostics(
              applicationId,
              null,
              diagnostics,
              false);
        }
      } else {
        // Do not kill Application due to AM_RM_RESYNC_EXCEED, since Exceed AM will kill itself.
        // In this way, we can support multiple LauncherServices to share a single RM,
        // like the sharing of HDFS and ZK.
      }
    }

    List<String> liveAssociatedApplicationIds = statusManager.getLiveAssociatedApplicationIds();
    for (String applicationId : liveAssociatedApplicationIds) {
      if (!liveApplicationReports.containsKey(applicationId)) {
        FrameworkStatus frameworkStatus = statusManager.getFrameworkStatusWithLiveAssociatedApplicationId(applicationId);
        String frameworkName = frameworkStatus.getFrameworkName();
        FrameworkState frameworkState = frameworkStatus.getFrameworkState();

        // APPLICATION_CREATED Application is not in the liveApplicationReports, but it is indeed live in RM.
        if (frameworkState == FrameworkState.APPLICATION_CREATED) {
          continue;
        }

        LOGGER.logWarning(
            "[%s]: Cannot find live associated Application %s in resynced live Applications. " +
                "Will complete it with RMResyncLost ExitStatus",
            frameworkName, applicationId);

        retrieveApplicationExitDiagnostics(
            applicationId,
            ExitStatusKey.AM_RM_RESYNC_LOST.toInt(),
            "AM lost after RMResynced",
            false);
      }
    }
  }
View Code

代码中可以看到,它会对final status为SUCCEEDED, KILLED, FAILED的application调用retrieveApplicationExitDiagnostics()检查。

 private void retrieveApplicationExitDiagnostics(String applicationId, Integer exitCode, String diagnostics, boolean needToKill) throws Exception {
    if (needToKill) {
      HadoopUtils.killApplication(applicationId);
    }

    String logSuffix = String.format(
        "[%s]: retrieveApplicationExitDiagnostics: ExitCode: %s, ExitDiagnostics: %s, NeedToKill: %s",
        applicationId, exitCode, diagnostics, needToKill);

    if (!statusManager.isApplicationIdAssociated(applicationId)) {
      LOGGER.logWarning("[NotAssociated]%s", logSuffix);
      return;
    }

    FrameworkStatus frameworkStatus = statusManager.getFrameworkStatusWithAssociatedApplicationId(applicationId);
    String frameworkName = frameworkStatus.getFrameworkName();

    // Schedule to retrieveDiagnostics
    LOGGER.logDebug("[%s]%s", frameworkName, logSuffix);
    statusManager.transitionFrameworkState(frameworkName, FrameworkState.APPLICATION_RETRIEVING_DIAGNOSTICS,
        new FrameworkEvent().setApplicationExitCode(exitCode).setApplicationExitDiagnostics(diagnostics));
    diagnosticsRetrieveHandler.retrieveDiagnosticsAsync(applicationId, diagnostics);
  }
View Code

然后用retriveDiagnosticsAsync() 另起一个线程执行检查。

从这里以后与3.1的第三步往后是一样的执行路径。

 

4、task/taskRole的retry

有两种情况:
一个container运行结束的时候,对它进行“尸检”时是调用attemptToRetry()进行的。与framework类似,一个task会有两种结局:
  1)completeTask --> 将task的状态标记为TASK_COMPLETED,attemptToStop()尝试结束整个framework。
  2)retryTask --> 将task的状态标记为TASK_WAITING,addContainerRequest()继续请求container。
二个是在recover()中,会调用attemptToRetry()来帮助恢复transitionTaskStateQueue,里面对所有状态是CONTAINER_COMPLETED的task进行了检查。
不过目前这个recover()也只在AM初次启动的时候才会调用到。
 
 
 
接下来需要了解一般的任务retry policy是怎么设置的?
 
 
 
 
 
posted on 2018-08-20 16:04  今天天蓝蓝  阅读(480)  评论(0编辑  收藏  举报