记录一次线上yarn RM频繁切换的故障
周末一大早被报警惊醒,rm频繁切换
急急忙忙排查 看到两处错误日志
错误信息1
ervation <memory:0, vCores:0> 2019-12-21 11:51:57,781 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode.unreserveResource(FSSchedulerNode.java:88) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.unreserve(FSAppAttempt.java:589) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainerInternal(FairScheduler.java:899) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:564) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:846) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1479) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:117) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:804) at java.lang.Thread.run(Thread.java:748)
错误信息2
明月照我去搬砖 2019/12/21 14:51:07 2019-12-21 07:37:45,533 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainerInternal(FairScheduler.java:902) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:564) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:837) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1475) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:117) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:804) at java.lang.Thread.run(Thread.java:748) 2019-12-21 07:37:45,534 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
查看源码处FairScheduler
@Override protected void completedContainerInternal( RMContainer rmContainer, ContainerStatus containerStatus, RMContainerEventType event) { try { writeLock.lock(); Container container = rmContainer.getContainer(); // Get the application for the finished container FSAppAttempt application = getCurrentAttemptForContainer(container.getId()); ApplicationId appId = container.getId().getApplicationAttemptId().getApplicationId(); if (application == null) { LOG.info("Container " + container + " of" + " finished application " + appId + " completed with event " + event); return; } // Get the node on which the container was allocated FSSchedulerNode node = getFSSchedulerNode(container.getNodeId()); if (rmContainer.getState() == RMContainerState.RESERVED) { application.unreserve(rmContainer.getReservedPriority(), node); //这里将node上该container资源释放 } else { try { application.containerCompleted(rmContainer, containerStatus, event); node.releaseContainer(rmContainer.getContainerId(), false); updateRootQueueMetrics(); LOG.info("Application attempt " + application.getApplicationAttemptId() + " released container " + container.getId() + " on node: " + node + " with event: " + event); }catch (Exception e){ LOG.error(e.getMessage(), e); } } } finally { writeLock.unlock(); } }
跟进去看下
/** * Remove the reservation on {@code node} at the given {@link Priority}. * This dispatches SchedulerNode handlers as well. */ public void unreserve(Priority priority, FSSchedulerNode node) { RMContainer rmContainer = node.getReservedContainer(); unreserveInternal(priority, node); node.unreserveResource(this); clearReservation(node); getMetrics().unreserveResource(node.getPartition(), getUser(), rmContainer.getContainer().getResource()); }
@Override public synchronized void unreserveResource( SchedulerApplicationAttempt application) { // Cannot unreserve for wrong application... ApplicationAttemptId reservedApplication = getReservedContainer().getContainer().getId().getApplicationAttemptId(); //获取不到该container的attemptId 报空指针 if (!reservedApplication.equals( application.getApplicationAttemptId())) { throw new IllegalStateException("Trying to unreserve " + " for application " + application.getApplicationId() + " when currently reserved " + " for application " + reservedApplication.getApplicationId() + " on node " + this); } setReservedContainer(null); this.reservedAppSchedulable = null; }
第二处报错是
rmContainer为null 了对removeapplicationattent的调用和对相同尝试的moveApplication的处理顺序很短则应用程序尝试仍将包含队列引用,
但已从队列的应用程序列表中删除如果对removeapplicationattent的两个调用连续出现,则应用程序仍将包含队列引用,但已从队列的应用程序列表
中删除在这两种情况下,第二个调用必须在进行removeApplication调
用之前进入。
其实就是重复释放container 但container已经在该节点上释放了 有一个状态不一致问题
这边是用的写锁 当一个线程已经读到containerId 另一线程释放掉 再次释放 就会出现异常
修改方法一
/** * Clean up a completed container. */ @Override protected synchronized void completedContainerInternal( RMContainer rmContainer, ContainerStatus containerStatus, RMContainerEventType event) { try { // writeLock.lock();//注释写锁 改用重锁 Container container = rmContainer.getContainer(); // Get the application for the finished container FSAppAttempt application = getCurrentAttemptForContainer(container.getId()); ApplicationId appId = container.getId().getApplicationAttemptId().getApplicationId(); if (application == null) { LOG.info("Container " + container + " of" + " finished application " + appId + " completed with event " + event); return; }
修改方法二
// Get the node on which the container was allocated FSSchedulerNode node = getFSSchedulerNode(container.getNodeId()); try { if (rmContainer.getState() == RMContainerState.RESERVED) { application.unreserve(rmContainer.getReservedPriority(), node); } else { // try { //将try移到上方 覆盖unreserve方法
application.containerCompleted(rmContainer, containerStatus, event);
node.releaseContainer(rmContainer.getContainerId(), false);
updateRootQueueMetrics();
LOG.info("Application attempt " + application.getApplicationAttemptId() + " released container " + container.getId(
) + " on node: " + node + " with event: " + event);
}catch (Exception e){
LOG.error(e.getMessage(), e); //将该异常处理掉而不是抛出
} }