记一次rm故障

现象来看与下连接相同

http://cache.baiducontent.com/c?m=9d78d513d99401ef05ad837e7c4d8b711925d6387d9583532e8ec40884642a071d26b4e8713510758b96383416ae394bea872173474466ecc5df893acabbe53f2ef876692c4dc101528445e9dc4755d620e74de8df59b0e2a763d5f984c4de24048004543dc6abd6061715ba38ba4566a1e0c215494b57fab33f3fb91f3568882233ab5aa8bd6d3140ddad9b175bc35d8a3c51d1f269f56352ec52b31f6c7519ff51e0550d6067bc093abe037f46cfab1bbe7a644023bc4bb5b3dce1ab08d19cbd71d8a78bb82fe33bbad2ea8f27193110a963eff1eaf22a643344838a89459225bc8cb4e908ba53914b02eb002a7e2c8e2bc3dec940f21500b2b836&p=9f7ac815d9c10ebe44be9b7c4e&newp=8e36d10a85cc43ec0cbd9b7c4253d8304a02c70e3dc3864e1290c408d23f061d4862e7bf27251200d0c7786507ac425cedf4377323454df6cc8a871d81edd17c&user=baidu&fm=sc&query=error+in+dispatcher+thread+java%2Eutil%2Econcurrent%2Erejectedexecutionexception&qid=ad6bf4940002c824&p1=1

Error:"Error in dispatcher thread java.util.concurrent.RejectedExecutionException" when running heavy load of job from YARN Resource Manager
icocio created · 5 天前
0
SupportKB
Problem Description: 
The YARN Resource Manager (RM) with HA configured is failing  when experiencing heavy loads of jobs. Even the standby RM is crashing. Both the Standby RM and the previously active RMs are failing as well. The following error is displayed in the Resource Manager log at the moment of shutdown:
2018-10-23 18:50:42,552 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(190)) - 
Error in dispatcher thread 
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@5407c4c8 rejected from 
java.util.concurrent.ThreadPoolExecutor@74d60fd0[Terminated, pool size = 14147, active threads = 0, queued tasks = 0, completed tasks = 32283] 
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) 
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) 
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) 
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134) 
at org.apache.hadoop.registry.server.services.RegistryAdminService.submit(RegistryAdminService.java:176) 
at org.apache.hadoop.registry.server.integration.RMRegistryOperationsService.purgeRecordsAsync(RMRegistryOperationsService.java:200) 
at org.apache.hadoop.registry.server.integration.RMRegistryOperationsService.purgeRecordsAsync(RMRegistryOperationsService.java:170) 
at org.apache.hadoop.registry.server.integration.RMRegistryOperationsService.onContainerFinished(RMRegistryOperationsService.java:146) 
at org.apache.hadoop.yarn.server.resourcemanager.registry.RMRegistryService.handleAppAttemptEvent(RMRegistryService.java:156) 
at org.apache.hadoop.yarn.server.resourcemanager.registry.RMRegistryService$AppEventHandler.handle(RMRegistryService.java:188) 
at org.apache.hadoop.yarn.server.resourcemanager.registry.RMRegistryService$AppEventHandler.handle(RMRegistryService.java:182) 
at org.apache.hadoop.yarn.event.AsyncDispatcher$MultiListenerHandler.handle(AsyncDispatcher.java:279) 
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) 
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) 
at java.lang.Thread.run(Thread.java:748) 
2018-10-23 18:50:42,552 INFO capacity.ParentQueue (ParentQueue.java:assignContainers(475)) - 
assignedContainer queue=root usedCapacity=0.78571427 absoluteUsedCapacity=0.78571427 
used=<memory:3914240, vCores:1076> cluster=<memory:4981760, vCores:2318> 
2018-10-23 18:50:42,559 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(422)) - 
container_e173_1540320252022_0085_02_002570 Container Transitioned from ALLOCATED to ACQUIRED 
2018-10-23 18:50:42,559 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(422)) - 
container_e173_1540320252022_0085_02_002571 Container Transitioned from ALLOCATED to ACQUIRED 
2018-10-23 21:49:24,484 INFO resourcemanager.ResourceManager (LogAdapter.java:info(45)) - STARTUP_MSG: 
/************************************************************ 
STARTUP_MSG: Starting ResourceManager 
STARTUP_MSG: user = yarn 
STARTUP_MSG: host = ustsmascmsp920.prod/10.86.128.54 
STARTUP_MSG: args = [] 
STARTUP_MSG: version = 2.7.3.2.6.1.0-129

  
Cause: 
Resource Manager has to purge the records under Zookeeper for every container that completes. While doing this, it scans almost all znodes from the root path. An increased number of znodes will lead to Zookeeper client session drop and causes AsyncDispatcher queue to get overwhelmed. Resource Manager might be shutting down due to a race condition. 
Solution: 
This issue is resolved in HDP-2.6.5. For versions prior to HDP-2.6.5, fo the following to disable ResourceManager registry: 
1.Log into Ambari UI.
2.Click YARN service.
3.Click Config > Advanced tab.
4.Expand Advanced yarn-site section.
5.Set hadoop.registry.rm.enabled to false.
6.Restart all affected.

另外，rm1挂掉后，rm2也没能切换成功active，具体日志记录：

12：35：21 rmfailover才说发现active rm 【rm2】
12:35：56 跟zk的session 又close了
12:37:18 感觉是rm2作为active正常了,开始报各种初始化信息
12：37：18 916 rm2又变成standby了，随后rm为实现fence自己停止工作

另外补充一点rm的隔离机制

https://www.cnblogs.com/shenh062326/p/3547786.html

posted on 2019-08-23 17:30 roger888 阅读(596) 评论(0) 编辑收藏举报

刷新页面返回顶部

roger888

记一次rm故障

导航

公告