yarn程序95%挂起
场景:
总资源:18G内存,9vcore
剩余资源:14G,5vcore
4个running application 每个占用1G,1vcore
5个accepted application
配置:
Dynamic Resource Pool Configuration:
最小资源数:1vcore,512mb
最大资源数:9vcore,18gb
最大运行数:6
Application Master 最大份额:0.4(限制可用于运行 Application Master 的资源池公平份额的比例。例如,如果设为 1.0,叶池中的 AM 最多可使用 100% 的内存和 CPU 公平份额。如果值为 -1.0,则此功能被禁用,Application Master 份额不会被检查。默认值为 0.5。)
有三个节点:每个节点6gb内存,3vcore
yarn-site.xml
<?xml version="1.0" encoding="UTF-8"?> <!--Autogenerated by Cloudera Manager--> <configuration> <property> <name>yarn.acl.enable</name> <value>true</value> </property> <property> <name>yarn.admin.acl</name> <value>*</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>259200</value> </property> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.ha.automatic-failover.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.ha.automatic-failover.embedded</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.zk-address</name> <value>sz280111:2181,sz280113:2181,sz280112:2181</value> </property> <property> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> <property> <name>yarn.client.failover-sleep-base-ms</name> <value>100</value> </property> <property> <name>yarn.client.failover-sleep-max-ms</name> <value>2000</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>yarnRM</value> </property> <property> <name>yarn.resourcemanager.work-preserving-recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.address.rm49</name> <value>sz280111:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address.rm49</name> <value>sz280111:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address.rm49</name> <value>sz280111:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address.rm49</name> <value>sz280111:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm49</name> <value>sz280111:8088</value> </property> <property> <name>yarn.resourcemanager.webapp.https.address.rm49</name> <value>sz280111:8090</value> </property> <property> <name>yarn.resourcemanager.address.rm61</name> <value>sz280112:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address.rm61</name> <value>sz280112:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address.rm61</name> <value>sz280112:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address.rm61</name> <value>sz280112:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm61</name> <value>sz280112:8088</value> </property> <property> <name>yarn.resourcemanager.webapp.https.address.rm61</name> <value>sz280112:8090</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm49,rm61</value> </property> <property> <name>yarn.nodemanager.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.nodemanager.recovery.dir</name> <value>/qhapp/cdh/var/lib/hadoop-yarn/yarn-nm-recovery</value> </property> <property> <name>yarn.resourcemanager.client.thread-count</name> <value>50</value> </property> <property> <name>yarn.resourcemanager.scheduler.client.thread-count</name> <value>50</value> </property> <property> <name>yarn.resourcemanager.admin.client.thread-count</name> <value>1</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>512</value> </property> <property> <name>yarn.scheduler.increment-allocation-mb</name> <value>256</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.increment-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>8</value> </property> <property> <name>yarn.resourcemanager.amliveliness-monitor.interval-ms</name> <value>1000</value> </property> <property> <name>yarn.am.liveness-monitor.expiry-interval-ms</name> <value>600000</value> </property> <property> <name>yarn.resourcemanager.am.max-attempts</name> <value>2</value> </property> <property> <name>yarn.resourcemanager.container.liveness-monitor.interval-ms</name> <value>600000</value> </property> <property> <name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name> <value>1000</value> </property> <property> <name>yarn.nm.liveness-monitor.expiry-interval-ms</name> <value>600000</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.client.thread-count</name> <value>50</value> </property> <property> <name>yarn.application.classpath</name> <value>$HADOOP_CLIENT_CONF_DIR,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value> </property> <property> <name>yarn.nodemanager.container-monitor.interval-ms</name> <value>3000</value> </property> <property> <name>yarn.resourcemanager.max-completed-applications</name> <value>1000</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/qhapp/cdh/var/lib/yarn/nm</value> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/qhapp/cdh/var/log/yarn/container-logs</value> </property> <property> <name>yarn.nodemanager.webapp.address</name> <value>sz280108:8042</value> </property> <property> <name>yarn.nodemanager.webapp.https.address</name> <value>sz280108:8044</value> </property> <property> <name>yarn.nodemanager.address</name> <value>sz280108:8041</value> </property> <property> <name>yarn.nodemanager.admin-env</name> <value>MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,HADOOP_YARN_HOME</value> </property> <property> <name>yarn.nodemanager.container-manager.thread-count</name> <value>20</value> </property> <property> <name>yarn.nodemanager.delete.thread-count</name> <value>4</value> </property> <property> <name>yarn.resourcemanager.nodemanagers.heartbeat-interval-ms</name> <value>100</value> </property> <property> <name>yarn.nodemanager.localizer.address</name> <value>sz280108:8040</value> </property> <property> <name>yarn.nodemanager.localizer.cache.cleanup.interval-ms</name> <value>600000</value> </property> <property> <name>yarn.nodemanager.localizer.cache.target-size-mb</name> <value>5120</value> </property> <property> <name>yarn.nodemanager.localizer.client.thread-count</name> <value>5</value> </property> <property> <name>yarn.nodemanager.localizer.fetch.thread-count</name> <value>4</value> </property> <property> <name>yarn.nodemanager.log.retain-seconds</name> <value>10800</value> </property> <property> <name>yarn.nodemanager.remote-app-log-dir</name> <value>/tmp/logs</value> </property> <property> <name>yarn.nodemanager.remote-app-log-dir-suffix</name> <value>logs</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>6144</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>3</value> </property> <property> <name>yarn.nodemanager.delete.debug-delay-sec</name> <value>0</value> </property> <property> <name>yarn.nodemanager.health-checker.script.path</name> <value></value> </property> <property> <name>yarn.nodemanager.health-checker.script.opts</name> <value></value> </property> <property> <name>yarn.nodemanager.disk-health-checker.interval-ms</name> <value>120000</value> </property> <property> <name>yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb</name> <value>0</value> </property> <property> <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name> <value>90.0</value> </property> <property> <name>yarn.nodemanager.disk-health-checker.min-healthy-disks</name> <value>0.25</value> </property> <property> <name>mapreduce.shuffle.max.threads</name> <value>80</value> </property> <property> <name>yarn.log.server.url</name> <value>http://sz280111:19888/jobhistory/logs/</value> </property> <property> <name>yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user</name> <value>nobody</value> </property> <property> <name>yarn.nodemanager.linux-container-executor.resources-handler.class</name> <value>org.apache.hadoop.yarn.server.nodemanager.util.DefaultLCEResourcesHandler</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>6</value> </property> </configuration>
解决思路1:
增加 Application Master 最大份额 到0.8
"Application Master 最大份额" 控制可用于AM容器的总群集内存,和cpu的百分比。如果你有几个作业,那么每个AM将为每个容器消耗所需的内存和cpu。
如果这超出了给定的总群集内存的百分比,下一个AM运行将等待,直到它有空闲的资源才会运行。
参考:https://community.hortonworks.com/questions/77454/tez-job-hang-waiting-for-am-container-to-be-alloca.html