十二:NodeManager
Health Checker Service 创建检查服务
Disk Checker 磁盘检查
Configuration Name | Allowed Values | Description |
---|---|---|
yarn.nodemanager.disk-health-checker.enable | true, false | Enable or disable the disk health checker service |
yarn.nodemanager.disk-health-checker.interval-ms | Positive integer | The interval, in milliseconds, at which the disk checker should run; the default value is 2 minutes |
yarn.nodemanager.disk-health-checker.min-healthy-disks | Float between 0-1 | The minimum fraction of disks that must pass the check for the NodeManager to mark the node as healthy; the default is 0.25 |
yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage | Float between 0-100 | The maximum percentage of disk space that may be utilized before a disk is marked as unhealthy by the disk checker service. This check is run for every disk used by the NodeManager. The default value is 90 i.e. 90% of the disk can be used. |
yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb | Integer | The minimum amount of free space that must be available on the disk for the disk checker service to mark the disk as healthy. This check is run for every disk used by the NodeManager. The default value is 0 i.e. the entire disk can be used. |
External Health Script 附件健康检查脚本
Configuration Name | Allowed Values | Description |
---|---|---|
yarn.nodemanager.health-checker.interval-ms | Postive integer | The interval, in milliseconds, at which health checker service runs; the default value is 10 minutes. |
yarn.nodemanager.health-checker.script.timeout-ms | Postive integer | The timeout for the health script that’s executed; the default value is 20 minutes. |
yarn.nodemanager.health-checker.script.path | String | Absolute path to the health check script to be run. |
yarn.nodemanager.health-checker.script.opts | String | Arguments to be passed to the script when the script is executed. |
NodeManager Restart NM重启
Step 1. To enable NM Restart functionality, set the following property in conf/yarn-site.xml to true. 启用NM restart
Property | Value |
---|---|
yarn.nodemanager.recovery.enabled | true, (default value is set to false) |
Step 2. Configure a path to the local file-system directory where the NodeManager can save its run state. 配置state-store
Property | Description |
---|---|
yarn.nodemanager.recovery.dir | The local filesystem directory in which the node manager will store state when recovery is enabled. The default value is set to$hadoop.tmp.dir/yarn-nm-recovery. |
Step 3. Configure a valid RPC address for the NodeManager. 重启后NM可能会使用不同的端口导致client连接失效,因此要把随机端口改成固定端口
Property | Description |
---|---|
yarn.nodemanager.address | Ephemeral ports (port 0, which is default) cannot be used for the NodeManager’s RPC server specified via yarn.nodemanager.address as it can make NM use different ports before and after a restart. This will break any previously running clients that were communicating with the NM before restart. Explicitly setting yarn.nodemanager.address to an address with specific port number (for e.g 0.0.0.0:45454) is a precondition for enabling NM restart. |
Step 4. Auxiliary services. 辅助服务 应用程序应该支持重启
NodeManagers in a YARN cluster can be configured to run auxiliary services. For a completely functional NM restart, YARN relies on any auxiliary service configured to also support recovery. This usually includes (1) avoiding usage of ephemeral ports so that previously running clients (in this case, usually containers) are not disrupted after restart and (2) having the auxiliary service itself support recoverability by reloading any previous state when NodeManager restarts and reinitializes the auxiliary service.
A simple example for the above is the auxiliary service ‘ShuffleHandler’ for MapReduce (MR). ShuffleHandler respects the above two requirements already, so users/admins don’t have do anything for it to support NM restart: (1) The configuration property mapreduce.shuffle.port controls which port the ShuffleHandler on a NodeManager host binds to, and it defaults to a non-ephemeral port. (2) The ShuffleHandler service also already supports recovery of previous state after NM restarts. ShuffleHandler支持NM的重启