记录一次hadoop2.8.4版本RM接入zk ha问题
背景:
公司将线上hadoop RM接入ZK 实现高可用 但ZK Znode 默认存储1M,当存储数据量大时候可能导致线上业务的崩溃
处理方案如下:
1,修改ZK配置 增加默认存储上限
2,修改RM数据存储在zk中的路径结构 使结构拆分能支撑更大的数据
问题一 修改ZK配置 增加默认存储上限
主要为修改配置参数
在zk各节点上修改配置 (修改为10M大小)
vi zkServer.sh
新增配置到图中位置 ZOO_USER_CFG="-Djute.maxbuffer=10240000"
修改zkCli.sh (不修改 客户端命令行 将不能取得超出1M的数据)
即使如此 当我们代码客户端也不能取得超出大小的数据 需要添加环境变量 如下
System.setProperty("jute.maxbuffer",String.valueOf(10240000));
同样的yarn的配置也要修改 不然也是白搭
yarn-env.sh
新增一行
YARN_RESOURCEMANAGER_OPTS="$YARN_RESOURCEMANAGER_OPTS -Djute.maxbuffer=10240000"
问题2 优化zk中存储结构
yarn 在zk中的存储如下
ROOT_DIR_PATH |--- VERSION_INFO |--- EPOCH_NODE |--- RM_ZK_FENCING_LOCK |--- RM_APP_ROOT | |----- (#ApplicationId1) | | |----- (#ApplicationAttemptIds) | | | |----- (#ApplicationId2) | | |----- (#ApplicationAttemptIds) | .... | |--- RM_DT_SECRET_MANAGER_ROOT |----- RM_DT_SEQUENTIAL_NUMBER_ZNODE_NAME |----- RM_DELEGATION_TOKENS_ROOT_ZNODE_NAME | |----- Token_1 | |----- Token_2 | .... | |----- RM_DT_MASTER_KEYS_ROOT_ZNODE_NAME | |----- Key_1 | |----- Key_2 .... |--- AMRMTOKEN_SECRET_MANAGER_ROOT |----- currentMasterKey |----- nextMasterKey
更新为:
* The znode structure is as follows: * ROOT_DIR_PATH * |--- VERSION_INFO * |--- EPOCH_NODE * |--- RM_ZK_FENCING_LOCK * |--- RM_APP_ROOT * | |----- HIERARCHIES * | | |----- 1 * | | | |----- (#ApplicationId barring last character) * | | | | |----- (#Last character of ApplicationId) * | | | | | |----- (#ApplicationAttemptIds) * | | | .... * | | | * | | |----- 2 * | | | |----- (#ApplicationId barring last 2 characters) * | | | | |----- (#Last 2 characters of ApplicationId) * | | | | | |----- (#ApplicationAttemptIds) * | | | .... * | | | * | | |----- 3 * | | | |----- (#ApplicationId barring last 3 characters) * | | | | |----- (#Last 3 characters of ApplicationId) * | | | | | |----- (#ApplicationAttemptIds) * | | | .... * | | | * | | |----- 4 * | | | |----- (#ApplicationId barring last 4 characters) * | | | | |----- (#Last 4 characters of ApplicationId) * | | | | | |----- (#ApplicationAttemptIds) * | | | .... * | | | * | |----- (#ApplicationId1) * | | |----- (#ApplicationAttemptIds) * | | * | |----- (#ApplicationId2) * | | |----- (#ApplicationAttemptIds) * | .... * | * |--- RM_DT_SECRET_MANAGER_ROOT * |----- RM_DT_SEQUENTIAL_NUMBER_ZNODE_NAME * |----- RM_DELEGATION_TOKENS_ROOT_ZNODE_NAME * | |----- 1 * | | |----- (#TokenId barring last character) * | | | |----- (#Last character of TokenId) * | | .... * | |----- 2 * | | |----- (#TokenId barring last 2 characters) * | | | |----- (#Last 2 characters of TokenId) * | | .... * | |----- 3 * | | |----- (#TokenId barring last 3 characters) * | | | |----- (#Last 3 characters of TokenId) * | | .... * | |----- 4 * | | |----- (#TokenId barring last 4 characters) * | | | |----- (#Last 4 characters of TokenId) * | | .... * | |----- Token_1 * | |----- Token_2 * | .... * | * |----- RM_DT_MASTER_KEYS_ROOT_ZNODE_NAME * | |----- Key_1 * | |----- Key_2 * .... * |--- AMRMTOKEN_SECRET_MANAGER_ROOT * |----- currentMasterKey * |----- nextMasterKey * * |-- RESERVATION_SYSTEM_ROOT * |------PLAN_1 * | |------ RESERVATION_1 * | |------ RESERVATION_2 * | .... * |------PLAN_2 * ....
yarn-siting.xml文件新增一个配置项
<property> <description>Index at which last section of application id (with each section separated by _ in application id) will be split so that application znode stored in zookeeper RM state store will be stored as two different znodes (parent-child). Split is done from the end. For instance, with no split, appid znode will be of the form application_1352994193343_0001. If the value of this config is 1, the appid znode will be broken into two parts application_1352994193343_000 and 1 respectively with former being the parent node. application_1352994193343_0002 will then be stored as 2 under the parent node application_1352994193343_000. This config can take values from 0 to 4. 0 means there will be no split. If configuration value is outside this range, it will be treated as config value of 0(i.e. no split). A value larger than 0 (up to 4) should be configured if you are storing a large number of apps in ZK based RM state store and state store operations are failing due to LenError in Zookeeper.</description> <name>yarn.resourcemanager.zk-appid-node.split-index</name> <value>0</value> </property>
参考:https://cloud.tencent.com/developer/article/1491079
参考:https://issues.apache.org/jira/browse/YARN-2368
参考:https://issues.apache.org/jira/browse/YARN-2962