Hbase合并Region的过程中出现永久RIT的解决
在合并Region的过程中出现永久RIT怎么办?笔者在生产环境中就遇到过这种情况,在批量合并Region的过程中,出现了永久MERGING_NEW的情况,虽然这种情况不会影响现有集群的正常的服务能力,但是如果集群有某个节点发生重启,那么可能此时该RegionServer上的Region是没法均衡的。因为在RIT状态时,HBase是不会执行Region负载均衡的,即使手动执行balancer命令也是无效的。
如果不解决这种RIT情况,那么后续有HBase节点相继重启,这样会导致整个集群的Region验证不均衡,这是很致命的,对集群的性能将会影响很大。经过查询HBase JIRA单,发现这种MERGING_NEW永久RIT的情况是触发了HBASE-17682的BUG,需要打上该Patch来修复这个BUG,其实就是HBase源代码在判断业务逻辑时,没有对MERGING_NEW这种状态进行判断,直接进入到else流程中了。源代码如下:
for (RegionState state : regionsInTransition.values()) { HRegionInfo hri = state.getRegion(); if (assignedRegions.contains(hri)) { // Region is open on this region server, but in transition. // This region must be moving away from this server, or splitting/merging. // SSH will handle it, either skip assigning, or re-assign. LOG.info("Transitioning " + state + " will be handled by ServerCrashProcedure for " + sn); } else if (sn.equals(state.getServerName())) { // Region is in transition on this region server, and this // region is not open on this server. So the region must be // moving to this server from another one (i.e. opening or // pending open on this server, was open on another one. // Offline state is also kind of pending open if the region is in // transition. The region could be in failed_close state too if we have // tried several times to open it while this region server is not reachable) if (state.isPendingOpenOrOpening() || state.isFailedClose() || state.isOffline()) { LOG.info("Found region in " + state + " to be reassigned by ServerCrashProcedure for " + sn); rits.add(hri); } else if(state.isSplittingNew()) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); } else { LOG.warn("THIS SHOULD NOT HAPPEN: unexpected " + state); } } }
修复之后代码:
for (RegionState state : regionsInTransition.values()) { HRegionInfo hri = state.getRegion(); if (assignedRegions.contains(hri)) { // Region is open on this region server, but in transition. // This region must be moving away from this server, or splitting/merging. // SSH will handle it, either skip assigning, or re-assign. LOG.info("Transitioning " + state + " will be handled by ServerCrashProcedure for " + sn); } else if (sn.equals(state.getServerName())) { // Region is in transition on this region server, and this // region is not open on this server. So the region must be // moving to this server from another one (i.e. opening or // pending open on this server, was open on another one. // Offline state is also kind of pending open if the region is in // transition. The region could be in failed_close state too if we have // tried several times to open it while this region server is not reachable) if (state.isPendingOpenOrOpening() || state.isFailedClose() || state.isOffline()) { LOG.info("Found region in " + state + " to be reassigned by ServerCrashProcedure for " + sn); rits.add(hri); } else if(state.isSplittingNew()) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); } else if (isOneOfStates(state, State.SPLITTING_NEW, State.MERGING_NEW)) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); }else { LOG.warn("THIS SHOULD NOT HAPPEN: unexpected " + state); } } }
但是,这里有一个问题,目前该JIRA单只是说了需要去修复BUG,打Patch。但是,实际生产情况下,面对这种RIT情况,是不可能长时间停止集群,影响应用程序读写的。那么,有没有临时的解决办法,先临时解决当前的MERGING_NEW这种永久RIT,之后在进行HBase版本升级操作。
办法是有的,在分析了MERGE合并的流程之后,发现HBase在执行Region合并时,会先生成一个初始状态的MERGING_NEW。整个Region合并流程如下:
从流程图中可以看到,MERGING_NEW是一个初始化状态,在Master的内存中,而处于Backup状态的Master内存中是没有这个新Region的MERGING_NEW状态的,那么可以通过对HBase的Master进行一个主备切换,来临时消除这个永久RIT状态。而HBase是一个高可用的集群,进行主备切换时对用户应用来说是无感操作。因此,面对MERGING_NEW状态的永久RIT可以使用对HBase进行主备切换的方式来做一个临时处理方案。之后,我们在对HBase进行修复BUG,打Patch进行版本升级。