Hbase合并Region的过程中出现永久RIT的解决

在合并Region的过程中出现永久RIT怎么办？笔者在生产环境中就遇到过这种情况，在批量合并Region的过程中，出现了永久MERGING_NEW的情况，虽然这种情况不会影响现有集群的正常的服务能力，但是如果集群有某个节点发生重启，那么可能此时该RegionServer上的Region是没法均衡的。因为在RIT状态时，HBase是不会执行Region负载均衡的，即使手动执行balancer命令也是无效的。

如果不解决这种RIT情况，那么后续有HBase节点相继重启，这样会导致整个集群的Region验证不均衡，这是很致命的，对集群的性能将会影响很大。经过查询HBase JIRA单，发现这种MERGING_NEW永久RIT的情况是触发了HBASE-17682的BUG，需要打上该Patch来修复这个BUG，其实就是HBase源代码在判断业务逻辑时，没有对MERGING_NEW这种状态进行判断，直接进入到else流程中了。源代码如下：

for (RegionState state : regionsInTransition.values()) {
        HRegionInfo hri = state.getRegion();
        if (assignedRegions.contains(hri)) {
          // Region is open on this region server, but in transition.
          // This region must be moving away from this server, or splitting/merging.
          // SSH will handle it, either skip assigning, or re-assign.
          LOG.info("Transitioning " + state + " will be handled by ServerCrashProcedure for " + sn);
        } else if (sn.equals(state.getServerName())) {
          // Region is in transition on this region server, and this
          // region is not open on this server. So the region must be
          // moving to this server from another one (i.e. opening or
          // pending open on this server, was open on another one.
          // Offline state is also kind of pending open if the region is in
          // transition. The region could be in failed_close state too if we have
          // tried several times to open it while this region server is not reachable)
          if (state.isPendingOpenOrOpening() || state.isFailedClose() || state.isOffline()) {
            LOG.info("Found region in " + state +
              " to be reassigned by ServerCrashProcedure for " + sn);
            rits.add(hri);
          } else if(state.isSplittingNew()) {
            regionsToCleanIfNoMetaEntry.add(state.getRegion());
          } else {
            LOG.warn("THIS SHOULD NOT HAPPEN: unexpected " + state);
          }
        }
      }

修复之后代码：

for (RegionState state : regionsInTransition.values()) {
        HRegionInfo hri = state.getRegion();
        if (assignedRegions.contains(hri)) {
          // Region is open on this region server, but in transition.
          // This region must be moving away from this server, or splitting/merging.
          // SSH will handle it, either skip assigning, or re-assign.
          LOG.info("Transitioning " + state + " will be handled by ServerCrashProcedure for " + sn);
        } else if (sn.equals(state.getServerName())) {
          // Region is in transition on this region server, and this
          // region is not open on this server. So the region must be
          // moving to this server from another one (i.e. opening or
          // pending open on this server, was open on another one.
          // Offline state is also kind of pending open if the region is in
          // transition. The region could be in failed_close state too if we have
          // tried several times to open it while this region server is not reachable)
          if (state.isPendingOpenOrOpening() || state.isFailedClose() || state.isOffline()) {
            LOG.info("Found region in " + state +
              " to be reassigned by ServerCrashProcedure for " + sn);
            rits.add(hri);
          } else if(state.isSplittingNew()) {
            regionsToCleanIfNoMetaEntry.add(state.getRegion());
          } else if (isOneOfStates(state, State.SPLITTING_NEW, State.MERGING_NEW)) {
             regionsToCleanIfNoMetaEntry.add(state.getRegion());
           }else {
            LOG.warn("THIS SHOULD NOT HAPPEN: unexpected " + state);
          }
        }
      }

但是，这里有一个问题，目前该JIRA单只是说了需要去修复BUG，打Patch。但是，实际生产情况下，面对这种RIT情况，是不可能长时间停止集群，影响应用程序读写的。那么，有没有临时的解决办法，先临时解决当前的MERGING_NEW这种永久RIT，之后在进行HBase版本升级操作。

办法是有的，在分析了MERGE合并的流程之后，发现HBase在执行Region合并时，会先生成一个初始状态的MERGING_NEW。整个Region合并流程如下：

从流程图中可以看到，MERGING_NEW是一个初始化状态，在Master的内存中，而处于Backup状态的Master内存中是没有这个新Region的MERGING_NEW状态的，那么可以通过对HBase的Master进行一个主备切换，来临时消除这个永久RIT状态。而HBase是一个高可用的集群，进行主备切换时对用户应用来说是无感操作。因此，面对MERGING_NEW状态的永久RIT可以使用对HBase进行主备切换的方式来做一个临时处理方案。之后，我们在对HBase进行修复BUG，打Patch进行版本升级。

posted @ 2019-03-30 15:47 niutao 阅读(1561) 评论(0) 编辑收藏举报

刷新页面返回顶部