总体上涉及了心跳检测、副本移除线程、副本恢复线程。当datanode发生宕机或者datanode中的某个storage（如一块硬盘）发生的错误时，namenode会根据datanode发送的心跳进行检测。但namenode并没有在心跳检测的汇报中进行即时反应，而是先记录对应的心跳信息，由另一个定期检测线程移除DatanodeManager和BlockManager中对应的block信息，并记录需要恢复的数据。对于数据的恢复，又新建了一个线程进行定期扫描，分配恢复副本需要的源数据节点和目标数据节点，在datanode的下一轮心跳检测中转换为对应的命令返回给datanode。

宕机的心跳检测

datanode会定时向namenode发送心跳数据包汇报当前的运行状态。namenode在一定时间内没收到数据节点的心跳时会标记为stale状态，然后转移该数据节点中的block到其它的数据节点。

hdfs配置中的几个参数：

dfs.heartbeat.interval，Hadoop心跳检测间隔，默认为3s。
dfs.namenode.stale.datanode.minimum.interval，datanode标记为stale状态的需要丢失的最小心跳次数，默认为3。
dfs.namenode.stale.datanode.interval，Hadoop datanode超时范围，超过此时间没收到心跳检测会被标记为stale状态，默认为30s，大小必须超过前面两个参数的乘积。

接收心跳消息

Hadoop的datanode心跳检测通过rpc的形式发送，rpc函数通过参数传递数据节点统计信息，返回namenode需要对数据节点的命令。

datanode在通过rpc发送消息时，namenode首先在rpc server处理，交给NameSystem。NameNodeRpcServer中的处理：

@Override // DatanodeProtocol
public HeartbeatResponse sendHeartbeat(DatanodeRegistration nodeReg,
    StorageReport[] report, long dnCacheCapacity, long dnCacheUsed,
    int xmitsInProgress, int xceiverCount,
    int failedVolumes, VolumeFailureSummary volumeFailureSummary,
    boolean requestFullBlockReportLease,
    @Nonnull SlowPeerReports slowPeers,
    @Nonnull SlowDiskReports slowDisks) throws IOException {
  checkNNStartup();
  verifyRequest(nodeReg);
  return namesystem.handleHeartbeat(nodeReg, report,
      dnCacheCapacity, dnCacheUsed, xceiverCount, xmitsInProgress,
      failedVolumes, volumeFailureSummary, requestFullBlockReportLease,
      slowPeers, slowDisks);
}

namesystem的类型为FSNamesystem，负责name-space state的相关管理（is a container of both transient and persisted name-space state, and does all the book-keeping work on a NameNode），是BlockManager, DatanodeManager, DelegationTokens, LeaseManager等服务的容器。在handleHeartbeat函数中，通过blockManager获取的DatanodeManager进行了处理：

DatanodeCommand[] cmds = blockManager.getDatanodeManager().handleHeartbeat(
  nodeReg, reports, getBlockPoolId(), cacheCapacity, cacheUsed,
  xceiverCount, maxTransfer, failedVolumes, volumeFailureSummary,
  slowPeers, slowDisks);

然后DatanodeManager中调用HeartbeatManager进行了处理：

heartbeatManager.updateHeartbeat(nodeinfo, reports, cacheCapacity,
        cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary);

HeartbeatManager中心跳的处理

HeartbeatManager类负责心跳的处理，心跳的处理并没有在接收到心跳消息后，而是用了一个额外的线程进行处理，默认每5min进行一次状态扫描。可能是某些处理中需要多个datanode的信息，所以没有直接对单个datanode发送消息时回复。对于datanodeManager中记录的有问题的datanode和storage，直接进行移除。此处只负责移除namenode（BlockManager和DatanodeManager等）中的datanode信息，对于丢失副本的恢复过程并不处理。

一个Monitor内部类实现了Runnable接口，负责监测线程的运行。private final Daemon heartbeatThread = new Daemon(new Monitor());。在当前的时间与上次检测的时间超过heartbeatRecheckInterval时，会调用heartbeatCheck()函数进行处理。

heartbeatCheck()函数中。每次循环首先遍历DatanodeManager中的所有的datanode状态以及每个datanode中的storage状态，统计发生错误的datanode和storage（每个datanode上可能有多个storage，标记datanode运行正常但是storage出现问题的情况）；然后通过DatanodeManager和BlockManager处理其中第一个datanode和storage，直至所有存在问题的datanode和storage都被处理完。

  public void run() {
    while(namesystem.isRunning()) {
      restartHeartbeatStopWatch();
      try {
        final long now = Time.monotonicNow();
        if (lastHeartbeatCheck + heartbeatRecheckInterval < now) {
          heartbeatCheck();
          lastHeartbeatCheck = now;
        }
	  // ....
     }
  }
	
	
  void heartbeatCheck() {
    final DatanodeManager dm = blockManager.getDatanodeManager();
    // It's OK to check safe mode w/o taking the lock here, we re-check
    // for safe mode after taking the lock before removing a datanode.
    if (namesystem.isInStartupSafeMode()) {
      return;
    }
    boolean allAlive = false;
    while (!allAlive) {
      // locate the first dead node.
      DatanodeDescriptor dead = null;

      // locate the first failed storage that isn't on a dead node.
      DatanodeStorageInfo failedStorage = null;

      // check the number of stale nodes
      int numOfStaleNodes = 0;
      int numOfStaleStorages = 0;
      synchronized(this) {
        for (DatanodeDescriptor d : datanodes) {
          // check if an excessive GC pause has occurred
          if (shouldAbortHeartbeatCheck(0)) {
            return;
          }
          if (dead == null && dm.isDatanodeDead(d)) {
            stats.incrExpiredHeartbeats();
            dead = d;
          }
          if (d.isStale(dm.getStaleInterval())) {
            numOfStaleNodes++;
          }
          DatanodeStorageInfo[] storageInfos = d.getStorageInfos();
          for(DatanodeStorageInfo storageInfo : storageInfos) {
            if (storageInfo.areBlockContentsStale()) {
              numOfStaleStorages++;
            }
            if (failedStorage == null &&
                storageInfo.areBlocksOnFailedStorage() &&
                d != dead) {
              failedStorage = storageInfo;
            }
          }
        }
        
        // Set the number of stale nodes in the DatanodeManager
        dm.setNumStaleNodes(numOfStaleNodes);
        dm.setNumStaleStorages(numOfStaleStorages);
      }

      allAlive = (dead == null && failedStorage == null);
      if (!allAlive && namesystem.isInStartupSafeMode()) {
        return;
      }
      if (dead != null) {
        // acquire the fsnamesystem lock, and then remove the dead node.
        namesystem.writeLock();
        try {
          dm.removeDeadDatanode(dead, !dead.isMaintenance());
        } finally {
          namesystem.writeUnlock();
        }
      }
      if (failedStorage != null) {
        // acquire the fsnamesystem lock, and remove blocks on the storage.
        namesystem.writeLock();
        try {
          blockManager.removeBlocksAssociatedTo(failedStorage);
        } finally {
          namesystem.writeUnlock();
        }
      }
    }
  }

dm.removeDeadDatanode(dead, !dead.isMaintenance())。在removeDeadDatanode函数中又调用了removeDatanode处理datanode的删除逻辑。删除heartbeatManager中记录的datanode、blockManager中相关的block、DatanodeManager内部（networktopology）的datanode记录、版本信息处理、blockManager中的租约信息。

  private void removeDatanode(DatanodeDescriptor nodeInfo,
      boolean removeBlocksFromBlocksMap) {
    assert namesystem.hasWriteLock();
    heartbeatManager.removeDatanode(nodeInfo);
    if (removeBlocksFromBlocksMap) {
      blockManager.removeBlocksAssociatedTo(nodeInfo);
    }
    networktopology.remove(nodeInfo);
    decrementVersionCount(nodeInfo.getSoftwareVersion());
    blockManager.getBlockReportLeaseManager().unregister(nodeInfo);

    if (LOG.isDebugEnabled()) {
      LOG.debug("remove datanode " + nodeInfo);
    }
    blockManager.checkSafeMode();
  }

恢复数据

当数据节点被判断为丢失时，blockManager在删除数据节点内的block信息的同时，会将block加入到pendingReconstruction类的列表中。BlockManager中的另一个线程会定期（默认3s）处理pendingReconstruction对象中的数据。

主要分成3步：1. 将block分为EC码block和副本block；2. 选择目标节点执行task；3. 将task放入到DatanodeDescriptor类的replicateBlocks队列中。

  /**
   * Periodically calls computeBlockRecoveryWork().
   * 默认每3s调用一次block recovery的操作。
   */
  private class RedundancyMonitor implements Runnable {

    @Override
    public void run() {
      while (namesystem.isRunning()) {
        try {
          // Process recovery work only when active NN is out of safe mode.
          if (isPopulatingReplQueues()) {
		    // 扫描neededReconstruction中的block，并且对每个block选择需要被恢复到的数据节点和拷贝数据的节点
            computeDatanodeWork();
            processPendingReconstructions();
            rescanPostponedMisreplicatedBlocks();
          }
          TimeUnit.MILLISECONDS.sleep(redundancyRecheckIntervalMs); // 默认 3s
        } catch (Throwable t) {
			// 省略异常处理
        }
      }
    }
  }
  
    /**
   * Reconstruct a set of blocks to full strength through replication or
   * erasure coding
   *
   * @param blocksToReconstruct blocks to be reconstructed, for each priority
   * @return the number of blocks scheduled for replication
   */
  @VisibleForTesting
  int computeReconstructionWorkForBlocks(
      List<List<BlockInfo>> blocksToReconstruct) {
    int scheduledWork = 0;
    List<BlockReconstructionWork> reconWork = new LinkedList<>();

    // Step 1: categorize at-risk blocks into replication and EC tasks
    namesystem.writeLock();
    try {
      synchronized (neededReconstruction) {
        for (int priority = 0; priority < blocksToReconstruct.size(); priority++) {
          for (BlockInfo block : blocksToReconstruct.get(priority)) {
            BlockReconstructionWork rw = scheduleReconstruction(block,
                priority);
            if (rw != null) {
              reconWork.add(rw);
            }
          }
        }
      }
    } finally {
      namesystem.writeUnlock();
    }

    // Step 2: choose target nodes for each reconstruction task
    final Set<Node> excludedNodes = new HashSet<>();
    for(BlockReconstructionWork rw : reconWork){
      // Exclude all of the containing nodes from being targets.
      // This list includes decommissioning or corrupt nodes.
      excludedNodes.clear();
      for (DatanodeDescriptor dn : rw.getContainingNodes()) {
        excludedNodes.add(dn);
      }

      // choose replication targets: NOT HOLDING THE GLOBAL LOCK
      final BlockPlacementPolicy placementPolicy =
          placementPolicies.getPolicy(rw.getBlock().getBlockType());
      rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes);
    }

    // Step 3: add tasks to the DN
    namesystem.writeLock();
    try {
      for(BlockReconstructionWork rw : reconWork){
        final DatanodeStorageInfo[] targets = rw.getTargets();
        if(targets == null || targets.length == 0){
          rw.resetTargets();
          continue;
        }

        synchronized (neededReconstruction) {
          if (validateReconstructionWork(rw)) {
            scheduledWork++;
          }
        }
      }
    } finally {
      namesystem.writeUnlock();
    }
	// 省略debug
    return scheduledWork;
  }

在对每个block创建新的转移任务时，需要选择一个当前已有副本的datanode和需要被复制到的datanode。对于已有副本的数据节点的选择，默认会先从没有写开销的DECOMMISSION_INPROGRESS状态的datanode中选，否则随机选一个没有达到副本限制的节点（每个节点会记录将要被复制的副本数，参数dfs.namenode.replication.max-streams用于限制每个节点上的副本数，默认为2），如果还不存在则随机选择其它符合要求的节点。对于被复制的数据节点，如同文件的第一次上传过程，调用了对应的副本放置策略进行选择。

/**
   * Parse the data-nodes the block belongs to and choose a certain number
   * from them to be the recovery sources.
   *
   * We prefer nodes that are in DECOMMISSION_INPROGRESS state to other nodes
   * since the former do not have write traffic and hence are less busy.
   * We do not use already decommissioned nodes as a source, unless there is
   * no other choice.
   * Otherwise we randomly choose nodes among those that did not reach their
   * replication limits. However, if the recovery work is of the highest
   * priority and all nodes have reached their replication limits, we will
   * randomly choose the desired number of nodes despite the replication limit.
   *
   * In addition form a list of all nodes containing the block
   * and calculate its replication numbers.
   *
   * @return the array of DatanodeDescriptor of the chosen nodes from which to
   *         recover the given block
   */
  @VisibleForTesting
  DatanodeDescriptor[] chooseSourceDatanodes(BlockInfo block,
      List<DatanodeDescriptor> containingNodes,
      List<DatanodeStorageInfo> nodesContainingLiveReplicas,
      NumberReplicas numReplicas,
      List<Byte> liveBlockIndices, int priority) // ...

数据节点的执行恢复的逻辑

前面向blockManager中获取的DatanodeDescriptor类加入了block需要创建副本的任务。DatanodeManager在通过RPC发送心跳消息给namenode时，namenode会在处理心跳时将副本复制任务转变为对应的命令返回给datanode。


// datanode发送心跳的rpc函数
@Override // DatanodeProtocol
public HeartbeatResponse sendHeartbeat(DatanodeRegistration nodeReg,
    StorageReport[] report, long dnCacheCapacity, long dnCacheUsed,
    int xmitsInProgress, int xceiverCount,
    int failedVolumes, VolumeFailureSummary volumeFailureSummary,
    boolean requestFullBlockReportLease,
    @Nonnull SlowPeerReports slowPeers,
    @Nonnull SlowDiskReports slowDisks) throws IOException {
  checkNNStartup();
  verifyRequest(nodeReg);
  return namesystem.handleHeartbeat(nodeReg, report,
      dnCacheCapacity, dnCacheUsed, xceiverCount, xmitsInProgress,
      failedVolumes, volumeFailureSummary, requestFullBlockReportLease,
      slowPeers, slowDisks);
}

// FSNameSystem的handleHearbeat函数，通过blockManager调用DatanodeManager处理心跳
{
  // ...
  DatanodeCommand[] cmds = blockManager.getDatanodeManager().handleHeartbeat(
          nodeReg, reports, getBlockPoolId(), cacheCapacity, cacheUsed,
          xceiverCount, maxTransfer, failedVolumes, volumeFailureSummary,
          slowPeers, slowDisks);
  // ...
}

// DatanodeManager中handleHeartbeat取出先前存储的任务，并转为BlockCommander。
{
  // ...
  List<BlockTargetPair> pendingList = nodeinfo.getReplicationCommand(
      numReplicationTasks);
  if (pendingList != null && !pendingList.isEmpty()) {
    cmds.add(new BlockCommand(DatanodeProtocol.DNA_TRANSFER, blockPoolId,
        pendingList));
  }

extendswind

Hadoop集群对datanode宕机后的处理机制源码阅读

宕机的心跳检测

接收心跳消息

HeartbeatManager中心跳的处理

恢复数据

数据节点的执行恢复的逻辑

公告