kafka服务端--KafkaController(二)--副本状态机/分区状态机

状态机一般用在事件处理中，并且事件会有多种状态。当事件发生变化时，会触发对应的事件处理动作。Kafka控制启动状态机时有下面特点：

1、分区状态机和副本状态机需要获取集群中所有分区和副本，因此需要先初始化上下文后，才能启动状态机。

2、分区包含了多个副本，只有当集群中所有的副本初始化好之后，才可以初始化分区状态机

一, ReplicaStateMachine

ReplicaStateMachine 记录着集群所有副本的状态信息，决定者副本处于什么样的状态，以及可以进行什么样的状态流转。

Kafka副本的状态可以有以下7种类型：

1、NewReplica：当分区重分配时，控制器可以创建一个新副本。这种状态下该副本只能作为follower，它可以是 Replica 删除后的一个临时状态，有效前置状态是 NonExistentReplica；

2、OnlineReplica：当副本被分配到指定的Partition上，并且副本完成创建，那么它将会被置为这个状态。在这个状态下，分区既可以作为Leader也可以作为Follower，有效前置状态是 NewReplica、OnlineReplica 或 OfflineReplica；

3、OfflineReplica：如果副本所在的Broker挂掉，副本将会置为这个状态。有效前置状态是 NewReplica、OfflineReplica 或 OnlineReplica；

4、ReplicaDeletionStarted：副本开始删除时被置为的状态，有效前置状态是 OfflineReplica；

5、ReplicaDeletionSuccessful：如果部分在删除时没有错误信息，它将被置为这个状态。表示该副本的数据已经从Broker清除了，有效前置状态是 ReplicaDeletionStarted；

6、ReplicaDeletionIneligible：如果副本删除失败，会转移到这个状态。表示非法删除，也就是删除不成功，有效前置状态是 ReplicaDeletionStarted；

7、NonExistentReplica：如果副本删除成功，将被转移到这个状态。有效前置状态是：ReplicaDeletionSuccessful。

ReplicaStateMachine初始化

副本状态机启动入口如下：

// Controller重新选举后触发
def startup() {
  // 初始化ZK上所有的副本状态信息（副本存活设置为Online，不存活的设置为ReplicaDeletionIneligible）
  initializeReplicaState()
  val (onlineReplicas, offlineReplicas) = controllerContext.onlineAndOfflineReplicas
  // 将存活的副本转化为OnlineReplica
  handleStateChanges(onlineReplicas.toSeq, OnlineReplica)
  // 将不存活的副本转化为OfflineReplica
  handleStateChanges(offlineReplicas.toSeq, OfflineReplica)  
}

上面的启动方法，首先会从ZK中恢复所有副本的状态。然后调用handleStateChanges()，将存活的副本转化为OnlineReplica状态。下面我们先看一下从ZK中恢复所有分区副本的状态：

/**
 * 初始化所有分区副本的状态
 */
private def initializeReplicaState() {
	// 循环所有分区
  controllerContext.allPartitions.foreach { partition =>
    val replicas = controllerContext.partitionReplicaAssignment(partition)
    replicas.foreach { replicaId =>
      val partitionAndReplica = PartitionAndReplica(partition, replicaId)
      // 如果副本存活，将状态设置为OnlineReplica
      if (controllerContext.isReplicaOnline(replicaId, partition)) {
        controllerContext.putReplicaState(partitionAndReplica, OnlineReplica)
      } else {
        // 不存活的副本设置为ReplicaDeletionIneligible，
        controllerContext.putReplicaState(partitionAndReplica, ReplicaDeletionIneligible)
      }
    }
  }
}

后面紧跟着的就是处理副本状态处理，分别对OnlineReplica和OfflineReplica做上线和线下处理。

/**
  * 副本状态机变化处理方法, 对多个副本的状态改变, 以批量请求的方式发送给多个Broker
  */
override def handleStateChanges(replicas: Seq[PartitionAndReplica], targetState: ReplicaState): Unit = {
  if (replicas.nonEmpty) {
    try {
      controllerBrokerRequestBatch.newBatch()
      // 处理状态请求
      replicas.groupBy(_.replica).foreach { case (replicaId, replicas) =>
        doHandleStateChanges(replicaId, replicas, targetState)
      }
      // 向Broker发送响应请求
      controllerBrokerRequestBatch.sendRequestsToBrokers(controllerContext.epoch)
    } catch {
      case e: ControllerMovedException =>
        error(s"Controller moved to another broker when moving some replicas to $targetState state", e)
        throw e
      case e: Throwable => error(s"Error while moving some replicas to $targetState state", e)
    }
  }
}

副本状态转换

状态转换为 NewReplica

case NewReplica =>
    validReplicas.foreach { replica =>
      val partition = replica.topicPartition
      val currentState = controllerContext.replicaState(replica)

      controllerContext.partitionLeadershipInfo.get(partition) match {
        /** 从ZK获取分区的 leaderAndIsr 信息 */
        case Some(leaderIsrAndControllerEpoch) =>
          if (leaderIsrAndControllerEpoch.leaderAndIsr.leader == replicaId) {
            /** NewReplica 状态的副本不能作为分区Leader */
            val exception = new StateChangeFailedException(s"Replica $replicaId for partition $partition cannot be moved to NewReplica state as it is being requested to become leader")
            logFailedStateChange(replica, currentState, OfflineReplica, exception)
          } else {
            /** 向replicaId的副本发送 LeaderAndIsr请求，并同时向所有Broker发送UpdateMetadata请求 */
            controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(Seq(replicaId),
              replica.topicPartition,
              leaderIsrAndControllerEpoch,
              controllerContext.partitionReplicaAssignment(replica.topicPartition),
              isNew = true)
            logSuccessfulTransition(replicaId, partition, currentState, NewReplica)
            /** 修改ControllerContext中的副本状态 */
            controllerContext.putReplicaState(replica, NewReplica)
          }
        case None =>
          /** 如副本没有LeaderAndIsr信息，则等待分区Leader选举完成 */
          logSuccessfulTransition(replicaId, partition, currentState, NewReplica)
          controllerContext.putReplicaState(replica, NewReplica)
      }
    }

简述一下上面的流程如下：

1、校验副本的前置状态，只有处于 NonExistentReplica 状态的副本才能转移到 NewReplica 状态；

2、从ZK中获取该分区的 LeaderIsrAndControllerEpoch 信息；

3、如果获取不到上述信息，直接将该副本的状态设置为 NewReplica，然后结束流程（新建分区时，副本可能处于这个状态，该分区的所有副本是没有 LeaderAndIsr 信息的）；

4、获取到分区的 LeaderIsrAndControllerEpoch 信息，如果发现该分区的 leader 是当前副本，那么就抛出 StateChangeFailedException 异常，因为处在这个状态的副本是不能被选举为 leader 的；

5、获取到了分区的 LeaderIsrAndControllerEpoch 信息，并且分区的 leader 不是当前副本，那么向该分区的所有副本添加一个 LeaderAndIsr 请求（添加 LeaderAndIsr 请求时，同时也会向所有的 Broker 都添加一个 UpdateMetadata 请求）；

6、最后将该副本的状态转移成 NewReplica，然后结束流程。

状态转换为 OnlineReplica

OnlineReplica是副本正常工作时的状态，此时的副本既可以是 leader 也可以是 follower，转换到这种状态的处理实现如下：

case OnlineReplica =>
    validReplicas.foreach { replica =>
      val partition = replica.topicPartition
      val currentState = controllerContext.replicaState(replica)

      currentState match {
        case NewReplica =>
          /** NewReplica --> OnlineReplica */
          val assignment = controllerContext.partitionReplicaAssignment(partition)
          /** 如副本不在分区副本集合中，添加进集合（正常情况下不会出现） */
          if (!assignment.contains(replicaId)) {
            controllerContext.updatePartitionReplicaAssignment(partition, assignment :+ replicaId)
          }
        case _ =>
          /** OnlineReplica | OfflineReplica --> OnlineReplica */
          controllerContext.partitionLeadershipInfo.get(partition) match {
            case Some(leaderIsrAndControllerEpoch) =>
              /** 如果该副本的 LeaderIsrAndControllerEpoch 信息存在，那么就更新副本的状态，并发送相应的请求 */
              controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(Seq(replicaId),
                replica.topicPartition,
                leaderIsrAndControllerEpoch,
                controllerContext.partitionReplicaAssignment(partition), isNew = false)
            case None =>
              /** 表示分区不是OnlinePartition状态，也就是Broker没有为分区启动log，并且没有分区高水位的值 */
          }
      }
      logSuccessfulTransition(replicaId, partition, currentState, OnlineReplica)
      controllerContext.putReplicaState(replica, OnlineReplica)
    }

从前面的状态转换可以看出，当副本处在 NewReplica、OnlineReplica、OfflineReplica 状态时，是可以转移到 OnlineReplica 状态的。代码中的实现可以分为如下2种情况：

A、NewReplica –> OnlineReplica

1) 从上下文中的 partitionReplicaAssignment 中获取分区的副本列表；

2) 如果副本不在列表中，那么将其添加到分区副本列表中；

3) 将副本的状态变更为 OnlineReplica 状态。

B、OnlineReplica | OfflineReplica –> OnlineReplica

1) 从上下文中的 partitionLeadershipInfo 获取分区的 LeaderAndIsr 信息；

2) 如果该信息存在，那么就向这个副本所在的 broker 添加这个分区的 LeaderAndIsr 请求，并将副本的状态设置为 OnlineReplica；

3) 如果信息不存在，不做任何处理；

4) 更新副本的状态为 OnlineReplica。

状态转换为 OfflineReplica

case OfflineReplica =>
    validReplicas.foreach { replica =>
      /** 发送 StopReplica 请求给该副本，先停止副本同步 */
      controllerBrokerRequestBatch.addStopReplicaRequestForBrokers(Seq(replicaId), replica.topicPartition, deletePartition = false)
    }
    /** 将副本列表拆分为：有LeadershipInfo和无LeadershipInfo两部分 */
    val (replicasWithLeadershipInfo, replicasWithoutLeadershipInfo) = validReplicas.partition { replica =>
      controllerContext.partitionLeadershipInfo.contains(replica.topicPartition)
    }
    /** 有LeadershipInfo的，控制器将副本从ISR中移除 */
    val updatedLeaderIsrAndControllerEpochs = removeReplicasFromIsr(replicaId, replicasWithLeadershipInfo.map(_.topicPartition))
    updatedLeaderIsrAndControllerEpochs.foreach { case (partition, leaderIsrAndControllerEpoch) =>
      if (!controllerContext.isTopicQueuedUpForDeletion(partition.topic)) {
        val recipients = controllerContext.partitionReplicaAssignment(partition).filterNot(_ == replicaId)
        /** 向该分区其他副本发送 LeaderAndIsr 请求 */
        controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(recipients,
          partition,
          leaderIsrAndControllerEpoch,
          controllerContext.partitionReplicaAssignment(partition), isNew = false)
      }
      val replica = PartitionAndReplica(partition, replicaId)
      val currentState = controllerContext.replicaState(replica)
      logSuccessfulTransition(replicaId, partition, currentState, OfflineReplica)
      controllerContext.putReplicaState(replica, OfflineReplica)
    }
    /** 无LeadershipInfo的，向所有存活Broker发送 UpdateMetadata请求 */
    replicasWithoutLeadershipInfo.foreach { replica =>
      val currentState = controllerContext.replicaState(replica)
      logSuccessfulTransition(replicaId, replica.topicPartition, currentState, OfflineReplica)
      controllerBrokerRequestBatch.addUpdateMetadataRequestForBrokers(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(replica.topicPartition))
      controllerContext.putReplicaState(replica, OfflineReplica)
    }

1) 校验前置状态，只有副本在 NewReplica、OnlineReplica、OfflineReplica 状态时，才可以转换到这种状态；

2) 向该副本所在Broker发送 StopReplica 请求（deletePartition = false）；

3) 将副本列表拆分为：有LeadershipInfo和无LeadershipInfo两部分

4) 有LeadershipInfo的，调用 removeReplicaFromIsr()，将该副本从分区的 isr 移除。然后向该分区其他副本发送 LeaderAndIsr 请求；

5) 无LeadershipInfo的，向所有存活Broker发送 UpdateMetadata请求

6) 更新副本的状态为 OfflineReplica。

状态转换为 ReplicaDeletionStarted

case ReplicaDeletionStarted =>
    validReplicas.foreach { replica =>
      val currentState = controllerContext.replicaState(replica)
      logSuccessfulTransition(replicaId, replica.topicPartition, currentState, ReplicaDeletionStarted)
      controllerContext.putReplicaState(replica, ReplicaDeletionStarted)
      /** 发送 StopReplica 请求给该副本,并设置 deletePartition=true */
      controllerBrokerRequestBatch.addStopReplicaRequestForBrokers(Seq(replicaId), replica.topicPartition, deletePartition = true)
    }

该状态是副本删除过程的开始状态，简述一下上面的逻辑：

1）校验前置状态，副本前置状态只能是 OfflineReplica；

2）更新该副本的状态为 ReplicaDeletionStarted；

3）向该副本发送 StopReplica 请求（deletePartition = true），收到这请求后，broker 会从物理存储上删除这个副本的数据内容；

状态转换为 ReplicaDeletionIneligible

case ReplicaDeletionIneligible =>
  validReplicas.foreach { replica =>
    val currentState = controllerContext.replicaState(replica)
    logSuccessfulTransition(replicaId, replica.topicPartition, currentState, ReplicaDeletionIneligible)
    controllerContext.putReplicaState(replica, ReplicaDeletionIneligible)
  }

该状态是副本删除失败的状态，简述一下上面的逻辑：

1）校验前置状态，副本的前置状态只能是 ReplicaDeletionStarted；

2）更新该副本的状态为 ReplicaDeletionIneligible。

状态转换为 ReplicaDeletionSuccessful

case ReplicaDeletionSuccessful =>
  validReplicas.foreach { replica =>
    val currentState = controllerContext.replicaState(replica)
    logSuccessfulTransition(replicaId, replica.topicPartition, currentState, ReplicaDeletionSuccessful)
    controllerContext.putReplicaState(replica, ReplicaDeletionSuccessful)
  }

该状态是副本删除成功的状态，简述一下上面的逻辑：

1）检验前置状态，副本的前置状态只能是 ReplicaDeletionStarted；

2）更新该副本的状态为 ReplicaDeletionSuccessful。

状态转换为 ReplicaDeletionIneligible

case NonExistentReplica =>
  validReplicas.foreach { replica =>
    val currentState = controllerContext.replicaState(replica)
    val currentAssignedReplicas = controllerContext.partitionReplicaAssignment(replica.topicPartition)
    // 从控制器上下文和副本状态机中清除这个副本的信息
    controllerContext.updatePartitionReplicaAssignment(replica.topicPartition, currentAssignedReplicas.filterNot(_ == replica.replica))
    logSuccessfulTransition(replicaId, replica.topicPartition, currentState, NonExistentReplica)
    controllerContext.removeReplicaState(replica)
  }

该状态是副本已经被完全删除，不存在的状态，简述一下上面的逻辑：

1）检验前置状态，副本的前置状态只能是 ReplicaDeletionSuccessful；

2）在控制器的 partitionReplicaAssignment 删除分区对应的副本信息；

3）从控制器上下文和副本状态机中将这个副本删除。

二, PartitionStateMachine

它实现了topic的分区状态切换功能，Partition存在的状态如下：

Partition状态切换的过程如下：

因此重点关注PartitionStateMachine的handleStateChange函数

private def handleStateChange(topic: String, partition: Int, targetState: PartitionState,
                              leaderSelector: PartitionLeaderSelector,
                              callbacks: Callbacks) {
  val topicAndPartition = TopicAndPartition(topic, partition)
  if (!hasStarted.get)
    throw new StateChangeFailedException(("Controller %d epoch %d initiated state change for partition %s to %s failed because " +
                                          "the partition state machine has not started")
                                            .format(controllerId, controller.epoch, topicAndPartition, targetState))
  val currState = partitionState.getOrElseUpdate(topicAndPartition, NonExistentPartition)
  try {
    targetState match {
      case NewPartition =>
//检查前置状态
        assertValidPreviousStates(topicAndPartition, List(NonExistentPartition), NewPartition)
        //更新controllerContext中的partitionReplicaAssignment
assignReplicasToPartitions(topic, partition)
//修改partition的状态
        partitionState.put(topicAndPartition, NewPartition)
        val assignedReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition).mkString(",")
        stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s with assigned replicas %s"
                                  .format(controllerId, controller.epoch, topicAndPartition, currState, targetState,
                                          assignedReplicas))
      case OnlinePartition =>
//检查前置状态
        assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OnlinePartition)
        partitionState(topicAndPartition) match {
          case NewPartition =>// NewPartition-> OnlinePartition
            /* 1.根据partitionReplicaAssignment中信息选择第一个live的replica为leader，其余为isr
     *2.将leader和isr持久化到zk
             *3.更新controllerContext中的partitionLeadershipInfo
*4.封装发送给这些replica所在的broker的LeaderAndIsrRequest请求，交由ControllerBrokerRequestBatch处理
*/
            initializeLeaderAndIsrForPartition(topicAndPartition)
          case OfflinePartition =>// OfflinePartition-> OnlinePartition
/* 1.根据不同的leaderSelector选举新的leader，这里一般调用的是OfflinePartitionLeaderSelector
     *2.将leader和isr持久化到zk
     *3.更新controllerContext中的partitionLeadershipInfo
*4.封装发送给这些replica所在的broker的LeaderAndIsrRequest请求，交由ControllerBrokerRequestBatch处理
*/
            electLeaderForPartition(topic, partition, leaderSelector)
          case OnlinePartition =>// OnlinePartition -> OnlinePartition
/* 1.根据不同的leaderSelector选举新的leader，这里一般调用的是ReassignedPartitionLeaderSelector
     *2.将leader和isr持久化到zk
     *3.更新controllerContext中的partitionLeadershipInfo
*4.封装发送给这些replica所在的broker的LeaderAndIsrRequest请求，交由ControllerBrokerRequestBatch处理
*/
            electLeaderForPartition(topic, partition, leaderSelector)
          case _ => // should never come here since illegal previous states are checked above
        }
//更新partition的状态
        partitionState.put(topicAndPartition, OnlinePartition)
        val leader = controllerContext.partitionLeadershipInfo(topicAndPartition).leaderAndIsr.leader
        stateChangeLogger.trace("Controller %d epoch %d changed partition %s from %s to %s with leader %d"
                                  .format(controllerId, controller.epoch, topicAndPartition, currState, targetState, leader))
      case OfflinePartition =>
        //检查前置状态
        assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OfflinePartition)
        stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"
                                  .format(controllerId, controller.epoch, topicAndPartition, currState, targetState))
//更新partition的状态
        partitionState.put(topicAndPartition, OfflinePartition)
      case NonExistentPartition =>
        //检查前置状态
        assertValidPreviousStates(topicAndPartition, List(OfflinePartition), NonExistentPartition)
        stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"
                                  .format(controllerId, controller.epoch, topicAndPartition, currState, targetState))
//更新partition的状态
        partitionState.put(topicAndPartition, NonExistentPartition)
        // post: partition state is deleted from all brokers and zookeeper
    }
  } catch {
    case t: Throwable =>
      stateChangeLogger.error("Controller %d epoch %d initiated state change for partition %s from %s to %s failed"
        .format(controllerId, controller.epoch, topicAndPartition, currState, targetState), t)
  }
}

KafkaController PartitionLeaderSelector

当partition的状态发生切换时，特别发生如下切换：OfflinePartition-> OnlinePartition和OnlinePartition -> OnlinePartition时需要调用不同的PartitionLeaderSelector来确定leader和isr，当前一共支持5种PartitionLeaderSelector，分别为：NoOpLeaderSelector，OfflinePartitionLeaderSelector，ReassignedPartitionLeaderSelector，PreferredReplicaPartitionLeaderSelector，ControlledShutdownLeaderSelector

1. NoOpLeaderSelector

/**
 * Essentially does nothing. Returns the current leader and ISR, and the current
 * set of replicas assigned to a given topic/partition.
 */
class NoOpLeaderSelector(controllerContext: ControllerContext) extends PartitionLeaderSelector with Logging {

  this.logIdent = "[NoOpLeaderSelector]: "

  def selectLeader(topicAndPartition: TopicAndPartition, currentLeaderAndIsr: LeaderAndIsr): (LeaderAndIsr, Seq[Int]) = {
    warn("I should never have been asked to perform leader election, returning the current LeaderAndIsr and replica assignment.")
    (currentLeaderAndIsr, controllerContext.partitionReplicaAssignment(topicAndPartition))
  }
}

基本上啥也没做，就是把currentLeaderAndIsr和set of replicas assigned to a given topic/partition

2. OfflinePartitionLeaderSelector

class OfflinePartitionLeaderSelector(controllerContext: ControllerContext, config: KafkaConfig)
  extends PartitionLeaderSelector with Logging {
  this.logIdent = "[OfflinePartitionLeaderSelector]: "
  def selectLeader(topicAndPartition: TopicAndPartition, currentLeaderAndIsr: LeaderAndIsr): (LeaderAndIsr, Seq[Int]) = {
    controllerContext.partitionReplicaAssignment.get(topicAndPartition) match {
      case Some(assignedReplicas) =>
        val liveAssignedReplicas = assignedReplicas.filter(r => controllerContext.liveBrokerIds.contains(r))
        val liveBrokersInIsr = currentLeaderAndIsr.isr.filter(r => controllerContext.liveBrokerIds.contains(r))
        val currentLeaderEpoch = currentLeaderAndIsr.leaderEpoch
        val currentLeaderIsrZkPathVersion = currentLeaderAndIsr.zkVersion
        val newLeaderAndIsr = liveBrokersInIsr.isEmpty match {
          case true =>//isr中的broker都离线了，则需要从asr中选择leader
            if (!LogConfig.fromProps(config.props.props, AdminUtils.fetchTopicConfig(controllerContext.zkClient,
              topicAndPartition.topic)).uncleanLeaderElectionEnable) {
              throw new NoReplicaOnlineException(("No broker in ISR for partition " +
                "%s is alive. Live brokers are: [%s],".format(topicAndPartition, controllerContext.liveBrokerIds)) +
                " ISR brokers are: [%s]".format(currentLeaderAndIsr.isr.mkString(",")))
            }
            debug("No broker in ISR is alive for %s. Pick the leader from the alive assigned replicas: %s"
              .format(topicAndPartition, liveAssignedReplicas.mkString(",")))
            liveAssignedReplicas.isEmpty match {
              case true =>//如果asr中的broker也都已经离线了，则这个topic/partition挂了
                throw new NoReplicaOnlineException(("No replica for partition " +
                  "%s is alive. Live brokers are: [%s],".format(topicAndPartition, controllerContext.liveBrokerIds)) +
                  " Assigned replicas are: [%s]".format(assignedReplicas))
              case false =>//如果asr中的broker有一些是在线的
                ControllerStats.uncleanLeaderElectionRate.mark()
                val newLeader = liveAssignedReplicas.head//取第一个为leader
                warn("No broker in ISR is alive for %s. Elect leader %d from live brokers %s. There's potential data loss."
                     .format(topicAndPartition, newLeader, liveAssignedReplicas.mkString(",")))
                new LeaderAndIsr(newLeader, currentLeaderEpoch + 1, List(newLeader), currentLeaderIsrZkPathVersion + 1)
            }
          case false =>//isr中的broker有一些是在线的
            val liveReplicasInIsr = liveAssignedReplicas.filter(r => liveBrokersInIsr.contains(r))
            val newLeader = liveReplicasInIsr.head//选择第一个live的replica
            debug("Some broker in ISR is alive for %s. Select %d from ISR %s to be the leader."
                  .format(topicAndPartition, newLeader, liveBrokersInIsr.mkString(",")))
            new LeaderAndIsr(newLeader, currentLeaderEpoch + 1, liveBrokersInIsr.toList, currentLeaderIsrZkPathVersion + 1)
        }
        info("Selected new leader and ISR %s for offline partition %s".format(newLeaderAndIsr.toString(), topicAndPartition))
        (newLeaderAndIsr, liveAssignedReplicas)
      case None =>
        throw new NoReplicaOnlineException("Partition %s doesn't have replicas assigned to it".format(topicAndPartition))
    }
  }
}

3. ReassignedPartitionLeaderSelector

class ReassignedPartitionLeaderSelector(controllerContext: ControllerContext) extends PartitionLeaderSelector with Logging {
  this.logIdent = "[ReassignedPartitionLeaderSelector]: "
  def selectLeader(topicAndPartition: TopicAndPartition, currentLeaderAndIsr: LeaderAndIsr): (LeaderAndIsr, Seq[Int]) = {
//patition被重新分配的replicas
    val reassignedInSyncReplicas = controllerContext.partitionsBeingReassigned(topicAndPartition).newReplicas
    val currentLeaderEpoch = currentLeaderAndIsr.leaderEpoch
    val currentLeaderIsrZkPathVersion = currentLeaderAndIsr.zkVersion
//在reassignedInSyncReplicas中筛选replica其所在的broker是live的和当前的replica是位于isr中的
val aliveReassignedInSyncReplicas = reassignedInSyncReplicas.filter(r => controllerContext.liveBrokerIds.contains(r) &&
                                                                             currentLeaderAndIsr.isr.contains(r))
    val newLeaderOpt = aliveReassignedInSyncReplicas.headOption
    newLeaderOpt match {//存在满足以上条件的replica，则筛选为leader
      case Some(newLeader) => (new LeaderAndIsr(newLeader, currentLeaderEpoch + 1, currentLeaderAndIsr.isr,
        currentLeaderIsrZkPathVersion + 1), reassignedInSyncReplicas)
      case None =>//否则reassigned失败
        reassignedInSyncReplicas.size match {
          case 0 =>
            throw new NoReplicaOnlineException("List of reassigned replicas for partition " +
              " %s is empty. Current leader and ISR: [%s]".format(topicAndPartition, currentLeaderAndIsr))
          case _ =>
            throw new NoReplicaOnlineException("None of the reassigned replicas for partition " +
              "%s are in-sync with the leader. Current leader and ISR: [%s]".format(topicAndPartition, currentLeaderAndIsr))
        }
    }
  }
}

4. PreferredReplicaPartitionLeaderSelector

class PreferredReplicaPartitionLeaderSelector(controllerContext: ControllerContext) extends PartitionLeaderSelector
with Logging {
  this.logIdent = "[PreferredReplicaPartitionLeaderSelector]: "
  def selectLeader(topicAndPartition: TopicAndPartition, currentLeaderAndIsr: LeaderAndIsr): (LeaderAndIsr, Seq[Int]) = {
    val assignedReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition)
//默认选举第一个replica作为leader
    val preferredReplica = assignedReplicas.head
    // check if preferred replica is the current leader
    val currentLeader = controllerContext.partitionLeadershipInfo(topicAndPartition).leaderAndIsr.leader
    if (currentLeader == preferredReplica) {//如果已经实现，则退出
      throw new LeaderElectionNotNeededException("Preferred replica %d is already the current leader for partition %s"
                                                   .format(preferredReplica, topicAndPartition))
    } else {
      info("Current leader %d for partition %s is not the preferred replica.".format(currentLeader, topicAndPartition) +
        " Trigerring preferred replica leader election")
      // 检查这个replica是否位于isr和其所在的broker是否live，如果是的话，则其恢复成leader，此场景主要用于负载均衡的情况
  if (controllerContext.liveBrokerIds.contains(preferredReplica) && currentLeaderAndIsr.isr.contains(preferredReplica)) {
        (new LeaderAndIsr(preferredReplica, currentLeaderAndIsr.leaderEpoch + 1, currentLeaderAndIsr.isr,
          currentLeaderAndIsr.zkVersion + 1), assignedReplicas)
      } else {
        throw new StateChangeFailedException("Preferred replica %d for partition ".format(preferredReplica) +
          "%s is either not alive or not in the isr. Current leader and ISR: [%s]".format(topicAndPartition, currentLeaderAndIsr))
      }
    }
  }
}

5. ControlledShutdownLeaderSelector

class ControlledShutdownLeaderSelector(controllerContext: ControllerContext)
        extends PartitionLeaderSelector
        with Logging {
  this.logIdent = "[ControlledShutdownLeaderSelector]: "
  def selectLeader(topicAndPartition: TopicAndPartition, currentLeaderAndIsr: LeaderAndIsr): (LeaderAndIsr, Seq[Int]) = {
    val currentLeaderEpoch = currentLeaderAndIsr.leaderEpoch
    val currentLeaderIsrZkPathVersion = currentLeaderAndIsr.zkVersion
    val currentLeader = currentLeaderAndIsr.leader
    val assignedReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition)
    val liveOrShuttingDownBrokerIds = controllerContext.liveOrShuttingDownBrokerIds
//筛选出live状态的replica
    val liveAssignedReplicas = assignedReplicas.filter(r => liveOrShuttingDownBrokerIds.contains(r))
//筛选出live状态的isr
    val newIsr = currentLeaderAndIsr.isr.filter(brokerId => !controllerContext.shuttingDownBrokerIds.contains(brokerId))
    val newLeaderOpt = newIsr.headOption
    newLeaderOpt match {
      case Some(newLeader) =>//如果存在newLeader，选择其作为leader
        debug("Partition %s : current leader = %d, new leader = %d"
              .format(topicAndPartition, currentLeader, newLeader))
        (LeaderAndIsr(newLeader, currentLeaderEpoch + 1, newIsr, currentLeaderIsrZkPathVersion + 1),
         liveAssignedReplicas)
      case None =>
        throw new StateChangeFailedException(("No other replicas in ISR %s for %s besides" +
          " shutting down brokers %s").format(currentLeaderAndIsr.isr.mkString(","), topicAndPartition, controllerContext.shuttingDownBrokerIds.mkString(",")))
    }
  }
}

posted @ 2023-02-17 17:34 車輪の唄阅读(129) 评论(0) 编辑收藏举报来源

刷新页面返回顶部

車輪の唄