kafka partiton迁移方法与原理
在kafka中增加新的节点后,数据是不会自动迁移到新的节点上的,需要我们手动将数据迁移(或者成为打散)到新的节点上
1 迁移方法
kafka为我们提供了用于数据迁移的脚本。我们可以用这些脚本完成数据的迁移。
1.1 生成partiton分配表
1.1.1 创建json文件topic-to-move.json
{
"topics": [{"topic": "testTopic"}],
"version": 1
}
1.1.2 生成partiton分配表
运行
$ ./kafka-reassign-partitions --zookeeper ${zk_address} --topics-to-move-json-file topic-to-move.json --broker-list "140,141" --generate
其中${zk_address}是kafka所连接zk地址,"140,141"为该topic将要迁移到的目标节点。
生成结果如下:
Current partition replica assignment
{"version":1,"partitions":[{"topic":"testTopic","partition":1,"replicas":[61,62]},{"topic":"testTopic","partition":0,"replicas":[62,61]}]}
Proposed partition reassignment configuration
{"version":1,"partitions":[{"topic":"testTopic","partition":1,"replicas":[140,141]},{"topic":"testTopic","partition":0,"replicas":[141,140]}]}
其中上半部分是当前的partiton分布情况,下半部分是迁移成功后的partion分布情况。那么我们就用下部分的json进行迁移。另外,也可以自己构造个类似的json文件,同样可以进行迁移。这里我们使用脚本为我们生成的json文件,将下半部分的json保存为expand-cluster-reassignment.json
1.2 执行迁移
$ ./kafka-reassign-partitions --zookeeper ${zk_address} --reassignment-json-file expand-cluster-reassignment.json --execute
1.3 查看迁移进度
$ ./kafka-reassign-partitions --zookeeper ${zk_address} --reassignment-json-file expand-cluster-reassignment.json --verify
2 源码分析
2.1 脚本调用
kafka-reassign-partitions.sh会调用kafka.admin.ReassignPartitionsCommand.scala,在代码运行过程中抛出的任何异常都会通过标准输出打印出来,所以如果执行该脚本报错,可以看下这块代码来定位问题。
def main(args: Array[String]): Unit = {
// 略
try {
if(opts.options.has(opts.verifyOpt)) // 校验
verifyAssignment(zkUtils, opts)
else if(opts.options.has(opts.generateOpt)) // 生成json
generateAssignment(zkUtils, opts)
else if (opts.options.has(opts.executeOpt)) // 执行迁移
executeAssignment(zkUtils, opts)
} catch {
case e: Throwable =>
println("Partitions reassignment failed due to " + e.getMessage)
println(Utils.stackTrace(e))
} finally {
val zkClient = zkUtils.zkClient
if (zkClient != null)
zkClient.close()
}
2.1.1 executeAssignment
executeAssignment 用于执行迁移。
def executeAssignment(zkUtils: ZkUtils,reassignmentJsonString: String){
// 略,做一些校验和去重等工作
// 获取当前的partition分布情况
zkUtils.getReplicaAssignmentForTopics(partitionsToBeReassigned.map(_._1.topic))
println("Current partition replica assignment\n\n%s\n\nSave this to use as the --reassignment-json-file option during rollback"
.format(zkUtils.formatAsReassignmentJson(currentPartitionReplicaAssignment)))
// 重点,执行迁移,z即将json写到zk上,准确的说是写到"/admin/reassign_partitions"下
// start the reassignment
if(reassignPartitionsCommand.reassignPartitions())
println("Successfully started reassignment of partitions %s".format(zkUtils.formatAsReassignmentJson(partitionsToBeReassigned.toMap)))
else
println("Failed to reassign partitions %s".format(partitionsToBeReassigned))
}
executeAssignment将json写到zk上后,brokerwatch到节点数据变化就开始进行迁移了
2.1.2 verifyAssignment
verifyAssignment用于校验迁移进度
def verifyAssignment(zkUtils: ZkUtils, opts: ReassignPartitionsCommandOptions) {
// 略
println("Status of partition reassignment:")
val reassignedPartitionsStatus = checkIfReassignmentSucceeded(zkUtils, partitionsToBeReassigned) // 重点
reassignedPartitionsStatus.foreach { partition =>
partition._2 match {
case ReassignmentCompleted =>
println("Reassignment of partition %s completed successfully".format(partition._1))
case ReassignmentFailed =>
println("Reassignment of partition %s failed".format(partition._1))
case ReassignmentInProgress =>
println("Reassignment of partition %s is still in progress".format(partition._1))
}
}
}
private def checkIfReassignmentSucceeded(zkUtils: ZkUtils, partitionsToBeReassigned: Map[TopicAndPartition, Seq[Int]])
:Map[TopicAndPartition, ReassignmentStatus] = {
val partitionsBeingReassigned = zkUtils.getPartitionsBeingReassigned().mapValues(_.newReplicas) // 从zk节点"/admin/reassign_partitions"读取迁移信息
partitionsToBeReassigned.map { topicAndPartition =>
(topicAndPartition._1, checkIfPartitionReassignmentSucceeded(zkUtils,topicAndPartition._1,
topicAndPartition._2, partitionsToBeReassigned, partitionsBeingReassigned))
}
}
def checkIfPartitionReassignmentSucceeded(zkUtils: ZkUtils, topicAndPartition: TopicAndPartition,
reassignedReplicas: Seq[Int],
partitionsToBeReassigned: Map[TopicAndPartition, Seq[Int]],
partitionsBeingReassigned: Map[TopicAndPartition, Seq[Int]]): ReassignmentStatus = {
val newReplicas = partitionsToBeReassigned(topicAndPartition)
partitionsBeingReassigned.get(topicAndPartition) match {
case Some(partition) => ReassignmentInProgress // 如果tp对应的数据存在则说明还在迁移
case None => // 否则可能是成功了
// check if the current replica assignment matches the expected one after reassignment
val assignedReplicas = zkUtils.getReplicasForPartition(topicAndPartition.topic, topicAndPartition.partition)
if(assignedReplicas == newReplicas) // 重点,如果节点不存在了,但是迁移后的replica列表和预期不一致,则报错
ReassignmentCompleted
else { // 经常遇到的报错
println(("ERROR: Assigned replicas (%s) don't match the list of replicas for reassignment (%s)" +
" for partition %s").format(assignedReplicas.mkString(","), newReplicas.mkString(","), topicAndPartition))
ReassignmentFailed
}
}
}
从源码中可以看出判断迁移是否完成是根据"/admin/reassign_partitions"是否存在来判断。如果节点不存在了,并且迁移后的AR和预期一致,则才算成功。
注意:在实际迁移中遇到过好几次报错类似如下,即上面代码的打印的日志
don't match the list of replicas for reassignment
从代码中可以看到出现这个错误的原因是"/admin/reassign_partitions"不存在了,但是当前topic的AR和预期的不一致。这个原因一般是由于迁移的时候broker那边报错了,然后将节点删除了,并没有进行迁移。具体原因需要看下broker的controller的日志。
2.2 broker如何进行迁移
2.2.1 入口
broker的controller节点负责partiton的迁移工作,在broker被选为controller节点的时候会watch "/admin/reassign_partitions" 节点的变化。
private def registerReassignedPartitionsListener() = {
zkUtils.zkClient.subscribeDataChanges(ZkUtils.ReassignPartitionsPath, partitionReassignedListener)
}
所以迁移的工作主要在partitionReassignedListener中,controller watch到"/admin/reassign_partitions"节点数据变化后,会读取该数据内容,并跳过正在删除的partiton,进行迁移工作。
class PartitionsReassignedListener(controller: KafkaController) extends IZkDataListener with Logging {
this.logIdent = "[PartitionsReassignedListener on " + controller.config.brokerId + "]: "
val zkUtils = controller.controllerContext.zkUtils
val controllerContext = controller.controllerContext
@throws(classOf[Exception])
def handleDataChange(dataPath: String, data: Object) {
val partitionsReassignmentData = zkUtils.parsePartitionReassignmentData(data.toString) // 读取"/admin/reassign_partitions"节点内的数据,封装成[TopicAndPartition, relipcs] 的形式
val partitionsToBeReassigned = inLock(controllerContext.controllerLock) {
partitionsReassignmentData.filterNot(p => controllerContext.partitionsBeingReassigned.contains(p._1))
}
partitionsToBeReassigned.foreach { partitionToBeReassigned => // 迁移每一个partiton
inLock(controllerContext.controllerLock) {
if(controller.deleteTopicManager.isTopicQueuedUpForDeletion(partitionToBeReassigned._1.topic)) { // 正在删除的则跳过
error("Skipping reassignment of partition %s for topic %s since it is currently being deleted"
.format(partitionToBeReassigned._1, partitionToBeReassigned._1.topic))
controller.removePartitionFromReassignedPartitions(partitionToBeReassigned._1)
} else {
val context = new ReassignedPartitionsContext(partitionToBeReassigned._2) // 将目标replica列表封装成ReassignedPartitionsContext
controller.initiateReassignReplicasForTopicPartition(partitionToBeReassigned._1, context) // 重点,以上都是读取,这里才是真正的迁移工作
}
}
}
}
2.2.2 KafkaController#initiateReassignReplicasForTopicPartition()
initiateReassignReplicasForTopicPartition进行迁移工作。但是他主要做一些校验工作,该方法中会watch该partiton的ISR变化情况,即监听“/brokers/topics/{topic}/partitions/{partiton}/state” 节点的变化, 这和迁移的原理有关系。
def initiateReassignReplicasForTopicPartition(topicAndPartition: TopicAndPartition,
reassignedPartitionContext: ReassignedPartitionsContext) {
val newReplicas = reassignedPartitionContext.newReplicas
val topic = topicAndPartition.topic
val partition = topicAndPartition.partition
val aliveNewReplicas = newReplicas.filter(r => controllerContext.liveBrokerIds.contains(r)) // 根据broker的存活进行过滤
try {
// 重点, 从controllerContext中读取partition的AR
val assignedReplicasOpt = controllerContext.partitionReplicaAssignment.get(topicAndPartition)
assignedReplicasOpt match {
case Some(assignedReplicas) =>
// 如果ontrollerContext中AR和目标迁移列表相同,则抛异常。注意他们都是Seq类型,相同是指顺序也相同。
if(assignedReplicas == newReplicas) {
throw new KafkaException("Partition %s to be reassigned is already assigned to replicas".format(topicAndPartition) +
" %s. Ignoring request for partition reassignment".format(newReplicas.mkString(",")))
} else {
if(aliveNewReplicas == newReplicas) { // 目标列表里的replic的broker都存活才能进行迁移
watchIsrChangesForReassignedPartition(topic, partition, reassignedPartitionContext) // 重点,后面会分析
controllerContext.partitionsBeingReassigned.put(topicAndPartition, reassignedPartitionContext)
deleteTopicManager.markTopicIneligibleForDeletion(Set(topic))
onPartitionReassignment(topicAndPartition, reassignedPartitionContext) // 重点,真正干活的,做迁移工作的
} else { // 有不存活的,则抛出异常。
// some replica in RAR is not alive. Fail partition reassignment
throw new KafkaException("Only %s replicas out of the new set of replicas".format(aliveNewReplicas.mkString(",")) +
" %s for partition %s to be reassigned are alive. ".format(newReplicas.mkString(","), topicAndPartition) +
"Failing partition reassignment")
}
}
case None => throw new KafkaException("Attempt to reassign partition %s that doesn't exist"
.format(topicAndPartition))
}
} catch {
case e: Throwable => error("Error completing reassignment of partition %s".format(topicAndPartition), e)
// remove the partition from the admin path to unblock the admin client
removePartitionFromReassignedPartitions(topicAndPartition) // 重点,一点迁移出问题,抛出异常,则会将"/admin/reassign_partitions"里的相应信息清空或者删除节点
}
}
2.2.3 KafkaController#onPartitionReassignment
真正的迁移步骤是在onPartitionReassignment完成的
def onPartitionReassignment(topicAndPartition: TopicAndPartition, reassignedPartitionContext: ReassignedPartitionsContext) {
val reassignedReplicas = reassignedPartitionContext.newReplicas
areReplicasInIsr(topicAndPartition.topic, topicAndPartition.partition, reassignedReplicas) match { // 判断目标迁移的replic是不是都在ISR中,都在的意思是目标迁移的replic列表是ISR的一个子集。例如ISR列表是[1,2,3],目标迁移列表是[2,3],则为true。
// 如果是false,则会将AR和目标replic列表做个并集,类似于增加该partiton的副本数。例如AR是[1,2,3], 目标是[2,3,4],则会将partiton的AR设置为[1, 2, 3, 4], 相当于partiton增加了一个新的replic。
case false =>
val newReplicasNotInOldReplicaList = reassignedReplicas.toSet -- controllerContext.partitionReplicaAssignment(topicAndPartition).toSet
val newAndOldReplicas = (reassignedPartitionContext.newReplicas ++ controllerContext.partitionReplicaAssignment(topicAndPartition)).toSet // AR和目标replica列表求并集
// 更新AR到zk和缓存
updateAssignedReplicasForPartition(topicAndPartition, newAndOldReplicas.toSeq)
// 发送LeaderAndIsr
updateLeaderEpochAndSendRequest(topicAndPartition, controllerContext.partitionReplicaAssignment(topicAndPartition),
newAndOldReplicas.toSeq)
// 让新增加的replic上线,使其开始从leader同步数据。
startNewReplicasForReassignedPartition(topicAndPartition, reassignedPartitionContext, newReplicasNotInOldReplicaList)
info("Waiting for new replicas %s for partition %s being ".format(reassignedReplicas.mkString(","), topicAndPartition) +
"reassigned to catch up with the leader")
// 如果是true,则将不再目标列表中的AR中的replic去掉,例如目标迁移是[2,3],AR是[1,2,3], 则将1下线
case true =>
//4. Wait until all replicas in RAR are in sync with the leader.
val oldReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition).toSet -- reassignedReplicas.toSet
//5. replicas in RAR -> OnlineReplica
reassignedReplicas.foreach { replica =>
replicaStateMachine.handleStateChanges(Set(new PartitionAndReplica(topicAndPartition.topic, topicAndPartition.partition,
replica)), OnlineReplica)
}
// 目标列表中没有leader则需要重新选下leader
moveReassignedPartitionLeaderIfRequired(topicAndPartition, reassignedPartitionContext)
// 以下是停用掉下掉的replica的一些工作,例如更新AR,更新zk,发送meta请求等。另外删除"/admin/reassign_partitions"节点数据
stopOldReplicasOfReassignedPartition(topicAndPartition, reassignedPartitionContext, oldReplicas)
updateAssignedReplicasForPartition(topicAndPartition, reassignedReplicas)
removePartitionFromReassignedPartitions(topicAndPartition)
info("Removed partition %s from the list of reassigned partitions in zookeeper".format(topicAndPartition))
controllerContext.partitionsBeingReassigned.remove(topicAndPartition)
sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(topicAndPartition))
deleteTopicManager.resumeDeletionForTopics(Set(topicAndPartition.topic))
}
}
可以到这里会有个疑问,目标迁移列表不是ISR的子集,就只是增加了replic,并没有去掉replic的步骤啊。这里的关键是在initiateReassignReplicasForTopicPartition中watch了partiton的ISR情况,即调用了watchIsrChangesForReassignedPartition方法。
2.2.4 watchIsrChangesForReassignedPartition
该方法监听/brokers/topics/{topic}/partitions/{partiton}/state” 节点的变化。如果目标迁移列表已经跟上leader了,那么就会将不在目标迁移列表里的replic下线,完成迁移
def handleDataChange(dataPath: String, data: Object) {
inLock(controllerContext.controllerLock) {
debug("Reassigned partitions isr change listener fired for path %s with children %s".format(dataPath, data))
val topicAndPartition = TopicAndPartition(topic, partition)
try {
controllerContext.partitionsBeingReassigned.get(topicAndPartition) match {
case Some(reassignedPartitionContext) =>
val newLeaderAndIsrOpt = zkUtils.getLeaderAndIsrForPartition(topic, partition)
newLeaderAndIsrOpt match {
case Some(leaderAndIsr) => // check if new replicas have joined ISR
val caughtUpReplicas = reassignedReplicas & leaderAndIsr.isr.toSet // 求并集ISR和目标迁移列表的并集
if(caughtUpReplicas == reassignedReplicas) { // 目标迁移列表全部跟上了,则再次调用KafkaController#onPartitionReassignment,这次会走true那个判断分支了,会将不再目标replic列表中的replic下线。
// resume the partition reassignment process
info("%d/%d replicas have caught up with the leader for partition %s being reassigned."
.format(caughtUpReplicas.size, reassignedReplicas.size, topicAndPartition) +
"Resuming partition reassignment")
controller.onPartitionReassignment(topicAndPartition, reassignedPartitionContext)
}
else {
info("%d/%d replicas have caught up with the leader for partition %s being reassigned."
.format(caughtUpReplicas.size, reassignedReplicas.size, topicAndPartition) +
"Replica(s) %s still need to catch up".format((reassignedReplicas -- leaderAndIsr.isr.toSet).mkString(",")))
}
case None => error("Error handling reassignment of partition %s to replicas %s as it was never created"
.format(topicAndPartition, reassignedReplicas.mkString(",")))
}
case None =>
}
} catch {
case e: Throwable => error("Error while handling partition reassignment", e)
}
}
}
3 broker 处理迁移的思路总结
从以上分析我们可以看出,broker会watch "/admin/reassign_partitions"节点。当发现有迁移任务的时候,会将partiton的AR进行扩展,例如原先partiton的AR是[1, 2], 现在要迁移到[2, 3],那么partiton会先将AR扩展到[1, 2, 3],并监控ISR的变化。
当replica-2和replica-3都跟上后,即在ISR中的时候,表明新的repica-3已经和leader数据同步了。这个时候就可以将replica-1剔除了,最后得到迁移结果是[2, 3]。即迁移是一个先增加再减少的过程。
4 可能遇到的问题
4.1 报错
Assigned replicas (0,1) don't match the list of replicas for reassignment (1,0) for partition [testTopic,0] Reassignment of partition [testTopic,0] failed
在2.1.2中已经说明了,该错误是由于"/admin/reassign_partitions"节点已经被删除了,但是AR和目标迁移列表不相同报的错,一般需要看下controller的日志,看下controller在迁移过程中是不是抛出了异常。
4.2 迁移一直在进行中,不能完成
迁移需要等目标迁移列表中的replic都跟上了leader才能完成,目前迁移列表一直跟不上,那么就不会完成。可以看下zk中“/brokers/topics/{topic}/partitions/{partiton}/state”,注意下目标迁移列表是不是在isr中,如果不在说明要迁移的replic还没有完成从leader拉取数据。具体为甚么没有拉取成功,可能是数据量比较大,拉取需要一定的时间;也可能是其他原因比如集群宕机了等,需要具体分析下