kafka partiton迁移方法与原理

在kafka中增加新的节点后，数据是不会自动迁移到新的节点上的，需要我们手动将数据迁移(或者成为打散)到新的节点上

1 迁移方法

kafka为我们提供了用于数据迁移的脚本。我们可以用这些脚本完成数据的迁移。

1.1 生成partiton分配表

1.1.1 创建json文件topic-to-move.json

{

  "topics": [{"topic": "testTopic"}],

  "version": 1

}

1.1.2 生成partiton分配表

运行

$ ./kafka-reassign-partitions --zookeeper ${zk_address} --topics-to-move-json-file  topic-to-move.json --broker-list "140,141" --generate

其中${zk_address}是kafka所连接zk地址,"140,141"为该topic将要迁移到的目标节点。

生成结果如下:

Current partition replica assignment

{"version":1,"partitions":[{"topic":"testTopic","partition":1,"replicas":[61,62]},{"topic":"testTopic","partition":0,"replicas":[62,61]}]}

Proposed partition reassignment configuration

{"version":1,"partitions":[{"topic":"testTopic","partition":1,"replicas":[140,141]},{"topic":"testTopic","partition":0,"replicas":[141,140]}]}

其中上半部分是当前的partiton分布情况，下半部分是迁移成功后的partion分布情况。那么我们就用下部分的json进行迁移。另外，也可以自己构造个类似的json文件，同样可以进行迁移。这里我们使用脚本为我们生成的json文件，将下半部分的json保存为expand-cluster-reassignment.json

1.2 执行迁移

$ ./kafka-reassign-partitions --zookeeper ${zk_address}  --reassignment-json-file expand-cluster-reassignment.json --execute

1.3 查看迁移进度

$ ./kafka-reassign-partitions --zookeeper ${zk_address} --reassignment-json-file expand-cluster-reassignment.json --verify

2 源码分析

2.1 脚本调用

kafka-reassign-partitions.sh会调用kafka.admin.ReassignPartitionsCommand.scala，在代码运行过程中抛出的任何异常都会通过标准输出打印出来，所以如果执行该脚本报错，可以看下这块代码来定位问题。

def main(args: Array[String]): Unit = {

    // 略

    try {

      if(opts.options.has(opts.verifyOpt)) // 校验

        verifyAssignment(zkUtils, opts)

      else if(opts.options.has(opts.generateOpt)) // 生成json

        generateAssignment(zkUtils, opts)

      else if (opts.options.has(opts.executeOpt)) // 执行迁移

        executeAssignment(zkUtils, opts)

    } catch {

      case e: Throwable =>

        println("Partitions reassignment failed due to " + e.getMessage)

        println(Utils.stackTrace(e))

    } finally {

      val zkClient = zkUtils.zkClient

      if (zkClient != null)

        zkClient.close()

    }

2.1.1 executeAssignment

executeAssignment 用于执行迁移。

  def executeAssignment(zkUtils: ZkUtils,reassignmentJsonString: String){

    // 略，做一些校验和去重等工作

    // 获取当前的partition分布情况

    zkUtils.getReplicaAssignmentForTopics(partitionsToBeReassigned.map(_._1.topic))

    println("Current partition replica assignment\n\n%s\n\nSave this to use as the --reassignment-json-file option during rollback"

      .format(zkUtils.formatAsReassignmentJson(currentPartitionReplicaAssignment)))

      // 重点，执行迁移，z即将json写到zk上，准确的说是写到"/admin/reassign_partitions"下

    // start the reassignment

    if(reassignPartitionsCommand.reassignPartitions())

      println("Successfully started reassignment of partitions %s".format(zkUtils.formatAsReassignmentJson(partitionsToBeReassigned.toMap)))

    else

      println("Failed to reassign partitions %s".format(partitionsToBeReassigned))

  }

executeAssignment将json写到zk上后，brokerwatch到节点数据变化就开始进行迁移了

2.1.2 verifyAssignment

verifyAssignment用于校验迁移进度

def verifyAssignment(zkUtils: ZkUtils, opts: ReassignPartitionsCommandOptions) {

    // 略

    println("Status of partition reassignment:")

    val reassignedPartitionsStatus = checkIfReassignmentSucceeded(zkUtils, partitionsToBeReassigned) // 重点

    reassignedPartitionsStatus.foreach { partition =>

      partition._2 match {

        case ReassignmentCompleted =>

          println("Reassignment of partition %s completed successfully".format(partition._1))

        case ReassignmentFailed =>

          println("Reassignment of partition %s failed".format(partition._1))

        case ReassignmentInProgress =>

          println("Reassignment of partition %s is still in progress".format(partition._1))

      }

    }

  }

  private def checkIfReassignmentSucceeded(zkUtils: ZkUtils, partitionsToBeReassigned: Map[TopicAndPartition, Seq[Int]])

  :Map[TopicAndPartition, ReassignmentStatus] = {

    val partitionsBeingReassigned = zkUtils.getPartitionsBeingReassigned().mapValues(_.newReplicas) // 从zk节点"/admin/reassign_partitions"读取迁移信息

    partitionsToBeReassigned.map { topicAndPartition =>

      (topicAndPartition._1, checkIfPartitionReassignmentSucceeded(zkUtils,topicAndPartition._1,

        topicAndPartition._2, partitionsToBeReassigned, partitionsBeingReassigned))

    }

  }

  def checkIfPartitionReassignmentSucceeded(zkUtils: ZkUtils, topicAndPartition: TopicAndPartition,

                                            reassignedReplicas: Seq[Int],

                                            partitionsToBeReassigned: Map[TopicAndPartition, Seq[Int]],

                                            partitionsBeingReassigned: Map[TopicAndPartition, Seq[Int]]): ReassignmentStatus = {

    val newReplicas = partitionsToBeReassigned(topicAndPartition)

    partitionsBeingReassigned.get(topicAndPartition) match {

      case Some(partition) => ReassignmentInProgress // 如果tp对应的数据存在则说明还在迁移

      case None =>  // 否则可能是成功了

        // check if the current replica assignment matches the expected one after reassignment

        val assignedReplicas = zkUtils.getReplicasForPartition(topicAndPartition.topic, topicAndPartition.partition)

        if(assignedReplicas == newReplicas) // 重点，如果节点不存在了，但是迁移后的replica列表和预期不一致，则报错

          ReassignmentCompleted

        else { // 经常遇到的报错

          println(("ERROR: Assigned replicas (%s) don't match the list of replicas for reassignment (%s)" +

            " for partition %s").format(assignedReplicas.mkString(","), newReplicas.mkString(","), topicAndPartition))

          ReassignmentFailed

        }

    }

  }

从源码中可以看出判断迁移是否完成是根据"/admin/reassign_partitions"是否存在来判断。如果节点不存在了，并且迁移后的AR和预期一致，则才算成功。

注意：在实际迁移中遇到过好几次报错类似如下，即上面代码的打印的日志

don't match the list of replicas for reassignment

从代码中可以看到出现这个错误的原因是"/admin/reassign_partitions"不存在了，但是当前topic的AR和预期的不一致。这个原因一般是由于迁移的时候broker那边报错了，然后将节点删除了，并没有进行迁移。具体原因需要看下broker的controller的日志。

2.2 broker如何进行迁移

2.2.1 入口

broker的controller节点负责partiton的迁移工作，在broker被选为controller节点的时候会watch "/admin/reassign_partitions" 节点的变化。

private def registerReassignedPartitionsListener() = {

    zkUtils.zkClient.subscribeDataChanges(ZkUtils.ReassignPartitionsPath, partitionReassignedListener)

  }

所以迁移的工作主要在partitionReassignedListener中，controller watch到"/admin/reassign_partitions"节点数据变化后，会读取该数据内容，并跳过正在删除的partiton，进行迁移工作。

class PartitionsReassignedListener(controller: KafkaController) extends IZkDataListener with Logging {

  this.logIdent = "[PartitionsReassignedListener on " + controller.config.brokerId + "]: "

  val zkUtils = controller.controllerContext.zkUtils

  val controllerContext = controller.controllerContext

  @throws(classOf[Exception])

  def handleDataChange(dataPath: String, data: Object) {

    val partitionsReassignmentData = zkUtils.parsePartitionReassignmentData(data.toString) // 读取"/admin/reassign_partitions"节点内的数据，封装成[TopicAndPartition, relipcs] 的形式

    val partitionsToBeReassigned = inLock(controllerContext.controllerLock) {

      partitionsReassignmentData.filterNot(p => controllerContext.partitionsBeingReassigned.contains(p._1))

    }

    partitionsToBeReassigned.foreach { partitionToBeReassigned => // 迁移每一个partiton

      inLock(controllerContext.controllerLock) {

        if(controller.deleteTopicManager.isTopicQueuedUpForDeletion(partitionToBeReassigned._1.topic)) { // 正在删除的则跳过

          error("Skipping reassignment of partition %s for topic %s since it is currently being deleted"

            .format(partitionToBeReassigned._1, partitionToBeReassigned._1.topic))

          controller.removePartitionFromReassignedPartitions(partitionToBeReassigned._1)

        } else {

          val context = new ReassignedPartitionsContext(partitionToBeReassigned._2) // 将目标replica列表封装成ReassignedPartitionsContext

          controller.initiateReassignReplicasForTopicPartition(partitionToBeReassigned._1, context) // 重点，以上都是读取，这里才是真正的迁移工作

        }

      }

    }

  }

2.2.2 KafkaController#initiateReassignReplicasForTopicPartition()

initiateReassignReplicasForTopicPartition进行迁移工作。但是他主要做一些校验工作，该方法中会watch该partiton的ISR变化情况，即监听“/brokers/topics/{topic}/partitions/{partiton}/state” 节点的变化, 这和迁移的原理有关系。

def initiateReassignReplicasForTopicPartition(topicAndPartition: TopicAndPartition,

                                        reassignedPartitionContext: ReassignedPartitionsContext) {

    val newReplicas = reassignedPartitionContext.newReplicas

    val topic = topicAndPartition.topic

    val partition = topicAndPartition.partition

    val aliveNewReplicas = newReplicas.filter(r => controllerContext.liveBrokerIds.contains(r)) // 根据broker的存活进行过滤

    try {

        // 重点, 从controllerContext中读取partition的AR

      val assignedReplicasOpt = controllerContext.partitionReplicaAssignment.get(topicAndPartition)

      assignedReplicasOpt match {

        case Some(assignedReplicas) =>

        // 如果ontrollerContext中AR和目标迁移列表相同，则抛异常。注意他们都是Seq类型，相同是指顺序也相同。

          if(assignedReplicas == newReplicas) {

            throw new KafkaException("Partition %s to be reassigned is already assigned to replicas".format(topicAndPartition) +

              " %s. Ignoring request for partition reassignment".format(newReplicas.mkString(",")))

          } else {

            if(aliveNewReplicas == newReplicas) { // 目标列表里的replic的broker都存活才能进行迁移

              watchIsrChangesForReassignedPartition(topic, partition, reassignedPartitionContext) // 重点，后面会分析

              controllerContext.partitionsBeingReassigned.put(topicAndPartition, reassignedPartitionContext)

              deleteTopicManager.markTopicIneligibleForDeletion(Set(topic))

              onPartitionReassignment(topicAndPartition, reassignedPartitionContext) // 重点，真正干活的，做迁移工作的

            } else { // 有不存活的，则抛出异常。

              // some replica in RAR is not alive. Fail partition reassignment

              throw new KafkaException("Only %s replicas out of the new set of replicas".format(aliveNewReplicas.mkString(",")) +

                " %s for partition %s to be reassigned are alive. ".format(newReplicas.mkString(","), topicAndPartition) +

                "Failing partition reassignment")

            }

          }

        case None => throw new KafkaException("Attempt to reassign partition %s that doesn't exist"

          .format(topicAndPartition))

      }

    } catch {

      case e: Throwable => error("Error completing reassignment of partition %s".format(topicAndPartition), e)

      // remove the partition from the admin path to unblock the admin client

      removePartitionFromReassignedPartitions(topicAndPartition) // 重点，一点迁移出问题，抛出异常，则会将"/admin/reassign_partitions"里的相应信息清空或者删除节点

    }

  }

2.2.3 KafkaController#onPartitionReassignment

真正的迁移步骤是在onPartitionReassignment完成的

def onPartitionReassignment(topicAndPartition: TopicAndPartition, reassignedPartitionContext: ReassignedPartitionsContext) {

    val reassignedReplicas = reassignedPartitionContext.newReplicas

    areReplicasInIsr(topicAndPartition.topic, topicAndPartition.partition, reassignedReplicas) match { // 判断目标迁移的replic是不是都在ISR中，都在的意思是目标迁移的replic列表是ISR的一个子集。例如ISR列表是[1，2，3]，目标迁移列表是[2,3]，则为true。

    // 如果是false，则会将AR和目标replic列表做个并集，类似于增加该partiton的副本数。例如AR是[1,2,3], 目标是[2,3,4]，则会将partiton的AR设置为[1, 2, 3, 4], 相当于partiton增加了一个新的replic。

      case false =>

        val newReplicasNotInOldReplicaList = reassignedReplicas.toSet -- controllerContext.partitionReplicaAssignment(topicAndPartition).toSet

        val newAndOldReplicas = (reassignedPartitionContext.newReplicas ++ controllerContext.partitionReplicaAssignment(topicAndPartition)).toSet // AR和目标replica列表求并集

        // 更新AR到zk和缓存

        updateAssignedReplicasForPartition(topicAndPartition, newAndOldReplicas.toSeq)

        // 发送LeaderAndIsr

        updateLeaderEpochAndSendRequest(topicAndPartition, controllerContext.partitionReplicaAssignment(topicAndPartition),

          newAndOldReplicas.toSeq)

        // 让新增加的replic上线，使其开始从leader同步数据。

        startNewReplicasForReassignedPartition(topicAndPartition, reassignedPartitionContext, newReplicasNotInOldReplicaList)

        info("Waiting for new replicas %s for partition %s being ".format(reassignedReplicas.mkString(","), topicAndPartition) +

          "reassigned to catch up with the leader")

          // 如果是true，则将不再目标列表中的AR中的replic去掉，例如目标迁移是[2,3],AR是[1,2,3], 则将1下线

      case true =>

        //4. Wait until all replicas in RAR are in sync with the leader.

        val oldReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition).toSet -- reassignedReplicas.toSet

        //5. replicas in RAR -> OnlineReplica

        reassignedReplicas.foreach { replica =>

          replicaStateMachine.handleStateChanges(Set(new PartitionAndReplica(topicAndPartition.topic, topicAndPartition.partition,

            replica)), OnlineReplica)

        }

        // 目标列表中没有leader则需要重新选下leader

        moveReassignedPartitionLeaderIfRequired(topicAndPartition, reassignedPartitionContext)

       // 以下是停用掉下掉的replica的一些工作，例如更新AR，更新zk，发送meta请求等。另外删除"/admin/reassign_partitions"节点数据

        stopOldReplicasOfReassignedPartition(topicAndPartition, reassignedPartitionContext, oldReplicas)

        updateAssignedReplicasForPartition(topicAndPartition, reassignedReplicas)

        removePartitionFromReassignedPartitions(topicAndPartition)

        info("Removed partition %s from the list of reassigned partitions in zookeeper".format(topicAndPartition))

        controllerContext.partitionsBeingReassigned.remove(topicAndPartition)

        sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq, Set(topicAndPartition))

        deleteTopicManager.resumeDeletionForTopics(Set(topicAndPartition.topic))

    }

  }

可以到这里会有个疑问，目标迁移列表不是ISR的子集，就只是增加了replic，并没有去掉replic的步骤啊。这里的关键是在initiateReassignReplicasForTopicPartition中watch了partiton的ISR情况，即调用了watchIsrChangesForReassignedPartition方法。

2.2.4 watchIsrChangesForReassignedPartition

该方法监听/brokers/topics/{topic}/partitions/{partiton}/state” 节点的变化。如果目标迁移列表已经跟上leader了，那么就会将不在目标迁移列表里的replic下线，完成迁移

def handleDataChange(dataPath: String, data: Object) {

    inLock(controllerContext.controllerLock) {

      debug("Reassigned partitions isr change listener fired for path %s with children %s".format(dataPath, data))

      val topicAndPartition = TopicAndPartition(topic, partition)

      try {

        controllerContext.partitionsBeingReassigned.get(topicAndPartition) match {

          case Some(reassignedPartitionContext) =>

            val newLeaderAndIsrOpt = zkUtils.getLeaderAndIsrForPartition(topic, partition)

            newLeaderAndIsrOpt match {

              case Some(leaderAndIsr) => // check if new replicas have joined ISR

                val caughtUpReplicas = reassignedReplicas & leaderAndIsr.isr.toSet // 求并集ISR和目标迁移列表的并集

                if(caughtUpReplicas == reassignedReplicas) { // 目标迁移列表全部跟上了，则再次调用KafkaController#onPartitionReassignment，这次会走true那个判断分支了，会将不再目标replic列表中的replic下线。

                  // resume the partition reassignment process

                  info("%d/%d replicas have caught up with the leader for partition %s being reassigned."

                    .format(caughtUpReplicas.size, reassignedReplicas.size, topicAndPartition) +

                    "Resuming partition reassignment")

                  controller.onPartitionReassignment(topicAndPartition, reassignedPartitionContext)

                }

                else {

                  info("%d/%d replicas have caught up with the leader for partition %s being reassigned."

                    .format(caughtUpReplicas.size, reassignedReplicas.size, topicAndPartition) +

                    "Replica(s) %s still need to catch up".format((reassignedReplicas -- leaderAndIsr.isr.toSet).mkString(",")))

                }

              case None => error("Error handling reassignment of partition %s to replicas %s as it was never created"

                .format(topicAndPartition, reassignedReplicas.mkString(",")))

            }

          case None =>

        }

      } catch {

        case e: Throwable => error("Error while handling partition reassignment", e)

      }

    }

  }

3 broker 处理迁移的思路总结

从以上分析我们可以看出，broker会watch "/admin/reassign_partitions"节点。当发现有迁移任务的时候，会将partiton的AR进行扩展，例如原先partiton的AR是[1, 2], 现在要迁移到[2, 3],那么partiton会先将AR扩展到[1, 2, 3]，并监控ISR的变化。

当replica-2和replica-3都跟上后，即在ISR中的时候，表明新的repica-3已经和leader数据同步了。这个时候就可以将replica-1剔除了，最后得到迁移结果是[2, 3]。即迁移是一个先增加再减少的过程。

4 可能遇到的问题

4.1 报错

Assigned replicas (0,1) don't match the list of replicas for reassignment (1,0) for partition [testTopic,0] Reassignment of partition [testTopic,0] failed

在2.1.2中已经说明了，该错误是由于"/admin/reassign_partitions"节点已经被删除了，但是AR和目标迁移列表不相同报的错，一般需要看下controller的日志，看下controller在迁移过程中是不是抛出了异常。

4.2 迁移一直在进行中，不能完成

迁移需要等目标迁移列表中的replic都跟上了leader才能完成，目前迁移列表一直跟不上，那么就不会完成。可以看下zk中“/brokers/topics/{topic}/partitions/{partiton}/state”，注意下目标迁移列表是不是在isr中，如果不在说明要迁移的replic还没有完成从leader拉取数据。具体为甚么没有拉取成功，可能是数据量比较大，拉取需要一定的时间；也可能是其他原因比如集群宕机了等，需要具体分析下