kafka apis反映出kafka broker server可以提供哪些服务,
broker server主要和producer,consumer,controller有交互,搞清这些api就清楚了broker server的所有行为

handleOffsetRequest

提供对offset的查询的需求,比如查询earliest,latest offset是什么,或before某个时间戳的offset是什么

  1. try {
  2. // ensure leader exists
  3. // 确定是否是leader replica,因为只有leader可以响应offset请求
  4. // 如果不是会抛异常
  5. val localReplica = if(!offsetRequest.isFromDebuggingClient)
  6. replicaManager.getLeaderReplicaIfLocal(topicAndPartition.topic, topicAndPartition.partition)
  7. else
  8. replicaManager.getReplicaOrException(topicAndPartition.topic, topicAndPartition.partition)
  9. val offsets = {
  10. val allOffsets = fetchOffsets(replicaManager.logManager, //获取offsets列表
  11. topicAndPartition,
  12. partitionOffsetRequestInfo.time,
  13. partitionOffsetRequestInfo.maxNumOffsets)
  14. if (!offsetRequest.isFromOrdinaryClient) {
  15. allOffsets
  16. } else {
  17. val hw = localReplica.highWatermark.messageOffset
  18. if (allOffsets.exists(_ > hw)) //过滤掉hw以后的offsets,因为那些都不是应该用户可见的
  19. hw +: allOffsets.dropWhile(_ > hw)
  20. else
  21. allOffsets
  22. }
  23. }
  24. (topicAndPartition, PartitionOffsetsResponse(ErrorMapping.NoError, offsets))
  25. } catch {
  26. // NOTE: UnknownTopicOrPartitionException and NotLeaderForPartitionException are special cased since these error messages
  27. // are typically transient and there is no value in logging the entire stack trace for the same
  28. case utpe: UnknownTopicOrPartitionException =>
  29. warn("Offset request with correlation id %d from client %s on partition %s failed due to %s".format(
  30. offsetRequest.correlationId, offsetRequest.clientId, topicAndPartition, utpe.getMessage))
  31. (topicAndPartition, PartitionOffsetsResponse(ErrorMapping.codeFor(utpe.getClass.asInstanceOf[Class[Throwable]]), Nil) )
  32. case nle: NotLeaderForPartitionException =>
  33. warn("Offset request with correlation id %d from client %s on partition %s failed due to %s".format(
  34. offsetRequest.correlationId, offsetRequest.clientId, topicAndPartition,nle.getMessage))
  35. (topicAndPartition, PartitionOffsetsResponse(ErrorMapping.codeFor(nle.getClass.asInstanceOf[Class[Throwable]]), Nil) )
  36. case e: Throwable =>
  37. warn("Error while responding to offset request", e)
  38. (topicAndPartition, PartitionOffsetsResponse(ErrorMapping.codeFor(e.getClass.asInstanceOf[Class[Throwable]]), Nil) )
  39. }

可以看到,当没有找到topic->partition, 或partition leader,或其他异常的时候,就会导致返回offsets为nil
这样在客户端,经常通过获取latestOffset来算spoutLag,会出现负值的情况

然后,fetchOffset调用fetchOffsetsBefore,来完成offset的获取,

  1. def fetchOffsetsBefore(log: Log, timestamp: Long, maxNumOffsets: Int): Seq[Long] = {
  2. val segsArray = log.logSegments.toArray //取出所有segments
  3. var offsetTimeArray: Array[(Long, Long)] = null
  4. if(segsArray.last.size > 0) //看最新的segment,即真正被写入的,是否有数据(Segment.size取出segment中log的bytes)
  5. offsetTimeArray = new Array[(Long, Long)](segsArray.length + 1)
  6. else
  7. offsetTimeArray = new Array[(Long, Long)](segsArray.length)
  8.  
  9. for(i <- 0 until segsArray.length)
  10. offsetTimeArray(i) = (segsArray(i).baseOffset, segsArray(i).lastModified) //对每个segment, 生成(baseOffset,最后更新的时间)
  11. if(segsArray.last.size > 0)
  12. offsetTimeArray(segsArray.length) = (log.logEndOffset, SystemTime.milliseconds) //对于最新的segment逻辑不同,这里取的是log.logEndOffset,有点tricky,因为只有取latest offset时才会取到最后这个
  13.  
  14. var startIndex = -1
  15. timestamp match {
  16. case OffsetRequest.LatestTime =>
  17. startIndex = offsetTimeArray.length - 1 //Latest,取的其实是log.logEndOffset
  18. case OffsetRequest.EarliestTime =>
  19. startIndex = 0 //earlist, 取的是第一个segment的baseOffset
  20. case _ => //对某一个时间,去offset
  21. var isFound = false
  22. debug("Offset time array = " + offsetTimeArray.foreach(o => "%d, %d".format(o._1, o._2)))
  23. startIndex = offsetTimeArray.length - 1
  24. while (startIndex >= 0 && !isFound) { //从最后一个segment开始,向前遍历
  25. if (offsetTimeArray(startIndex)._2 <= timestamp) //找到小于等于timestamp的segment
  26. isFound = true
  27. else
  28. startIndex -=1
  29. }
  30. }
  31.  
  32. val retSize = maxNumOffsets.min(startIndex + 1) //选择返回几个offset
  33. val ret = new Array[Long](retSize)
  34. for(j <- 0 until retSize) {
  35. ret(j) = offsetTimeArray(startIndex)._1 //返回当前segment,往前的所有segment的baseoffset
  36. startIndex -= 1
  37. }
  38. // ensure that the returned seq is in descending order of offsets
  39. ret.toSeq.sortBy(- _)
  40. }

handleProducerOrOffsetCommitRequest

这个用于处理Producer的请求,其实就是写数据
名字有些tricky,和offsetCommit有什么关系,因为对于kafka的highlevel consumer,consumeroffset是被写入kafka topic的,所以offsetCommitRequest其实就是一种特殊的producer request
你看他实际也是,用producerRequestFromOffsetCommit,将它转换成了producer request

主要调用appendToLocalLog,核心逻辑

  1. val partitionOpt = replicaManager.getPartition(topicAndPartition.topic, topicAndPartition.partition) //取到partition,如果没有找到,抛异常
  2. val info = partitionOpt match {
  3. case Some(partition) =>
  4. partition.appendMessagesToLeader(messages.asInstanceOf[ByteBufferMessageSet],producerRequest.requiredAcks) //将数据写入
  5. case None => throw new UnknownTopicOrPartitionException("Partition %s doesn't exist on %d"
  6. .format(topicAndPartition, brokerId))
  7. }

Partition.appendMessagesToLeader

  1. def appendMessagesToLeader(messages: ByteBufferMessageSet, requiredAcks: Int=0) = {
  2. inReadLock(leaderIsrUpdateLock) {
  3. val leaderReplicaOpt = leaderReplicaIfLocal() //是否是leader replica
  4. leaderReplicaOpt match {
  5. case Some(leaderReplica) =>
  6. val log = leaderReplica.log.get //取得replica.log
  7. val minIsr = log.config.minInSyncReplicas //配置的最小isr的size
  8. val inSyncSize = inSyncReplicas.size //当前isr真实的size
  9.  
  10. // Avoid writing to leader if there are not enough insync replicas to make it safe
  11. if (inSyncSize < minIsr && requiredAcks == -1) {
  12. throw new NotEnoughReplicasException("Number of insync replicas for partition [%s,%d] is [%d], below required minimum [%d]"
  13. .format(topic,partitionId,minIsr,inSyncSize))
  14. }
  15.  
  16. val info = log.append(messages, assignOffsets = true) //将message append到log
  17. // 当有新数据产生了,需要去触发delayedFetchRequest,consumer的fetch request当达到log end offset的时候是会block的,所以这里需要unblock
  18. // probably unblock some follower fetch requests since log end offset has been updated
  19. replicaManager.unblockDelayedFetchRequests(new TopicAndPartition(this.topic, this.partitionId))
  20. // we may need to increment high watermark since ISR could be down to 1
  21. maybeIncrementLeaderHW(leaderReplica) //增加HW
  22. info
  23. case None => //如果找不到leader,往往是因为发生了迁移
  24. throw new NotLeaderForPartitionException("Leader not local for partition [%s,%d] on broker %d"
  25. .format(topic, partitionId, localBrokerId))
  26. }
  27. }
  28. }

对于producer的写策略,取决于配置的acker机制,

acks = 0,那没有failover处理的,发就发了
acks = 1,当写leader replica成功后就返回,其他的replica都是通过fetcher去同步的,所以kafka是异步写
不过有数据丢失的风险,如果leader的数据没有来得及同步,leader挂了,那么会丢失数据
acks = –1, 要等待所有的replicas都成功后,才能返回
所以这里需要产生DelayedProducerRequest,这个request只有在所有的follower都fetch成功后才能reponse
所以DelayedProducerRequest会在fetch request中被触发unblock

  1. if(produceRequest.requiredAcks == 0) {
  2. //acks == 0,即不需要ack,没啥需要特别做的
  3. } else if (produceRequest.requiredAcks == 1 || //acks == 1,即需要立即返回response
  4. produceRequest.numPartitions <= 0 || //没有要求取数据,因为request里面的partition数为0
  5. numPartitionsInError == produceRequest.numPartitions) { //所有的partition都取失败了
  6. //这几种情况都需要立即返回
  7. requestChannel.sendResponse(new RequestChannel.Response(request, new BoundedByteBufferSend(response)))
  8. } else { //这个地方没加注释,应该是ack == -1的情况
  9. // create a list of (topic, partition) pairs to use as keys for this delayed request
  10. val producerRequestKeys = produceRequest.data.keys.toSeq
  11. val statuses = localProduceResults.map(r =>
  12. r.key -> DelayedProduceResponseStatus(r.end + 1, ProducerResponseStatus(r.errorCode, r.start))).toMap
  13. val delayedRequest = new DelayedProduce(
  14. producerRequestKeys,
  15. request,
  16. produceRequest.ackTimeoutMs.toLong,
  17. produceRequest,
  18. statuses,
  19. offsetCommitRequestOpt)
  20.  
  21. // add the produce request for watch if it's not satisfied, otherwise send the response back
  22. val satisfiedByMe = producerRequestPurgatory.checkAndMaybeWatch(delayedRequest)
  23. if (satisfiedByMe)
  24. producerRequestPurgatory.respond(delayedRequest)
  25. }

handleFetchRequest

响应读数据的请求,来自consumer或follower fetcher

  1. def handleFetchRequest(request: RequestChannel.Request) {
  2. val fetchRequest = request.requestObj.asInstanceOf[FetchRequest]
  3. val dataRead = replicaManager.readMessageSets(fetchRequest) //从replicaManager读出数据
  4.  
  5. // if the fetch request comes from the follower,
  6. // update its corresponding log end offset
  7. if(fetchRequest.isFromFollower) //如果是follower的fetch request,更新follower的leo,还可能需要更新ISR
  8. recordFollowerLogEndOffsets(fetchRequest.replicaId, dataRead.mapValues(_.offset))
  9.  
  10. // check if this fetch request can be satisfied right away
  11. val bytesReadable = dataRead.values.map(_.data.messages.sizeInBytes).sum
  12. val errorReadingData = dataRead.values.foldLeft(false)((errorIncurred, dataAndOffset) =>
  13. errorIncurred || (dataAndOffset.data.error != ErrorMapping.NoError))
  14. //fetch request是可以delay的,但满足如下要求时是需要立刻返回
  15. // send the data immediately if 1) fetch request does not want to wait
  16. // 2) fetch request does not require any data
  17. // 3) has enough data to respond
  18. // 4) some error happens while reading data
  19. if(fetchRequest.maxWait <= 0 || //不想等
  20. fetchRequest.numPartitions <= 0 || //没有请求数据
  21. bytesReadable >= fetchRequest.minBytes || //读到的数据已足够
  22. errorReadingData) { //有异常
  23. debug("Returning fetch response %s for fetch request with correlation id %d to client %s"
  24. .format(dataRead.values.map(_.data.error).mkString(","), fetchRequest.correlationId, fetchRequest.clientId))
  25. val response = new FetchResponse(fetchRequest.correlationId, dataRead.mapValues(_.data))
  26. requestChannel.sendResponse(new RequestChannel.Response(request, new FetchResponseSend(response)))
  27. } else { //否则产生delay fetcher request,比如没新数据的时候,后续有数据时会unblock这些request
  28. debug("Putting fetch request with correlation id %d from client %s into purgatory".format(fetchRequest.correlationId,
  29. fetchRequest.clientId))
  30. // create a list of (topic, partition) pairs to use as keys for this delayed request
  31. val delayedFetchKeys = fetchRequest.requestInfo.keys.toSeq
  32. val delayedFetch = new DelayedFetch(delayedFetchKeys, request, fetchRequest.maxWait, fetchRequest,
  33. dataRead.mapValues(_.offset))
  34.  
  35. // add the fetch request for watch if it's not satisfied, otherwise send the response back
  36. val satisfiedByMe = fetchRequestPurgatory.checkAndMaybeWatch(delayedFetch)
  37. if (satisfiedByMe)
  38. fetchRequestPurgatory.respond(delayedFetch)
  39. }
  40. }

readMessageSets其实就是对每个topicAndPartititon调用readMessageSet

  1. private def readMessageSet(topic: String,
  2. partition: Int,
  3. offset: Long,
  4. maxSize: Int,
  5. fromReplicaId: Int): (FetchDataInfo, Long) = {
  6. // check if the current broker is the leader for the partitions
  7. val localReplica = if(fromReplicaId == Request.DebuggingConsumerId)
  8. getReplicaOrException(topic, partition)
  9. else
  10. getLeaderReplicaIfLocal(topic, partition) //判断是否是leader,非leader也不能响应fetch请求
  11. trace("Fetching log segment for topic, partition, offset, size = " + (topic, partition, offset, maxSize))
  12. //我的理解,fromReplicaId只有从follower来的fetch请求才会有
  13. val maxOffsetOpt =
  14. if (Request.isValidBrokerId(fromReplicaId))
  15. None //从follower来的fetch请求,不需要设最大的offset,有多少读多少好了
  16. else //对于普通的fetch请求,不能读超出hw offset
  17. Some(localReplica.highWatermark.messageOffset)
  18. val fetchInfo = localReplica.log match {
  19. case Some(log) =>
  20. log.read(offset, maxSize, maxOffsetOpt)
  21. case None =>
  22. error("Leader for partition [%s,%d] does not have a local log".format(topic, partition))
  23. FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MessageSet.Empty)
  24. }
  25. (fetchInfo, localReplica.highWatermark.messageOffset)
  26. }

如果是follower fetch request,需要做recordFollowerLogEndOffsets更新follower的leo,

  1. private def recordFollowerLogEndOffsets(replicaId: Int, offsets: Map[TopicAndPartition, LogOffsetMetadata]) {
  2. debug("Record follower log end offsets: %s ".format(offsets))
  3. offsets.foreach {
  4. case (topicAndPartition, offset) =>
  5. replicaManager.updateReplicaLEOAndPartitionHW(topicAndPartition.topic, //更新LEO和HW
  6. topicAndPartition.partition, replicaId, offset)
  7. //当一次follower fetch成功后,需要check之前的delayedProduceRequest是否可以response
  8. //因为ack=-1时,需要所有的follower都fetch成功后才能response
  9. // for producer requests with ack = -1, we need to check
  10. // if they can be unblocked after some follower's log end offsets have moved
  11. replicaManager.unblockDelayedProduceRequests(topicAndPartition)
  12. }
  13. }

最终调用到ReplicaManager.updateReplicaLEOAndPartitionHW,并修正改partition的ISR

  1. def updateReplicaLEOAndPartitionHW(topic: String, partitionId: Int, replicaId: Int, offset: LogOffsetMetadata) = {
  2. getPartition(topic, partitionId) match {
  3. case Some(partition) =>
  4. partition.getReplica(replicaId) match {
  5. case Some(replica) =>
  6. replica.logEndOffset = offset //将follower的replica的leo设为当前取得的offset
  7. // check if we need to update HW and expand Isr
  8. partition.updateLeaderHWAndMaybeExpandIsr(replicaId) //更新ISR
  9. debug("Recorded follower %d position %d for partition [%s,%d].".format(replicaId, offset.messageOffset, topic, partitionId))
  10. case None =>
  11. throw new NotAssignedReplicaException(("Leader %d failed to record follower %d's position %d since the replica" +
  12. " is not recognized to be one of the assigned replicas %s for partition [%s,%d]").format(localBrokerId, replicaId,
  13. offset.messageOffset, partition.assignedReplicas().map(_.brokerId).mkString(","), topic, partitionId))
  14.  
  15. }
  16. case None =>
  17. warn("While recording the follower position, the partition [%s,%d] hasn't been created, skip updating leader HW".format(topic, partitionId))
  18. }
  19. }

最终调到partition.updateLeaderHWAndMaybeExpandIsr来更新ISR

  1. def updateLeaderHWAndMaybeExpandIsr(replicaId: Int) {
  2. inWriteLock(leaderIsrUpdateLock) {
  3. // check if this replica needs to be added to the ISR
  4. leaderReplicaIfLocal() match { //只有当前的replica是leader,才能更新ISR
  5. case Some(leaderReplica) =>
  6. val replica = getReplica(replicaId).get
  7. val leaderHW = leaderReplica.highWatermark
  8. // For a replica to get added back to ISR, it has to satisfy 3 conditions- //满足下面3条就需要加到ISR中
  9. // 1. It is not already in the ISR
  10. // 2. It is part of the assigned replica list. See KAFKA-1097
  11. // 3. It's log end offset >= leader's high watermark
  12. if (!inSyncReplicas.contains(replica) && //本身不在ISR中
  13. assignedReplicas.map(_.brokerId).contains(replicaId) && //在AR中
  14. replica.logEndOffset.offsetDiff(leaderHW) >= 0) { //当前的leo大于leader的HW, 说明已经追上了
  15. // expand ISR
  16. val newInSyncReplicas = inSyncReplicas + replica //扩展ISR
  17. info("Expanding ISR for partition [%s,%d] from %s to %s"
  18. .format(topic, partitionId, inSyncReplicas.map(_.brokerId).mkString(","), newInSyncReplicas.map(_.brokerId).mkString(",")))
  19. // update ISR in ZK and cache
  20. updateIsr(newInSyncReplicas) //把ISR更新到zk
  21. replicaManager.isrExpandRate.mark()
  22. }
  23. maybeIncrementLeaderHW(leaderReplica) 增加hw
  24. case None => // nothing to do if no longer leader
  25. }
  26. }
  27. }

maybeIncrementLeaderHW

  1. private def maybeIncrementLeaderHW(leaderReplica: Replica) {
  2. val allLogEndOffsets = inSyncReplicas.map(_.logEndOffset) //取出ISR中所有replica的leo列表
  3. val newHighWatermark = allLogEndOffsets.min(new LogOffsetMetadata.OffsetOrdering) //取最小的作为新的hw,这样可以保证只有在所有replica都完成同步的offset,才会设为hw
  4. val oldHighWatermark = leaderReplica.highWatermark //当前旧的hw
  5. if(oldHighWatermark.precedes(newHighWatermark)) { //判断新的hw一定要大于就的hw
  6. leaderReplica.highWatermark = newHighWatermark //更新hw
  7. debug("High watermark for partition [%s,%d] updated to %s".format(topic, partitionId, newHighWatermark))
  8. // some delayed requests may be unblocked after HW changed
  9. val requestKey = new TopicAndPartition(this.topic, this.partitionId)
  10. replicaManager.unblockDelayedFetchRequests(requestKey) //hw变化,触发unblockDelayedFetch很容易理解,有新数据,你之前block的读请求,可以继续读数据
  11. replicaManager.unblockDelayedProduceRequests(requestKey) //也触发unblock DelayedProduce,hw变化表示有数据完成所有replica同步,这样可以reponse produce request
  12. } else {
  13. debug("Skipping update high watermark since Old hw %s is larger than new hw %s for partition [%s,%d]. All leo's are %s"
  14. .format(oldHighWatermark, newHighWatermark, topic, partitionId, allLogEndOffsets.mkString(",")))
  15. }
  16. }

handleControlledShutdownRequest

响应broker发来的shutdown请求,

  1. def handleControlledShutdownRequest(request: RequestChannel.Request) {
  2. val controlledShutdownRequest = request.requestObj.asInstanceOf[ControlledShutdownRequest]
  3. val partitionsRemaining = controller.shutdownBroker(controlledShutdownRequest.brokerId)
  4. val controlledShutdownResponse = new ControlledShutdownResponse(controlledShutdownRequest.correlationId,
  5. ErrorMapping.NoError, partitionsRemaining)
  6. requestChannel.sendResponse(new Response(request, new BoundedByteBufferSend(controlledShutdownResponse)))
  7. }

单纯的调用,controller.shutdownBroker,这种是优雅的shutdown,会做很多的准备工作

  1. def shutdownBroker(id: Int) : Set[TopicAndPartition] = {
  2.  
  3. if (!isActive()) { //如果当前broker不是controller,抛异常退出
  4. throw new ControllerMovedException("Controller moved to another broker. Aborting controlled shutdown")
  5. }
  6.  
  7. controllerContext.brokerShutdownLock synchronized {
  8. info("Shutting down broker " + id)
  9.  
  10. inLock(controllerContext.controllerLock) {
  11. if (!controllerContext.liveOrShuttingDownBrokerIds.contains(id)) //如果broker不存在,抛异常
  12. throw new BrokerNotAvailableException("Broker id %d does not exist.".format(id))
  13.  
  14. controllerContext.shuttingDownBrokerIds.add(id) //将broker加入真正shuttingDown的broker list
  15. debug("All shutting down brokers: " + controllerContext.shuttingDownBrokerIds.mkString(","))
  16. debug("Live brokers: " + controllerContext.liveBrokerIds.mkString(","))
  17. }
  18.  
  19. val allPartitionsAndReplicationFactorOnBroker: Set[(TopicAndPartition, Int)] = //找出broker上所有的partition和replica
  20. inLock(controllerContext.controllerLock) {
  21. controllerContext.partitionsOnBroker(id)
  22. .map(topicAndPartition => (topicAndPartition, controllerContext.partitionReplicaAssignment(topicAndPartition).size))
  23. }
  24.  
  25. allPartitionsAndReplicationFactorOnBroker.foreach {
  26. case(topicAndPartition, replicationFactor) =>
  27. // Move leadership serially to relinquish lock.
  28. inLock(controllerContext.controllerLock) {
  29. controllerContext.partitionLeadershipInfo.get(topicAndPartition).foreach { currLeaderIsrAndControllerEpoch =>
  30. if (replicationFactor > 1) { //如果打开副本机制,=1就是没有副本
  31. if (currLeaderIsrAndControllerEpoch.leaderAndIsr.leader == id) { //如果是leader
  32. // If the broker leads the topic partition, transition the leader and update isr. Updates zk and
  33. // notifies all affected brokers
  34. partitionStateMachine.handleStateChanges(Set(topicAndPartition), OnlinePartition,
  35. controlledShutdownPartitionLeaderSelector) //主动做leader重新选举
  36. } else { //如果该broker上的replica不是leader,发送stopReplicas请求
  37. // Stop the replica first. The state change below initiates ZK changes which should take some time
  38. // before which the stop replica request should be completed (in most cases)
  39. brokerRequestBatch.newBatch()
  40. brokerRequestBatch.addStopReplicaRequestForBrokers(Seq(id), topicAndPartition.topic,
  41. topicAndPartition.partition, deletePartition = false)
  42. brokerRequestBatch.sendRequestsToBrokers(epoch, controllerContext.correlationId.getAndIncrement)
  43.  
  44. // If the broker is a follower, updates the isr in ZK and notifies the current leader
  45. replicaStateMachine.handleStateChanges(Set(PartitionAndReplica(topicAndPartition.topic,
  46. topicAndPartition.partition, id)), OfflineReplica)
  47. }
  48. }
  49. }
  50. }
  51. }
  52. def replicatedPartitionsBrokerLeads() = inLock(controllerContext.controllerLock) {
  53. trace("All leaders = " + controllerContext.partitionLeadershipInfo.mkString(","))
  54. controllerContext.partitionLeadershipInfo.filter {
  55. case (topicAndPartition, leaderIsrAndControllerEpoch) =>
  56. leaderIsrAndControllerEpoch.leaderAndIsr.leader == id && controllerContext.partitionReplicaAssignment(topicAndPartition).size > 1
  57. }.map(_._1)
  58. }
  59. replicatedPartitionsBrokerLeads().toSet
  60. }
  61. }

这里做leader重新选举用的是controlledShutdownPartitionLeaderSelector
这个选举策略很简单,
排除了shuttingDownBroker的产生新的ISR,然后选择head作为新的leader

  1. val newIsr = currentLeaderAndIsr.isr.filter(brokerId => !controllerContext.shuttingDownBrokerIds.contains(brokerId))
  2. val newLeaderOpt = newIsr.headOption

handleTopicMetadataRequest,handleUpdateMetadataRequest

就是处理读取和更新MetadataCache的请求,

  1. KafkaApis.metadataCache
  1. 首先看看MetaCache是什么?
  1. /**
  2. * A cache for the state (e.g., current leader) of each partition. This cache is updated through
  3. * UpdateMetadataRequest from the controller. Every broker maintains the same cache, asynchronously.
  4. */
  5. private[server] class MetadataCache {
  6. private val cache: mutable.Map[String, mutable.Map[Int, PartitionStateInfo]] =
  7. new mutable.HashMap[String, mutable.Map[Int, PartitionStateInfo]]()
  8. private var aliveBrokers: Map[Int, Broker] = Map()
  9. private val partitionMetadataLock = new ReentrantReadWriteLock()

可见cache为,Map[String, mutable.Map[Int, PartitionStateInfo],记录每个topic,每个partition的PartitionStateInfo

  1. case class PartitionStateInfo(val leaderIsrAndControllerEpoch: LeaderIsrAndControllerEpoch,
  2. val allReplicas: Set[Int])

包含,leaderIsrAndControllerEpoch,记录leader和isr
allReplicas记录所有的replicas,即AR,注意这里只会记录replica id,replica的具体情况,只会在replicaManager里面记录
这里为每个partition记录leaderIsrAndControllerEpoch,是不是有点浪费

而aliveBrokers,记录所有活的brokers的id和ip:port

所以也比较简单,这个cache在每个brokers之间是会被异步更新的,通过handleUpdateMetadataRequest

handleStopReplicaRequest

停止replica请求,一般是当broker stop或需要删除某replica时被调用

处理很简单,主要就是停止fetcher线程,并删除partition目录

stopReplicas

  1. def stopReplicas(stopReplicaRequest: StopReplicaRequest): (mutable.Map[TopicAndPartition, Short], Short) = {
  2. replicaStateChangeLock synchronized { // 加锁
  3. val responseMap = new collection.mutable.HashMap[TopicAndPartition, Short]
  4. if(stopReplicaRequest.controllerEpoch < controllerEpoch) { // 检查Epoch,防止收到过期的request
  5. (responseMap, ErrorMapping.StaleControllerEpochCode)
  6. } else {
  7. controllerEpoch = stopReplicaRequest.controllerEpoch // 更新Epoch
  8. // First stop fetchers for all partitions, then stop the corresponding replicas
  9. replicaFetcherManager.removeFetcherForPartitions(stopReplicaRequest.partitions.map(r => TopicAndPartition(r.topic, r.partition))) // 先通过FetcherManager停止相关partition的Fetcher线程
  10. for(topicAndPartition <- stopReplicaRequest.partitions){
  11. val errorCode = stopReplica(topicAndPartition.topic, topicAndPartition.partition, stopReplicaRequest.deletePartitions) // 调用stopReplica
  12. responseMap.put(topicAndPartition, errorCode)
  13. }
  14. (responseMap, ErrorMapping.NoError)
  15. }
  16. }
  17. }

stopReplica,注意很多情况下是不需要真正删除replica的,比如宕机

  1. def stopReplica(topic: String, partitionId: Int, deletePartition: Boolean): Short = {
  2. getPartition(topic, partitionId) match {
  3. case Some(partition) =>
  4. leaderPartitionsLock synchronized {
  5. leaderPartitions -= partition
  6. }
  7. if(deletePartition) { // 仅仅在deletePartition=true时,才会真正删除该partition
  8. val removedPartition = allPartitions.remove((topic, partitionId))
  9. if (removedPartition != null)
  10. removedPartition.delete() // this will delete the local log
  11. }
  12. case None => //do nothing if replica no longer exists. This can happen during delete topic retries
  13. }
  14. }

handleLeaderAndIsrRequest

处理leaderAndIsr的更新,这个和handleUpdateMetadataRequest的区别是,不光更新cache,需要真正去做replica的leader切换
主要调用,
replicaManager.becomeLeaderOrFollower(leaderAndIsrRequest, offsetManager)
核心逻辑如下,前面那段主要是判断这个request是否有效,根据controllerEpoch和leaderEpoch

  1. def becomeLeaderOrFollower(leaderAndISRRequest: LeaderAndIsrRequest): (collection.Map[(String, Int), Short], Short) = {
  2. replicaStateChangeLock synchronized {// 加锁
  3. val responseMap = new collection.mutable.HashMap[(String, Int), Short]
  4. if(leaderAndISRRequest.controllerEpoch < controllerEpoch) { // 检查requset epoch
  5. (responseMap, ErrorMapping.StaleControllerEpochCode)
  6. } else {
  7. val controllerId = leaderAndISRRequest.controllerId
  8. val correlationId = leaderAndISRRequest.correlationId
  9. controllerEpoch = leaderAndISRRequest.controllerEpoch
  10.  
  11. // First check partition's leader epoch
  12. // 前面只是检查了request的epoch,但是还要检查其中的每个partitionStateInfo中的leader epoch
  13. val partitionState = new HashMap[Partition, PartitionStateInfo]()
  14. leaderAndISRRequest.partitionStateInfos.foreach{ case ((topic, partitionId), partitionStateInfo) =>
  15. val partition = getOrCreatePartition(topic, partitionId, partitionStateInfo.replicationFactor) // get或创建partition,partition只是逻辑存在,所以也是创建partition对象
  16. val partitionLeaderEpoch = partition.getLeaderEpoch()
  17. // If the leader epoch is valid record the epoch of the controller that made the leadership decision.
  18. // This is useful while updating the isr to maintain the decision maker controller's epoch in the zookeeper path
  19. if (partitionLeaderEpoch < partitionStateInfo.leaderIsrAndControllerEpoch.leaderAndIsr.leaderEpoch) { // local的partitionLeaderEpoch要小于request中的leaderEpoch,否则就是过时的request
  20. if(partitionStateInfo.allReplicas.contains(config.brokerId)) // 判断该partition是否被assigned给当前的broker
  21. partitionState.put(partition, partitionStateInfo) // 只将被分配到当前broker的partition放入partitionState,其中partition是当前的状况,partitionStateInfo是request中最新的状况
  22. else { }
  23. } else { // Received invalid LeaderAndIsr request
  24. // Otherwise record the error code in response
  25. responseMap.put((topic, partitionId), ErrorMapping.StaleLeaderEpochCode)
  26. }
  27. }
  28.  
  29. //核心逻辑,判断是否为leader或follower,分别调用makeLeaders和makeFollowers
  30. val partitionsTobeLeader = partitionState //从partitionState中筛选出以该broker为leader replica的
  31. .filter{ case (partition, partitionStateInfo) => partitionStateInfo.leaderIsrAndControllerEpoch.leaderAndIsr.leader == config.brokerId}
  32. val partitionsToBeFollower = (partitionState -- partitionsTobeLeader.keys)
  33.  
  34. if (!partitionsTobeLeader.isEmpty) makeLeaders(controllerId, controllerEpoch, partitionsTobeLeader, leaderAndISRRequest.correlationId, responseMap)
  35. if (!partitionsToBeFollower.isEmpty) makeFollowers(controllerId, controllerEpoch, partitionsToBeFollower, leaderAndISRRequest.leaders, leaderAndISRRequest.correlationId, responseMap)
  36.  
  37. // we initialize highwatermark thread after the first leaderisrrequest. This ensures that all the partitions
  38. // have been completely populated before starting the checkpointing there by avoiding weird race conditions
  39. if (!hwThreadInitialized) {
  40. startHighWaterMarksCheckPointThread() // 启动HighWaterMarksCheckPointThread,hw很重要,需要定期存到磁盘,这样failover的时候可以重新load
  41. hwThreadInitialized = true
  42. }
  43. replicaFetcherManager.shutdownIdleFetcherThreads() //关闭idle的fether,如果成为leader,就不需要fetch
  44. (responseMap, ErrorMapping.NoError)
  45. }
  46. }
  47. }

replicaManager里面有个allPartitions,记录所有partition的情况,

  1. private val allPartitions = new Pool[(String, Int), Partition]

其中Partition结构中,比较主要的数据是,

  1. private val assignedReplicaMap = new Pool[Int, Replica]

这个记录brokerid和replica的对应关系

  1. def getOrCreatePartition(topic: String, partitionId: Int): Partition = {
  2. var partition = allPartitions.get((topic, partitionId))
  3. if (partition == null) {
  4. allPartitions.putIfNotExists((topic, partitionId), new Partition(topic, partitionId, time, this))
  5. partition = allPartitions.get((topic, partitionId))
  6. }
  7. partition
  8. }

所以getOrCreatePartition,只是get当前replicaManager里面保存的该partiiton的情况

replicaManager.makeLeaders

关闭所有成为leader的replica对应的fetcher,然后关键是调用,

  1. // Update the partition information to be the leader
  2. partitionState.foreach{ case (partition, partitionStateInfo) =>
  3. partition.makeLeader(controllerId, partitionStateInfo, correlationId)}

上面提到case (partition, partitionStateInfo)中,partition是replicaManager当前的情况,而partitionStateInfo中间放的是request的新的分配情况,

  1. def makeLeader(controllerId: Int,
  2. partitionStateInfo: PartitionStateInfo, correlationId: Int,
  3. offsetManager: OffsetManager): Boolean = {
  4. inWriteLock(leaderIsrUpdateLock) {
  5. val allReplicas = partitionStateInfo.allReplicas
  6. val leaderIsrAndControllerEpoch = partitionStateInfo.leaderIsrAndControllerEpoch
  7. val leaderAndIsr = leaderIsrAndControllerEpoch.leaderAndIsr
  8. // record the epoch of the controller that made the leadership decision. This is useful while updating the isr
  9. // to maintain the decision maker controller's epoch in the zookeeper path
  10. controllerEpoch = leaderIsrAndControllerEpoch.controllerEpoch
  11. // add replicas that are new
  12. allReplicas.foreach(replica => getOrCreateReplica(replica)) //request中allReplicas
  13. val newInSyncReplicas = leaderAndIsr.isr.map(r => getOrCreateReplica(r)).toSet //request中ISR中的所有replicas
  14. // remove assigned replicas that have been removed by the controller
  15. // assignedReplicas表示当前partition分配情况,需要根据allReplicas更新,如果replicaid不在allReplicas中,则需要从assignedReplicas中删除
  16. (assignedReplicas().map(_.brokerId) -- allReplicas).foreach(removeReplica(_))
  17. inSyncReplicas = newInSyncReplicas // 用request中的数据来更新当前partition中的
  18. leaderEpoch = leaderAndIsr.leaderEpoch
  19. zkVersion = leaderAndIsr.zkVersion
  20. leaderReplicaIdOpt = Some(localBrokerId)
  21. // construct the high watermark metadata for the new leader replica
  22. val newLeaderReplica = getReplica().get
  23. newLeaderReplica.convertHWToLocalOffsetMetadata() //对于新建的replica,只有offset,需要从log读取一下metadata
  24. // reset log end offset for remote replicas
    // 理解这,关键知道leo什么时候被更新的,leo只有当follower成功fetch leader的数据时,才会更新leader.assignedReplicas.getReplica.leo
    // 所以这里需要把leo给reset,因为如果有数据,可能是上次该broker称为leader时的遗留数据
  25. assignedReplicas.foreach(r => if (r.brokerId != localBrokerId) r.logEndOffset = LogOffsetMetadata.UnknownOffsetMetadata)
  26. // 上面把所有remote replica的leo重置了成UnknownOffsetMetadata(-1),那么在maybeIncrementLeaderHW中会取所有replica中最小的leo,如果除leader外有其他replica,因为刚被重置过,最小leo一定是-1
    // -1一定小于当前的hw,所以hw其实不会被increment。只有当isr中只有leader时,那hw会被increment到leader.leo

    maybeIncrementLeaderHW(newLeaderReplica)
  27. if (topic == OffsetManager.OffsetsTopicName)
  28. offsetManager.loadOffsetsFromLog(partitionId)
  29. true
  30. }
  31. }

这里还有个函数getOrCreateReplica,知道两点,
a. 在这里当local replica不存在的时候,会真正的创建replica
b. 所有生成replica都是用这个函数,所以其他的replica list都是assignedReplicaMap中replica的引用,比如inSyncReplicas

  1. def getOrCreateReplica(replicaId: Int = localBrokerId): Replica = {
  2. val replicaOpt = getReplica(replicaId)//assignedReplicaMap.get(replicaId)
  1. replicaOpt match {
  2. case Some(replica) => replica
  3. case None =>
  4. if (isReplicaLocal(replicaId)) { //如果是local,并且在AR中没有,那么需要创建这个replica
  5. val config = LogConfig.fromProps(logManager.defaultConfig.toProps, AdminUtils.fetchTopicConfig(zkClient, topic))
  6. val log = logManager.createLog(TopicAndPartition(topic, partitionId), config) //真正的创建replica文件
  7. val checkpoint = replicaManager.highWatermarkCheckpoints(log.dir.getParentFile.getAbsolutePath) //需要读出hw checkpoint
  8. val offsetMap = checkpoint.read
  9. if (!offsetMap.contains(TopicAndPartition(topic, partitionId)))
  10. warn("No checkpointed highwatermark is found for partition [%s,%d]".format(topic, partitionId))
  11. val offset = offsetMap.getOrElse(TopicAndPartition(topic, partitionId), 0L).min(log.logEndOffset) //读出hw,和loe求min,防止hw大于loe
  12. val localReplica = new Replica(replicaId, this, time, offset, Some(log))
  13. addReplicaIfNotExists(localReplica)
  14. } else { //
  15. val remoteReplica = new Replica(replicaId, this, time)
  16. addReplicaIfNotExists(remoteReplica)
  17. }
  18. getReplica(replicaId).get
  19. }
  20. }

replicaManager.makeFollowers

  1. var partitionsToMakeFollower: Set[Partition] = Set() //记录leader发生变化的partition
  2. //调用partition.makeFollower
  3. if (partition.makeFollower(controllerId, partitionStateInfo, correlationId, offsetManager)) // 仅仅当partition的leader发生变化时才返回true,因为如果不变,不需要做任何操作
  4. partitionsToMakeFollower += partition
  5. //由于leader已发生变化,需要把旧的fetcher删除
  6. replicaFetcherManager.removeFetcherForPartitions(partitionsToMakeFollower.map(new TopicAndPartition(_)))
  7.  
  8. //由于leader已发生变化,所以之前和旧leader同步的数据可能和新的leader是不一致的,但hw以下的数据,大家都是一致的,所以就把hw以上的数据truncate掉,防止不一致
  9. logManager.truncateTo(partitionsToMakeFollower.map(partition => (new TopicAndPartition(partition), partition.getOrCreateReplica().highWatermark.messageOffset)).toMap)
  10.  
  11. if (isShuttingDown.get()) {
  12. //真正shuttingDown,就不要再加fetcher
  13. }
  14. else {
  15. // we do not need to check if the leader exists again since this has been done at the beginning of this process
  16. val partitionsToMakeFollowerWithLeaderAndOffset = partitionsToMakeFollower.map(partition => //
  17. new TopicAndPartition(partition) -> BrokerAndInitialOffset(
  18. leaders.find(_.id == partition.leaderReplicaIdOpt.get).get,
  19. partition.getReplica().get.logEndOffset.messageOffset)).toMap
  20.  
  21. replicaFetcherManager.addFetcherForPartitions(partitionsToMakeFollowerWithLeaderAndOffset) //增加新的fetcher
    }

partition.makeFollower
比较简单,只是更新assignedReplicas和ISR

  1. def makeFollower(controllerId: Int,
  2. partitionStateInfo: PartitionStateInfo,
  3. correlationId: Int, offsetManager: OffsetManager): Boolean = {
  4. inWriteLock(leaderIsrUpdateLock) {
  5. val allReplicas = partitionStateInfo.allReplicas
  6. val leaderIsrAndControllerEpoch = partitionStateInfo.leaderIsrAndControllerEpoch
  7. val leaderAndIsr = leaderIsrAndControllerEpoch.leaderAndIsr
  8. val newLeaderBrokerId: Int = leaderAndIsr.leader
  9. // record the epoch of the controller that made the leadership decision. This is useful while updating the isr
  10. // to maintain the decision maker controller's epoch in the zookeeper path
  11. controllerEpoch = leaderIsrAndControllerEpoch.controllerEpoch
  12. // add replicas that are new
  13. allReplicas.foreach(r => getOrCreateReplica(r))
  14. // remove assigned replicas that have been removed by the controller
  15. (assignedReplicas().map(_.brokerId) -- allReplicas).foreach(removeReplica(_))
  16. inSyncReplicas = Set.empty[Replica] // 将isr置空,不同于makeLeader
  17. leaderEpoch = leaderAndIsr.leaderEpoch
  18. zkVersion = leaderAndIsr.zkVersion
  19.  
  20. if (leaderReplicaIdOpt.isDefined && leaderReplicaIdOpt.get == newLeaderBrokerId) { // 判断replica leader是否发生了变化
  21. false
  22. }
  23. else {
  24. leaderReplicaIdOpt = Some(newLeaderBrokerId) // 如果发生变化,则更新leader
  25. true
  26. }
  27. }
  28. }

Apache Kafka源码分析 - KafkaApis的更多相关文章

  1. Apache Kafka源码分析 – Broker Server

    1. Kafka.scala 在Kafka的main入口中startup KafkaServerStartable, 而KafkaServerStartable这是对KafkaServer的封装 1: ...

  2. apache kafka源码分析-Producer分析---转载

    原文地址:http://www.aboutyun.com/thread-9938-1-1.html 问题导读1.Kafka提供了Producer类作为java producer的api,此类有几种发送 ...

  3. Apache Kafka源码分析 - kafka controller

    前面已经分析过kafka server的启动过程,以及server所能处理的所有的request,即KafkaApis 剩下的,其实关键就是controller,以及partition和replica ...

  4. Apache Kafka源码分析 – Controller

    https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Controller+Internalshttps://cwiki.apache.org ...

  5. Apache Kafka源码分析 – Log Management

    LogManager LogManager会管理broker上所有的logs(在一个log目录下),一个topic的一个partition对应于一个log(一个log子目录)首先loadLogs会加载 ...

  6. Apache Kafka源码分析 - autoLeaderRebalanceEnable

    在broker的配置中,auto.leader.rebalance.enable (false) 那么这个leader是如何进行rebalance的? 首先在controller启动的时候会打开一个s ...

  7. Apache Kafka源码分析 – Replica and Partition

    Replica 对于local replica, 需要记录highWatermarkValue,表示当前已经committed的数据对于remote replica,需要记录logEndOffsetV ...

  8. Apache Kafka源码分析 – ReplicaManager

    如果说controller作为master,负责全局的事情,比如选取leader,reassignment等那么ReplicaManager就是worker,负责完成replica的管理工作 主要工作 ...

  9. Apache Kafka源码分析 - ReplicaStateMachine

    startup 在onControllerFailover中被调用, /** * Invoked on successful controller election. First registers ...

随机推荐

  1. Notification通知

    manager = (NotificationManager) getSystemService(Context.NOTIFICATION_SERVICE); button = (Button) fi ...

  2. Java正则表达式, 提取双引号中间的部分

    String str="this is \"Tom\" and \"Eric\", this is \"Bruce lee\", ...

  3. spring获取webapplicationcontext,applicationcontext几种方法详解

    法一:在初始化时保存ApplicationContext对象代码: ApplicationContext ac = new FileSystemXmlApplicationContext(" ...

  4. sql按天分组

    sql按天分组,这都不会 晕!!!!!!! ) ;

  5. 5.post上传和压缩、插件模拟请求

      gzip gzip一种压缩方式,或者是文件形式,它主要用于网络传输数据的压缩 gzip压缩好不好用 浏览器:网速一定.内容越小.请求响应的速度是不是更快 手机server:返回数据类型是json/ ...

  6. ArduinoYun的电源插座

    ArduinoYun的电源插座 Arduino Yun有两排插座,这些插座可以按类型分为三类:电源.数字IO和模拟输入.电源部分主要集中在如图1.7所示的部分本文选自Arduino Yun快速入门教程 ...

  7. 洗衣店专用手持智能POS PDA手持设备 上门收衣 现场刷卡 打印票据 开单系统

    手持上门收衣设备通过wifi或者3G手机卡等进行联网,功能便捷强大,多功能一体同步使用,通过手持机上门收.取衣物,快速开单收衣消费.取货.新建会员.现场办理会员发卡.手持机读发会员卡和会员用卡消费等. ...

  8. 一只小蜜蜂...[HDU2044]

    一只小蜜蜂... Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (Java/Others)Total Su ...

  9. eclipse 新建 java 文件时自动生成注释

    http://blog.csdn.net/kimsoft/article/details/5337910 Windows->Preference->Java->Code Style- ...

  10. [leetCode][013] Two Sum 2

    题目: Given an array of integers that is already sorted in ascending order, find two numbers such that ...