上次留下来的问题

如果消息是发给很多不同的topic的， async producer如何在按batch发送的同时区分topic的
它是如何用key来做partition的？
是如何实现对消息成批量的压缩的？

async producer如何在按batch发送的同时区分topic的

　　这个问题的答案是： DefaultEventHandler会把发给它的一个batch的消息（实际上是Seq[KeyedMessage[K,V]]类型）拆开，确定每条消息该发送给哪个broker。对发给每个broker的消息，会按topic和partition来组合。即：拆包=>根据metaData组装

这个功能是通过partitionAndCollate方法实现的

def partitionAndCollate(messages: Seq[KeyedMessage[K,Message]]): Option[Map[Int, collection.mutable.Map[TopicAndPartition, Seq[KeyedMessage[K,Message]]]]]

　　它返回一个Option对象，这个Option的元素是一个Map，Key是brokerId，value是发给这个broker的消息。对每一条消息，先确定它要被发给哪一个topic的哪个parition。然后确定这个parition的leader broker，然后去Map[Int, collection.mutable.Map[TopicAndPartition, Seq[KeyedMessage[K,Message]]]]这个Map里找到对应的broker,然后把这条消息填充给对应的topic+partition对应的Seq[KeyedMessage[K,Message]]。这样就得到了最后的结果。这个结果表示了哪些消息要以怎样的结构发给一个broker。真正发送的时候，会按照brokerId的不同，把打包好的消息发给不同的broker。

首先，看一下kafka protocol里对于Producer Request结构的说明：

ProduceRequest => RequiredAcks Timeout [TopicName [Partition MessageSetSize MessageSet]]

  RequiredAcks => int16

  Timeout => int32

  Partition => int32

  MessageSetSize => int32

发给一个broker的消息就是这样的结构。

同时，在kafka wiki里对于Produce API 有如下说明：

The produce API is used to send message sets to the server. For efficiency it allows sending message sets intended for many topic partitions in a single request.

即在一个produce request里，可以同时发消息给多个topic+partition的组合。当然一个produce request是发给一个broker的。

使用

send(brokerid, messageSetPerBroker)

　　把消息set发给对应的brokerid。

它是如何用key来做partition的？

首先看下KeyedMessage类的定义：

case class KeyedMessage[K, V](val topic: String, val key: K, val partKey: Any, val message: V) {

  if(topic == null)

    throw new IllegalArgumentException("Topic cannot be null.")

  def this(topic: String, message: V) = this(topic, null.asInstanceOf[K], null, message)

  def this(topic: String, key: K, message: V) = this(topic, key, key, message)

  def partitionKey = {

    if(partKey != null)

      partKey

    else if(hasKey)

      key

    else

      null

  }

  def hasKey = key != null

}

　　当使用三个参数的构造函数时， partKey会等于key。partKey是用来做partition的，但它不会最当成消息的一部分被存储。

前边提到了，在确定一个消息应该发给哪个broker之前，要先确定它发给哪个partition,这样才能根据paritionId去找到对应的leader所在的broker。

val topicPartitionsList = getPartitionListForTopic(message) //获取这个消息发送给的topic的partition信息
val partitionIndex = getPartition(message.topic, message.partitionKey, topicPartitionsList)//确定这个消息发给哪个partition

　　注意传给getPartition方法中时使用的是partKey。getPartition方法为：

  private def getPartition(topic: String, key: Any, topicPartitionList: Seq[PartitionAndLeader]): Int = {

    val numPartitions = topicPartitionList.size

    if(numPartitions <= 0)

      throw new UnknownTopicOrPartitionException("Topic " + topic + " doesn't exist")

    val partition =

      if(key == null) {

        // If the key is null, we don't really need a partitioner

        // So we look up in the send partition cache for the topic to decide the target partition

        val id = sendPartitionPerTopicCache.get(topic)

        id match {

          case Some(partitionId) =>

            // directly return the partitionId without checking availability of the leader,

            // since we want to postpone the failure until the send operation anyways

            partitionId

          case None =>

            val availablePartitions = topicPartitionList.filter(_.leaderBrokerIdOpt.isDefined)

            if (availablePartitions.isEmpty)

              throw new LeaderNotAvailableException("No leader for any partition in topic " + topic)

            val index = Utils.abs(Random.nextInt) % availablePartitions.size

            val partitionId = availablePartitions(index).partitionId

            sendPartitionPerTopicCache.put(topic, partitionId)

            partitionId

        }

      } else

        partitioner.partition(key, numPartitions)

　　当partKey为null时，首先它从sendParitionPerTopicCache里取这个topic缓存的partitionId，这个cache是一个Map.如果之前己经使用sendPartitionPerTopicCache.put(topic, partitionId)缓存了一个，就直接取出它。否则就随机从可用的partitionId里取出一个，把它缓存到sendParitionPerTopicCache。这就使得当sendParitionPerTopicCache里有一个可用的partitionId时，很多消息都会被发送给这同一个partition。因此若所有消息的partKey都为空，在一段时间内只会有一个partition能收到消息。之所以会说“一段”时间，而不是永久，是因为handler隔一段时间会重新获取它发送过的消息对应的topic的metadata，这个参数通过topic.metadata.refresh.interval.ms来设置。当它重新获取metadata之后，会消空一些缓存，就包括这个sendParitionPerTopicCache。因此，接下来就会生成另一个随机的被缓存的partitionId。

  if (topicMetadataRefreshInterval >= 0 &&

          SystemTime.milliseconds - lastTopicMetadataRefreshTime > topicMetadataRefreshInterval) {  //若该refresh topic metadata 了，do the refresh

        Utils.swallowError(brokerPartitionInfo.updateInfo(topicMetadataToRefresh.toSet, correlationId.getAndIncrement))

        sendPartitionPerTopicCache.clear()

        topicMetadataToRefresh.clear

        lastTopicMetadataRefreshTime = SystemTime.milliseconds

      }

　　当partKey不为null时，就用传给handler的partitioner的partition方法，根据partKey和numPartitions来确定这个消息被发给哪个partition。注意这里的numPartition是topicPartitionList.size获取的，有可能会有parition不存在可用的leader。这样的问题将留给send时解决。实际上发生这种情况时，partitionAndCollate会将这个消息分派给brokerId为-1的broker。而send方法会在发送前判断brokerId

    if(brokerId < 0) {

      warn("Failed to send data since partitions %s don't have a leader".format(messagesPerTopic.map(_._1).mkString(",")))

      messagesPerTopic.keys.toSeq

　　当brokerId<0时，就返回一个非空的Seq，包括了所有没有leader的topic+partition的组合，如果重试了指定次数还不能发送，将最终导致handle方法抛出一个 FailedToSendMessageException异常。

是如何实现对消息成批量的压缩的？

这个是在

private def groupMessagesToSet(messagesPerTopicAndPartition: collection.mutable.Map[TopicAndPartition, Seq[KeyedMessage[K,Message]]])

中处理。

说明为：

/** enforce the compressed.topics config here.
* If the compression codec is anything other than NoCompressionCodec,
* Enable compression only for specified topics if any
* If the list of compressed topics is empty, then enable the specified compression codec for all topics
* If the compression codec is NoCompressionCodec, compression is disabled for all topics
*/

即，如果没有设置压缩，就所有topic对应的消息集都不压缩。如果设置了压缩，并且没有设置对个别topic启用压缩，就对所有topic都使用压缩；否则就只对设置了压缩的topic压缩。

在这个gruopMessageToSet中，并不有具体的压缩逻辑。而是返回一个ByteBufferMessageSet对象。它的注释为：

/**
* A sequence of messages stored in a byte buffer
*
* There are two ways to create a ByteBufferMessageSet
*
* Option 1: From a ByteBuffer which already contains the serialized message set. Consumers will use this method.
*
* Option 2: Give it a list of messages along with instructions relating to serialization format. Producers will use this method.

看来它是对于消息集进行序列化和反序列化的工具。

在它的实现里用到了CompressionFactory对象。从它的实现里可以看到Kafka只支持GZIP和Snappy两种压缩方式。

compressionCodec match {

      case DefaultCompressionCodec => new GZIPOutputStream(stream)

      case GZIPCompressionCodec => new GZIPOutputStream(stream)

      case SnappyCompressionCodec =>

        import org.xerial.snappy.SnappyOutputStream

        new SnappyOutputStream(stream)

      case _ =>

        throw new kafka.common.UnknownCodecException("Unknown Codec: " + compressionCodec)

Kafka 之 async producer (2) kafka.producer.async.DefaultEventHandler的更多相关文章

Kafka深度解析（如何在producer中指定partition）（转）
原文链接:Kafka深度解析背景介绍 Kafka简介 Kafka是一种分布式的,基于发布/订阅的消息系统.主要设计目标如下: 以时间复杂度为O(1)的方式提供消息持久化能力,即使对TB级以上数据也能 ...
kafka 0.8.2 消息生产者 producer
package com.hashleaf.kafka; import java.util.Properties; import kafka.javaapi.producer.Producer; imp ...
Kafka 0.11.0.0 实现 producer的Exactly-once 语义（官方DEMO）
<dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients&l ...
Kafka 0.11.0.0 实现 producer的Exactly-once 语义（中文）
很高兴地告诉大家,具备新的里程碑意义的功能的Kafka 0.11.x版本(对应 Confluent Platform 3.3)已经release,该版本引入了exactly-once语义,本文阐述的内 ...
Kafka 0.11.0.0 实现 producer的Exactly-once 语义（英文）
Exactly-once Semantics are Possible: Here’s How Kafka Does it I’m thrilled that we have hit an excit ...
Kafka 详解（三）------Producer生产者
在第一篇博客我们了解到一个kafka系统,通常是生产者Producer 将消息发送到 Broker,然后消费者 Consumer 去 Broker 获取,那么本篇博客我们来介绍什么是生产者Produc ...
Kafka学习（四）-------- Kafka核心之Producer
通过https://www.cnblogs.com/tree1123/p/11243668.html 已经对consumer有了一定的了解.producer比consumer要简单一些. 一.旧版本p ...
Apache Kafka（六）- High Throughput Producer
High Throughput Producer 在有大量消息需要发送的情况下,默认的Kafka Producer配置可能无法达到一个可观的的吞吐.在这种情况下,我们可以考虑调整两个方面,以提高Pro ...
Apache Kafka（五）- Safe Kafka Producer
Kafka Safe Producer 在应用Kafka的场景中,需要考虑到在异常发生时(如网络异常),被发送的消息有可能会出现丢失.乱序.以及重复消息. 对于这些情况,我们可以创建一个“safe p ...
《Apache kafka实战》读书笔记-kafka集群监控工具
<Apache kafka实战>读书笔记-kafka集群监控工具作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 如官网所述,Kafka使用基于yammer metric ...

随机推荐

DOS批处理命令-注释
注释是每个程序中不可或缺的(不是对计算机来说,而是对我们这些程序员阅读代码来说) 语法: ①rem 这是批处理的注释命令,rem后面的内容全部是注释例:rem 这是一行注释 ②:: 批处理遇到以冒号 ...
基于asp.net的Web开发架构探索
问题由来最近在研究适合团队开发的web架构解决方案,该架构即要适合分工协作又要有一定扩展性,适合不同的数据库需要,因此我查阅了一些资料,初步构想出了一套架构,请各位多多指教. 探索 web开发架构最 ...
(转)实战Memcached缓存系统（2）Memcached Java API基础之MemcachedClient
1. 构造函数 public MemcachedClient(InetSocketAddress[] ia) throws IOException; public MemcachedClient(Li ...
css3学习笔记之多列
CSS3 创建多列 column-count 属性指定了需要分割的列数. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 <!D ...
免费的HTML5连载来了《HTML5网页开发实例详解》连载（六）媒体查询
响应式设计的另一个重要技术手段是媒体查询.如果只是简单的设计一个流式布局系统,那么可以保证每个网格按比例的放大和缩小,但有可能会使得在小屏幕下(如手机设备)网格太小而严重影响阅读,这样的设计称不上响应 ...
图像热点&图像映射
图像映射图像映射也称为图像热点. 作用: 让同一张图片上的不同区域,可以实现多个不同的超链接功能. 图示: <map>图像映射三步走: 图像映射的实现需要三方面配合完成: 1.图像映射容 ...
DataX的简单编译安装测试
搭建环境: Java > =1.6 Python>=2.6 <3 Ant Rpmbuild G++ 编译DataX: 进入rpm文件夹 ...
ArcSDE for oracle10g安装后post的时候出现错误
The Post Installation Setup can not locate required Oracle files in your path.Check your Oracle inst ...
译文：Javascript-Functions
个人理解+google翻译+有道翻译.如有错误,请指正.原文来自MDN:Functions Functions是javascript基本构建模块之一.每一个function是一个javascript程 ...
解决IE6中ajax ‘aborted’错误请求中断
给a标签绑定了一个click事件用来触发ajax请求,在IE6中,请求时常会被中断,在其他浏览器中都一切正常. 在IE6中使用Fiddler2和httpWatch监视请求,经常会出现”aborted” ...

Kafka 之 async producer (2) kafka.producer.async.DefaultEventHandler

上次留下来的问题

async producer如何在按batch发送的同时区分topic的

它是如何用key来做partition的？

是如何实现对消息成批量的压缩的？

Kafka 之 async producer (2) kafka.producer.async.DefaultEventHandler的更多相关文章

随机推荐

热门专题