我们使用 Kafka 生产者在发消息的时候我们关注什么（Python 客户端 1.01 broker）

之前使用 Kafka 的客户端消费者比较多一点，而且也是无脑订阅使用也没有深入了解过具体的参数。总的来说使用不够细节。

这次公司项目活动期间暴露非常多的问题，于是有了这篇文章。

首先我们来拆解一下 Kafka KafkaProducer 类给我们暴露的参数，我会依次介绍一下这些参数的功能以及效果，其中也包括比较重要的自定义 paritiition 的参数。

1. bootstrap_servers

bootstrap_servers: 'host[:port]' string (or list of 'host[:port]'

    strings) that the producer should contact to bootstrap initial

    cluster metadata. This does not have to be the full node list.

    It just needs to have at least one broker that will respond to a

    Metadata API Request. Default port is . If no servers are

    specified, will default to localhost:.

bootstrap_servers 接受单个字符串或者 list 参数，默认值是 'localhost'，该值让你指定 kafka 的服务 server 格式是 host:port 。例如我们使用的集群，那么我们需要指定我们的生产者找到我们的集群地址我们会设置类似这样的 bootstrap_servers 地址

kafka_conf = {

    'bootstrap_servers': ['10.171.97.1:9092', '10.163.13.219:9092', '10.170.249.122:9092']

}

self.kafka_producer = kafka.KafkaProducer(**kafka_conf)

默认是 localhost。

2. client_id

client_id (str): a name for this client. This string is passed in

    each request to servers and can be used to identify specific

    server-side log entries that correspond to this client.

    Default: 'kafka-python-producer-#' (appended with a unique number

    per instance)

设置了这个客户端 id 可以让你在与 server 通讯的时候标识这个生产者端。

默认是 None 不设置。

3. key_serializer | value_serializer

 key_serializer (callable): used to convert user-supplied keys to bytes

     If not None, called as f(key), should return bytes. Default: None.

 value_serializer (callable): used to convert user-supplied message

     values to bytes. If not None, called as f(value), should return

     bytes. Default: None.

key 和 value 的内容的序列化。可以指定自己的回调解析。

默认参数是 None

4. acks

acks (, , 'all'): The number of acknowledgments the producer requires

    the leader to have received before considering a request complete.

    This controls the durability of records that are sent. The

    following settings are common:

    : Producer will not wait for any acknowledgment from the server.

        The message will immediately be added to the socket

        buffer and considered sent. No guarantee can be made that the

        server has received the record in this case, and the retries

        configuration will not take effect (as the client won't

        generally know of any failures). The offset given back for each

        record will always be set to -.

    : Wait for leader to write the record to its local log only.

        Broker will respond without awaiting full acknowledgement from

        all followers. In this case should the leader fail immediately

        after acknowledging the record but before the followers have

        replicated it then the record will be lost.

    all: Wait for the full set of in-sync replicas to write the record.

        This guarantees that the record will not be lost as long as at

        least one in-sync replica remains alive. This is the strongest

        available guarantee.

    If unset, defaults to acks=.

acks 是保障可靠性中非常重要的参数。

在设置为 0 的情况下， kafka 会不管任何的事情，只要收到了消息就往 buffer 或者直接发送。是效率最高，但是完全不能保证消息是否发送成功，也不会确认 broker 的信息，更不会去重试属于典型的 fire and forget。

该参数的默认值是 1，在参数被设置成 1 的情况下，会等待消费被 broker 端 leader 写入到日志中完成并且 broker leader 会返回 response ，但是不会等待其他 isr 写好副本的返回。在这种情况下如果 leader 挂了进行重新选举，可能会丢失消息。该参数比较好的兼容了吞吐量和可用性，只要 leader 不挂消息不会丢失，而且消息会慢慢被拷贝到其他的 isr 上。

在设置成 all 的情况下，会等待所有的 isr 都同步之后，才会返回，在消息完全不允许丢失的情况下启用该参数。他能保证只要还有一个可用的 isr 存活，消息都不会丢失。

默认是设置成 1.

5. retries

retries (int): Setting a value greater than zero will cause the client

    to resend any record whose send fails with a potentially transient

    error. Note that this retry is no different than if the client

    resent the record upon receiving the error. Allowing retries

    without setting max_in_flight_requests_per_connection to  will

    potentially change the ordering of records because if two batches

    are sent to a single partition, and the first fails and is retried

    but the second succeeds, then the records in the second batch may

    appear first.

    Default: .

当我们设置 ack 大于 0 的情况下，该参数生效。如果我们发送失败设置了该参数会进行重试。默认情况下重试被关闭，如果需要保证数据不丢失活着高可用，可以将参数改为一个较大的值。

试到成功为止，但是也可能因为这个引入一些 block 的问题，需要根据自己的环境进行评估和把握。

默认是不开启重试。

6. compression_type

compression_type (str): The compression type for all data generated by

    the producer. Valid values are 'gzip', 'snappy', 'lz4', or None.

    Compression is of full batches of data, so the efficacy of batching

    will also impact the compression ratio (more batching means better

    compression). Default: None.

指定一个压缩类型。这里注意我们是生产者，如果生产者生产了压缩的消息到达了 broker ，消费者可能同样需要指定相同的解压算法对数据进行解压。

默认是不压缩。

7. batch_size

batch_size (int): Requests sent to brokers will contain multiple

    batches, one for each partition with data available to be sent.

    A small batch size will make batching less common and may reduce

    throughput (a batch size of zero will disable batching entirely).

    Default:

批处理大小默认值 16kb，用于减少多次发送给 broker 给 broker 带来压力，发往 broker 同一个 partitions 的消息到达批处理大小后发送。我还没有调整过该参数，不太清楚具体影响。应该是在数据量和压力都特别大的情况下，有助于帮助网络减少繁忙状态。

默认大小是 16kb。

8. linger_ms

linger_ms (int): The producer groups together any records that arrive

    in between request transmissions into a single batched request.

    Normally this occurs only under load when records arrive faster

    than they can be sent out. However in some circumstances the client

    may want to reduce the number of requests even under moderate load.

    This setting accomplishes this by adding a small amount of

    artificial delay; that is, rather than immediately sending out a

    record the producer will wait for up to the given delay to allow

    other records to be sent so that the sends can be batched together.

    This can be thought of as analogous to Nagle's algorithm in TCP.

    This setting gives the upper bound on the delay for batching: once

    we get batch_size worth of records for a partition it will be sent

    immediately regardless of this setting, however if we have fewer

    than this many bytes accumulated for this partition we will

    'linger' for the specified time waiting for more records to show

    up. This setting defaults to  (i.e. no delay). Setting linger_ms=

    would have the effect of reducing the number of requests sent but

    would add up to 5ms of latency to records sent in the absense of

    load. Default: .

这个参数通常情况下只有过载的情况下会触发，过载指生产者发送消息到 broker 的速度已经跟不上消息到达生产者的速度了。这个时候我们会在生产端 hold 一段时间 linger_ms 然后一并将内容发送到 broker ，以求减少到达 broker 的 requests 。实现这个功能是采取增加一段小的延迟来实现的。这有点像 tcp 上之前常用的 nagle 算法。

同样这个参数我没有尝试过 - -不太清楚效果。如果遭遇大流量拥塞可以尝试开启。

默认是不开启。

9. partitioner

partitioner (callable): Callable used to determine which partition

    each message is assigned to. Called (after key serialization):

    partitioner(key_bytes, all_partitions, available_partitions).

    The default partitioner implementation hashes each non-None key

    using the same murmur2 algorithm as the java client so that

    messages with the same key are assigned to the same partition.

    When a key is None, the message is delivered to a random partition

    (filtered to partitions with available leaders only, if possible).

partitioner 是消息生产到指定 partitions 的计算函数，Python 端默认使用 DefaultPartitioner() 可以直接查看分组策略和使用参数我这里贴一个

class DefaultPartitioner(object):

    """Default partitioner.

    Hashes key to partition using murmur2 hashing (from java client)

    If key is None, selects partition randomly from available,

    or from all partitions if none are currently available

    """

    @classmethod

    def __call__(cls, key, all_partitions, available):

        """

        Get the partition corresponding to key

        :param key: partitioning key

        :param all_partitions: list of all partitions sorted by partition ID

        :param available: list of available partitions in no particular order

        :return: one of the values from all_partitions or available

        """

        if key is None:

            if available:

                return random.choice(available)

            return random.choice(all_partitions)

        idx = murmur2(key)

        idx &= 0x7fffffff

        idx %= len(all_partitions)

        return all_partitions[idx]

可以看到如果不指定 key 进行分组的话使用了python 的 random choice 方法进行选择，如果使用 key 分组默认使用 murmu2 算法。

如果不指定 key 出现数据有倾斜的问题，可以尝试提供新的的 partitions 算法。

10. buffer_memory

buffer_memory (int): The total bytes of memory the producer should use

    to buffer records waiting to be sent to the server. If records are

    sent faster than they can be delivered to the server the producer

    will block up to max_block_ms, raising an exception on timeout.

    In the current implementation, this setting is an approximation.

    Default:  (32MB)

生产者端的一个发送 buffer ，最大提供了 32mb。如果 32mb 被填充满并且来不及全部发送给 broker ，将会触发 max_block_ms 并且 raise 出一个 timeout exception 。

默认 32mb ，这个参数应该也需要在数据量非常大的情况下才会触发。但是我感觉该参数如果不是临时阻塞用上了 32mb 的缓存，应该都会触发超时。

11. max_block_ms

max_block_ms (int): Number of milliseconds to block during

    :meth:`~kafka.KafkaProducer.send` and

    :meth:`~kafka.KafkaProducer.partitions_for`. These methods can be

    blocked either because the buffer is full or metadata unavailable.

    Blocking in the user-supplied serializers or partitioner will not be

    counted against this timeout. Default: .

设置在调用 KafKaProducer.send 和 KafKaProducer.paritions_for 还有在 buffer 已经满的情况下的超时时间。

例如我的 send 方法阻塞了默认情况下 60s 会失效。但是感觉这个失效时间有点略长。如果 buffer 阻塞等 60s 可能后面的消息也已经排起长队了。

12.max_request_size

max_request_size (int): The maximum size of a request. This is also

    effectively a cap on the maximum record size. Note that the server

    has its own cap on record size which may be different from this.

    This setting will limit the number of record batches the producer

    will send in a single request to avoid sending huge requests.

    Default: .

该参数会限制单批发送的最大请求大小，来避免请求发送过大。在服务器端应该也有个类似参数来控制消息避免发送过大参数。

默认最大可以发送 1m 消息。

13.retry_backoff_ms

retry_backoff_ms (int): Milliseconds to backoff when retrying on

    errors. Default: .

重试阻塞时间默认为 100ms

14.metadata_max_age_ms

metadata_max_age_ms (int): The period of time in milliseconds after

    which we force a refresh of metadata even if we haven't seen any

    partition leadership changes to proactively discover any new

    brokers or partitions. Default:

元数据最大刷新时间

默认是 5分钟刷通过去 broker 刷新一次元数据。

15.max_in_flight_requests_per_connection

max_in_flight_requests_per_connection (int): Requests are pipelined

    to kafka brokers up to this number of maximum requests per

    broker connection. Note that if this setting is set to be greater

    than  and there are failed sends, there is a risk of message

    re-ordering due to retries (i.e., if retries are enabled).

    Default: .

看了几篇文章感觉没有把这个参数说清楚，这个参数默认是 5 。说的是单个 connection 同时允许 5 个消息发送之后确认消息，如果设置成 1 发送一条将会对消息进行确认来保证顺序。

怎么说呢，因为 producer 是 pipline 的是顺序发送的，只有重试的时候会引入顺序问题。比如我发了 1 2 3 4 5 然后开始确认了，确认的时候 2 没了我需要确认 2 然后进行重发。顺序就乱了，因为 5 都接收了我还要发送一次 2 。试想一下如果我们具有 idempotent ，那么我们 2 错了我们可以让 3 4 5都进行重传，这样又保证了顺序。

如果我们没有幂等，我们就需要将 ack 设置为 all 并且将 max_in_flight_requests_per_connection 调整为 1 ，然后启用一个 partitions 来完全保障数据的顺序传输。

还有不少参数包括安全协议传输的参数，我没有列举到这里。那些参数大部分时候都用不上，需要的时候再去看也行。

当我们在发消息的时候通常不需要关注到这么多参数，只是针对特定情况下我们需要调整一些参数来保障我们想要实现的语意。比较常见的一个情况是我们可能希望我们的消息是不丢失的，那我们应该如何配置呢？

其实感觉 kafka 很多情况下都依赖消费端进行幂等，如果消费端幂等的话整个流程会非常健壮和快速也就是实现 At least Once 语意，而不是去保证 exactly once。

1. 使用 producer.send 设置 ack 为 all 。

2. 设置 retry 为较大值，重试避免消息丢失。

3. 设置 unclean.leader.election.enable = false 阻止落后太多的非 irs 竞选 leader。

4. 设置 topic 级别的 replication.factor >= 3 多备份冗余。

5. 设置 min.insync.replicas > 1 控制提交数尽量多确认提交。

6. 消费端放弃使用 autocommit 而使用手动 commit 老保障消息的准确，如果我们使用 autocommit 也要保障消费端 idempotent。

Reference:

https://en.wikipedia.org/wiki/Nagle%27s_algorithm

https://xinklabi.iteye.com/blog/2195092 MurmurHash算法（高运算性能，低碰撞率，hadoop、memcached等使用）

https://www.iteblog.com/archives/2560.html Kafka 是如何保证数据可靠性和一致性

https://stackoverflow.com/questions/49802686/understanding-the-max-inflight-property-of-kafka-producer Understanding the max.inflight property of kafka producer

http://matt33.com/2018/10/24/kafka-idempotent/ Kafka 事务性之幂等性实现

我们使用 Kafka 生产者在发消息的时候我们关注什么（Python 客户端 1.01 broker）的更多相关文章

Kafka 消费者到底是什么以及消费者位移主题到底是什么（Python 客户端 1.01 broker）
Kafka 中有这样一个概念消费者组,所有我们去订阅 topic 和 topic 交互的一些操作我们都是通过消费者组去交互的. 在 consumer 端设置了消费者的名字之后,该客户端可以对多个 to ...
开发测试时给 Kafka 发消息的 UI 发送器――Mikasa
开发测试时给 Kafka 发消息的 UI 发送器――Mikasa 说来话长,自从入了花瓣,整个人就掉进连环坑了. 后端元数据采集是用 Storm 来走拓扑流程的,又因为 @Zola 不是很喜欢 Jav ...
kafka 0.8.2 消息生产者 KafkaProducer
package com.hashleaf.kafka; import java.util.Properties; import java.util.concurrent.ExecutorService ...
kafka 生产者发送消息
KafkaProducer 创建一个 KafkaThread 来运行 Sender.run 方法. 1. 发送消息的入口在 KafkaProducer#doSend 中,但其实是把消息加入到 batc ...
Kafka生产者-向Kafka中写入数据
(1)生产者概览 (1)不同的应用场景对消息有不同的需求,即是否允许消息丢失.重复.延迟以及吞吐量的要求.不同场景对Kafka生产者的API使用和配置会有直接的影响. 例子1:信用卡事务处理系统,不允 ...
Kafka集群安装部署、Kafka生产者、Kafka消费者
Storm上游数据源之Kakfa 目标: 理解Storm消费的数据来源.理解JMS规范.理解Kafka核心组件.掌握Kakfa生产者API.掌握Kafka消费者API.对流式计算的生态环境有深入的了解 ...
Kafka权威指南读书笔记之（三）Kafka 生产者一一向 Kafka 写入数据
不管是把 Kafka 作为消息队列.消息总线还是数据存储平台来使用 ,总是需要有一个可以往 Kafka 写入数据的生产者和一个从 Kafka 读取数据的消费者,或者一个兼具两种角色的应用程序. 开发者 ...
Kafka(分布式发布-订阅消息系统)工作流程说明
Kafka系统架构Apache Kafka是分布式发布-订阅消息系统.它最初由LinkedIn公司开发,之后成为Apache项目的一部分.Kafka是一种快速.可扩展的.设计内在就是分布式的,分区的和 ...
kafka生产者和消费者
在使用kafka时,有时候为验证应用程序,需要手动读取消息或者手动生成消息.这个时候可以借助kafka-console-consumer.sh和kafka-console-producer.sh 这两 ...

随机推荐

ubuntu gcc 降级适应matlab
一.安装gcc 4.7 Ubuntu14.04自带的gcc版本是4.8,MATLAB2014a支持的最高版本为4.7x.因此,需要安装gcc4.7,并给gcc降级在终端执行gcc 4.7的安装命令: ...
Process.Start cmd 参数空格问题解决
Process.Start("cmd.exe", "/c start \"title\" \"C:\\Program Files\\a. ...
R_基本统计分析_06
summary()提供基础的统计信息 sapply(x,FUN,options)可以指定统计函数 fivenum()可以返回图基五数 Hmisc 中的describe(data)返回变量,观测的变量, ...
你再也不用使用 Redux、Mobx、Flux 等状态管理了
Unstated Next readme 的中文翻译前言这个库的作者希望使用 React 内置 API ,直接实现状态管理的功能.看完这个库的说明后,没有想到代码可以这个玩.短短几行代码,仅仅使用 ...
SSM框架之MyBatis入门介绍
一.什么是MyBatis? MyBatis源自Apache的iBatis开源项目, 从iBatis3.x开始正式更名为MyBatis.它是一个优秀的持久层框架. 二.为什么使用MyBatis? 为了和 ...
如何为UEditor设置默认值
// 初始化UEditor var ue = UE.getEditor('editor'); ue.ready(function() { //设置默认值 ue.setContent('默认值....' ...
在Linux主机使用命令行批量删除harbor镜像
在Linux主机使用命令行批量删除harbor镜像脚本使用说明: 此脚本不是万能脚本,根据自身环境要调整很多能用harbor的域名就不要用IP 脚本前半部分可以套用,后半部分需一步一步试错,结合 ...
Linux主机之间传输文件的几种方法对比
1.scp传输 scp -r /data/file root@ip:/data/ scp -C /data/sda.img root@ip:/data/img/#-r: 支持目录#-C: 启用压缩传送 ...
Java精通并发-锁升级与偏向锁深入解析
对于synchronized关键字,我们在实际使用时可能经常听说用它是一个非常重的操作,其实这个“重”是要针对JDK的版本来说的,如今JDK已经到了12版本了,其实对这个关键字一直是存在偏见的,它底层 ...
qingqing的项目
1 https://www.cnblogs.com/zhangqing979797/p/10147679.html 2 https://www.cnblogs.com/zhangqing979797/ ...

我们使用 Kafka 生产者在发消息的时候我们关注什么（Python 客户端 1.01 broker）

我们使用 Kafka 生产者在发消息的时候我们关注什么（Python 客户端 1.01 broker）的更多相关文章

随机推荐

热门专题