kafka 教程(一)-初识kafka

消息队列 MQ

消息队列就是消息 message 加队列 queue，是一种消息传输的容器，提供生产和消费 API 来存储和获取消息。

消息队列分两种：点对点（p2p）、发布订阅（pub/sub）

相同点：生产的消息存入队列，都从队列中获取消息

不同点：p2p 模式是一个消息只能被消费一次，消费之后这个消息就不存在了，比如打电话；

　　　　而发布订阅模式是一个消息可以被消费 N 次，而且可以被多个消费者同时消费，比如微信公众号；

kafka 简介

kafka 就是一个发布订阅消息系统，有以下特点：

高吞吐量：支持每秒百万级的消息生产消费

持久性：有一套完善的消息存储机制，确保消息安全持久

分布式：基于分布式的扩展和容错机制；kafka 会将数据复制几份到其他服务器上，如果一台服务器挂了，会自动切到其他服务器。

kafka 也是一个消息中间件；

常用来处理活跃的数据，如登录、浏览

kafka 组成

kafka 服务

topic：主题，代表消息的类别，如体育的，娱乐的

broker：消息代理，就是集群中的一个节点，负责存储数据，topic 可以分区存储

partition：topic 物理上的分组，一个 topic 在 broker 中被分成 n 个 partition

message：消息，每个消息被分到对应的 partition，需要一种映射关系

kafka 服务相关

producer：消息生产者

consumer：消息消费者

zookeeper：协调 kafka 正常运行

broker 配置

一个 broker 代表一个 kafka 服务，配置文件为 kafka 配置文件：server.properties

1. 为了减少磁盘写入次数，kafka 会先把消息 buffer 起来，当消息达到一定数量或者过了一定时间后，再 flush 到磁盘

对应配置

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync

# the OS cache lazily. The following configurations control the flush of data to disk.

# There are a few important trade-offs here:

#    . Durability: Unflushed data may be lost if you are not using replication.

#    . Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.

#    . Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.

# The settings below allow one to configure the flush policy to flush data after a period of time or

# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk

#log.flush.interval.messages=10000　　<=========

# The maximum amount of time a message can sit in a log before we force a flush

#log.flush.interval.ms=1000　　<=========

2. 消息保存一定时间会自动删除，默认 7 天，168 小时

对应配置

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can

# be set to delete segments after a period of time, or after a given size has accumulated.

# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens

# from the end of the log.

# The minimum age of a log file to be eligible for deletion

log.retention.hours=168　　

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining

# segments don't drop below log.retention.bytes.

#log.retention.bytes=

# The maximum size of a log segment file. When this size is reached a new log segment will be created.

log.segment.bytes=

# The interval at which log segments are checked to see if they can be deleted according

# to the retention policies

log.retention.check.interval.ms=

# By default the log cleaner is disabled and the log retention policy will default to just delete segments after their retention expires.

# If log.cleaner.enable=true is set the cleaner will be enabled and individual logs can then be marked for log compaction.

log.cleaner.enable=false

producer 配置

消息生产者，配置文件：producer.properties

1. partitioner.class：可以自定义分区方法，指定用户自己写的算法

2. producer.type=sync：发送消息是同步还是异步，同步是发出消息后收到回应再发下一条，异步是只管发

3. 异步发送支持批量发送，提高发送效率，先把消息缓存到内存中，然后一次性发出去，对应参数 queue.buffering.max.ms=；queue.buffering.max.messages=；据说默认 5000 和 10000

consumer 配置

配置文件：consumer.properties

1. group.id=test-consumer-group：每个消费者都属于某个 group，这里指定组 id

2. kafka 对消息的消费形式跟分组有关，

组间，不同的组消费相同的数据，互不影响；

组内，组内成员消费相同的数据，不同的 consumer 不能同时消费一个 topic 的 1 个 partition，可以同时消费一个 topic 的不同 partition

　　// 所以，对应一个 topic，同一个组不推荐超过 partition 个数的成员来消费这个 topic，这样会有 consumer 被浪费

3. 一个 consumer 开启多个线程，一个线程相当于一个 consumer

（这是Kafka用来实现一个Topic消息的广播（发给所有的Consumer）和单播（发给某一个Consumer）的手段。

一个Topic可以对应多个Consumer Group。如果需要实现广播，只要每个Consumer有一个独立的Group就可以了。

要实现单播只要所有的Consumer在同一个Group里。用Consumer Group还可以将Consumer进行自由的分组而不需要多次发送消息到不同的Topic。）

partition

每个 partition 在存储层面是个 append log 文件，新消息追加到文件尾部；

每条消息在 log 文件中有个位置称为 offset（偏移量）；

越多的 partition 意味着可以容纳更多的 consumer，有效提升并发消费的能力；

业务分区增加 topic，数据量大增加 partition

message

3个属性：

offset：long型，代表此消息在 partition 中的序号，或者说 id

MessageSize：int32，代表字节大小

data：具体内容

broker 配置详解

# Licensed to the Apache Software Foundation (ASF) under one or more

# contributor license agreements.  See the NOTICE file distributed with

# this work for additional information regarding copyright ownership.

# The ASF licenses this file to You under the Apache License, Version 2.0

# (the "License"); you may not use this file except in compliance with

# the License.  You may obtain a copy of the License at

#

#    http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

##################################################################################

#  broker就是一个kafka的部署实例，在一个kafka集群中，每一台kafka都要有一个broker.id

#  并且，该id唯一，且必须为整数

##################################################################################

broker.id=

############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from

# java.net.InetAddress.getCanonicalHostName() if not configured.

#   FORMAT:

#     listeners = security_protocol://host_name:port

#   EXAMPLE:

#     listeners = PLAINTEXT://your.host.name:9092

#listeners=PLAINTEXT://:9092

# Hostname and port the broker will advertise to producers and consumers. If not set,

# it uses the value for "listeners" if configured.  Otherwise, it will use the value

# returned from java.net.InetAddress.getCanonicalHostName().

#advertised.listeners=PLAINTEXT://your.host.name:9092

##################################################################################

#The number of threads handling network requests

# 默认处理网络请求的线程个数 3个

##################################################################################

num.network.threads=

##################################################################################

# The number of threads doing disk I/O

# 执行磁盘IO操作的默认线程个数

##################################################################################

num.io.threads=

##################################################################################

# The send buffer (SO_SNDBUF) used by the socket server

# socket服务使用的进行发送数据的缓冲区大小，默认100kb

##################################################################################

socket.send.buffer.bytes=

##################################################################################

# The receive buffer (SO_SNDBUF) used by the socket server

# socket服务使用的进行接受数据的缓冲区大小，默认100kb

##################################################################################

socket.receive.buffer.bytes=

##################################################################################

# The maximum size of a request that the socket server will accept (protection against OOM)

# socket服务所能够接受的最大的请求量，防止出现OOM(Out of memory)内存溢出，默认值为：100m

# （应该是socker server所能接受的一个请求的最大大小，默认为100M）

##################################################################################

socket.request.max.bytes=

############################# Log Basics （数据相关部分，kafka的数据称为log）#############################

##################################################################################

# A comma seperated list of directories under which to store log files

# 一个用逗号分隔的目录列表，用于存储kafka接受到的数据

##################################################################################

log.dirs=/home/uplooking/data/kafka

##################################################################################

# The default number of log partitions per topic. More partitions allow greater

# parallelism for consumption, but this will also result in more files across

# the brokers.

# 每一个topic所对应的log的partition分区数目，默认1个。更多的partition数目会提高消费

# 并行度，但是也会导致在kafka集群中有更多的文件进行传输

# （partition就是分布式存储，相当于是把一份数据分开几份来进行存储，即划分块、划分分区的意思）

##################################################################################

num.partitions=

##################################################################################

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.

# This value is recommended to be increased for installations with data dirs located in RAID array.

# 每一个数据目录用于在启动kafka时恢复数据和在关闭时刷新数据的线程个数。如果kafka数据存储在磁盘阵列中

# 建议此值可以调整更大。

##################################################################################

num.recovery.threads.per.data.dir=

############################# Log Flush Policy （数据刷新策略）#############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync

# the OS cache lazily. The following configurations control the flush of data to disk.

# There are a few important trade-offs（平衡） here:

#    . Durability 持久性: Unflushed data may be lost if you are not using replication.

#    . Latency 延时性: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.

#    . Throughput 吞吐量: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.

# The settings below allow one to configure the flush policy to flush data after a period of time or

# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# kafka中只有基于消息条数和时间间隔数来制定数据刷新策略，而没有大小的选项，这两个选项可以选择配置一个

# 当然也可以两个都配置，默认情况下两个都配置，配置如下。

# The number of messages to accept before forcing a flush of data to disk

# 消息刷新到磁盘中的消息条数阈值

#log.flush.interval.messages=

# The maximum amount of time a message can sit in a log before we force a flush

# 消息刷新到磁盘生成一个log数据文件的时间间隔

#log.flush.interval.ms=

############################# Log Retention Policy（数据保留策略） #############################

# The following configurations control the disposal（清理） of log segments（分片）. The policy can

# be set to delete segments after a period of time, or after a given size has accumulated（累积）.

# A segment will be deleted whenever（无论什么时间） *either* of these criteria（标准） are met. Deletion always happens

# from the end of the log.

# 下面的配置用于控制数据片段的清理，只要满足其中一个策略（基于时间或基于大小），分片就会被删除

# The minimum age of a log file to be eligible for deletion

# 基于时间的策略，删除日志数据的时间，默认保存7天

log.retention.hours=

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining

# segments don't drop below log.retention.bytes. 1G

# 基于大小的策略，1G

#log.retention.bytes=

# The maximum size of a log segment file. When this size is reached a new log segment will be created.

# 数据分片策略

log.segment.bytes=

# The interval at which log segments are checked to see if they can be deleted according

# to the retention policies 5分钟

# 每隔多长时间检测数据是否达到删除条件

log.retention.check.interval.ms=

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).

# This is a comma separated host:port pairs, each corresponding to a zk

# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".

# You can also append an optional chroot string to the urls to specify the

# root directory for all kafka znodes.

zookeeper.connect=uplooking01:,uplooking02:,uplooking03:

# Timeout in ms for connecting to zookeeper

zookeeper.connection.timeout.ms=