Kafka笔记——技术点汇总
Table of contents
Overview
Introduction
- Apache kafka is distributed streaming platform.
- Capabilities of streaming platform:
- It lets you publish and subscribe to streams of records. It is similar to a message queue or enterprise messaging system.
- It lets you store streams of records in a fault-tolerant way.
- It lets you process streams of records as they occur.
Use cases
- Messaging
- In comparison to most messaging systems(e.g. ActiveMQ, RocketMQ) Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.
- Website activity tracking
- The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds.
- Site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.
- Metrics
- Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.
- Log aggregation
- Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption.
- In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency.
- Stream processing
- Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing. Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza.
- Event sourcing
- Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records.
- Commit log
- Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data. The log compaction feature in Kafka helps support this usage. In this usage Kafka is similar to Apache BookKeeper project.
Manual setup
Assumption
We assume that JDK and Zookeeper have been installed.
Configuration
- Configure on one server.
tar zxvf kafka_2.10-0.10.2.1.tgz -C /opt/app
cd /opt/app/kafka_2.10-0.10.2.1
vi config/server.properties
broker.id=1 # Unique
log.dirs=/opt/data/kafka.logs # Data directory
default.replication.factor=3
zookeeper.connect=centos1:2181,centos2:2181/kafka # Specify chroot at the end, default chroot is /
# Configurations for product server
num.partitions=8
delete.topic.enable=false
auto.create.topics.enable=false
log.retention.hours=168
min.insync.replicas=2
queued.max.requests=500
- Copy Kafka directory to the other servers.
scp -r /opt/app/kafka_2.10-0.10.2.1 hadoop@centos2:/opt/app
scp -r /opt/app/kafka_2.10-0.10.2.1 hadoop@centos3:/opt/app
- Remember to configure
broker.id
on the other servers.
vi config/server.properties
Startup & test
- Startup daemon on all servers.
bin/kafka-server-start.sh -daemon config/server.properties
jps
Kafka
- Test.
bin/kafka-topics.sh --create --zookeeper centos1:2181,centos2:2181,centos3:2181/kafka --replication-factor 2 --partitions 2 --topic test-topic
bin/kafka-topics.sh --list --zookeeper centos1:2181,centos2:2181,centos3:2181/kafka
bin/kafka-topics.sh --describe --zookeeper centos1:2181,centos2:2181,centos3:2181/kafka --topic test-topic
bin/kafka-console-producer.sh --broker-list centos1:9092,centos2:9092,centos3:9092 --topic test-topic
bin/kafka-console-consumer.sh --bootstrap-server centos1:9092,centos2:9092,centos3:9092 --topic test-topic --from-beginning
- Shutdown daemon on all servers.
bin/kafka-server-stop.sh
Principle
Topic
- A topic is a category or feed name to which records are published.
- Partition
- Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
- The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
- Storage
- The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.
- Offset control
- In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
Distribution
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
Producer
The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record).
Consumer
- Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
- If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
- If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
- This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
- Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
Operation
Adding topics
- Command:
bin/kafka-topics.sh --zookeeper zk_host:port/chroot --create --topic my_topic_name --partitions 20 --replication-factor 3 --config x=y
- You have the option of either adding topics manually or having them be created automatically when data is first published to a non-existent topic.
- The replication factor controls how many servers will replicate each message that is written. We recommend you use a replication factor of 2 or 3 so that you can transparently bounce machines without interrupting data consumption.
- Each sharded partition log is placed into its own folder under the Kafka log directory. The name of such folders consists of the topic name, appended by a dash (-) and the partition id.
Modifying topics
- Commands:
- Add partitions
bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --partitions 40
* Add configs
bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --config x=y
* Remove a config
bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --delete-config x
- Adding partitions doesn't change the partitioning of existing data so this may disturb consumers if they rely on that partition. That is if data is partitioned by
hash(key) % number_of_partitions
then this partitioning will potentially be shuffled by adding partitions but Kafka will not attempt to automatically redistribute data in any way. - Kafka does not currently support reducing the number of partitions for a topic.
Removing a topic
- Command
bin/kafka-topics.sh --zookeeper zk_host:port/chroot --delete --topic my_topic_name
- Topic deletion option is disabled by default. To enable it set the server configdelete.
delete.topic.enable=true
Graceful shutdown
- The Kafka cluster will automatically detect any broker shutdown or failure and elect new leaders for the partitions on that machine.
- Note that controlled shutdown will only succeed if all the partitions hosted on the broker have replicas (i.e. the replication factor is greater than 1 and at least one of these replicas is alive). This is generally what you want since shutting down the last replica would make that topic partition unavailable.
- Controlled leadership migration requires using a special setting:
controlled.shutdown.enable=true
Balancing leadership
- Whenever a broker stops or crashes leadership for that broker's partitions transfers to other replicas. To avoid this, Kafka has a notion of preferred replicas. If the list of replicas for a partition is 1,5,9 then node 1 is preferred as the leader to either node 5 or 9 because it is earlier in the replica list.
- You can have the Kafka cluster try to restore leadership to the restored replicas by running the command:
bin/kafka-preferred-replica-election.sh --zookeeper zk_host:port/chroot
- Configure Kafka to do this automatically:
auto.leader.rebalance.enable=true
Checking consumer position
- Before 0.9.0.0
bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zookeeper localhost:2181 --group test
- Since 0.9.0.0
bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --describe --group test-consumer-group
- If you are using the old high-level consumer and storing the group metadata in ZooKeeper (i.e.
offsets.storage=zookeeper
)
bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --describe --group test-consumer-group
Listing consumer groups
- List all consumer groups across all topics:
bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --list
- If you are using the old high-level consumer and storing the group metadata in ZooKeeper (i.e.
offsets.storage=zookeeper
)
bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --list
Infrequently used operations
The following infrequently used operations refer to official documentation.
* Migrating data to new machines
* Increasing replication factor
* Limiting bandwidth usage during data migration
* Setting quotas
API
Overview
- Four core APIs:
- The Producer API allows an application to publish a stream of records to one or more Kafka topics.
- The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
- The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
- The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
- Maven dependency for producer API and consumer API.
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.2.0</version>
</dependency>
Producer API
- The producer is thread safe and sharing a single producer instance across threads will generally be faster than having multiple instances.
- The producer consists of a pool of buffer space that holds records that haven't yet been transmitted to the server as well as a background I/O thread that is responsible for turning these records into requests and transmitting them to the cluster. Failure to close the producer after use will leak these resources.
- The send() method is asynchronous. When called it adds the record to a buffer of pending record sends and immediately returns. This allows the producer to batch together individual records for efficiency.
- Producer provides software load balancing through an optionally user-specified kafka.producer.Partitioner.
- An example.
import java.util.Properties;
import java.util.UUID;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
public class TestProducer {
public static void main(String[] args) {
Properties conf = new Properties();
conf.put("bootstrap.servers", "centos1:9092,centos2:9092");
conf.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
conf.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
final String topic = "test-topic";
Producer<String, String> producer = new KafkaProducer<>(conf);
try {
for (int index = 0; index < 10000; index++) {
String recordKey = UUID.randomUUID().toString();
String recordValue = UUID.randomUUID().toString();
ProducerRecord<String, String> record = new ProducerRecord<>(topic, recordKey, recordValue);
producer.send(record);
}
} finally {
producer.close();
}
}
}
Consumer API
- The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives messages in a call to poll(long).
- The committed position is the last offset that has been stored securely. Should the process fail and restart, this is the offset that the consumer will recover to. The consumer can either automatically commit offsets periodically; or it can choose to control this committed position manually by calling one of the commit APIs (e.g. commitSync and commitAsync).
- Membership in a consumer group is maintained dynamically: if a process fails, the partitions assigned to it will be reassigned to other consumers in the same group. Similarly, if a new consumer joins the group, partitions will be moved from existing consumers to the new one. This is known as rebalancing the group and is discussed in more detail below. Group rebalancing is also used when new partitions are added to one of the subscribed topics or when a new topic matching a subscribed regex is created. The group will automatically detect the new partitions through periodic metadata refreshes and assign them to members of the group.
- Automatic offset committing example.
import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
public class TestConsumer {
public static void main(String[] args) {
Properties conf = new Properties();
conf.put("bootstrap.servers", "centos1:9092,centos2:9092");
conf.put("group.id", "test-group");
conf.put("enable.auto.commit", "true");
conf.put("auto.commit.interval.ms", "1000");
conf.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
conf.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
final String topic = "test-topic";
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(conf);
try {
consumer.subscribe(Arrays.asList(topic));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}
}
} finally {
consumer.close();
}
}
}
- Manual offset committing example.
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
public class TestConsumer {
public static void main(String[] args) {
Properties conf = new Properties();
conf.put("bootstrap.servers", "centos1:9092,centos2:9092");
conf.put("group.id", "test-group");
conf.put("enable.auto.commit", "false");
conf.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
conf.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
final String topic = "test-topic";
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(conf);
try {
consumer.subscribe(Arrays.asList(topic));
final int minBatchSize = 200;
List<ConsumerRecord<String, String>> buffer = new ArrayList<>();
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
buffer.add(record);
}
if (buffer.size() >= minBatchSize) {
insertIntoDb(buffer);
consumer.commitSync();
buffer.clear();
}
}
} finally {
consumer.close();
}
}
private static void insertIntoDb(List<ConsumerRecord<String, String>> records) {
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}
}
}
- Manual partition assignment example.
import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.TopicPartition;
public class TestConsumer {
public static void main(String[] args) {
Properties conf = new Properties();
conf.put("bootstrap.servers", "localhost:9092,localhost:9093");
conf.put("group.id", "test-group");
conf.put("enable.auto.commit", "true");
conf.put("auto.commit.interval.ms", "1000");
conf.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
conf.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
final String topic = "test-topic";
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(conf);
try {
TopicPartition partition0 = new TopicPartition(topic, 0);
// TopicPartition partition1 = new TopicPartition(topic, 1);
consumer.assign(Arrays.asList(partition0));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("partition = %d, offset = %d, key = %s, value = %s%n", record.partition(), record.offset(), record.key(), record.value());
}
}
} finally {
consumer.close();
}
}
}
- Controlling consumer's position example.
import java.util.Arrays;
import java.util.List;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.TopicPartition;
public class TestConsumer {
public static void main(String[] args) {
Properties conf = new Properties();
conf.put("bootstrap.servers", "localhost:9092,localhost:9093");
conf.put("group.id", "test-group");
conf.put("enable.auto.commit", "true");
conf.put("auto.commit.interval.ms", "1000");
conf.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
conf.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
final String topic = "test-topic";
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(conf);
try {
TopicPartition partition0 = new TopicPartition(topic, 0);
TopicPartition partition1 = new TopicPartition(topic, 1);
List<TopicPartition> partitions = Arrays.asList(partition0, partition1);
consumer.assign(partitions);
// consumer.seekToBeginning(partitions);
consumer.seek(partition0, 1000);
consumer.seek(partition1, 2000);
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("partition = %d, offset = %d, key = %s, value = %s%n", record.partition(), record.offset(), record.key(), record.value());
}
}
} finally {
consumer.close();
}
}
}
Performance tunning
- JDK: use the latest released version of JDK 1.8.
- Memory: you need sufficient memory to buffer active readers and writers. You can do a back-of-the-envelope estimate of memory needs by assuming you want to be able to buffer for 30 seconds and compute your memory need as write_throughput*30.
- Disk:
- In general disk throughput is the performance bottleneck, and more disks is better.
- You can either RAID these drives together into a single volume or format and mount each drive as its own directory. Since Kafka has replication the redundancy provided by RAID can also be provided at the application level.
- OS:
- File descriptor limits: We recommend at least 100000 allowed file descriptors for the broker processes as a starting point.
- Max socket buffer size.
- Filesystem: EXT4 has had more usage, but recent improvements to the XFS filesystem have shown it to have better performance characteristics for Kafka's workload with no compromise in stability.
Kafka笔记——技术点汇总的更多相关文章
- Hadoop笔记——技术点汇总
目录 · 概况 · Hadoop · 云计算 · 大数据 · 数据挖掘 · 手工搭建集群 · 引言 · 配置机器名 · 调整时间 · 创建用户 · 安装JDK · 配置文件 · 启动与测试 · Clo ...
- Storm笔记——技术点汇总
目录 概况 手工搭建集群 引言 安装Python 配置文件 启动与测试 应用部署 参数配置 Storm命令 原理 Storm架构 Storm组件 Stream Grouping 守护进程容错性(Dae ...
- Spark Streaming笔记——技术点汇总
目录 目录 概况 原理 API DStream WordCount示例 Input DStream Transformation Operation Output Operation 缓存与持久化 C ...
- Spark SQL笔记——技术点汇总
目录 概述 原理 组成 执行流程 性能 API 应用程序模板 通用读写方法 RDD转为DataFrame Parquet文件数据源 JSON文件数据源 Hive数据源 数据库JDBC数据源 DataF ...
- ZooKeeper笔记——技术点汇总
目录 · ZooKeeper安装 · 分布式一致性理论 · 一致性级别 · 集中式系统 · 分布式系统 · ACID特性 · CAP理论 · BASE理论 · 一致性协议 · ZooKeeper概况 ...
- JVM笔记——技术点汇总
目录 · 初步认识 · Java里程碑(关键部分) · 理解虚拟机 · Java虚拟机种类 · Java语言规范 · Java虚拟机规范 · 基本结构 · Java堆(Heap) · Java栈(St ...
- Netty笔记——技术点汇总
目录 · Linux网络IO模型 · 文件描述符 · 阻塞IO模型 · 非阻塞IO模型 · IO复用模型 · 信号驱动IO模型 · 异步IO模型 · BIO编程 · 伪异步IO编程 · NIO编程 · ...
- Java并发编程笔记——技术点汇总
目录 · 线程安全 · 线程安全的实现方法 · 互斥同步 · 非阻塞同步 · 无同步 · volatile关键字 · 线程间通信 · Object.wait()方法 · Object.notify() ...
- Spark笔记——技术点汇总
目录 概况 手工搭建集群 引言 安装Scala 配置文件 启动与测试 应用部署 部署架构 应用程序部署 核心原理 RDD概念 RDD核心组成 RDD依赖关系 DAG图 RDD故障恢复机制 Standa ...
随机推荐
- python之numpy库[2]
python-numpy csv文件的写入和存取 写入csv文件 CSV (Comma‐Separated Value, 逗号分隔值),是一种常见的文件格式,用来存储批量数据. 写入csv文件 np. ...
- ActiveMQ 学习第二弹
经历了昨天的初识 ActiveMQ,正好今天下班有点事耽搁了还没法回家,那就再学习会 ActiveMQ 吧!现在官网的文档没啥好看的了,毕竟是入门学习,太深奥的东西也理解不了.然后看官网上有推荐书籍& ...
- debian将默认中文改成英文
$ sudo export LANG=en_US.UTF-8 $ sudo dpkg-reconfigure locales
- Mac 上Python多版本切换
Mac上自带了Python2.x的版本,有时需要使用Python3.x版本做开发,但不能删了Python2.x,可能引起系统不稳定,那么就需要安装多个版本的Python. 1.安装Python3.x版 ...
- iframe访问子页面方法
在Iframe中调用子页面的Js函数 调用IFRAME子页面的JS函数 说明:假设有2个页面,index.html和inner.html.其中index.html中有一个iframe,这个iframe ...
- 删除iPhone图片,提示“没有删除此项目的权限”
解决方法:设置-照片与相机-iCloud照片图库-关闭 (IOS10)
- MongoDB--操作符
$gt -- > $lt -- < $gte -- >= $lte -- <= $all 与 in 类似,不同的是必须满足[]内所有的值 $exists 字段是否存在 db.s ...
- Blockly编程:用Scratch制作游戏愤怒的小牛(小鸟)
愤怒的小鸟曾经很热门,网上还说他是程序员最喜欢玩的游戏.最先我是WIKIOI的评测页面看到他的,后来在2014年全国信息学奥林匹克联赛第一天第三题飞扬的小鸟也看到了它.因此,突然想做一个类似愤怒的小鸟 ...
- TortoiseGit使用SSH
Windows TortoiseGit使用SSH连接 1 找到TortoiseGit自带的Puttygen工具 2.1 如果未生成过SSHKey,选择Generate(生成的过程中记得移动鼠标) 2. ...
- celery的使用
1.celery的任务调度 # -*- coding: utf-8 -*- import threading from bs4 import BeautifulSoup from tornado im ...