Table of contents
Overview
- Introduction
- Use cases
Manual setup
Principle
- Topic
- Distribution
- Producer
- Consumer
Operation
API
Performance tunning

Overview

Introduction

Apache kafka is distributed streaming platform.
Capabilities of streaming platform:
- It lets you publish and subscribe to streams of records. It is similar to a message queue or enterprise messaging system.
- It lets you store streams of records in a fault-tolerant way.
- It lets you process streams of records as they occur.

Use cases

Messaging
- In comparison to most messaging systems(e.g. ActiveMQ, RocketMQ) Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.
Website activity tracking
- The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds.
- Site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.
Metrics
- Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.
Log aggregation
- Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption.
- In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency.
Stream processing
- Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing. Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza.
Event sourcing
- Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records.
Commit log
- Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data. The log compaction feature in Kafka helps support this usage. In this usage Kafka is similar to Apache BookKeeper project.

Manual setup

Assumption

We assume that JDK and Zookeeper have been installed.

Configuration

Configure on one server.

tar zxvf kafka_2.10-0.10.2.1.tgz -C /opt/app
cd /opt/app/kafka_2.10-0.10.2.1
vi config/server.properties

broker.id=1 # Unique
log.dirs=/opt/data/kafka.logs # Data directory
default.replication.factor=3
zookeeper.connect=centos1:2181,centos2:2181/kafka # Specify chroot at the end, default chroot is /
# Configurations for product server
num.partitions=8
delete.topic.enable=false
auto.create.topics.enable=false
log.retention.hours=168
min.insync.replicas=2
queued.max.requests=500

Copy Kafka directory to the other servers.

scp -r /opt/app/kafka_2.10-0.10.2.1 hadoop@centos2:/opt/app
scp -r /opt/app/kafka_2.10-0.10.2.1 hadoop@centos3:/opt/app

Remember to configure broker.id on the other servers.

vi config/server.properties

Startup & test

Startup daemon on all servers.

bin/kafka-server-start.sh -daemon config/server.properties
jps

Kafka

Test.

bin/kafka-topics.sh --create --zookeeper centos1:2181,centos2:2181,centos3:2181/kafka --replication-factor 2 --partitions 2 --topic test-topic
bin/kafka-topics.sh --list --zookeeper centos1:2181,centos2:2181,centos3:2181/kafka
bin/kafka-topics.sh --describe --zookeeper centos1:2181,centos2:2181,centos3:2181/kafka --topic test-topic
bin/kafka-console-producer.sh --broker-list centos1:9092,centos2:9092,centos3:9092 --topic test-topic
bin/kafka-console-consumer.sh --bootstrap-server centos1:9092,centos2:9092,centos3:9092 --topic test-topic --from-beginning

Shutdown daemon on all servers.

bin/kafka-server-stop.sh

Principle

Topic

A topic is a category or feed name to which records are published.
Partition
- Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
- The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
Storage
- The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.
Offset control
- In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".

Distribution

Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

Producer

The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record).

Consumer

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

Operation

Adding topics

Command:

bin/kafka-topics.sh --zookeeper zk_host:port/chroot --create --topic my_topic_name --partitions 20 --replication-factor 3 --config x=y

You have the option of either adding topics manually or having them be created automatically when data is first published to a non-existent topic.
The replication factor controls how many servers will replicate each message that is written. We recommend you use a replication factor of 2 or 3 so that you can transparently bounce machines without interrupting data consumption.
Each sharded partition log is placed into its own folder under the Kafka log directory. The name of such folders consists of the topic name, appended by a dash (-) and the partition id.

Modifying topics

Commands:
- Add partitions

bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --partitions 40

* Add configs

bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --config x=y

* Remove a config

bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --delete-config x

Adding partitions doesn't change the partitioning of existing data so this may disturb consumers if they rely on that partition. That is if data is partitioned by hash(key) % number_of_partitions then this partitioning will potentially be shuffled by adding partitions but Kafka will not attempt to automatically redistribute data in any way.
Kafka does not currently support reducing the number of partitions for a topic.

Removing a topic

Command

bin/kafka-topics.sh --zookeeper zk_host:port/chroot --delete --topic my_topic_name

Topic deletion option is disabled by default. To enable it set the server configdelete.

delete.topic.enable=true

Graceful shutdown

The Kafka cluster will automatically detect any broker shutdown or failure and elect new leaders for the partitions on that machine.
Note that controlled shutdown will only succeed if all the partitions hosted on the broker have replicas (i.e. the replication factor is greater than 1 and at least one of these replicas is alive). This is generally what you want since shutting down the last replica would make that topic partition unavailable.
Controlled leadership migration requires using a special setting:

controlled.shutdown.enable=true

Balancing leadership

Whenever a broker stops or crashes leadership for that broker's partitions transfers to other replicas. To avoid this, Kafka has a notion of preferred replicas. If the list of replicas for a partition is 1,5,9 then node 1 is preferred as the leader to either node 5 or 9 because it is earlier in the replica list.
You can have the Kafka cluster try to restore leadership to the restored replicas by running the command:

bin/kafka-preferred-replica-election.sh --zookeeper zk_host:port/chroot

Configure Kafka to do this automatically:

auto.leader.rebalance.enable=true

Checking consumer position

Before 0.9.0.0

bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zookeeper localhost:2181 --group test

Since 0.9.0.0

bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --describe --group test-consumer-group

If you are using the old high-level consumer and storing the group metadata in ZooKeeper (i.e. offsets.storage=zookeeper)

bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --describe --group test-consumer-group

Listing consumer groups

List all consumer groups across all topics:

bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --list

If you are using the old high-level consumer and storing the group metadata in ZooKeeper (i.e. offsets.storage=zookeeper)

bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --list

Infrequently used operations

The following infrequently used operations refer to official documentation.

* Migrating data to new machines
* Increasing replication factor
* Limiting bandwidth usage during data migration
* Setting quotas

API

Overview

Four core APIs:
- The Producer API allows an application to publish a stream of records to one or more Kafka topics.
- The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
- The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
- The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
Maven dependency for producer API and consumer API.

<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-clients</artifactId>
    <version>0.10.2.0</version>
</dependency>

Producer API

The producer is thread safe and sharing a single producer instance across threads will generally be faster than having multiple instances.
The producer consists of a pool of buffer space that holds records that haven't yet been transmitted to the server as well as a background I/O thread that is responsible for turning these records into requests and transmitting them to the cluster. Failure to close the producer after use will leak these resources.
The send() method is asynchronous. When called it adds the record to a buffer of pending record sends and immediately returns. This allows the producer to batch together individual records for efficiency.
Producer provides software load balancing through an optionally user-specified kafka.producer.Partitioner.
An example.

import java.util.Properties;
import java.util.UUID;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
public class TestProducer {
    public static void main(String[] args) {
        Properties conf = new Properties();
        conf.put("bootstrap.servers", "centos1:9092,centos2:9092");
        conf.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        conf.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        final String topic = "test-topic";
        Producer<String, String> producer = new KafkaProducer<>(conf);
        try {
            for (int index = 0; index < 10000; index++) {
                String recordKey = UUID.randomUUID().toString();
                String recordValue = UUID.randomUUID().toString();
                ProducerRecord<String, String> record = new ProducerRecord<>(topic, recordKey, recordValue);
                producer.send(record);
            }
        } finally {
            producer.close();
        }
    }
}

Consumer API

The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives messages in a call to poll(long).
The committed position is the last offset that has been stored securely. Should the process fail and restart, this is the offset that the consumer will recover to. The consumer can either automatically commit offsets periodically; or it can choose to control this committed position manually by calling one of the commit APIs (e.g. commitSync and commitAsync).
Membership in a consumer group is maintained dynamically: if a process fails, the partitions assigned to it will be reassigned to other consumers in the same group. Similarly, if a new consumer joins the group, partitions will be moved from existing consumers to the new one. This is known as rebalancing the group and is discussed in more detail below. Group rebalancing is also used when new partitions are added to one of the subscribed topics or when a new topic matching a subscribed regex is created. The group will automatically detect the new partitions through periodic metadata refreshes and assign them to members of the group.
Automatic offset committing example.

import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
public class TestConsumer {
    public static void main(String[] args) {
        Properties conf = new Properties();
        conf.put("bootstrap.servers", "centos1:9092,centos2:9092");
        conf.put("group.id", "test-group");
        conf.put("enable.auto.commit", "true");
        conf.put("auto.commit.interval.ms", "1000");
        conf.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        conf.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        final String topic = "test-topic";
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(conf);
        try {
            consumer.subscribe(Arrays.asList(topic));
            while (true) {
                ConsumerRecords<String, String> records = consumer.poll(100);
                for (ConsumerRecord<String, String> record : records) {
                    System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
                }
            }
        } finally {
            consumer.close();
        }
    }
}

Manual offset committing example.

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
public class TestConsumer {
    public static void main(String[] args) {
        Properties conf = new Properties();
        conf.put("bootstrap.servers", "centos1:9092,centos2:9092");
        conf.put("group.id", "test-group");
        conf.put("enable.auto.commit", "false");
        conf.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        conf.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        final String topic = "test-topic";
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(conf);
        try {
            consumer.subscribe(Arrays.asList(topic));
            final int minBatchSize = 200;
            List<ConsumerRecord<String, String>> buffer = new ArrayList<>();
            while (true) {
                ConsumerRecords<String, String> records = consumer.poll(100);
                for (ConsumerRecord<String, String> record : records) {
                    buffer.add(record);
                }
                if (buffer.size() >= minBatchSize) {
                    insertIntoDb(buffer);
                    consumer.commitSync();
                    buffer.clear();
                }
            }
        } finally {
            consumer.close();
        }
    }
    private static void insertIntoDb(List<ConsumerRecord<String, String>> records) {
        for (ConsumerRecord<String, String> record : records) {
            System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
        }
    }
}

Manual partition assignment example.

import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.TopicPartition;
public class TestConsumer {
    public static void main(String[] args) {
        Properties conf = new Properties();
        conf.put("bootstrap.servers", "localhost:9092,localhost:9093");
        conf.put("group.id", "test-group");
        conf.put("enable.auto.commit", "true");
        conf.put("auto.commit.interval.ms", "1000");
        conf.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        conf.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        final String topic = "test-topic";
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(conf);
        try {
            TopicPartition partition0 = new TopicPartition(topic, 0);
//          TopicPartition partition1 = new TopicPartition(topic, 1);
            consumer.assign(Arrays.asList(partition0));
            while (true) {
                ConsumerRecords<String, String> records = consumer.poll(100);
                for (ConsumerRecord<String, String> record : records) {
                    System.out.printf("partition = %d, offset = %d, key = %s, value = %s%n", record.partition(), record.offset(), record.key(), record.value());
                }
            }
        } finally {
            consumer.close();
        }
    }
}

Controlling consumer's position example.

import java.util.Arrays;
import java.util.List;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.TopicPartition;
public class TestConsumer {
    public static void main(String[] args) {
        Properties conf = new Properties();
        conf.put("bootstrap.servers", "localhost:9092,localhost:9093");
        conf.put("group.id", "test-group");
        conf.put("enable.auto.commit", "true");
        conf.put("auto.commit.interval.ms", "1000");
        conf.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        conf.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        final String topic = "test-topic";
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(conf);
        try {
            TopicPartition partition0 = new TopicPartition(topic, 0);
            TopicPartition partition1 = new TopicPartition(topic, 1);
            List<TopicPartition> partitions = Arrays.asList(partition0, partition1);
            consumer.assign(partitions);
//          consumer.seekToBeginning(partitions);
            consumer.seek(partition0, 1000);
            consumer.seek(partition1, 2000);
            while (true) {
                ConsumerRecords<String, String> records = consumer.poll(100);
                for (ConsumerRecord<String, String> record : records) {
                    System.out.printf("partition = %d, offset = %d, key = %s, value = %s%n", record.partition(), record.offset(), record.key(), record.value());
                }
            }
        } finally {
            consumer.close();
        }
    }
}

Performance tunning

JDK: use the latest released version of JDK 1.8.
Memory: you need sufficient memory to buffer active readers and writers. You can do a back-of-the-envelope estimate of memory needs by assuming you want to be able to buffer for 30 seconds and compute your memory need as write_throughput*30.
Disk:
- In general disk throughput is the performance bottleneck, and more disks is better.
- You can either RAID these drives together into a single volume or format and mount each drive as its own directory. Since Kafka has replication the redundancy provided by RAID can also be provided at the application level.
OS:
- File descriptor limits: We recommend at least 100000 allowed file descriptors for the broker processes as a starting point.
- Max socket buffer size.
Filesystem: EXT4 has had more usage, but recent improvements to the XFS filesystem have shown it to have better performance characteristics for Kafka's workload with no compromise in stability.

Kafka笔记——技术点汇总的更多相关文章

Hadoop笔记——技术点汇总
目录 · 概况 · Hadoop · 云计算 · 大数据 · 数据挖掘 · 手工搭建集群 · 引言 · 配置机器名 · 调整时间 · 创建用户 · 安装JDK · 配置文件 · 启动与测试 · Clo ...
Storm笔记——技术点汇总
目录概况手工搭建集群引言安装Python 配置文件启动与测试应用部署参数配置 Storm命令原理 Storm架构 Storm组件 Stream Grouping 守护进程容错性(Dae ...
Spark Streaming笔记——技术点汇总
目录目录概况原理 API DStream WordCount示例 Input DStream Transformation Operation Output Operation 缓存与持久化 C ...
Spark SQL笔记——技术点汇总
目录概述原理组成执行流程性能 API 应用程序模板通用读写方法 RDD转为DataFrame Parquet文件数据源 JSON文件数据源 Hive数据源数据库JDBC数据源 DataF ...
ZooKeeper笔记——技术点汇总
目录 · ZooKeeper安装 · 分布式一致性理论 · 一致性级别 · 集中式系统 · 分布式系统 · ACID特性 · CAP理论 · BASE理论 · 一致性协议 · ZooKeeper概况 ...
JVM笔记——技术点汇总
目录 · 初步认识 · Java里程碑(关键部分) · 理解虚拟机 · Java虚拟机种类 · Java语言规范 · Java虚拟机规范 · 基本结构 · Java堆(Heap) · Java栈(St ...
Netty笔记——技术点汇总
目录 · Linux网络IO模型 · 文件描述符 · 阻塞IO模型 · 非阻塞IO模型 · IO复用模型 · 信号驱动IO模型 · 异步IO模型 · BIO编程 · 伪异步IO编程 · NIO编程 · ...
Java并发编程笔记——技术点汇总
目录 · 线程安全 · 线程安全的实现方法 · 互斥同步 · 非阻塞同步 · 无同步 · volatile关键字 · 线程间通信 · Object.wait()方法 · Object.notify() ...
Spark笔记——技术点汇总
目录概况手工搭建集群引言安装Scala 配置文件启动与测试应用部署部署架构应用程序部署核心原理 RDD概念 RDD核心组成 RDD依赖关系 DAG图 RDD故障恢复机制 Standa ...

随机推荐

python基础入门教程《python入门经典》
第一章在python中使用数字 1.用变量存储信息 1.1变量的类型变量,用于存储很多不同的数据类型的信息. 基本数据类型数据类型存储内容示例 integer 整 float 浮点 ...
两个同级div等高布局
显示效果: css代码如下 .wrap{ overflow:hidden; } .left{ width:30%; background:#09C; } .right{ width:70%; back ...
sql 中三大范式详解
1 第一范式(1NF) 在任何一个关系数据库中,第一范式(1NF)是对关系模式的基本要求,不满足第一范式(1NF)的数据库就不是关系数据库. 所谓第一范式(1NF)是指数据库表的每一列 ...
M41T11-RTC（实时时钟）
一.理论准备 1. 主要器件:STM8单片机.M41T11时钟IC.32.768kHz晶振等. 2. 外围设备:烧录工具ST-Link/v2.串口.5v供电SATA线. 3. 主要思想:通过单片机对时 ...
Sublime常用插件
注:此插件为我自己在用的,仅代表个人,如果发现好用的插件,会不断更新此博文. 1,package control 我们用sublime几乎都会首先安装这个插件,这个插件是管理插件的功能,先安装它,再安 ...
逻辑关系下的NN应用
自己好奇搜了几篇别人对Ng视频的的笔记,读下去可观性很强,后回到自己的笔记却觉得矛盾很多,有些地方搞得很模糊,自己没有仔细去想导致写完读起来很怪,此篇之后我决定放慢记笔记的速度,力求尽多地搞清楚模 ...
jQuery.validate 的form校验
jQuery验证框架 : 基本html代码: <script src="js/jquery-1.9.1.js"></script> <script s ...
jquery判断按钮是否被选中了
<script type="text/javascript"> function genjin_view2(elm){ if($(elm).attr("che ...
用CSS3实现无限循环的无缝滚动
有时候在页面的某个模块中,需要无限循环的滚动一些消息.那么如果我们用js实现无缝衔接滚动的思路是什么呢(比如我们这个模块是向上滚动的)? 克隆A一份完全一样的数据B放在原数据A的后面: 使用setIn ...
Ionic Demo 解析
Ionic Demo 解析 index.html 解析 1.引入所需要的类库 <link rel="manifest" href="manifest.json&qu ...

Kafka笔记——技术点汇总

Table of contents

Overview

Introduction

Use cases

Manual setup

Assumption

Configuration

Startup & test

Principle

Topic

Distribution

Producer

Consumer

Operation

Adding topics

Modifying topics

Removing a topic

Graceful shutdown

Balancing leadership

Checking consumer position

Listing consumer groups

Infrequently used operations

API

Overview

Producer API

Consumer API

Performance tunning

Kafka笔记——技术点汇总的更多相关文章

随机推荐

热门专题