Kafka is a distributed publish-subscribe messaging system. It was originally developed at LinkedIn and became an Apache project in July, 2011. Today, Kafka is used by LinkedIn, Twitter, and Square for applications including log aggregation, queuing, and real time monitoring and event processing.

In the upcoming version 0.8 release, Kafka will support intra-cluster replication, which increases both the availability and the durability of the system. In the following post, I will give an overview of Kafka's replication design.

Kafka Introduction

Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. Kafka differs from traditional messaging systems in that:

  1. It's designed as a distributed system that's easy to scale out.
  2. It persists messages on disk and thus can be used for batched consumption such as ETL, in addition to real time applications.
  3. It offers high throughput for both publishing and subscribing.
  4. It supports multi-subscribers and automatically balances the consumers during failure.

Check out the Kafka Design Wiki for more details.

Replication

With replication, Kafka clients will get the following benefits:

  1. A producer can continue to publish messages during failure and it can choose between latency and durability, depending on the application.
  2. A consumer continues to receive the correct messages in real time, even when there is failure.

All distributed systems must make trade-offs between guaranteeing consistency, availability, and partition tolerance (CAP Theorem). Our goal was to support replication in a Kafka cluster within a single datacenter, where network partitioning is rare, so our design focuses on maintaining highly available and strongly consistent replicas. Strong consistency means that all replicas are byte-to-byte identical, which simplifies the job of an application developer.

Strongly consistent replicas

In the literature, there are two typical approaches of maintaining strongly consistent replicas. Both require one of the replicas to be designated as the leader, to which all writes are issued. The leader is responsible for ordering all incoming writes, and for propagating those writes to other replicas (followers), in the same order.

The first approach is quorum-based. The leader waits until a majority of replicas have received the data before it is considered safe (i.e., committed). On leader failure, a new leader is elected through the coordination of a majority of the followers. This approach is used in Apache Zookeeper and Google'sSpanner.

The second approach is for the leader to wait for "all" (to be clarified later) replicas to receive the data. When the leader fails, any other replica can then take over as the new leader.

We selected the second approach for Kafka replication for two primary reasons:

  1. The second approach can tolerate more failures with the same number of replicas. That is, it can tolerate f failures with f+1 replicas, while the first approach often only tolerates f failures with 2f +1 replicas. For example, if there are only 2 replicas, the first approach can't tolerate any failures.
  2. While the first approach generally has better latency, as it hides the delay from a slow replica, our replication is designed for a cluster within the same datacenter, so variance due to network delay is small.

Terminology

To understand how replication is implemented in Kafka, we need to first introduce some basic concepts. In Kafka, a message stream is defined by a topic, divided into one or more partitions. Replication happens at the partition level and each partition has one or more replicas.

The replicas are assigned evenly to different servers (called brokers) in a Kafka cluster. Each replica maintains a log on disk. Published messages are appended sequentially in the log and each message is identified by a monotonically increasing offset within the log.

The offset is logical concept within a partition. Given an offset, the same message can be identified in each replica of the partition. When a consumer subscribes to a topic, it keeps track of an offset in each partition for consumption and uses it to issue fetch requests to the broker.

Implementation

 
Figure 1. A Kafka cluster with 4 brokers, 1 topic and 2 partitions, each with 3 replicas

When a producer publishes a message to a partition in a topic, the message is first forwarded to the leader replica of the partition and is appended to its log. The follower replicas keep pulling new messages from the leader. Once enough replicas have received the message, the leader commits it.

One subtle issue is how the leader decides what's enough. The leader can't always wait for writes to complete on all replicas. This is because any follower replica can fail and the leader can't wait indefinitely.

To address this problem, for each partition of a topic, we maintain an in-sync replica set (ISR). This is the set of replicas that are alive and have fully caught up with the leader (note that the leader is always in ISR). When a partition is created initially, every replica is in the ISR. When a new message is published, the leader waits until it reaches all replicas in the ISR before committing the message. If a follower replica fails, it will be dropped out of the ISR and the leader then continues to commit new messages with fewer replicas in the ISR. Notice that now, the system is running in an under replicated mode.

The leader also maintains a high watermark (HW), which is the offset of the last committed message in a partition. The HW is continuously propagated to the followers and is checkpointed to disk in each broker periodically for recovery.

When a failed replica is restarted, it first recovers the latest HW from disk and truncates its log to the HW. This is necessary since messages after the HW are not guaranteed to be committed and may need to be thrown away. Then, the replica becomes a follower and starts fetching messages after the HW from the leader. Once it has fully caught up, the replica is added back to the ISR and the system is back to the fully replicated mode.

Handling Failures

We rely on Zookeeper for detecting broker failures. Similar to Helix, we use a controller (embedded in one of the brokers) to receive all Zookeeper notifications about the failure and to elect new leaders. If a leader fails, the controller selects one of the replicas in the ISR as the new leader and informs the followers about the new leader.

By design, committed messages are always preserved during leadership change whereas some uncommitted data could be lost. The leader and the ISR for each partition are also stored in Zookeeper and are used during the failover of the controller. Both the leader and the ISR are expected to change infrequently since failures are rare.

For clients, a broker only exposes committed messages to the consumers. Since committed data is always preserved during broker failures, a consumer can automatically fetch messages from another replica, using the same offset.

A producer can choose when to receive the acknowledgement from the broker after publishing a message. For example, it can wait until the message is committed by the leader (i.e, it's received by all replicas in the ISR). Alternatively, it may choose to receive an acknowledgement as soon as the message is appended to the log in the leader replica, but may not be committed yet. In the former case, the producer has to wait a bit longer, but all acknowledged messages are guaranteed to be kept by the brokers. In the latter case, the producer has lower latency, but a smaller number of acknowledged messages could be lost when a broker fails.

More info

For more details of the design and the implementation, check out the Kafka Replication Wiki and drop by my Kafka Replication Talk at ApacheCon 2013 in late February.

reference from:http://engineering.linkedin.com/kafka/intra-cluster-replication-apache-kafka

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Replication

Intra-cluster Replication in Apache Kafka--reference的更多相关文章

  1. Apache Kafka教程

    1.卡夫卡教程 今天,我们正在使用Apache Kafka Tutorial开始我们的新旅程.在这个Kafka教程中,我们将看到什么是Kafka,Apache Kafka历史以及Kafka的原因.此外 ...

  2. How to set an Apache Kafka multi node – multi broker cluster【z】

    Set a multi node Apache ZooKeeper cluster On every node of the cluster add the following lines to th ...

  3. Understanding, Operating and Monitoring Apache Kafka

    Apache Kafka is an attractive service because it's conceptually simple and powerful. It's easy to un ...

  4. Apache Kafka: Next Generation Distributed Messaging System---reference

    Introduction Apache Kafka is a distributed publish-subscribe messaging system. It was originally dev ...

  5. How Cigna Tuned Its Spark Streaming App for Real-time Processing with Apache Kafka

    Explore the configuration changes that Cigna’s Big Data Analytics team has made to optimize the perf ...

  6. Configuring Apache Kafka for Performance and Resource Management

    Apache Kafka is optimized for small messages. According to benchmarks, the best performance occurs w ...

  7. Configuring High Availability and Consistency for Apache Kafka

    To achieve high availability and consistency targets, adjust the following parameters to meet your r ...

  8. Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)

    I wrote a blog post about how LinkedIn uses Apache Kafka as a central publish-subscribe log for inte ...

  9. 《Apache kafka实战》读书笔记-kafka集群监控工具

    <Apache kafka实战>读书笔记-kafka集群监控工具 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 如官网所述,Kafka使用基于yammer metric ...

随机推荐

  1. bzoj 2482: [Spoj GSS2] Can you answer these queries II 线段树

    2482: [Spoj1557] Can you answer these queries II Time Limit: 20 Sec  Memory Limit: 128 MBSubmit: 145 ...

  2. js中实现页面跳转

    1.在本页中跳转到指定页面 1.window.location.href方式    <script language="javascript" type="text ...

  3. 网络安装CentOS 5.3

    转自网络安装CentOS 5.3 0. 基本要求 (1) 需要使用至少两台服务器:其中一台没有操作系统,是我们即将安装的服务器;另外一台是已经安装好操作系统的服务器,我们用来存储CentOS的安装文件 ...

  4. 【整理】各种Python的IDE(集成开发环境)的总结和对比

    原地址:http://www.tuicool.com/articles/rMVJNn 原文  http://www.crifan.com/summary_common_python_ide_pyscr ...

  5. 打造属于自己的Altium Designer 3D封装库,不需要懂专门的三维设计软件

    看到Andy_2020发的帖子“Altium Designer专题”之后,对Altium Designer的3D功能很感兴趣,着手自己做一个AD的3D封装库.刚开始按照Andy介绍的方法,学了两天So ...

  6. UCS-2和UTF8的四个新知识点和新的疑问

    最初的unicode编码是固定长度的,16位,也就是2两个字节代表一个字符,这样一共可以表示65536个字符.显然,这样要表示各种语言中所有的字符是远远不够的.Unicode4.0规范考虑到了这种情况 ...

  7. Android源码之Matrix

    Matrix类在Android中主要用来进行矩阵变换,其可操作的对象包括图像.点阵.Vector(向量).矩形等. Matrix支持的变换类型主要有以下几种: 1.Translate:平移变换 2.R ...

  8. 【SSSP】A forward-backward single-source paths algorithm

    0. 引子基础的算法和数据结构已经学习的差不多了,上学期期末就打算重点研究研究STOC和FOCS上面的论文.做这件事情的初衷是了解别人是如何改进原有算法的,搞清楚目前比较热的算法问题有哪些,更重要的是 ...

  9. POJ_2392_Space_Elevator_(动态规划,背包)

    描述 http://poj.org/problem?id=2392 磊方块,每种方块有数量,高度,以及该种方块所能处在的最高高度.问最高磊多高? Space Elevator Time Limit: ...

  10. 【转】c++重载、覆盖、隐藏——理不清的区别

    原文网址:http://blog.sina.com.cn/s/blog_492d601f0100jqqm.html 再次把林锐博士的<高质量c++编程指南>翻出来看的时候,再一次的觉得这是 ...