Apache Kafka is an attractive service because it’s conceptually simple and powerful. It’s easy to understand writing messages to a log in one place, then reading messages from that log in another place. This simplicity not only allows for a nice separation of concerns, but also is relatively easier to understand than more complex alternatives. Plus, Kafka is powerful enough to pass messages at wire speed, going as fast as your network allows.

Understanding Apache Kafka

At the same time, Apache Kafka has a surprising number of moving parts that can be monitored, and it can be a little hard to know even where to start. In broad strokes, there are two categories to monitor: infrastructure and application performance. Most of the metric space for Kafka centers on measuring the movement of messages over a network, and the underlying mechanisms that support correct passage/delivery of those messages. Kafka labels a subset of the metrics involved as “topics” and “consumer groups,” which typically address application performance, providing answers to questions such as “how many messages have been sent to topic X,” or, “how far behind is consumer group Y?”

The Big Picture

To make things a bit more complicated, Kafka provides node-level metrics, and doesn’t provide a coordinated rollup across the logical service. So, you may want to monitor “active.controller.count” but each node tells you “zero” and only the one coordinator node tells you “one,” and that might move around from node to node as instances are cycled. Or, you may want to monitor “messages.in” but each node only tells you the number of messages it’s seen, rather than the sum total across the brokers. In these cases, it is necessary to create an aggregation across the nodes, to get the sum total of messages or partitions or other kinds of counts.

Now that we have the big picture, let’s get specific. There are lots of good sources for comprehensive monitoring and laundry lists of metrics, so instead I’ll give a couple of embarrassing stories from our own experience when we were first adopting this wonderful technology.

Embarrassing story #1:

It doesn’t work when the disk is full. I think this one must happen to everyone at some point, because you sort of hope that Kafka will manage disk space holistically. Instead, Kafka has a fine-grained policy for the limits of a partition. It doesn’t have an overall limit and it cannot work backward  to keep the disk from becoming full by throwing away older data. Instead, it just fills the disk and tips over, as if that’s what you asked it to do, even though you didn’t mean to really. So, we had just rolled out a small staging cluster with mostly default settings, and within a few days the disks were full and everything failed. Luckily, it failed fast.

If you tell this embarrassing story to an experienced Kafka dev, they might say “oh, you didn’t configure things properly.” And, yes, it is possible to take a guess at how many topics you need to support at most, and then work backward from a number of partitions and configure settings that might keep the disk from filling so long as you stay under that number of partitions. However,  the number of partitions is a moving target; topics can be made on demand, topics with more partitions than the default can be made, etc. You can make an imaginary partition budget, but there’s no way to enforce that budget. What you really want, in the general case, is to use the disk up to a point, then hope Kafka acts like a FIFO and throws away old things to make room for new things. Until that’s available, I don’t think there’s a robust way to defend against filling the disk with Kafka configuration alone, so monitoring disk utilization is essential.

We use a nice machine learning model in our product that learns disk utilization trends and warns us if the trend predicts a full disk in the near future. We get a warning if the full disk is three days out (the long weekend), and a critical alert if eight hours out. What’s great about this approach is that it adapts to real utilization automatically, pays attention to the velocity of data rather than some arbitrary threshold, and is independent of the number of active topics, and independent of the activity level in different topics. See how I turned our embarrassment into a positive? Pretty slick, right?

Embarrassing story #2:

Broker down is bad. We deployed three Kafka brokers in staging, and the story we told ourselves was that one broker could go down and the other two would pick up the slack. Like any quorum-based service, we hoped, two out of three would mean a majority would always be up and running. So, sure enough a broker went down, and we were like “cool, we’ll fix that later, but leave it for now because it might offer good clues we’ll want to investigate.” Then, much later, one of the producers wedged. Then another. Then it was an outage. It turns out we had configured those Kafka Producers to require acks from each of the replicas, and there were three replicas. So, when one of the brokers went down, there was no fourth broker to take up the missing replica and the Producers were only able to get two acks. Everything had seemed to keep working for a long while. But, while the broker was down, the Producers kept resources for each message hoping for the last ack, then eventually crashed the jvm when it was out of resources. This one was a little tricky to figure out because there were no logged failures on the Producers, and the Consumers were happy to read from the replicas that were working, so it wasn’t obvious that there was a problem with Kafka.

The fix was to configure the Producers to only require one acknowledgement before considering the message delivered. Also, we had to stop thinking about Kafka as if it were another quorum-type service. A better mental image is to think of Kafka brokers as partition servers, and the number of brokers should be the number of partitions you need to serve in parallel at full speed. It won’t quite work out that way in practice, but that mental model is better than thinking about quorum. Then, once you have a number of brokers established, think about replication as providing failure tolerance, but work backward. If one broker goes down, the replication factor should be equal or less than the number of remaining brokers, because otherwise there’s nowhere to move the replicas. Three replicas on three brokers means when a broker goes down, there’s nothing to take on the third replica and you have a policy violation / outage.

Also, note that when the broker is down, it naturally doesn’t have any metrics to report. You can’t tell from inspecting the down node that something is wrong with the cluster. You have to go add up all the metrics from the other brokers. So, it is important to have service-wide aggregations for key performance indicators, such as  under.replicated.partitions, offline.partition.count, bytes.in, bytes.out, messages.in, etc. Adding up all the active.controller.counts will tell you that there is no leader if the sum is zero, and some split-brain catastrophe has happened if the sum is more than one. With service aggregations, you can assess the impact of a broker going down: either the cluster recovers and is fine, or it gets into trouble pretty quickly.

Lastly, I’d like to point out that it may be possible in some environments to have a strategy for dealing with the large list of metrics in Kafka, but still collect everything that’s interesting. Whenever a topic or partition is created, data about those are reported essentially forever. A topic may be abandoned and all the message data expired, but all the metrics around that topic will continue to report values. In staging and testing-type environments, especially,  we’ve seen large collections of abandoned topics. But even in production, given enough time, the collection of inactive topics can grow substantially. Kafka doesn’t particularly encourage cleaning up old topics, so eventually you have a long list of all the topics ever made. As an adaptive filter, one can track message counts and ignore all the metrics related to topics that had no increase in the number of messages. That way, you can reasonably “watch everything live” and avoid having to maintain a list of topics to be  monitored, or otherwise manually tune your data collection.

In the next part of Monitoring Kafka series we will discuss Kafka consumer lag and some other concerns in a little more detail.

Understanding, Operating and Monitoring Apache Kafka的更多相关文章

  1. Apache Kafka监控之Kafka Web Console

    Kafka Web Console:是一款开源的系统,源码的地址在https://github.com/claudemamo/kafka-web-console中.Kafka Web Console也 ...

  2. Understanding When to use RabbitMQ or Apache Kafka

    https://content.pivotal.io/rabbitmq/understanding-when-to-use-rabbitmq-or-apache-kafka How do humans ...

  3. Understanding When to use RabbitMQ or Apache Kafka Kafka RabbitMQ 性能对比

    Understanding When to use RabbitMQ or Apache Kafka https://content.pivotal.io/rabbitmq/understanding ...

  4. 【转载】Understanding When to use RabbitMQ or Apache Kafka

    https://content.pivotal.io/rabbitmq/understanding-when-to-use-rabbitmq-or-apache-kafka RabbitMQ: Erl ...

  5. Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)

    I wrote a blog post about how LinkedIn uses Apache Kafka as a central publish-subscribe log for inte ...

  6. 重磅开源 KSQL:用于 Apache Kafka 的流数据 SQL 引擎 2017.8.29

    Kafka 的作者 Neha Narkhede 在 Confluent 上发表了一篇博文,介绍了Kafka 新引入的KSQL 引擎——一个基于流的SQL.推出KSQL 是为了降低流式处理的门槛,为处理 ...

  7. Apache Kafka: Next Generation Distributed Messaging System---reference

    Introduction Apache Kafka is a distributed publish-subscribe messaging system. It was originally dev ...

  8. Install and Configure Apache Kafka on Ubuntu 16.04

    https://devops.profitbricks.com/tutorials/install-and-configure-apache-kafka-on-ubuntu-1604-1/ by hi ...

  9. An Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache Kafka, too!)

    01 Mar 2018 Piotr Nowojski (@PiotrNowojski) & Mike Winters (@wints) This post is an adaptation o ...

随机推荐

  1. best matched pair

    今天的模拟赛,被虐的不行....英文太差,弄不懂题意,弄懂题意了还不会... 感觉快要受不了了... #include <iostream> #include <cstdio> ...

  2. codevs 3290 华容道(SPFA+bfs)

    codevs 3290华容道 3290 华容道 2013年NOIP全国联赛提高组 时间限制: 1 s  空间限制: 128000 KB 题目描述 Description 小 B 最近迷上了华容道,可是 ...

  3. mac系统下如何解压.car文件

    纯手打: 1.去github下载demo然后运行  github地址:https://github.com/steventroughtonsmith/cartool 2.找到项目下cartool的位置 ...

  4. yii抛出错误页面CHttpException

    public void __construct(integer $status, string $message=NULL, integer $code=0) $status integer HTTP ...

  5. Alcatraz安装在xcode7失败执行下面代码

    1.步奏rm -rf ~/Library/Application\ Support/Developer/Shared/Xcode/Plug-ins/Alcatraz.xcplugin 2.步奏 rm ...

  6. July 8th, Week 28th Friday, 2016

    Care and diligence bring luck. 谨慎和勤奋带来好运气. Just as we have said before, diligence is the mother of g ...

  7. Linq学习笔记---Linq to Xml操作

    LINQ to XML的成员, 属性列表: 属性 说明 Document 获取此 XObject 的 XDocument  EmptySequence  获取空的元素集合  FirstAttribut ...

  8. ActiveMQ的几种消息持久化机制

    为了避免意外宕机以后丢失信息,需要做到重启后可以恢复消息队列,消息系统一般都会采用持久化机制. ActiveMQ的消息持久化机制有JDBC,AMQ,KahaDB和LevelDB,无论使用哪种持久化方式 ...

  9. 【转载】Pyqt 编写的俄罗斯方块

    #!/usr/bin/env python # -*- coding: utf-8 -*- from __future__ import print_function from __future__ ...

  10. 关于Mesos和Kubernetes的区别

    这个主题应该和服务发现注册一样,进入视野...