Kafka monitoring监控

一、Metrics

kafka有两个metrics包,在看源码的时候很容易混淆

package kafka.metrics

package org.apache.kafka.common.metrics

可以看到这两个包的包名都是metrics,但是他们负责的任务并不相同,而且两个包中的类并没有任何的互相引用关系.可以看作是两个完全独立的包.kafka.mtrics这个包,主要调用yammer的Api,并进行封装,提供给client监测kafka的各个性能参数。

而commons.metrics这个包是我这篇文章主要要介绍的,这个包并不是面向client提供服务的,他是为了给kafka中的其他组件,比如replicaManager,PartitionManager,QuatoManager提供调用,让这些Manager了解kafka现在的运行状况,以便作出相应决策的.

首先metrics第一次被初始化,在kafkaServer的startup()方法中

metrics = new Metrics(metricConfig, reporters, kafkaMetricsTime, true)

quotaManagers = QuotaFactory.instantiate(config, metrics, time)

初始化了一个Metrics,并将这个实例传到quotaManagers的构造函数中,这里简单介绍一下quotaManagers.这是kafka中用来限制kafka,producer的传输速度的,比如在config文件下设置producer不能以超过5MB/S的速度传输数据,那么这个限制就是通过quotaManager来实现的.

回到metrics上,跟进代码.

public class Metrics implements Closeable {

 ....

 ....

    private final ConcurrentMap<MetricName, KafkaMetric> metrics;

    private final ConcurrentMap<String, Sensor> sensors;

metrics与sensors这两个concurrentMap是Metrics中两个重要的成员属性.那么什么是KafkaMetric,什么是Sensor呢?

首先分析KafkaMetric

KafkaMetric实现了Metric接口,可以看到它的核心方法value()返回要监控的参数的值.

public interface Metric {

    /**

     * A name for this metric

     */

    public MetricName metricName();

    /**

     * The value of the metric

     */

    public double value();

}

那么KafkaMetric又是如何实现value()方法的呢?

@Override

public double value() {

    synchronized (this.lock) {

        return value(time.milliseconds());

    }

}

double value(long timeMs) {

    return this.measurable.measure(config, timeMs);

}

原来value()是通过kafkaMetric中的另一个成员属性measurable完成

public interface Measurable {

    /**

     * Measure this quantity and return the result as a double

     * @param config The configuration for this metric

     * @param now The POSIX time in milliseconds the measurement is being taken

     * @return The measured value

     */

    public double measure(MetricConfig config, long now);

}

其实这边挺绕的,Metrics有kafkaMetric的成员变量,而kafkaMetric又通过Measurable返回要检测的值.打个比方,Metrics好比是汽车的仪表盘,kafkaMetric就是仪表盘上的一个仪表,Measurable就是对真正要检测的组件的一个封装.来看看一个Measrable的简单实现,在sender.java类中.

metrics.addMetric(m, new Measurable() {

    public double measure(MetricConfig config, long now) {

        return (now - metadata.lastSuccessfulUpdate()) / 1000.0;

    }

});

可以看到measure的实现就是简单地返回要返回的值,因为是直接在目标类中定义的,所以可以直接获得相应变量的引用.

接下来介绍Sensor,也就是下面的ConcurrentMap中的Sensor

private final ConcurrentMap<String, Sensor> sensors;

以下是Sensor类的源码

/**

 * A sensor applies a continuous sequence of numerical values to a set of associated metrics. For example a sensor on

 * message size would record a sequence of message sizes using the {@link #record(double)} api and would maintain a set

 * of metrics about request sizes such as the average or max.

 */

public final class Sensor {

    //一个kafka就只有一个Metrics实例,这个registry就是对这个Metrics的引用

    private final Metrics registry;

    private final String name;

    private final Sensor[] parents;

    private final List<Stat> stats;

    private final List<KafkaMetric> metrics;

这一段的注释很有意义,从注释中可以看到Sensor的作用不同KafkaMetric. KafkaMetric仅仅是返回某一个参数的值,而Sensor有基于某一参数时间序列进行统计的功能,比如平均值,最大值,最小值.那这些统计又是如何实现的呢?答案是List<Stat> stats这个属性成员.

public interface Stat {

    /**

     * Record the given value

     * @param config The configuration to use for this metric

     * @param value The value to record

     * @param timeMs The POSIX time in milliseconds this value occurred

     */

    public void record(MetricConfig config, double value, long timeMs);

}

可以看到Stat是一个接口,其中有一个record方法可以记录一个采样数值,下面看一个例子,max这个功能如何用Stat来实现?

public final class Max extends SampledStat {

    public Max() {

        super(Double.NEGATIVE_INFINITY);

    }

    @Override

    protected void update(Sample sample, MetricConfig config, double value, long now) {

        sample.value = Math.max(sample.value, value);

    }

    @Override

    public double combine(List<Sample> samples, MetricConfig config, long now) {

        double max = Double.NEGATIVE_INFINITY;

        for (int i = ; i < samples.size(); i++)

            max = Math.max(max, samples.get(i).value);

        return max;

    }

}

是不是很简单,update相当于冒一次泡,把当前的值与历史的最大值比较.combine相当于用一次完整的冒泡排序找出最大值,需要注意的是,max是继承SampleStat的,而SampleStat是Stat接口的实现类.那我们回到Sensor类上来.

public void record(double value, long timeMs) {

    this.lastRecordTime = timeMs;

    synchronized (this) {

        // increment all the stats

        for (int i = ; i < this.stats.size(); i++)

            this.stats.get(i).record(config, value, timeMs);

        checkQuotas(timeMs);

    }

    for (int i = ; i < parents.length; i++)

        parents[i].record(value, timeMs);

}

record方法,每个注册于其中的stats提交值,同时如果自己有父sensor的话,向父sensor提交.

public void checkQuotas(long timeMs) {

    for (int i = ; i < this.metrics.size(); i++) {

        KafkaMetric metric = this.metrics.get(i);

        MetricConfig config = metric.config();

        if (config != null) {

            Quota quota = config.quota();

            if (quota != null) {

                double value = metric.value(timeMs);

                if (!quota.acceptable(value)) {

                    throw new QuotaViolationException(

                        metric.metricName(),

                        value,

                        quota.bound());

                }

            }

        }

    }

}

checkQuotas,通过这里其实是遍历注册在sensor上的每一个KafkaMetric来检查他们的值有没有超过config文件中设置的配额.注意这里的QuotaVioLationException,是不是很熟悉.在QuatoManager中,如果有一个client的上传/下载速度超过指定配额.那么就会抛出这个警告.

try {

  clientSensors.quotaSensor.record(value)

  // trigger the callback immediately if quota is not violated

  callback()

} catch {

  case qve: QuotaViolationException =>

    // Compute the delay

    val clientMetric = metrics.metrics().get(clientRateMetricName(clientQuotaEntity.sanitizedUser, clientQuotaEntity.clientId))

    throttleTimeMs = throttleTime(clientMetric, getQuotaMetricConfig(clientQuotaEntity.quota))

    clientSensors.throttleTimeSensor.record(throttleTimeMs)

    // If delayed, add the element to the delayQueue

    delayQueue.add(new ThrottledResponse(time, throttleTimeMs, callback))

    delayQueueSensor.record()

    logger.debug("Quota violated for sensor (%s). Delay time: (%d)".format(clientSensors.quotaSensor.name(), throttleTimeMs))

}

这里就很好理解了,向clientSensor提交上传,下载的值,如果成功了,就掉用相应的callback,如果失败了catch的就是QuotaViolationException.

其实metrics的运行模型还是很简单的,让人感觉绕的就是,各种抽象,Metrics,KafkaMetrics,Sensor,Stat这些概念吧.

最后,Sensor会初始化一个线程专门用来清除长时间没有使用的线程.这个线程名为"SensorExpiryThread"

class ExpireSensorTask implements Runnable {

    public void run() {

        for (Map.Entry<String, Sensor> sensorEntry : sensors.entrySet()) {

            // removeSensor also locks the sensor object. This is fine because synchronized is reentrant

            // There is however a minor race condition here. Assume we have a parent sensor P and child sensor C.

            // Calling record on C would cause a record on P as well.

            // So expiration time for P == expiration time for C. If the record on P happens via C just after P is removed,

            // that will cause C to also get removed.

            // Since the expiration time is typically high it is not expected to be a significant concern

            // and thus not necessary to optimize

            synchronized (sensorEntry.getValue()) {

                if (sensorEntry.getValue().hasExpired()) {

                    log.debug("Removing expired sensor {}", sensorEntry.getKey());

                    removeSensor(sensorEntry.getKey());

                }

            }

        }

    }

二、JMX

本博文通过使用jmx调用kafka的几个监测项属性来讲述下如何使用jmx来监控kafka.
有关Jmx的使用可以参考：

在使用jmx之前需要确保kafka开启了jmx监控，kafka启动时要添加JMX_PORT=9999这一项，也就是：

JMX_PORT= bin/kafka-server-start.sh config/server.properties &

自行搭建了一个kafka集群，只有两个节点。集群中有一个topic（name=default_channel_kafka_zzh_demo）,分为5个partition(0 1 2 3 4).

这里讨论的kafka版本是0.8.1.x和0.8.2.x，这两者在使用jmx监控时会有差异，差异体现在ObjectName之中。熟悉kafka的同学知道，kafka有topic和partition这两个概念，topic中根据一定的策略来分为若干个partitions, 这里就以此举例来看,
在0.8.1.x中有关此项的属性的ObjectName(String值)为：
“kafka.log”:type=”Log”,name=”default_channel_kafka_zzh_demo-*-LogEndOffset”

而在0.8.2.x中有关的属性的ObjectName为：
kafka.log:type=Log,name=LogEndOffset,topic=default_channel_kafka_zzh_demo,partition=0

所以在程序中要区别对待。

这里采用三个监测项来演示如果使用jmx进行监控：

上面所说的offset (集群中的一个topic下的所有partition的LogEndOffset值，即logSize)
sendCount(集群中的一个topic下的发送总量，这个值是集群中每个broker中此topic的发送量之和)
sendTps(集群中的一个topic下的TPS, 这个值也是集群中每个broker中此topic的发送量之和)

首先是针对单个kafka broker的。

package kafka.jmx;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

import javax.management.*;

import javax.management.remote.JMXConnector;

import javax.management.remote.JMXConnectorFactory;

import javax.management.remote.JMXServiceURL;

import java.io.IOException;

import java.net.MalformedURLException;

import java.util.HashMap;

import java.util.Map;

import java.util.Set;

/**

 * Created by hidden on 2016/12/8.

 */

public class JmxConnection {

    private static Logger log = LoggerFactory.getLogger(JmxConnection.class);

    private MBeanServerConnection conn;

    private String jmxURL;

    private String ipAndPort = "localhost:9999";

    private int port = ;

    private boolean newKafkaVersion = false;

    public JmxConnection(Boolean newKafkaVersion, String ipAndPort){

        this.newKafkaVersion = newKafkaVersion;

        this.ipAndPort = ipAndPort;

    }

    public boolean init(){

        jmxURL = "service:jmx:rmi:///jndi/rmi://" +ipAndPort+ "/jmxrmi";

        log.info("init jmx, jmxUrl: {}, and begin to connect it",jmxURL);

        try {

            JMXServiceURL serviceURL = new JMXServiceURL(jmxURL);

            JMXConnector connector = JMXConnectorFactory.connect(serviceURL,null);

            conn = connector.getMBeanServerConnection();

            if(conn == null){

               log.error("get connection return null!");

                return  false;

            }

        } catch (MalformedURLException e) {

            e.printStackTrace();

            return false;

        } catch (IOException e) {

            e.printStackTrace();

            return false;

        }

        return true;

    }

    public String getTopicName(String topicName){

        String s;

        if (newKafkaVersion) {

            s = "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=" + topicName;

        } else {

            s = "\"kafka.server\":type=\"BrokerTopicMetrics\",name=\"" + topicName + "-MessagesInPerSec\"";

        }

        return s;

    }

    /**

     * @param topicName: topic name, default_channel_kafka_zzh_demo

     * @return 获取发送量(单个broker的，要计算某个topic的总的发送量就要计算集群中每一个broker之和)

     */

public long getMsgInCountPerSec(String topicName){

    String objectName = getTopicName(topicName);

    Object val = getAttribute(objectName,"Count");

    String debugInfo = "jmxUrl:"+jmxURL+",objectName="+objectName;

    if(val !=null){

        log.info("{}, Count:{}",debugInfo,(long)val);

        return (long)val;

    }

    return ;

}

    /**

     * @param topicName: topic name, default_channel_kafka_zzh_demo

     * @return 获取发送的tps，和发送量一样如果要计算某个topic的发送量就需要计算集群中每一个broker中此topic的tps之和。

     */

    public double getMsgInTpsPerSec(String topicName){

        String objectName = getTopicName(topicName);

        Object val = getAttribute(objectName,"OneMinuteRate");

        if(val !=null){

            double dVal = ((Double)val).doubleValue();

            return dVal;

        }

        return ;

    }

    private Object getAttribute(String objName, String objAttr)

    {

        ObjectName objectName =null;

        try {

            objectName = new ObjectName(objName);

        } catch (MalformedObjectNameException e) {

            e.printStackTrace();

            return null;

        }

        return getAttribute(objectName,objAttr);

    }

    private Object getAttribute(ObjectName objName, String objAttr){

        if(conn== null)

        {

            log.error("jmx connection is null");

            return null;

        }

        try {

            return conn.getAttribute(objName,objAttr);

        } catch (MBeanException e) {

            e.printStackTrace();

            return null;

        } catch (AttributeNotFoundException e) {

            e.printStackTrace();

            return null;

        } catch (InstanceNotFoundException e) {

            e.printStackTrace();

            return null;

        } catch (ReflectionException e) {

            e.printStackTrace();

            return null;

        } catch (IOException e) {

            e.printStackTrace();

            return null;

        }

    }

    /**

     * @param topicName

     * @return 获取topicName中每个partition所对应的logSize(即offset)

     */

    public Map<Integer,Long> getTopicEndOffset(String topicName){

        Set<ObjectName> objs = getEndOffsetObjects(topicName);

        if(objs == null){

            return null;

        }

        Map<Integer, Long> map = new HashMap<>();

        for(ObjectName objName:objs){

            int partId = getParId(objName);

            Object val = getAttribute(objName,"Value");

            if(val !=null){

                map.put(partId,(Long)val);

            }

        }

        return map;

    }

    private int getParId(ObjectName objName){

        if(newKafkaVersion){

            String s = objName.getKeyProperty("partition");

            return Integer.parseInt(s);

        }else {

            String s = objName.getKeyProperty("name");

            int to = s.lastIndexOf("-LogEndOffset");

            String s1 = s.substring(, to);

            int from = s1.lastIndexOf("-") + ;

            String ss = s.substring(from, to);

            return Integer.parseInt(ss);

        }

    }

    private Set<ObjectName> getEndOffsetObjects(String topicName){

        String objectName;

        if (newKafkaVersion) {

            objectName = "kafka.log:type=Log,name=LogEndOffset,topic="+topicName+",partition=*";

        }else{

            objectName = "\"kafka.log\":type=\"Log\",name=\"" + topicName + "-*-LogEndOffset\"";

        }

        ObjectName objName = null;

        Set<ObjectName> objectNames = null;

        try {

            objName = new ObjectName(objectName);

            objectNames = conn.queryNames(objName,null);

        } catch (MalformedObjectNameException e) {

            e.printStackTrace();

            return  null;

        } catch (IOException e) {

            e.printStackTrace();

            return null;

        }

        return objectNames;

    }

}

注意代码中对于两种不同kafka版本的区别处理。对应前面所说的三个检测项的方法为：

public Map<Integer,Long> getTopicEndOffset(String topicName)

public long getMsgInCountPerSec(String topicName)

public double getMsgInTpsPerSec(String topicName)

对于整个集群的处理需要另外一个类来保证，总体上是对集群中的每一个broker相应的值进行累加.

package kafka.jmx;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

import java.util.ArrayList;

import java.util.HashMap;

import java.util.List;

import java.util.Map;

/**

 * Created by hidden on 2016/12/8.

 */

public class JmxMgr {

    private static Logger log = LoggerFactory.getLogger(JmxMgr.class);

    private static List<JmxConnection> conns = new ArrayList<>();

    public static boolean init(List<String> ipPortList, boolean newKafkaVersion){

        for(String ipPort:ipPortList){

            log.info("init jmxConnection [{}]",ipPort);

            JmxConnection conn = new JmxConnection(newKafkaVersion, ipPort);

            boolean bRet = conn.init();

            if(!bRet){

                log.error("init jmxConnection error");

                return false;

            }

            conns.add(conn);

        }

        return true;

    }

    public static long getMsgInCountPerSec(String topicName){

        long val = 0;

        for(JmxConnection conn:conns){

            long temp = conn.getMsgInCountPerSec(topicName);

            val += temp;

        }

        return val;

    }

    public static double getMsgInTpsPerSec(String topicName){

        double val = 0;

        for(JmxConnection conn:conns){

            double temp = conn.getMsgInTpsPerSec(topicName);

            val += temp;

        }

        return val;

    }

    public static Map<Integer, Long> getEndOffset(String topicName){

        Map<Integer,Long> map = new HashMap<>();

        for(JmxConnection conn:conns){

            Map<Integer,Long> tmp = conn.getTopicEndOffset(topicName);

            if(tmp == null){

                log.warn("get topic endoffset return null, topic {}", topicName);

                continue;

            }

            for(Integer parId:tmp.keySet()){//change if bigger

                if(!map.containsKey(parId) || (map.containsKey(parId) && (tmp.get(parId)>map.get(parId))) ){

                    map.put(parId, tmp.get(parId));

                }

            }

        }

        return map;

    }

    public static void main(String[] args) {

        List<String> ipPortList = new ArrayList<>();

        ipPortList.add("xx.101.130.1:9999");

        ipPortList.add("xx.101.130.2:9999");

        JmxMgr.init(ipPortList,true);

        String topicName = "default_channel_kafka_zzh_demo";

        System.out.println(getMsgInCountPerSec(topicName));

        System.out.println(getMsgInTpsPerSec(topicName));

        System.out.println(getEndOffset(topicName));

    }

}

结果：

-- :: -[INFO] - [init jmxConnection [xx.101.130.:]] - [kafka.jmx.JmxMgr:]

-- :: -[INFO] - [init jmx, jmxUrl: service:jmx:rmi:///jndi/rmi://xx.101.130.1:9999/jmxrmi, and begin to connect it] - [kafka.jmx.JmxConnection:35]

-- :: -[INFO] - [init jmxConnection [xx.101.130.:]] - [kafka.jmx.JmxMgr:]

-- :: -[INFO] - [init jmx, jmxUrl: service:jmx:rmi:///jndi/rmi://xx.101.130.2:9999/jmxrmi, and begin to connect it] - [kafka.jmx.JmxConnection:35]

-- :: -[INFO] - [jmxUrl:service:jmx:rmi:///jndi/rmi://xx.101.130.1:9999/jmxrmi,objectName=kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=default_channel_kafka_zzh_demo, Count:6000] - [kafka.jmx.JmxConnection:73]

-- :: -[INFO] - [jmxUrl:service:jmx:rmi:///jndi/rmi://xx.101.130.2:9999/jmxrmi,objectName=kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=default_channel_kafka_zzh_demo, Count:4384] - [kafka.jmx.JmxConnection:73]

3.915592283987704E-65

{=, =, =, =, =}

三、kafka Manager

Kafka monitoring监控的更多相关文章

Kafka 消息监控 - Kafka Eagle
1.概述在开发工作当中,消费 Kafka 集群中的消息时,数据的变动是我们所关心的,当业务并不复杂的前提下,我们可以使用 Kafka 提供的命令工具,配合 Zookeeper 客户端工具,可以很方便 ...
kafka消息监控-KafkaOffsetMonitor
参照site:https://github.com/quantifind/KafkaOffsetMonitor 一.简述这个应用程序用来实时监控Kafka服务的Consumer以及它们所在的Part ...
如何在web项目中添加javamelody monitoring 监控。
1.在工程的maven pom中添加依赖javamelody-core <!-- https://mvnrepository.com/art ...
Spark+Kafka实时监控Oracle数据预警
目标: 监控Oracle某张记录表,有新增数据则获取表数据,并推送到微信企业. 流程: Kafka实时监控Oracle指定表,获取该表操作信息(日志),使用Spark Structured Strea ...
kafka配置监控和消费者测试
概念运维配置监控生产者与消费者流处理分区partition 一定条件下,分区数越多,吞吐量越高.分区也是保证消息被顺序消费的基础,kafka只能保证一个分区内消息的有序性副本每个分区有 ...
zabbix web monitoring 监控网页
配置 Web 场景配置 web 场景: 转到: 配置 (Configuration)–>主机 (或者模板 ) 点击主机 (host)/ 模板 (template) 行中的 Web 点击右上角 ...
【Kafka】监控及运维——kafka-eagle
目录简单介绍概述安装部署一.环境要求二.下载源码包并解压三.准备数据库四.修改配置文件五.配置环境变量六.启动kafka-eagle 七.成功运行简单介绍概述 Kafka-eag ...
Spark-StructuredStreaming 下的checkpointLocation分析以及对接 Grafana 监控和提交Kafka Lag 监控
一.Spark-StructuredStreaming checkpointLocation 介绍 Structured Streaming 在 Spark 2.0 版本于 2016 年引入, 是基于 ...
Docker搭建kafka及监控
环境安装 docker安装 yum update yum install docker # 启动 systemctl start docker # 加入开机启动 systemctl enable do ...

随机推荐

在windows远程提交任务给Hadoop集群（Hadoop 2.6）
我使用3台Centos虚拟机搭建了一个Hadoop2.6的集群.希望在windows7上面使用IDEA开发mapreduce程序,然后提交的远程的Hadoop集群上执行.经过不懈的google终于搞定 ...
Ryouko's Memory Note
题目意思:一个书有 n 页,每页的编号依次从 1 到 n 编排.如果从页 x 翻到页 y,那么|x-y|页都需要翻到(联系生活实际就很容易理解的了).接着有m pieces 的 information ...
js教程--从入门到精通第一篇 js的前世今生以及js中基本数据类型和引入方式
1.Javascript前世今生 1.1.什么是Javascript Javascript运行于Javascript [解释器/引擎]中的解释性脚本语言 Javascript ...
C++笔记--模板
一个string模板简单的定义 template <class C>//模板形式,C是一个类型名字,不一定是某个类的名字 class String{ struct srep; srep ...
python调用Java代码
#coding:utf-8 #!/usr/bin/python from jpype import * import os.path,json from ethereum.utils import e ...
Sql Server 判断表是否存在方法
在创建表之前,通常需要先判断该表是否已经存在,如果存在则不需要创建:有时候删除表之前也需要先行判断是否存在,否则会报错. 判断方法大致有以下两种: 方法一: from sysObjects where ...
es6学习总结一
一.let与var 的区别 (1).作用域: let什么的变量在块级作用域内有效,而var声明则在全局作用内有效 (2).变量提升: let声明的变量不存在变量提升,即一定要在声明之后才能调用,否则就 ...
springcloud安全控制token的创建与解析
import io.jsonwebtoken.Claims; import io.jsonwebtoken.Jwts; import io.jsonwebtoken.SignatureAlgorith ...
Algorithms - Insertion sort
印象图1 插入排序过程思想插入排序(Insertion Sort)的主要思想是不断地将待排序的元素插入到有序序列中,是有序序列不断地扩大,直至所有元素都被插入到有序序列中. 分析时间复杂度: ...
Python中logging模块的基本用法
在 PyCon 2018 上,Mario Corchero 介绍了在开发过程中如何更方便轻松地记录日志的流程. 整个演讲的内容包括: 为什么日志记录非常重要日志记录的流程是怎样的怎样来进行日志记录 ...

Kafka monitoring监控

Kafka monitoring监控的更多相关文章

随机推荐

热门专题