从零开始写一个Exporter

前言

上一篇文章中已经给大家整体的介绍了开源监控系统Prometheus，其中Exporter作为整个系统的Agent端，通过HTTP接口暴露需要监控的数据。那么如何将用户指标通过Exporter的形式暴露出来呢？比如说在线，请求失败数，异常请求等指标可以通过Exporter的形式暴露出来，从而基于这些指标做告警监控。

演示环境

$ uname -a

Darwin 18.6. Darwin Kernel Version 18.6.: Thu Apr  :: PDT ; root:xnu-4903.261.~/RELEASE_X86_64 x86_64

$ go version

go version go1.12.4 darwin/amd64

四类指标介绍

Prometheus定义了4种不同的指标类型：Counter(计数器)，Gauge(仪表盘)，Histogram(直方图)，Summary(摘要)。

其中Exporter返回的样本数据中会包含数据类型的说明，例如：

# TYPE node_network_carrier_changes_total counter

node_network_carrier_changes_total{device="br-01520cb4f523"}

这四类指标的特征为：

Counter：只增不减（除非系统发生重启，或者用户进程有异常）的计数器。常见的监控指标如http_requests_total, node_cpu都是Counter类型的监控指标。一般推荐在定义为Counter的指标末尾加上_total作为后缀。

Gauge：可增可减的仪表盘。Gauge类型的指标侧重于反应系统当前的状态。因此此类指标的数据可增可减。常见的例如node_memory_MemAvailable_bytes(可用内存)。

HIstogram：分析数据分布的直方图。显示数据的区间分布。例如统计请求耗时在0-10ms的请求数量和10ms-20ms的请求数量分布。

Summary: 分析数据分布的摘要。显示数据的中位数，9分数等。

实战

接下来我将用Prometheus提供的Golang SDK 编写包含上述四类指标的Exporter，示例的编写修改自SDK的example。由于example中示例比较复杂，我会精简一下，尽量让大家用最小的学习成本能够领悟到Exporter开发的精髓。第一个例子会演示Counter和Gauge的用法，第二个例子演示Histogram和Summary的用法。

Counter和Gauge用法演示：

package main

import (

    "flag"

    "log"

    "net/http"

    "github.com/prometheus/client_golang/prometheus/promhttp"

)

var addr = flag.String("listen-address", ":8080", "The address to listen on for HTTP requests.")

func main() {

    flag.Parse()

    http.Handle("/metrics", promhttp.Handler())

    log.Fatal(http.ListenAndServe(*addr, nil))

}

上述代码就是一个通过0.0.0.0:8080/metrics 暴露golang信息的原始Exporter，没有包含任何的用户自定义指标信息。接下来往里面添加Counter和Gauge类型指标：

 func recordMetrics() {

     go func() {

         for {

             opsProcessed.Inc()

             myGague.Add()

             time.Sleep( * time.Second)

         }

     }()

 }

 var (

     opsProcessed = promauto.NewCounter(prometheus.CounterOpts{

         Name: "myapp_processed_ops_total",

         Help: "The total number of processed events",

     })

     myGague = promauto.NewGauge(prometheus.GaugeOpts{

         Name: "my_example_gauge_data",

         Help: "my example gauge data",

         ConstLabels:map[string]string{"error":""},

     })

 )

在上面的main函数中添加recordMetrics方法调用。curl 127.0.0.1:8080/metrics 能看到自定义的Counter类型指标myapp_processed_ops_total 和 Gauge 类型指标my_example_gauge_data。

# HELP my_example_gauge_data my example gauge data

# TYPE my_example_gauge_data gauge

my_example_gauge_data{error=""}

# HELP myapp_processed_ops_total The total number of processed events

# TYPE myapp_processed_ops_total counter

myapp_processed_ops_total

其中#HELP 是代码中的Help字段信息，#TYPE 说明字段的类型，例如my_example_gauge_data是gauge类型指标。my_example_gauge_data是指标名称，大括号括起来的error是该指标的维度，44是该指标的值。需要特别注意的是第12行和16行用的是promauto包的NewXXX方法，例如：

func NewCounter(opts prometheus.CounterOpts) prometheus.Counter {

    c := prometheus.NewCounter(opts)

    prometheus.MustRegister(c)

    return c

}

可以看到该函数是会自动调用MustRegister方法，如果用的是prometheus包的NewCounter则需要再自行调用MustRegister注册收集的指标。其中Couter类型指标有以下的内置接口：

type Counter interface {

    Metric

    Collector

    // Inc increments the counter by 1. Use Add to increment it by arbitrary

    // non-negative values.

    Inc()

    // Add adds the given value to the counter. It panics if the value is <

    // 0.

    Add(float64)

}

可以通过Inc()接口给指标直接进行+1操作，也可以通过Add(float64)给指标加上某个值。还有继承自Metric和Collector的一些描述接口，这里不做展开。

Gauge类型的内置接口有：

type Gauge interface {

    Metric

    Collector

    // Set sets the Gauge to an arbitrary value.

    Set(float64)

    // Inc increments the Gauge by 1. Use Add to increment it by arbitrary

    // values.

    Inc()

    // Dec decrements the Gauge by 1. Use Sub to decrement it by arbitrary

    // values.

    Dec()

    // Add adds the given value to the Gauge. (The value can be negative,

    // resulting in a decrease of the Gauge.)

    Add(float64)

    // Sub subtracts the given value from the Gauge. (The value can be

    // negative, resulting in an increase of the Gauge.)

    Sub(float64)

    // SetToCurrentTime sets the Gauge to the current Unix time in seconds.

    SetToCurrentTime()

}

需要注意的是Gauge提供了Sub(float64)的减操作接口，因为Gauge是可增可减的指标。Counter因为是只增不减的指标，所以只有加的接口。

Histogram和Summary用法演示：

 package main

 import (

     "flag"

     "fmt"

     "log"

     "math"

     "math/rand"

     "net/http"

     "time"

     "github.com/prometheus/client_golang/prometheus"

     "github.com/prometheus/client_golang/prometheus/promhttp"

 )

 var (

     addr              = flag.String("listen-address", ":8080", "The address to listen on for HTTP requests.")

     uniformDomain     = flag.Float64("uniform.domain", 0.0002, "The domain for the uniform distribution.")

     normDomain        = flag.Float64("normal.domain", 0.0002, "The domain for the normal distribution.")

     normMean          = flag.Float64("normal.mean", 0.00001, "The mean for the normal distribution.")

     oscillationPeriod = flag.Duration("oscillation-period", *time.Minute, "The duration of the rate oscillation period.")

 )

 var (

     rpcDurations = prometheus.NewSummaryVec(

         prometheus.SummaryOpts{

             Name:       "rpc_durations_seconds",

             Help:       "RPC latency distributions.",

             Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},

         },

         []string{"service","error_code"},

     )

     rpcDurationsHistogram = prometheus.NewHistogram(prometheus.HistogramOpts{

         Name:    "rpc_durations_histogram_seconds",

         Help:    "RPC latency distributions.",

         Buckets: prometheus.LinearBuckets(, , ),

     })

 )

 func init() {

     // Register the summary and the histogram with Prometheus's default registry.

     prometheus.MustRegister(rpcDurations)

     prometheus.MustRegister(rpcDurationsHistogram)

     // Add Go module build info.

     prometheus.MustRegister(prometheus.NewBuildInfoCollector())

 }

 func main() {

     flag.Parse()

     start := time.Now()

     oscillationFactor := func() float64 {

         return  + math.Sin(math.Sin(*math.Pi*float64(time.Since(start))/float64(*oscillationPeriod)))

     }

     go func() {

         i :=

         for {

             time.Sleep(time.Duration(*oscillationFactor()) * time.Millisecond)

             if (i*) >  {

                 break

             }

             rpcDurations.WithLabelValues("normal","").Observe(float64((i*)%))

             rpcDurationsHistogram.Observe(float64((i*)%))

             fmt.Println(float64((i*)%), " i=", i)

             i++

         }

     }()

     go func() {

         for {

             v := rand.ExpFloat64() / 1e6

             rpcDurations.WithLabelValues("exponential", "").Observe(v)

             time.Sleep(time.Duration(*oscillationFactor()) * time.Millisecond)

         }

     }()

     // Expose the registered metrics via HTTP.

     http.Handle("/metrics", promhttp.Handler())

     log.Fatal(http.ListenAndServe(*addr, nil))

 }

第25-32行定义了一个Summary类型指标，其中有service和errro_code两个维度。第33-37行定义了一个Histogram类型指标，从0开始，5为宽度，有20个直方。也就是0-5，6-10，11-15 .... 等20个范围统计。

其中直方图HIstogram指标的相关结果为：

 # HELP rpc_durations_histogram_seconds RPC latency distributions.

 # TYPE rpc_durations_histogram_seconds histogram

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le=""}

 rpc_durations_histogram_seconds_bucket{le="+Inf"}

 rpc_durations_histogram_seconds_sum

 rpc_durations_histogram_seconds_count

xxx_count反应当前指标的记录总数，xxx_sum表示当前指标的总数。不同的le表示不同的区间，后面的数字是从开始到这个区间的总数。例如le="30"后面的10表示有10个样本落在0-30区间，那么26-30这个区间一共有多少个样本呢，只需要用len="30" - len="25"，即2个。也就是27和30这两个点。

Summary相关的结果如下：

 # HELP rpc_durations_seconds RPC latency distributions.

 # TYPE rpc_durations_seconds summary

 rpc_durations_seconds{error_code="",service="exponential",quantile="0.5"} 7.176288428497417e-07

 rpc_durations_seconds{error_code="",service="exponential",quantile="0.9"} 2.6582266087185467e-06

 rpc_durations_seconds{error_code="",service="exponential",quantile="0.99"} 4.013935374172691e-06

 rpc_durations_seconds_sum{error_code="",service="exponential"} 0.00015065426336339398

 rpc_durations_seconds_count{error_code="",service="exponential"}

 rpc_durations_seconds{error_code="",service="normal",quantile="0.5"}

 rpc_durations_seconds{error_code="",service="normal",quantile="0.9"}

 rpc_durations_seconds{error_code="",service="normal",quantile="0.99"}

 rpc_durations_seconds_sum{error_code="",service="normal"}

 rpc_durations_seconds_count{error_code="",service="normal"}

其中sum和count指标的含义和上面Histogram一致。拿第8-10行指标来说明，第8行的quantile 0.5 表示这里指标的中位数是51，9分数是90。

自定义类型

如果上面Counter，Gauge，Histogram，Summary四种内置指标都不能满足我们要求时，我们还可以自定义类型。只要实现了Collect接口的方法，然后调用MustRegister即可：

func MustRegister(cs ...Collector) {

    DefaultRegisterer.MustRegister(cs...)

}

type Collector interface {

    Describe(chan<- *Desc)

    Collect(chan<- Metric)

}

总结

文章通过Prometheus内置的Counter(计数器)，Gauge(仪表盘)，Histogram(直方图)，Summary(摘要)演示了Exporter的开发，最后提供了自定义类型的实现方法。

参考

https://prometheus.io/docs/guides/go-application/

https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/promql/prometheus-metrics-types

https://songjiayang.gitbooks.io/prometheus/content/concepts/metric-types.html