Prometheus的监控解决方案（含监控kubernetes）

prometheus的简介和安装

Prometheus（普罗米修斯）是一个开源系统监控和警报工具，最初是在SoundCloud建立的。自2012年成立以来，许多公司和组织都采用了普罗米修斯，该项目拥有一个非常活跃的开发者和用户社区。它现在是一个独立的开放源码项目，并且独立于任何公司。为了强调这一点，为了澄清项目的治理结构，普罗米修斯在2016年加入了云计算基金会，成为继Kubernetes之后的第二个托管项目。

特征：

Prometheus的主要特征有：

多维度数据模型
灵活的查询语言
不依赖分布式存储，单个服务器节点是自主的
以HTTP方式，通过pull模型拉去时间序列数据
也通过中间网关支持push模型
通过服务发现或者静态配置，来发现目标服务对象
支持多种多样的图表和界面展示，grafana也支持它
组件

Prometheus生态包括了很多组件，它们中的一些是可选的：

主服务Prometheus Server负责抓取和存储时间序列数据
客户库负责检测应用程序代码
支持短生命周期的PUSH网关
基于Rails/SQL仪表盘构建器的GUI
多种导出工具，可以支持Prometheus存储数据转化为HAProxy、StatsD、Graphite等工具所需要的数据存储格式
警告管理器
命令行查询工具
其他各种支撑工具
多数Prometheus组件是Go语言写的，这使得这些组件很容易编译和部署。

架构

下面这张图说明了Prometheus的整体架构，以及生态中的一些组件作用:

Prometheus服务，可以直接通过目标拉取数据，或者间接地通过中间网关拉取数据。它在本地存储抓取的所有数据，并通过一定规则进行清理和整理数据，并把得到的结果存储到新的时间序列中，PromQL和其他API可视化地展示收集的数据

适用场景

Prometheus在记录纯数字时间序列方面表现非常好。它既适用于面向服务器等硬件指标的监控，也适用于高动态的面向服务架构的监控。对于现在流行的微服务，Prometheus的多维度数据收集和数据筛选查询语言也是非常的强大。

Prometheus是为服务的可靠性而设计的，当服务出现故障时，它可以使你快速定位和诊断问题。它的搭建过程对硬件和服务没有很强的依赖关系。

不适用场景

Prometheus，它的价值在于可靠性，甚至在很恶劣的环境下，你都可以随时访问它和查看系统服务各种指标的统计信息。如果你对统计数据需要100%的精确，它并不适用，例如：它不适用于实时计费系统

Prometheus的安装

tar xvfz prometheus-*.tar.gz

cd prometheus-*

在运行Prometheus服务之前，我们需要指定一个该服务运行所需要的配置文件

Prometheus通过Http方式拉取目标机上的度量指标。Prometheus服务也暴露自己运行所产生的数据，它能够抓取和监控自己的健康状况。

实际上，Prometheus服务收集自己运行所产生的时间序列数据，是没有什么意义的。但是它是一个非常好的入门级教程。保存在Prometheus配置到文件中，并自定义命名该文件名，如：prometheus.yml

在启动普罗米修斯之前，需要进行配置

配置prometheus.yml文件，

global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
# - "first.rules"
# - "second.rules"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
在示例配置文件中有三个模块：global, rule_files, and scrape_configs.

global普罗米修斯服务器的全局配置。我们有两种选择。第一个，scrape_interval，控制普罗米修斯的目标。您可以将其覆盖到单个目标。在这种情况下，全球设置是每15秒刷新一次。evaluation_interval控制普罗米修斯评估规则的频率。普罗米修斯使用规则创建新的时间序列并生成警报。

rule_files指定我们希望普罗米修斯服务器加载的任何规则的位置。

scrape_configs控制普罗米修斯监视的资源。由于普罗米修斯也将自身作为HTTP端点的数据公开，因此它可以对自己的健康进行刷新和监控。在默认的配置中，有一个单独的任务，叫做prometheus。这将使普罗米修斯服务器暴露的时间序列数据受到影响。该作业包含一个单独的、静态配置的目标，即端口9090端口上的localhost。这个默认作业是通过URL抓取的:http://localhost:9090 /指标。

普罗米修斯通过导航到自己的指标端点来提供关于自身的度量：

http://ip:9090/metrics.

查看http相关参数：

官方文档：https://prometheus.io/docs/prometheus/latest/getting_started/

中文翻译：https://github.com/1046102779/prometheus/blob/master/introduction/install.md

安装grafana

官网安装步骤：

http://docs.grafana.org/installation/rpm/

下载安装grafana

wgethttps://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-5.0.4-1.x86_64.rpm

yum install initscripts fontconfig

rpm -Uvh grafana-5.0.4-1.x86_64.rpm

配置prometheus数据源

常见匹配符和函数
官方文档：

https://prometheus.io/docs/prometheus/latest/querying/operators/

中文翻译：

https://github.com/1046102779/prometheus/blob/master/prometheus/querying/operators.md

常见匹配符：

+，-，*，/，%，^（加，减，乘，除，取余，幂次方）

==，!=，>，<，>=，<=（等于，不等于，大于，小于，大于等于，小于等于）
聚合操作符：

sum(求和),min(取最小),max(取最大),avg(取平均)，count (计数器)

stddev (计算偏差),stdvar (计算方差)，count_values(每个元素独立值数量)，bottomk (取倒数几个),topk(取前几位)
具体使用：

查询指标name为http_requests_total 条件为job，handler 的数据:
http_requests_total{job="prometheus", handler="query"}
取5min内其他条件同上的数据:
http_requests_total{job="prometheus", handler="query"}[5m]
匹配job名称以server结尾的数据:
http_requests_total{job=~".*eus"}
匹配status不等于4xx的数据：
http_requests_total{status!~"4.."}
查询5min内，每秒指标为http_requests_total的数据比率：
rate(http_requests_total[5m])
根据job分组，取每秒数据数量：
sum(rate(http_requests_total[5m])) by (job)
取各个实例的未使用内存量（以MB为单位）
(node_memory_CommitLimit_bytes - node_memory_NFS_Unstable_bytes) / 1024
以instance, job为分组，取未使用内存量（以MB为单位）
sum(node_memory_CommitLimit_bytes - node_memory_NFS_Unstable_bytes) by (instance, job) / 1024
假如数据如下：
http_requests_total{code="503",handler="query_range",instance="localhost:9090",job="prometheus",method="get"}
http_requests_total{code="400",handler="query_range",instance="localhost:9090",job="prometheus",method="get"}
http_requests_total{code="400",handler="query",instance="localhost:9090",job="prometheus",method="get"}
取http_requests_total前五数据
topk(5, http_requests_total)
以handler,instance为分组，取http_requests_total前三的数据：
topk(3, http_requests_total) by (handler,instance)
取数据的个数：
count(container_cpu_system_seconds_total) by (id)
函数使用方法：

1、absent()
absent(v instant-vector)，如果赋值给它的向量具有样本数据，则返回空向量；如果传递的瞬时向量参数没有样本数据，则返回不带度量指标名称且带有标签的样本值为1的结果
当监控度量指标时，如果获取到的样本数据是空的，使用absent方法对告警是非常有用的
absent(nonexistent{job="promethues"})
2、irate
irate(v range-vector)函数, 输入：范围向量，输出：key: value = 度量指标： (last值-last前一个值)/时间戳差值。它是基于最后两个数据点，自动调整单调性，如：服务实例重启，则计数器重置。
下面表达式针对范围向量中的每个时间序列数据，返回两个最新数据点过去5分钟的HTTP请求速率。
irate(http_requests_total{job="node-mysql"}[5m])
3、predict_linear
predict_linear(v range-vector, t scalar)预测函数，输入：范围向量和从现在起t秒后，输出：不带有度量指标，只有标签列表的结果值。
predict_linear(http_requests_total{code="200",instance="localhost:9090",job="prometheus",method="get"}[5m], 5)
4、rate()
rate(v range-vector)函数, 输入：范围向量，输出：key: value = 不带有度量指标，且只有标签列表：(last值-first值)/时间差s
http每秒的平均响应时间：
rate(http_request_size_bytes_sum [5m]) / rate(http_request_size_bytes_count [5m])
Prometheus监控服务

官方文档：

https://prometheus.io/docs/prometheus/latest/configuration/configuration/

中文翻译：

https://github.com/1046102779/prometheus/blob/master/operating/configuration.md

Prometheus可以通过命令行参数和配置文件来配置它的服务参数。命令行主要用于配置系统参数（例如：存储位置，保留在磁盘和内存中的数据量大小等），配置文件主要用于配置与抓取任务和任务下的实例相关的所有内容, 并且加载指定的抓取规则file。

可以通过运行prometheus -h命令, 查看Prometheus服务所有可用的命令行参数

使用-config.file命令行参数来指定Prometheus启动所需要的配置文件。

这个配置文件是YAML格式，通过下面描述的范式定义, 括号表示参数是可选的。对于非列表参数，这个值被设置了默认值。

全局配置示例。

全局配置指定的参数，在其他上下文配置中是生效的。这也默认这些全局参数在其他配置区域有效。

# my global config
global:
scrape_interval: 15s
# Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s
# Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- /etc/prometheus/rules.yml
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
#监控node节点和node节点mysql
- job_name: node-mysql
static_configs:
- targets: ['192.168.81.173:9100','192.168.81.173:9104']
#monitor k8s监控kubernetes
- job_name: 'kubernetes-nodes-cadvisor'
kubernetes_sd_configs:
- api_server: 'http://localhost:8080';;
role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_role]
action: replace
target_label: kubernetes_role
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:10255'
target_label: __address__
- job_name: 'kubernetes_node'
kubernetes_sd_configs:
- role: node
api_server: 'http://localhost:8080';;
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'kubernetes-services'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__]
action: replace
target_label: nodeIp
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
Prometheus监控服务主要是通过exporter来监控，需要客户端安装相应的exporter来转换成prometheus能识别的方式，prometheus已经维护了大多数常见服务的exporter：https://prometheus.io/docs/instrumenting/exporters/

监控MySQL

在prometheus服务端配置job和static-configs等，如上图配置，然后在客户端需安装mysql-exporter

Wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.10.0/mysqld_exporter-0.10.0.linux-amd64.tar.gz -O mysqld_exporter-0.10.0.linux-amd64.tar.gz

mysql授权：

GRANT REPLICATION CLIENT, PROCESS ON *.* TO 'prom'@'localhost' identified by 'abc123';

GRANT SELECT ON performance_schema.* TO 'prom'@'localhost';

配置mysql-exporter配置文件

vim .my.cnf

[client]

user=prom

password=abc123
启动mysql-exporter

./mysqld_exporter -config.my-cnf=".my.cnf"

然后可以看到新监听了一个9104端口，MySQL监控配置完成

监控kubernetes

prometheus获取监控端点的方式有很多，其中就包括k8s，prometheu会通过调用master的apiserver获取到节点信息，然后去调取每个节点的数据。

配置方式：在prometheus服务端配置文件中配置job等相应信息，如上配置会监控每个节点的容器信息和节点监控信息。需要在k8s中部署node-exporter pod,yaml文件如下：

apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: 'true'
labels:
app: node-exporter
name: node-exporter
name: node-exporter
spec:
clusterIP: None
ports:
- name: scrape
port: 9100
protocol: TCP
selector:
app: node-exporter
type: ClusterIP
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-exporter
spec:
template:
metadata:
labels:
app: node-exporter
name: node-exporter
spec:
containers:
- image: prom/node-exporter
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
name: scrape
hostNetwork: true
构建node_export的pod

kubectl create -f node_export_pod.yaml

查看prometheus监控状态

报警
官方文档：https://prometheus.io/docs/alerting/configuration/

中文翻译：

https://github.com/1046102779/prometheus/blob/master/alerting/configuration.md

Pormetheus的警告由独立的两部分组成。Prometheus服务中的警告规则发送警告到Alertmanager。然后这个Alertmanager管理这些警告。包括silencing, inhibition, aggregation，以及通过一些方法发送通知，例如：email，webhook和HipChat。

prometheus设置报警的思路:

1、./alertmanager --config.file=simple.yml加载的报警的媒介（如邮件、webhook）

2、./prometheus --config.file=prometheus.yml中指定配置通信的主机和规则文件。

3、在上述配置的规则文件中配置预警策略和模板

配置预警

1、下载安装解压alermanager

tar -zxvf alertmanager-0.15.0-rc.1.linux-amd64.tar.gz

cd alertmanager-0.15.0-rc.1.linux-amd64

配置报警媒介文件

vim aler.yml
global:
resolve_timeout: 6m
smtp_smarthost: 'mail.baiwutong.com:25'
smtp_from: 'monit@baiwutong.com'
smtp_auth_username: 'monit@baiwutong.com'
smtp_auth_password: 'xxxxxxx'
smtp_require_tls: false
templates:
- '/root/alertmanager/template/*.tmpl'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 3s
group_interval: 5m
repeat_interval: 10m
receiver: default-receiver
routes:
- match:
job: ".*"
routes:
- match:
status: yellow
receiver: default-receiver
receivers:
- name: 'default-receiver'
email_configs:
- to: 'guoyinzhao@baiwutong.com'
send_resolved: true
#headers: { Subject: "[mail] 测试技术部监控告警邮件" }
启动警告器

nohup ./alertmanager --config.file=alert.yml &

配置通信主机及路径

在prometheus配置文件中指定通信主机和报警规则文件路径

alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
# - alertmanager:9093
rule_files:
- /etc/prometheus/rules.yml
配置报警规则

groups:
- name: test-rule
rules:
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
- alert: NodeFilesystemUsage
expr: (node_filesystem_size_bytes{device="rootfs"} - node_filesystem_free_bytes{device="rootfs"}) / node_filesystem_size_bytes{device="rootfs"} * 100 > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Filesystem usage detected"
description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}"
- alert: NodeCPUUsage
expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 3m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High CPU usage detected"
description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }}"
主要参考文档：
官网：https://prometheus.io/docs/introduction/overview/

中文翻译：https://github.com/1046102779/prometheus/
---------------------
作者：guoyinzhao
来源：CSDN
原文：https://blog.csdn.net/guoyinzhao/article/details/81283761
版权声明：本文为博主原创文章，转载请附上博文链接！

Prometheus的监控解决方案（含监控kubernetes）的更多相关文章

如何用Prometheus监控十万container的Kubernetes集群
概述不久前,我们在文章<如何扩展单个Prometheus实现近万Kubernetes集群监控?>中详细介绍了TKE团队大规模Kubernetes联邦监控系统Kvass的演进过程,其中介绍 ...
Promethus+Grafana监控解决方案
[MySQL]企业级监控解决方案Promethus+Grafana Promethus用作监控数据采集与处理,而Grafana只是用作数据展示一.Promethus简介 Prometheus(普罗米 ...
基于promtheus的监控解决方案
一.前言鄙人就职于某安全公司,团队的定位是研发安全产品云汇聚平台,为用户提供弹性伸缩的云安全能力.前段时间产品组提出了一个监控需求,大致要求:平台对vm实行动态实时监控,输出相应图表界面,并提供警报 ...
Prometheus 监控K8S Node监控
Prometheus 监控K8S Node监控 Prometheus社区提供的NodeExporter项目可以对主机的关键度量指标进行监控,通过Kubernetes的DeamonSet可以在各个主机节 ...
Prometheus基于consul自动发现监控对象 https://www.iloxp.com/archive/11/
Prometheus 监控目标为什么要自动发现频繁对Prometheus配置文件进行修改,无疑给运维人员带来很大的负担,还有可能直接变成一个“配置小王子”,即使是配置小王子也会存在人为失误的情况 ...
简单4步，利用Prometheus Operator实现自定义指标监控
本文来自Rancher Labs 在过去的文章中,我们花了相当大的篇幅来聊关于监控的话题.这是因为当你正在管理Kubernetes集群时,一切都会以极快的速度发生变化.因此有一个工具来监控集群的健康状 ...
Golang 基于Prometheus Node_Exporter 开发自定义脚本监控
Golang 基于Prometheus Node_Exporter 开发自定义脚本监控公司是今年决定将一些传统应用从虚拟机上迁移到Kubernetes上的,项目多而乱,所以迁移工作进展缓慢,为了建立 ...
Greenplum数仓监控解决方案（开源版本）
Greenplum监控解决方案基于Prometheus+Grafana+greenplum_exporter+node_exporter实现关联图一.基本概念 1.Prometheus Pr ...
Probius+Prometheus通过API集成POD监控
上一篇文章Probius+Kubernetes任务系统如虎添翼讲了我们把Kubernetes集成进了任务系统Probius,上线后小伙伴反馈虽然摆脱了Kubernetes-Dashboard,但还是得 ...

随机推荐

【.NET 与树莓派】气压传感器——BMP180
BMP180 是一款数字气压计传感器,实际可读出温度和气压值.此模块使用 IIC(i2c)协议.模块体积很小,比老周的大拇指指甲还小:也很便宜,一般是长这样的.螺丝孔只开一个,也有开两个孔的. 这货基 ...
GKCTF 2021 Reverse Writeup
前言 GKCTF 2021所以题目均以开源,下面所说的一切思路可以自行通过源码对比IDA进行验证. Github项目地址:https://github.com/w4nd3r-0/GKCTF2021 出 ...
luogu3888 GDOI2014拯救莫里斯（状压dp）
题目描述莫莉斯·乔是圣域里一个叱咤风云的人物,他凭借着自身超强的经济头脑,牢牢控制了圣域的石油市场. 圣域的地图可以看成是一个n*m的矩阵.每个整数坐标点(x , y)表示一座城市\(( 1\le ...
FastAPI 学习之路（十五）响应状态码
系列文章: FastAPI 学习之路(一)fastapi--高性能web开发框架 FastAPI 学习之路(二) FastAPI 学习之路(三) FastAPI 学习之路(四) FastAPI 学习之 ...
java的加载与执行原理剖析
到目前为止,我们接触过的重点术语,总结一下: Java体系的技术被划分为三大块: JavaSE:标准版 JavaEE:企业版 JavaME:微型版安装JDK之后: JDK:java开发工具箱 JRE ...
python 工具箱
strip() 方法可以从字符串去除不想要的空白符. print() BIF的file参数控制将数据发送/保存到哪里. finally组总会执行,而不论try/except语句中出现什么异常. 会向e ...
痞子衡嵌入式：超级下载算法RT-UFL v1.0在IAR EW for Arm下的使用
痞子衡主导的"学术"项目 <RT-UFL - 一个适用全平台i.MXRT的超级下载算法设计> v1.0 版发布近 4 个月了,部分客户已经在实际项目开发调试中用上了这个 ...
【UE4】异步加载关卡 LoadingScreen ( 蓝图和C++ )
一般先跳转到一个临时的关卡,然后异步加载目标关卡,同时展示Loading界面对于含有流关卡的目标关卡,可以先载入子关卡蓝图异步加载无进度条 C++ 异步加载关卡 LoadPackageAsync ...
vue3.x相对于vue2.x生命周期改动
vue3.x已经正式发布了,部分小伙伴已经用了vue3.x开发,部分小伙伴还在观望中,下面是两个影响比较大的改动 1.beforeDestroy和destroyed不能用了. 这个应该是vue2.x项 ...
Java集合 - 集合知识点总结概述
集合概述概念:对象的容器,定义了对多个对象进项操作的的常用方法.可实现数组的功能. 和数组的区别: 数组长度固定,集合长度不固定. 数组可以存储基本类型和引用类型,集合只能存储引用类型. 位置: j ...

Prometheus的监控解决方案（含监控kubernetes）

Prometheus的监控解决方案（含监控kubernetes）的更多相关文章

随机推荐

热门专题