部署kube-prometheus,添加邮件报警
这个项目出自coreos,已经存在很久了,第一次尝试的时候还很简陋,现在完善了很多。
项目提供了一键部署脚本,跑起来并不难,不过个人感觉要真正掌握并灵活使用并不是很容易。
kube version: 1.9.1
OS version: debian stretch
1、从github上把项目拉下来。
# git clone https://github.com/coreos/prometheus-operator.git
2、准备镜像,这里用到了墙外的镜像,自行科学下载。
quay.io/prometheus/alertmanager:v0.9.1
quay.io/coreos/configmap-reload:v0.0.1
grafana/grafana:4.5.2
quay.io/coreos/grafana-watcher:v0.0.8
quay.io/coreos/kube-state-metrics:v1.0.1
gcr.io/google_containers/addon-resizer:1.0
quay.io/prometheus/node-exporter:v0.15.0
quay.io/prometheus/prometheus:v2.0.0
quay.io/coreos/prometheus-config-reloader:v0.0.2
quay.io/coreos/prometheus-operator:v0.15.0
3、执行脚本部署项目。
脚本内容如下:
#!/usr/bin/env bash
if [ -z "${KUBECONFIG}" ]; then
export KUBECONFIG=~/.kube/config
fi
# CAUTION - setting NAMESPACE will deploy most components to the given namespace
# however some are hardcoded to 'monitoring'. Only use if you have reviewed all manifests.
if [ -z "${NAMESPACE}" ]; then
NAMESPACE=monitoring
fi
kubectl create namespace "$NAMESPACE"
kctl() {
kubectl --namespace "$NAMESPACE" "$@"
}
kctl apply -f manifests/prometheus-operator
# Wait for CRDs to be ready.
printf "Waiting for Operator to register custom resource definitions..."
until kctl get customresourcedefinitions servicemonitors.monitoring.coreos.com > /dev/null 2>&1; do sleep 1; printf "."; done
until kctl get customresourcedefinitions prometheuses.monitoring.coreos.com > /dev/null 2>&1; do sleep 1; printf "."; done
until kctl get customresourcedefinitions alertmanagers.monitoring.coreos.com > /dev/null 2>&1; do sleep 1; printf "."; done
until kctl get servicemonitors.monitoring.coreos.com > /dev/null 2>&1; do sleep 1; printf "."; done
until kctl get prometheuses.monitoring.coreos.com > /dev/null 2>&1; do sleep 1; printf "."; done
until kctl get alertmanagers.monitoring.coreos.com > /dev/null 2>&1; do sleep 1; printf "."; done
echo "done!"
kctl apply -f manifests/node-exporter
kctl apply -f manifests/kube-state-metrics
kctl apply -f manifests/grafana/grafana-credentials.yaml
kctl apply -f manifests/grafana
find manifests/prometheus -type f ! -name prometheus-k8s-roles.yaml ! -name prometheus-k8s-role-bindings.yaml -exec kubectl --namespace "$NAMESPACE" apply -f {} \;
kubectl apply -f manifests/prometheus/prometheus-k8s-roles.yaml
kubectl apply -f manifests/prometheus/prometheus-k8s-role-bindings.yaml
kctl apply -f manifests/alertmanager/
从脚本上看,其实很简单,有木有..
# cd contrib/kube-prometheus/
# hack/cluster-monitoring/deploy
namespace "monitoring" created
clusterrolebinding "prometheus-operator" created
clusterrole "prometheus-operator" created
serviceaccount "prometheus-operator" created
service "prometheus-operator" created
deployment "prometheus-operator" created
Waiting for Operator to register custom resource definitions......done!
daemonset "node-exporter" created
service "node-exporter" created
clusterrolebinding "kube-state-metrics" created
clusterrole "kube-state-metrics" created
deployment "kube-state-metrics" created
rolebinding "kube-state-metrics" created
role "kube-state-metrics-resizer" created
serviceaccount "kube-state-metrics" created
service "kube-state-metrics" created
secret "grafana-credentials" created
secret "grafana-credentials" unchanged
configmap "grafana-dashboards-0" created
deployment "grafana" created
service "grafana" created
servicemonitor "alertmanager" created
servicemonitor "prometheus-operator" created
prometheus "k8s" created
servicemonitor "kubelet" created
servicemonitor "prometheus" created
service "prometheus-k8s" created
servicemonitor "node-exporter" created
servicemonitor "kube-scheduler" created
servicemonitor "kube-controller-manager" created
servicemonitor "kube-state-metrics" created
configmap "prometheus-k8s-rules" created
serviceaccount "prometheus-k8s" created
servicemonitor "kube-apiserver" created
role "prometheus-k8s" created
role "prometheus-k8s" created
role "prometheus-k8s" created
clusterrole "prometheus-k8s" created
rolebinding "prometheus-k8s" created
rolebinding "prometheus-k8s" created
rolebinding "prometheus-k8s" created
clusterrolebinding "prometheus-k8s" created
secret "alertmanager-main" created
service "alertmanager-main" created
alertmanager "main" created
4、由于事先准备好了镜像,很快就运行起来了。
# kubectl get po -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 11h
alertmanager-main-1 2/2 Running 0 11h
alertmanager-main-2 2/2 Running 0 11h
grafana-6b67b479d5-2hhnp 2/2 Running 0 11h
kube-state-metrics-6f7b5c94f-r8hm7 2/2 Running 0 11h
node-exporter-27744 1/1 Running 0 11h
node-exporter-9vhlv 1/1 Running 0 11h
node-exporter-rhjfb 1/1 Running 0 11h
node-exporter-xpqr8 1/1 Running 0 11h
prometheus-k8s-0 2/2 Running 0 11h
prometheus-k8s-1 2/2 Running 0 11h
prometheus-operator-8697c7fff9-mm8v5 1/1 Running 0 11h
这里曝光了三个服务:
- Prometheus UI on node port
30900
- Alertmanager UI on node port
30903
- Grafana on node port
30902
通过相应地端口就能访问对应的服务。
5、添加controller-manager和scheduler的监控。
# kubectl apply -f -f manifests/k8s/ -n kube-system
这里其实是添加了2个service,注意namespace是kube-system,而不是monitoring:
# kubectl get ep -n kube-system | grep discovery
kube-controller-manager-prometheus-discovery 192.168.5.104:10252,192.168.5.105:10252,192.168.5.42:10252 3d
kube-scheduler-prometheus-discovery 192.168.5.104:10251,192.168.5.105:10251,192.168.5.42:10251 3d
6、由于我的kubernetes是高可用架构,有三个apiserver,这里有个bug有个需要处理。apiserver是无状态的,三个endpoint会自身冲突,需要是在kubernetes1.9以上,通过给apiserver传递一个参数--endpoint-reconciler-type=lease
解决该问题。
默认是这个样子:
解决完是这个样子:
7、添加邮件报警规则。这个配置是用base64加密过的。
# vim manifests/alertmanager/alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-main
data:
alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KICBzbXRwX3NtYXJ0aG9zdDogIm1haWwub3VwZW5nLmNvbToyNSIKICBzbXRwX2Zyb206ICJuYWdpb3NfbW9uaXRvckBvdXBlbmcuY29tIgogIHNtdHBfYXV0aF91c2VybmFtZTogIm5hZ2lvc19tb25pdG9yQG91cGVuZy5jb20iCiAgc210cF9hdXRoX3Bhc3N3b3JkOiAiZGVsbGRlbGwiCnJvdXRlOgogIGdyb3VwX2J5OiBbJ2FsZXJ0bmFtZScsICdjbHVzdGVyJywgJ3NlcnZpY2UnXQogIGdyb3VwX3dhaXQ6IDMwcwogIGdyb3VwX2ludGVydmFsOiA1bQogIHJlcGVhdF9pbnRlcnZhbDogM2gKICByZWNlaXZlcjogdGVhbS1YLW1haWxzCiAgcm91dGVzOgogIC0gbWF0Y2hfcmU6CiAgICAgIGFsZXJ0bmFtZTogXihob3N0X2NwdV91c2FnZXxub2RlX2ZpbGVzeXN0ZW1fZnJlZXxob3N0X2Rvd24pJAogICAgcmVjZWl2ZXI6IHRlYW0tWC1tYWlscwogICAgcm91dGVzOgogICAgLSBtYXRjaDoKICAgICAgICBzZXZlcml0eTogY3JpdGljYWwKICAgICAgcmVjZWl2ZXI6IHRlYW0tWC1tYWlscwpyZWNlaXZlcnM6Ci0gbmFtZTogInRlYW0tWC1tYWlscyIKICBlbWFpbF9jb25maWdzOgogIC0gdG86ICJuaG9yaXpvbi1zYUBvdXBlbmcuY29tIgo=
要拿到配置内容,用base64反解一下就可以了,默认配置是这样:
# echo "加密的内容" | base64 -d
global:
resolve_timeout: 5m
route:
group_by: ['job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'null'
routes:
- match:
alertname: DeadMansSwitch
receiver: 'null'
receivers:
- name: 'null'
添加自己的邮件设置:
global:
resolve_timeout: 5m
smtp_smarthost: "mail.xxxx.com:25"
smtp_from: "example@xxxx.com"
smtp_auth_username: "expample@xxxx.com"
smtp_auth_password: "pass"
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 15m
receiver: team-X-mails
routes:
- match_re:
alertname: ^(host_cpu_usage|node_filesystem_free|host_down)$
receiver: team-X-mails
routes:
- match:
severity: critical
receiver: team-X-mails
receivers:
- name: "team-X-mails"
email_configs:
- to: "example@xxxx.com"
定义好之后,用base64加密一下,替换掉之前的地方。然后应用配置:
# kubectl apply -f manifests/alertmanager/alertmanager-config.yaml -n monitoring
这时候就能收到报警了:
8、到这里部署就完成了,贴一下几个页面:
grafana
prometheus
alertmanager
补充
正常情况下,DeadMansSwitch
会一直存在,为了让它不报警,这里我添加了一条匹配规则和一个receivernull
,并将报警发送到null:
global:
resolve_timeout: 5m
smtp_smarthost: "mail.xxxx.com:25"
smtp_from: "example@xxxx.com"
smtp_auth_username: "expample@xxxx.com"
smtp_auth_password: "pass"
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 15m
receiver: team-X-mails
routes:
- match_re:
alertname: ^(host_cpu_usage|node_filesystem_free|host_down)$
receiver: team-X-mails
routes:
- match:
severity: critical
receiver: team-X-mails
- match:
alertname: DeadMansSwitch
receiver: 'null'
receivers:
- name: "team-X-mails"
email_configs:
- to: "example@xxxx.com"
- name: 'null'
这样就不会一直收到报警了。
部署kube-prometheus,添加邮件报警的更多相关文章
- zabbix添加邮件报警机制
zabbix添加邮件报警机制 作者:尹正杰 还记得之前跟大家聊过的一个如何监控一个目录的话题吗?我们虽然监控出来数据了,也有数据了,但是,只是监控也没有用啊~因为我们不能24小时盯着屏幕然后 出了事情 ...
- 分布式监控系统Zabbix3.2给异常添加邮件报警
在前一篇 分布式监控系统Zabbix3.2跳坑指南 中已安装好服务端和客户端,此处客户端是被监控的服务器,可能有上百台服务器.监控的目的一个是可以查看历史状态,可以对比零晨和工作区间数据的对比,以便后 ...
- nagios部署+短信和邮件报警
操作系统 CentOS6.6 服务端:10.0.0.20 客户端:10.0.0.50 一.nagios的服务端安装部署 1.nagios安装 [root@manager src]# rzrz wai ...
- Prometheus + AlertManager 邮件报警
安装 wget https://github.com/prometheus/alertmanager/releases/download/v0.13.0/alertmanager-0.13.0.lin ...
- zabbix监控MySQL,Tomcat及配置邮件报警
目录 一.思路 二.部署.配置 环境 安装zabbix 对zabbix进行初步优化 添加监控主机 部署监控Tomcat 配置邮件报警 三.总结 一.思路 首先搭建zabbixserver,本机需要安装 ...
- Grafana邮件报警
一.概述 报警是Grafana的一项革命性功能,它让Grafana从一个数据可视化工具变成一个真正的任务监控工具.报警规则可以使用现有的图表控制面板设置,阈值可以通过拖拉右边的线控制,非常简单.Gra ...
- prometheus + grafana + node_exporter + alertmanager 的安装部署与邮件报警 (一)
大家一定要先看详细的理论教程,再开始搭建,这样报错后才容易找到突破口 参考文档 https://www.cnblogs.com/afterdawn/p/9020129.html https://www ...
- 分布式监控系统Zabbix-3.0.3-完整安装记录(5)-邮件报警部署
前面几篇陆续介绍了zabbix3.0.3监控系统的部署和监控项配置,今天这里分享下zabbix3.0.3的邮件报警的配置过程~由于采用sendmail发送邮件,常常会被认为是垃圾邮件被拒,所以不推荐这 ...
- Prometheus+AlertManager实现邮件报警
AlertManager下载 https://prometheus.io/download/ 解压 添加配置文件test.yml,配置收发邮件邮箱 参考配置: global: smtp_smartho ...
随机推荐
- 20155311 Exp3 免杀原理与实践
20155311 Exp3 免杀原理与实践 •免杀 一般是对恶意软件做处理,让它不被杀毒软件所检测.也是渗透测试中需要使用到的技术. [基础问题回答] (1)杀软是如何检测出恶意代码的? 1.通过特征 ...
- tensorflow batch
这两天一直在看tensorflow中的读取数据的队列,说实话,真的是很难懂.也可能我之前没这方面的经验吧,最早我都使用的theano,什么都是自己写.经过这两天的文档以及相关资料,并且请教了国内的师弟 ...
- [SPOJ2939]Qtree5
[SPOJ2939]Qtree5 Tags:题解 题意 链接 给你\(n\)个节点的黑白树,初始全黑.每次可以翻转某点颜色,或查询距离某点最近的白点的距离.\(n\le 10^5\).强制LCT,不准 ...
- CS100.1x Introduction to Big Data with Apache Spark
CS100.1x简介 这门课主要讲数据科学,也就是data science以及怎么用Apache Spark去分析大数据. Course Software Setup 这门课主要介绍如何编写和调试Py ...
- 数位DP模板详解
// pos = 当前处理的位置(一般从高位到低位) // pre = 上一个位的数字(更高的那一位) // status = 要达到的状态,如果为1则可以认为找到了答案,到时候用来返回, // 给计 ...
- 【Android UI设计与开发】第04期:引导界面(四)仿人人网V5.9.2最新版引导界面
这一篇我将会以人人网的引导界面为实例来展开详细的讲解,人人网的引导界面比较的新颖,不同于其他应用程序千篇一律的靠滑动来引导用户,而是以一个一个比较生动形象的动画效果展示在用户们的面前,有一种给人眼前一 ...
- BLCR技术初探
BLCR技术到底是什么技术?我没空和你乱扯,自己去看该官方网站的介绍:http://crd.lbl.gov/groups-depts/ftg/projects/current-projects/BLC ...
- 阿里云配置ssl证书
一.申请证书和下载证书(阿里云申请) 二.在nginx服务器上配置ssl证书 1.检查服务器是否安装openssl 2.在nginx conf 文件夹创建 cret 文件,放置证书 [root@web ...
- POW的重力之美
定律一:每一个UTXO都保持其状不变,直到有外力迫使它改变这种状态为止--艾萨克•牛顿,原理2.0 在过去的几年里,关于比特币的工作量证明(PoW)所造成的"巨大的能源浪费"已经被 ...
- Windows下fabric sdk连接Linux上fabric网络的调试过程
上个月刚入职一家公司从事区块链研发工作,选型采用Hyperledger Fabric作为开发平台.团队的小组成员全部采用的是在VirtualBox上面安装桌面版的Ubuntu 16.04虚拟机,开发工 ...