kubernetes(k8s) Prometheus+grafana监控告警安装部署
主机数据收集
主机数据的采集是集群监控的基础;外部模块收集各个主机采集到的数据分析就能对整个集群完成监控和告警等功能。一般主机数据采集和对外提供数据使用cAdvisor 和node-exporter等工具。
cAdvisor
概述
Kubernetes的生态中,cAdvisor是作为容器监控数据采集的Agent,其部署在每个节点上,内部代码结构大致如下:代码结构很良好,collector和storage部分基本可做到增量扩展开发。
关于cAdvisor支持自定义指标方式能力,其自身是通过容器部署的时候设置lable标签项:io.cadvisor.metric.开头的lable,而value则为自定义指标的配置文件,形如下:
{
"endpoint" : {
"protocol": "https",
"port": 8000,
"path": "/nginx_status"
},
"metrics_config" : [
{ "name" : "activeConnections",
"metric_type" : "gauge",
"units" : "number of active connections",
"data_type" : "int",
"polling_frequency" : 10,
"regex" : "Active connections: ([0-9]+)"
},
{ "name" : "reading",
"metric_type" : "gauge",
"units" : "number of reading connections",
"data_type" : "int",
"polling_frequency" : 10,
"regex" : "Reading: ([0-9]+) .*"
},
{ "name" : "writing",
"metric_type" : "gauge",
"data_type" : "int",
"units" : "number of writing connections",
"polling_frequency" : 10,
"regex" : ".*Writing: ([0-9]+).*"
},
{ "name" : "waiting",
"metric_type" : "gauge",
"units" : "number of waiting connections",
"data_type" : "int",
"polling_frequency" : 10,
"regex" : ".*Waiting: ([0-9]+)"
}
]
}
但kubernetes1.6开始 cAdvisor 被集成到kubernetes中,可以在安装kubernetes通过参数激活 cAdvisor
当前cAdvisor只支持http接口方式,也就是被监控容器应用必须提供http接口,所以能力较弱,如果我们在collector这一层做扩展增强,提供数据库,mq等等标准应用的监控模式是很有价值的。在此之前的另一种方案就是如上图所示搭配promethuese(其内置有非常丰富的标准应用的插件涵盖了APM所需的采集大部分插件),但是这往往会导致系统更复杂(如果应用层并非想使用promethuse)
在Kubernetes监控生态中,一般是如下的搭配使用:
Node-exporter
概述
node-exporter 运行在节点上采集节点主机本身的cpu和内存等使用信息,并对外提供获取主机性能开销的信息。
部署
下面是node-exporter在k8s下的部署文件
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: 'true'
labels:
app: node-exporter
name: node-exporter
name: node-exporter
namespace: kube-system
spec:
clusterIP: None
ports:
- name: scrape
port: 9100
protocol: TCP
selector:
app: node-exporter
type: ClusterIP
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-exporter
namespace: kube-system
spec:
template:
metadata:
labels:
app: node-exporter
name: node-exporter
spec:
containers:
- image: prom/node-exporter:latest
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
name: scrape
hostNetwork: true
hostPID: true
restartPolicy: Always
监控
完成对kubernetes的监控, 监控收集数据一般有PULL和PUSH两种方式。PULL方式是监控平台从集群中的主机上主动拉取采集到的主机信息,而PUSH方式是主机将采集到的信息推送到监控平台。常用的监控平台是Prometheus,是采用PULL的方式采集主机信息。
Prometheus
概述
Prometheus 是源于 Google Borgmon 的一个系统监控和报警工具,用 Golang 语言开发。基本原理是通过 HTTP 协议周期性地抓取被监控组件的状态(pull 方式),这样做的好处是任意组件只要提供 HTTP 接口就可以接入监控系统,不需要任何 SDK 或者其他的集成过程。
这样做非常适合虚拟化环境比如 VM 或者 Docker ,故其为为数不多的适合 Docker、Mesos 、Kubernetes 环境的监控系统之一,被很多人称为下一代监控系统。
特性
- 自定义多维度的数据模型
- 非常高效的存储 平均一个采样数据占 ~3.5 bytes左右,320万的时间序列,每30秒采样,保持60天,消耗磁盘大概228G。
- 强大的查询语句
- 轻松实现数据可视化
优点
- 非常少的外部依赖,安装使用超简单
- 已经有非常多的系统集成 例如:docker HAProxy Nginx JMX等等
- 服务自动化发现
- 直接集成到代码
- 设计思想是按照分布式、微服务架构来实现的
组件
- Prometheus server
主要负责数据采集和存储,提供 PromQL 查询语言的支持; - Push Gateway
支持临时性 Job 主动推送指标的中间网关; - Exporters
提供被监控组件信息的 HTTP 接口被叫做 exporter ,目前互联网公司常用的组件大部分都有 exporter 可以直接使用,比如 Varnish、Haproxy、Nginx、MySQL、Linux 系统信息 (包括磁盘、内存、CPU、网络等等); - PromDash
使用 rails 开发的 dashboard,用于可视化指标数据; - WebUI
9090 端口提供的图形化功能; - Alertmanager
用来进行报警; - APIclients
提供 HTTPAPI 接口
部署
- 安装Rbac
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: kube-system
- 安装Configmap
# cat configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: kube-system
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module: [http_2xx]
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter.example.com:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name
- job_name: 'kubernetes-ingresses'
kubernetes_sd_configs:
- role: ingress
relabel_configs:
- source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
regex: (.+);(.+);(.+)
replacement: ${1}://${2}${3}
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter.example.com:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_ingress_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_ingress_name]
target_label: kubernetes_name
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- 部署 Prometheus
---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
labels:
name: prometheus-deployment
name: prometheus
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- image: prom/prometheus:v2.0.0
name: prometheus
command:
- "/bin/prometheus"
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention=24h"
ports:
- containerPort: 9090
protocol: TCP
volumeMounts:
- mountPath: "/prometheus"
name: data
- mountPath: "/etc/prometheus"
name: config-volume
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 500m
memory: 2500Mi
serviceAccountName: prometheus
volumes:
- name: data
emptyDir: {}
- name: config-volume
configMap:
name: prometheus-config
---
kind: Service
apiVersion: v1
metadata:
labels:
app: prometheus
name: prometheus
namespace: kube-system
spec:
type: NodePort
ports:
- port: 9090
targetPort: 9090
nodePort: 30003
selector:
app: prometheus
测试
访问集群中 http://prometheus地址:30003后。在Graph页面
输入:
node_cpu
查询命令可以看到节点cpu的使用信息。prometheus监控节点信息成功。
访问targets页面可以看到prometheus采集的监控信息的来源。
告警
Prometheus的告警是使用AlertManger来一同完成的。Prometheus在监控信息超过设定阀值时就将告警信息发送给AlertManger模块,AlertManger模块负责告警。
AlertManger
概述
Alertmanager与Prometheus是相互分离的两个组件。Prometheus服务器根据报警规则将警报发送给Alertmanager,然后Alertmanager将silencing、inhibition、aggregation等消息通过电子邮件、PaperDuty和HipChat发送通知。
设置警报和通知的主要步骤:
- 安装配置Alertmanager
- 配置Prometheus通过-alertmanager.url标志与Alertmanager通信
- 在Prometheus中创建告警触发规则。
- 在Alertmanager中设置告警通知规则
告警通知规则
Alertmanager处理由例如Prometheus服务器等客户端发来的警报。它负责删除重复数据、分组,并将警报通过路由发送到正确的接收器,比如电子邮件、Slack等。Alertmanager还支持groups,silencing和警报抑制的机制。
分组
分组是指将同一类型的警报分类为单个通知。当许多系统同时宕机时,很有可能成百上千的警报会同时生成,这种机制特别有用。
例如,当数十或数百个服务的实例在运行,网络发生故障时,有可能一半的服务实例不能访问数据库。在prometheus告警规则中配置为每一个服务实例都发送警报的话,那么结果是数百警报被发送至Alertmanager。
但是作为用户只想看到单一的报警页面,同时仍然能够清楚的看到哪些实例受到影响,因此,可以通过配置Alertmanager将警报分组打包,并发送一个相对看起来紧凑的通知。
分组警报、警报时间,以及接收警报的receiver是在alertmanager配置文件中通过路由树配置的。
抑制(Inhibition)
抑制是指当警报发出后,停止重复发送由此警报引发其他错误的警报的机制。(比如网络不可达,导致其他服务连接相关警报)
例如,当整个集群网络不可达,此时警报被触发,可以事先配置Alertmanager忽略由该警报触发而产生的所有其他警报,这可以防止通知数百或数千与此问题不相关的其他警报。
抑制机制也是通过Alertmanager的配置文件来配置。
沉默(Silences)
Silences是一种简单的特定时间不告警的机制。silences警告是通过匹配器(matchers)来配置,就像路由树一样。传入的警报会匹配RE,如果匹配,将不会为此警报发送通知。
silences报警机制可以通过Alertmanager的Web页面进行配置。
接收
使用Receiver定义各种通知用户的途径,告警经过分组,过滤处理后选择匹配的通知渠道发送给接收用户。
部署
报警触发规则
定义报警规则
报警规则通过以下格式定义:
ALERT <alert name>
IF <expression>
[ FOR <duration> ]
[ LABELS <label set> ]
[ ANNOTATIONS <label set> ]
可选的FOR语句,使得Prometheus在表达式输出的向量元素(例如高HTTP错误率的实例)之间等待一段时间,将警报计数作为触发此元素。如果元素是active,但是没有firing的,就处于pending状态。
LABELS(标签)语句允许指定一组标签附加警报上。将覆盖现有冲突的任何标签,标签值也可以被模板化。
ANNOTATIONS(注释)它们被用于存储更长的其他信息,例如警报描述或者链接,注释值也可以被模板化。
Templating(模板) 标签和注释值可以使用控制台模板进行模板化。labels变量保存警报实例的标签键/值对,value保存警报实例的评估值。
# To insert a firing element's label values:
{{ $labels.<labelname> }}
# To insert the numeric expression value of the firing element:
{{ $value }}
例子:
prometheus.yml中配置Prometheus和AlertManager通信的通信方式以及告警触发规则。
在prometheus.yml中配置Prometheus来和AlertManager通信
alerting:
alertmanagers:
- static_configs:
- targets: ["alertManager域名:9093"]
在prometheus.yml中指定匹配报警规则的间隔
# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]
在prometheus.yml中指定规则文件(可使用通配符,如rules/*.rules)
rule_files:
- /etc/prometheus/rules.yml
其中rule_files就是用来指定报警规则的,这里我们将rules.yml用ConfigMap的形式挂载到/etc/prometheus目录下面即可:
rules.yml: |
groups:
- name: test-rule
rules:
- alert: NodeFilesystemUsage
expr: (node_filesystem_size{device="rootfs"} - node_filesystem_free{device="rootfs"}) / node_filesystem_size{device="rootfs"} * 100 > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Filesystem usage detected"
description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}"
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
- alert: NodeCPUUsage
expr: (100 - (avg by (instance) (irate(node_cpu{job="kubernetes-node-exporter",mode="idle"}[5m])) * 100)) > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High CPU usage detected"
description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }}"
配置文件设置好后,需要让prometheus重新读取,有两种方法:
通过HTTP API向/-/reload发送POST请求,
例:curl -X POST http://localhost:9090/-/reload向prometheus进程发送SIGHUP信号
告警通知规则
全局
要指定加载的配置文件,需要使用-config.file标志。该文件使用YAML来完成,通过下面的描述来定义。带括号的参数表示是可选的,对于非列表的参数的值,将被设置为指定的缺省值。
通用占位符定义解释:
<duration> : 与正则表达式匹配的持续时间值,[0-9]+(ms|[smhdwy])
<labelname>: 与正则表达式匹配的字符串,[a-zA-Z_][a-zA-Z0-9_]*
<labelvalue>: unicode字符串
<filepath>: 有效的文件路径
<boolean>: boolean类型,true或者false
<string>: 字符串
<tmpl_string>: 模板变量字符串
global全局配置文件参数在所有配置上下文生效,作为其他配置项的默认值,可被覆盖.
global:
resolve_timeout: 30s
smtp_smarthost: "smtp.163.com:25"
smtp_from: 'xiyanxiyan10@163.com'
smtp_auth_username: "xiyanxiyan10@163.com"
smtp_auth_password: "xiyanxiyan10"
smtp_require_tls: false
路由(route)
路由块定义了路由树及其子节点。如果没有设置的话,子节点的可选配置参数从其父节点继承。
每个警报都会在配置的顶级路由中进入路由树,该路由树必须匹配所有警报(即没有任何配置的匹配器)。然后遍历子节点。如果continue的值设置为false,它在第一个匹配的子节点之后就停止;如果continue的值为true,警报将继续进行后续子节点的匹配。如果警报不匹配任何节点的任何子节点(没有匹配的子节点,或不存在),该警报基于当前节点的配置处理。
route:
receiver: mailhook
group_wait: 30s
group_interval: 1m
repeat_interval: 1m
group_by: [NodeMemoryUsage, NodeCPUUsage, NodeFilesystemUsage]
routes:
- receiver: mailhook
group_wait: 10s
match:
team: node
设置alertmanager.yml的的route与receivers
route:
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group_by: ['alertname']
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 5s
# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 1m
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 3h
# A default receiver
receiver: mengyuan
receivers:
- name: 'mengyuan'
webhook_configs:
- url: http://192.168.0.53:8080
email_configs:
- to: 'xiyanxiyan10@hotmail.com'
路由配置格式
#报警接收器
[ receiver: <string> ]
#分组
[ group_by: '[' <labelname>, ... ']' ]
# Whether an alert should continue matching subsequent sibling nodes.
[ continue: <boolean> | default = false ]
# A set of equality matchers an alert has to fulfill to match the node.
#根据匹配的警报,指定接收器
match:
[ <labelname>: <labelvalue>, ... ]
# A set of regex-matchers an alert has to fulfill to match the node.
match_re:
#根据匹配正则符合的警告,指定接收器
[ <labelname>: <regex>, ... ]
# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> ]
# How long to wait before sending notification about new alerts that are
# in are added to a group of alerts for which an initial notification
# has already been sent. (Usually ~5min or more.)
[ group_interval: <duration> ]
# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> ]
# Zero or more child routes.
routes:
[ - <route> ... ]
例子:
// Match does a depth-first left-to-right search through the route tree
// and returns the matching routing nodes.
func (r *Route) Match(lset model.LabelSet) []*Route {
Alert
Alert是alertmanager接收到的报警,类型如下。
// Alert is a generic representation of an alert in the Prometheus eco-system.
type Alert struct {
// Label value pairs for purpose of aggregation, matching, and disposition
// dispatching. This must minimally include an "alertname" label.
Labels LabelSet `json:"labels"`
// Extra key/value information which does not define alert identity.
Annotations LabelSet `json:"annotations"`
// The known time range for this alert. Both ends are optional.
StartsAt time.Time `json:"startsAt,omitempty"`
EndsAt time.Time `json:"endsAt,omitempty"`
GeneratorURL string `json:"generatorURL"`
}
具有相同Lables的Alert(key和value都相同)才会被认为是同一种。在prometheus rules文件配置的一条规则可能会产生多种报警
抑制规则 inhibit_rule
抑制规则,是存在另一组匹配器匹配的情况下,使其他被引发警报的规则静音。这两个警报,必须有一组相同的标签。
抑制配置格式
# Matchers that have to be fulfilled in the alerts to be muted.
##必须在要需要静音的警报中履行的匹配者
target_match:
[ <labelname>: <labelvalue>, ... ]
target_match_re:
[ <labelname>: <regex>, ... ]
# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
#必须存在一个或多个警报以使抑制生效的匹配者。
source_match:
[ <labelname>: <labelvalue>, ... ]
source_match_re:
[ <labelname>: <regex>, ... ]
# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
#在源和目标警报中必须具有相等值的标签才能使抑制生效
[ equal: '[' <labelname>, ... ']' ]
接收器(receiver)
顾名思义,警报接收的配置。
通用配置格式
# The unique name of the receiver.
name: <string>
# Configurations for several notification integrations.
email_configs:
[ - <email_config>, ... ]
pagerduty_configs:
[ - <pagerduty_config>, ... ]
slack_config:
[ - <slack_config>, ... ]
opsgenie_configs:
[ - <opsgenie_config>, ... ]
webhook_configs:
[ - <webhook_config>, ... ]
邮件接收器email_config
# Whether or not to notify about resolved alerts.
#警报被解决之后是否通知
[ send_resolved: <boolean> | default = false ]
# The email address to send notifications to.
to: <tmpl_string>
# The sender address.
[ from: <tmpl_string> | default = global.smtp_from ]
# The SMTP host through which emails are sent.
[ smarthost: <string> | default = global.smtp_smarthost ]
# The HTML body of the email notification.
[ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ]
# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { <string>: <tmpl_string>, ... } ]
Slcack接收器slack_config
# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]
# The Slack webhook URL.
[ api_url: <string> | default = global.slack_api_url ]
# The channel or user to send notifications to.
channel: <tmpl_string>
# API request data as defined by the Slack webhook API.
[ color: <tmpl_string> | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ]
[ username: <tmpl_string> | default = '{{ template "slack.default.username" . }}'
[ title: <tmpl_string> | default = '{{ template "slack.default.title" . }}' ]
[ title_link: <tmpl_string> | default = '{{ template "slack.default.titlelink" . }}' ]
[ pretext: <tmpl_string> | default = '{{ template "slack.default.pretext" . }}' ]
[ text: <tmpl_string> | default = '{{ template "slack.default.text" . }}' ]
[ fallback: <tmpl_string> | default = '{{ template "slack.default.fallback" . }}' ]
Webhook接收器webhook_config
# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]
# The endpoint to send HTTP POST requests to.
url: <string>
Alertmanager会使用以下的格式向配置端点发送HTTP POST请求:
{
"version": "3",
"groupKey": <number> // key identifying the group of alerts (e.g. to deduplicate)
"status": "<resolved|firing>",
"receiver": <string>,
"groupLabels": <object>,
"commonLabels": <object>,
"commonAnnotations": <object>,
"externalURL": <string>, // backling to the Alertmanager.
"alerts": [
{
"labels": <object>,
"annotations": <object>,
"startsAt": "<rfc3339>",
"endsAt": "<rfc3339>"
},
...
]
}
例子:
receivers:
- name: mailhook
email_configs:
- to: "xiyanxiyan10@hotmail.com"
html: '{{ template "alert.html" . }}'
headers: { Subject: "[WARN] Warn Email" }
集群信息展示
集群的整体信息已经收集汇总在Prometheus中,但Prometheus主要是对外提供数据获取接口,并不负责完成完善的图形展示,因此需要使用DashBoard工具对接Prometheus完成集群信息的图形化展示.
Grafana
概述
grafana 是一款采用 go 语言编写的开源应用,主要用于大规模指标数据的可视化展现,基于商业友好的 Apache License 2.0 开源协议。grafana有热插拔控制面板和可扩展的数据源,目前已经支持绝大部分常用的时序数据库。
目前grafana支持的数据源
支持的数据源
上文已经提到,Grafana支持很多的数据源,主要支持的有如下数据源:
部署
部署服务
# cat grafana-deploy.yaml
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: grafana-core
namespace: kube-system
labels:
app: grafana
component: core
spec:
replicas: 1
template:
metadata:
labels:
app: grafana
component: core
spec:
containers:
- image: grafana/grafana:4.2.0
name: grafana-core
imagePullPolicy: IfNotPresent
# env:
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 100m
memory: 100Mi
env:
# The following env variables set up basic auth twith the default admin user and admin password.
- name: GF_AUTH_BASIC_ENABLED
value: "true"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "false"
# - name: GF_AUTH_ANONYMOUS_ORG_ROLE
# value: Admin
# does not really work, because of template variables in exported dashboards:
# - name: GF_DASHBOARDS_JSON_ENABLED
# value: "true"
readinessProbe:
httpGet:
path: /login
port: 3000
# initialDelaySeconds: 30
# timeoutSeconds: 1
volumeMounts:
- name: grafana-persistent-storage
mountPath: /var
volumes:
- name: grafana-persistent-storage
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: kube-system
labels:
app: grafana
component: core
spec:
type: NodePort
ports:
- port: 3000
targetPort: 3000
nodePort: 30009
selector:
app: grafana
配置grafana
访问Grafana部署的端口可以看到以下页面
默认用户名和密码都是admin
配置数据源为prometheus
导入面板之后就可以看到对应的监控数据了。
示例配置
kubernetes(k8s) Prometheus+grafana监控告警安装部署的更多相关文章
- [k8s]prometheus+grafana监控node和mysql(普罗/grafana均vm安装)
https://github.com/prometheus/prometheus Architecture overview Prometheus Server Prometheus Server 负 ...
- [转帖]安装prometheus+grafana监控mysql redis kubernetes等
安装prometheus+grafana监控mysql redis kubernetes等 https://www.cnblogs.com/sfnz/p/6566951.html plug 的模式进行 ...
- 部署Prometheus+Grafana监控
Prometheus 1.不是很友好,各种配置都手写 2.对docker和k8s监控有成熟解决方案 Prometheus(普罗米修斯) 是一个最初在SoudCloud上构建的监控系统,开源项目,拥有非 ...
- [转帖]Prometheus+Grafana监控Kubernetes
原博客的位置: https://blog.csdn.net/shenhonglei1234/article/details/80503353 感谢原作者 这里记录一下自己试验过程中遇到的问题: . 自 ...
- Prometheus+Grafana监控Kubernetes
涉及文件下载地址:链接:https://pan.baidu.com/s/18XHK7ex_J0rzTtfW-QA2eA 密码:0qn6 文件中需要下载的镜像需要自己提前下载好,eg:prom/node ...
- Prometheus + Grafana 监控系统搭
本文主要介绍基于Prometheus + Grafana 监控Linux服务器. 一.Prometheus 概述(略) 与其他监控系统对比 1 Prometheus vs. Zabbix Zabbix ...
- 【Springboot】用Prometheus+Grafana监控Springboot应用
1 简介 项目越做越发觉得,任何一个系统上线,运维监控都太重要了.关于Springboot微服务的监控,之前写过[Springboot]用Springboot Admin监控你的微服务应用,这个方案可 ...
- cAdvisor+Prometheus+Grafana监控docker
cAdvisor+Prometheus+Grafana监控docker 一.cAdvisor(需要监控的主机都要安装) 官方地址:https://github.com/google/cadvisor ...
- Prometheus+Grafana监控SpringBoot
Prometheus+Grafana监控SpringBoot 一.Prometheus监控SpringBoot 1.1 pom.xml添加依赖 1.2 修改application.yml配置文件 1. ...
随机推荐
- ubuntu安装vbox虚拟机
ubuntu安装vbox虚拟机 一.安装准备 1.查看主机配置 二.下载安装包 (建议将安装包下载并保存) a.下载virtualbox安装包 下载链接https://www.virtualbo ...
- JAVA web 框架集合
“框架”犹如滔滔江水连绵不绝, 知道有它就好,先掌握自己工作和主流的框架: 在研究好用和新框架. 主流框架教程分享在Java帮帮-免费资源网 其他教程需要时间制作,会陆续分享!!! 152款框架,你还 ...
- NOIP 2008 传球游戏
洛谷 P1057 传球游戏 洛谷传送门 JDOJ 1536: [NOIP2008]传球游戏 T3 JDOJ传送门 Description 上体育课的时候,小蛮的老师经常带着同学们一起做游戏.这次, ...
- Linux中的关机操作
shutdown -h now //马上停止服务进行关机 shutdown -h 12:00 .//在12点后进行关机 shutdown -h +10 //在10分钟后进行关机 shutdown ...
- 远程windows
1. 起因 因为经常用teamviewer,所以断定我是商业用户,不允许我用了.想买一个授权,结果太贵了,1700多.使用了很多其他的,向日葵卡顿,有的窗口点不到,vnc慢,效果差,卡顿,还收费,等等 ...
- Android Q Beta 6 终极测试版发布!
前言 当今手机市场可谓是百花齐放,但手机系统却屈指可数,其中Android和iOS就占据了整个手机系统市场的99%,单单Android就占据了整个手机系统市场的86%,可谓是占据绝对优势. 其 ...
- Java实现AES加密(window机器和linux机器) 注意window机器 和linux机器算法稍有不同
一)什么是AES? 高级加密标准(英语:Advanced Encryption Standard,缩写:AES),是一种区块加密标准.这个标准用来替代原先的DES,已经被多方分析且广为全世界所使用. ...
- 关于“100g文件全是数组,取最大的100个数”解决方法汇总
原题如下: 有一个100G大小的文件里存的全是数字,并且每个数字见用逗号隔开.现在在这一大堆数字中找出100个最大的数出来. 我认为,首先要摸清考官的意图.是想问你os方面的知识,还是算法,或者数据结 ...
- Javascript Asynchronous Investigation
介绍 同步任务:在主线程上排队执行的任务,只有前一个任务执行完毕,才能执行后一个任务: 异步任务:不进入主线程,而进入任务队列中的任务,只有任务队列通知主线程,某个异步任务可以执行了,这个任务才会进入 ...
- 关于微信小程序前端Canvas组件教程
关于微信小程序前端Canvas组件教程 微信小程序Canvas接口函数 上述为微信小程序Canvas的内部接口,通过熟练使用Canvas,即可画出较为美观的前端页面.下面是使用微信小程序画图的一些 ...