具体参考网址:https://www.cnblogs.com/sanduzxcvbnm/p/16291296.html

本章用到的yaml文件地址:https://files.cnblogs.com/files/sanduzxcvbnm/operator_yaml.zip?t=1654593400

背景说明

依据官方文档进行部署,解决部署过程中出现的各种问题,并有所优化

以上缺少的部分可以根据实际情况进行修改而定

安装

git clone https://github.com/coreos/kube-prometheus.git
cd kube-prometheus/manifests

有俩文件需要修改镜像仓库,否则会拉取不到镜像

文件1:kubeStateMetrics-deployment.yaml =》 k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.4.2 =》 bitnami/kube-state-metrics:2.4.2

文件2:prometheusAdapter-deployment.yaml =》 k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1 =》 selina5288/prometheus-adapter:v0.9.1

有三个文件需要修改apiVersion (k8s版本是1.20.11,PodDisruptionBudget 看在1.20中还是v1beta1,修改为policy/v1beta1 )

文件1:alertmanager-podDisruptionBudget.yaml =》 apiVersion: policy/v1 =》apiVersion: policy/v1beta1

文件2:prometheus-podDisruptionBudget.yaml =》 apiVersion: policy/v1 =》apiVersion: policy/v1beta1

文件3:prometheusAdapter-podDisruptionBudget.yaml =》 apiVersion: policy/v1 =》apiVersion: policy/v1beta1

需要新增的文件,保存在manifests目录下

文件1:prometheus-kubeControllerManagerService.yaml

apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: kube-controller-manager
labels:
app.kubernetes.io/name: kube-controller-manager
spec:
clusterIP: None
selector:
component: kube-controller-manager
ports:
- name: https-metrics
port: 10257
targetPort: 10257
protocol: TCP

文件2:prometheus-kubeSchedulerService.yaml

apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: kube-scheduler
labels:
app.kubernetes.io/name: kube-scheduler
spec:
clusterIP: None
selector:
component: kube-scheduler
ports:
- name: https-metrics
port: 10259
targetPort: 10259
protocol: TCP
# (执行kubectl apply -f setup/ 则会报错:The CustomResourceDefinition "prometheuses.monitoring.coreos.com" is invalid: metadata.annotations: Too long: must have at most 262144 bytes)
# 或者先执行 kubectl apply -f setup/ ,等出现上述报错后,再单独执行报错文件 kubectl create -f setup/0prometheusCustomResourceDefinition.yaml kubectl create -f setup/ kubectl apply -f .
kubectl get pods -n monitoring
kubectl get svc -n monitoring

访问

针对 grafana、alertmanager 和 prometheus 都创建了一个类型为 ClusterIP 的 Service,当然如果我们想要在外网访问这两个服务的话可以通过创建对应的 Ingress 对象或者使用 NodePort 类型的 Service,我们这里为了简单,直接使用 NodePort 类型的服务即可,编辑 grafana、alertmanager-main 和 prometheus-k8s 这3个 Service,将服务类型更改为 NodePort:

# 将 type: ClusterIP 更改为 type: NodePort
$ kubectl edit svc grafana -n monitoring
$ kubectl edit svc alertmanager-main -n monitoring
$ kubectl edit svc prometheus-k8s -n monitoring
$ kubectl get svc -n monitoring

注意: 这一步用浏览器访问会报错504,原因是设置了网络访问策略,删除对应的网络策略就可以了,使用ingress无法访问也是同样的解决办法

或者创建对应的 Ingress 对象

本机hosts文件需要添加自定义解析:

# cat alertmanager-ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: alertmanager-ingress
namespace: monitoring
spec:
ingressClassName: nginx
rules:
- host: www.fff.com # 自定义域名,本机hosts配置解析
http:
paths:
- backend:
service:
name: alertmanager-main
port:
number: 9093
path: /
pathType: Prefix # cat grafana-ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana-ingress
namespace: monitoring
spec:
ingressClassName: nginx
rules:
- host: www.eee.com # 自定义域名,本机hosts配置解析
http:
paths:
- backend:
service:
name: grafana
port:
number: 3000
path: /
pathType: Prefix # cat prometheus-ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: prometheus-ingress
namespace: monitoring
spec:
ingressClassName: nginx
rules:
- host: www.ddd.com # 自定义域名,本机hosts配置解析
http:
paths:
- backend:
service:
name: prometheus-k8s
port:
number: 9090
path: /
pathType: Prefix

Grafana 第一次登录使用 admin:admin,进入首页后,可以发现其实 Grafana 已经有很多配置好的监控图表了。

监控kube-controller-manager 和 kube-scheduler 这两个系统组件

安装步骤中已经新增俩文件:prometheus-kubeControllerManagerService.yaml 和 prometheus-kubeSchedulerService.yaml,但是prometheus的targets中无法访问,这是因为kube-controller-manager 和 kube-scheduler 都使用了 --secure-port 绑定到 127.0.0.1 而不是 0.0.0.0

解决办法:

vim /etc/kubernetes/manifests/kube-controller-manager.yaml
将--bind-address=127.0.0.1 改为 --bind-address=0.0.0.0 vim /etc/kubernetes/manifests/kube-scheduler.yaml
将--bind-address=127.0.0.1 改为 --bind-address=0.0.0.0

由于 kube-controller-manager 和 kube-scheduler 是以静态 Pod 运行在集群中的,所以只要修改静态 Pod 目录下对应的 yaml 文件即可。等待一会后,对应服务会自动重启

配置 PrometheusRule 自定义监控rules

自定义一个报警规则,只需要创建一个具有 prometheus=k8s 和 role=alert-rules 标签的 PrometheusRule 对象就行了,比如:

注意 label 标签一定至少要有 prometheus=k8s 和 role=alert-rules

# prometheus-etcdRules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s # 必须有
role: alert-rules # 必须有
name: etcd-rules
namespace: monitoring
spec:
groups:
- name: etcd # 具体的报警规则
rules:
- alert: EtcdClusterUnavailable
annotations:
summary: etcd cluster small
description: If one more etcd peer goes down the cluster will be unavailable
expr: |
count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1)
for: 3m
labels:
severity: critical
# kubectl apply -f prometheus-etcdRules.yaml
prometheusrule.monitoring.coreos.com/etcd-rules created

配置企业微信报警

直接修改 alertmanager-secret.yaml 文件,增加报警信息参数,然后重新更新这个资源对象

除了watchdog外,其余报警都通过企业微信发送

apiVersion: v1
kind: Secret
metadata:
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.24.0
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
"global":
"resolve_timeout": "5m"
"inhibit_rules":
- "equal":
- "namespace"
- "alertname"
"source_matchers":
- "severity = critical"
"target_matchers":
- "severity =~ warning|info"
- "equal":
- "namespace"
- "alertname"
"source_matchers":
- "severity = warning"
"target_matchers":
- "severity = info"
- "equal":
- "namespace"
"source_matchers":
- "alertname = InfoInhibitor"
"target_matchers":
- "severity = info"
"receivers":
- "name": "Default"
"wechat_configs":
- corp_id: 'xxx' # 根据实际情况填写
api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
send_resolved: true
to_party: '2' # 根据实际情况填写
agent_id: 1000005 # 根据实际情况填写
api_secret: 'xxx' # 根据实际情况填写
- "name": "Watchdog"
- "name": "Critical"
"wechat_configs":
- corp_id: 'xxx' # 根据实际情况填写
api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
send_resolved: true
to_party: '2' # 根据实际情况填写
agent_id: 1000005 # 根据实际情况填写
api_secret: 'xxx' # 根据实际情况填写
- "name": "null"
"wechat_configs":
- corp_id: 'xxx' # 根据实际情况填写
api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
send_resolved: true
to_party: '2' # 根据实际情况填写
agent_id: 1000005 # 根据实际情况填写
api_secret: 'xxx' # 根据实际情况填写
"route":
"group_by":
- "namespace"
"group_interval": "5m"
"group_wait": "30s"
"receiver": "Default"
"repeat_interval": "12h"
"routes":
- "matchers":
- "alertname = Watchdog"
"receiver": "Watchdog"
- "matchers":
- "alertname = InfoInhibitor"
"receiver": "null"
- "matchers":
- "severity = critical"
"receiver": "Critical"
type: Opaque
# 直接更新该文件,然后就可以收到告警了
$ kubectl apply -f alertmanager-secret.yaml
secret/alertmanager-main configured

注意:执行命令kubectl apply -f alertmanager-secret.yaml表示是创建一个secret,名称为alertmanager-main,里面的内容是alertmanager.yaml文件。

若是增加自定义企业微信告警模板的话,有两种解决办法:

第一种是在alertmanager-secret.yaml文件中继续新增模板文件内容,还是使用apply命令

apiVersion: v1
kind: Secret
metadata:
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.24.0
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
"global":
"resolve_timeout": "5m"
"inhibit_rules":
- "equal":
- "namespace"
- "alertname"
"source_matchers":
- "severity = critical"
"target_matchers":
- "severity =~ warning|info"
- "equal":
- "namespace"
- "alertname"
"source_matchers":
- "severity = warning"
"target_matchers":
- "severity = info"
- "equal":
- "namespace"
"source_matchers":
- "alertname = InfoInhibitor"
"target_matchers":
- "severity = info"
"receivers":
- "name": "Default"
"wechat_configs":
- corp_id: 'ww0b85c21458a13b12' # 根据实际情况来定
api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
send_resolved: true
to_party: '2' # 根据实际情况来定
agent_id: 1000005 # 根据实际情况来定
api_secret: 'xxx' # 根据实际情况来定
- "name": "Watchdog"
- "name": "Critical"
"wechat_configs":
- corp_id: 'ww0b85c21458a13b12' # 根据实际情况来定
api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
send_resolved: true
to_party: '2' # 根据实际情况来定
agent_id: 1000005 # 根据实际情况来定
api_secret: 'xxx' # 根据实际情况来定
- "name": "null"
"wechat_configs":
- corp_id: 'ww0b85c21458a13b12' # 根据实际情况来定
api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
send_resolved: true
to_party: '2' # 根据实际情况来定
agent_id: 1000005 # 根据实际情况来定
api_secret: 'xxx' # 根据实际情况来定
"route":
"group_by":
- "namespace"
"group_interval": "5m"
"group_wait": "30s"
"receiver": "Default"
"repeat_interval": "12h"
"routes":
- "matchers":
- "alertname = Watchdog"
"receiver": "Watchdog"
- "matchers":
- "alertname = InfoInhibitor"
"receiver": "null"
- "matchers":
- "severity = critical"
"receiver": "Critical"
"templates":
- 'wechat_template.tmpl'
wechat_template.tmpl: |-
{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
==========异常告警==========
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
{{- if gt (len $alert.Labels.namespace) 0 }}
命名空间: {{ $alert.Labels.namespace }}
{{- end }}
{{- if gt (len $alert.Labels.node) 0 }}
节点信息: {{ $alert.Labels.node }}
{{- end }}
{{- if gt (len $alert.Labels.pod) 0 }}
实例名称: {{ $alert.Labels.pod }}
{{- end }}
============END============
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
==========异常恢复==========
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
{{- if gt (len $alert.Labels.namespace) 0 }}
命名空间: {{ $alert.Labels.namespace }}
{{- end }}
{{- if gt (len $alert.Labels.node) 0 }}
节点信息: {{ $alert.Labels.node }}
{{- end }}
{{- if gt (len $alert.Labels.pod) 0 }}
实例名称: {{ $alert.Labels.pod }}
{{- end }}
============END============
{{- end }}
{{- end }}
{{- end }}
{{- end }}
type: Opaque

问题:异常告警跟异常恢复消息在一起发送的时候,异常恢复中恢复时间显示不对

但是单独的异常恢复消息发送后,,显示的恢复时间是对的

单独的异常告警消息中,时间显示的也是对的

第二种,单独创建alertmanager.yaml文件和wechat.tmpl模板文件。使用创建secret的命令进行创建

alertmanager.yaml文件内容

global:
resolve_timeout: 5m
wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
templates:
- '*.tmpl'
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 5m
receiver: 'wechat'
routes:
- receiver: 'wechat'
group_wait: 10s
match:
severity: warning
- receiver: 'wechat'
group_wait: 5s
match:
severity: critical
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: 'xxx' # 根据实际情况来定
agent_id: '1000005' # 根据实际情况来定
api_secret: 'xxx' # 根据实际情况来定
to_party: '2' # 根据实际情况来定
send_resolved: true

创建一个wechat.tmpl的文件

{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
==========异常告警==========
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
{{- if gt (len $alert.Labels.namespace) 0 }}
命名空间: {{ $alert.Labels.namespace }}
{{- end }}
{{- if gt (len $alert.Labels.node) 0 }}
节点信息: {{ $alert.Labels.node }}
{{- end }}
{{- if gt (len $alert.Labels.pod) 0 }}
实例名称: {{ $alert.Labels.pod }}
{{- end }}
============END============
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
==========异常恢复==========
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
{{- if gt (len $alert.Labels.namespace) 0 }}
命名空间: {{ $alert.Labels.namespace }}
{{- end }}
{{- if gt (len $alert.Labels.node) 0 }}
节点信息: {{ $alert.Labels.node }}
{{- end }}
{{- if gt (len $alert.Labels.pod) 0 }}
实例名称: {{ $alert.Labels.pod }}
{{- end }}
============END============
{{- end }}
{{- end }}
{{- end }}
{{- end }}
# 删除原来的secret
$ kubectl delete secret alertmanager-main -n monitoring
secret "alertmanager-main" deleted # 使用如下命令创建新的secret,注意:这个命令有别去第一种方法的命令
$ kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=wechat.tmpl -n monitoring
secret/alertmanager-main configured

以上两种方法判断是否生效的办法:

1.查看k8s中的密文,是否有创建的那些

2.查看alertmanager日志,是否有报错

3.查看alertmanager的web页面中config的信息,配置的信息是否显示的有

4.进入到alertmanager的pod中,查看文件是否存在

自动发现配置

在 Service 的 annotation 区域添加 prometheus.io/scrape=true 的声明,将上面文件直接保存为 prometheus-additional.yaml,然后通过这个文件创建一个对应的 Secret 对象:

# cat prometheus-additional.yaml
- job_name: 'kubernetes-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name # kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
secret "additional-configs" created

在声明 prometheus 的资源对象文件中通过 additionalScrapeConfigs 属性添加上这个额外的配置:(prometheus-prometheus.yaml)

# cat prometheus-prometheus.yaml

  ......
version: v2.15.2
additionalScrapeConfigs: # 如下三行是新增的
name: additional-configs
key: prometheus-additional.yaml

添加完成后,直接更新 prometheus 这个 CRD 资源对象即可:

# kubectl apply -f prometheus-prometheus.yaml
prometheus.monitoring.coreos.com "k8s" configured

隔一小会儿,可以前往 Prometheus 的 Dashboard 中查看配置已经生效了:

切换到 targets 页面下面却并没有发现对应的监控任务,查看 Prometheus 的 Pod 日志,可以看到有很多错误日志出现,都是 xxx is forbidden,这说明是 RBAC 权限的问题,通过 prometheus 资源对象的配置可以知道 Prometheus 绑定了一个名为 prometheus-k8s 的 ServiceAccount 对象,而这个对象绑定的是一个名为 prometheus-k8s 的 ClusterRole:(prometheus-clusterRole.yaml)

上面的权限规则中我们可以看到明显没有对 Service 或者 Pod 的 list 权限,所以报错了,要解决这个问题,我们只需要添加上需要的权限即可:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-k8s
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get

更新上面的 ClusterRole 这个资源对象,然后重建下 Prometheus 的所有 Pod,正常就可以看到 targets 页面下面有 kubernetes-endpoints 这个监控任务了

这里发现的几个抓取目标是因为 Service 中都有 prometheus.io/scrape=true 这个 annotation。

数据持久化

Prometheus持久化:prometheus-prometheus.yaml,新增如下配置

  storage:
volumeClaimTemplate:
spec:
storageClassName: rook-cephfs # 根据实际情况修改
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 30Gi

Grafana 持久化

1.grafana-pvc.yaml (新建该文件)

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: grafana
namespace: monitoring
spec:
storageClassName: rook-cephfs # 根据实际情况修改
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi

2.grafana-deployment.yaml (修改该文件)

      volumes:
- name: grafana-storage # 新增配置
persistentVolumeClaim:
claimName: grafana
#- emptyDir: {} # 注释原来的
# name: grafana-storage

# kubectl apply -f grafana-pvc.yaml
persistentvolumeclaim/grafana created # kubectl apply -f grafana-deployment.yaml
deployment.apps/grafana configured

新增serviceMonitor监控ingress-nginx

prometheus opertaor是通过serviceMontior这个CRD来获取指标监控的,会通过Service的标签进行关联jobs

在manifests下创建一个kubernetes-serviceMonitorIngressNginx.yaml,并应用

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: kube-prometheus
name: ingress-nginx
namespace: monitoring
spec:
endpoints:
- interval: 15s
port: metrics
jobLabel: app.kubernetes.io/name
namespaceSelector:
matchNames:
- ingress-nginx
selector:
matchLabels:
app.kubernetes.io/name: ingress-nginx # kubectl apply -f kubernetes-serviceMonitorIngressNginx.yaml
servicemonitor.monitoring.coreos.com/ingress-nginx created

在manifests下创建一个ingress-metrics.yaml,并应用

apiVersion: v1
kind: Service
metadata:
name: ingress-nginx
namespace: ingress-nginx
labels:
app.kubernetes.io/name: ingress-nginx
annotations:
prometheus.io/port: "10254" #这2个注解是ingress-nginx官方提供的
prometheus.io/scrape: "true"
spec:
type: ClusterIP
ports:
- name: metrics
port: 10254
targetPort: 10254
protocol: TCP
selector:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/component: controller # kubectl apply -f ingress-metrics.yaml
service/ingress-nginx created

前提条件: (这一步在上面 自动发现配置 中已经操作过了,若是未操作过 自动发现配置 ,则还需要操作这个前提条件)

# vim prometheus-clusterRole.yaml
#新增一个apigroups
- apiGroups:
- ""
resources:
- services
- endpoints
- pods
verbs:
- get
- list
- watch

Thanos

关于 prometheus operator 中如何配置 thanos,可以查看官方文档的介绍:https://github.com/coreos/prometheus-operator/blob/master/Documentation/thanos.md

$ kubectl explain prometheus.spec.thanos
KIND: Prometheus
VERSION: monitoring.coreos.com/v1 RESOURCE: thanos <Object> DESCRIPTION:
Thanos configuration allows configuring various aspects of a Prometheus
server in a Thanos environment. This section is experimental, it may change
significantly without deprecation notice in any release. This is
experimental and may change significantly without backward compatibility in
any release. FIELDS:
baseImage <string>
Thanos base image if other than default. grpcServerTlsConfig <Object>
GRPCServerTLSConfig configures the gRPC server from which Thanos Querier
reads recorded rule data. Note: Currently only the CAFile, CertFile, and
KeyFile fields are supported. Maps to the '--grpc-server-tls-*' CLI args. image <string>
Image if specified has precedence over baseImage, tag and sha combinations.
Specifying the version is still necessary to ensure the Prometheus Operator
knows what version of Thanos is being configured. listenLocal <boolean>
ListenLocal makes the Thanos sidecar listen on loopback, so that it does
not bind against the Pod IP. objectStorageConfig <Object>
ObjectStorageConfig configures object storage in Thanos. resources <Object>
Resources defines the resource requirements for the Thanos sidecar. If not
provided, no requests/limits will be set sha <string>
SHA of Thanos container image to be deployed. Defaults to the value of
`version`. Similar to a tag, but the SHA explicitly deploys an immutable
container image. Version and Tag are ignored if SHA is set. tag <string>
Tag of Thanos sidecar container image to be deployed. Defaults to the value
of `version`. Version is ignored if Tag is set. tracingConfig <Object>
TracingConfig configures tracing in Thanos. This is an experimental
feature, it may change in any upcoming release in a breaking way. version <string>
Version describes the version of Thanos to use.

上面的属性中有一个 objectStorageConfig 字段,该字段也就是用来指定对象存储相关配置的,这里同样我们使用前面 Thanos 章节中的对象存储配置即可:(thanos-storage-minio.yaml)

# cat thanos-storage-minio.yaml
type: s3
config:
bucket: promethes-operator-data # 记得事先在 minio 中创建这个 bucket
endpoint: minio.minio.svc.cluster.local:9000
access_key: minio
secret_key: minio123
insecure: true
signature_version2: false

使用上面的配置文件创建一个对应的 Secret 资源对象:

$ kubectl -n monitoring create secret generic thanos-objectstorage --from-file=thanos.yaml=thanos-storage-minio.yaml
secret/thanos-objectstorage created

创建完成后在 prometheus 的 CRD 对象中添加如下配置:(prometheus-prometheus.yaml )

thanos:
objectStorageConfig:
key: thanos.yaml
name: thanos-objectstorage

然后直接更新 prometheus 这个 CRD 对象即可:

$ kubectl apply -f prometheus-prometheus.yaml

更新完成后,可以看到 Prometheus 的 Pod 变成了4个容器,新增了一个 sidecar 容器.

部署其他的 Thanos 组件,比如 Querier、Store、Compactor

参考网址:https://www.cnblogs.com/sanduzxcvbnm/p/16284934.html

https://jishuin.proginn.com/p/763bfbd56ae4

该操作中使用到的yaml文件:https://files.cnblogs.com/files/sanduzxcvbnm/operator_thanos.zip?t=1654661018

现阶段 Prometheus CRD 里面对接 Thanos 的方式是一个实验特性,所以如果你是在生产环境要使用的话需要注意,可能后续版本就变动了,这里我们可以直接通过 thanos 属性来指定使用的镜像版本,以及对应的对象存储配置,这里我们仍然使用 minio 来做对象存储(部署参考前面章节),首先登录 MinIO 创建一个 thanos 的 bucket。然后创建一个对象存储配置文件:

# thanos-storage-minio.yaml
type: s3
config:
bucket: promethes-operator-data # bucket 名称,需要事先创建
endpoint: minio.default.svc.cluster.local:9000 # minio 访问地址
access_key: minio
secret_key: minio123
insecure: true
signature_version2: false

使用上面的配置文件来创建一个 Secret 对象:

$ kubectl create secret generic thanos-objectstorage --from-file=thanos.yaml=thanos-storage-minio.yaml -n monitoring
secret/thanos-objectstorage created

对象存储的配置准备好过后,接下来我们就可以在 Prometheus CRD 中添加对应的 Thanos 配置了,完整的资源对象如下所示:(个别参数有变动)

# cat prometheus-prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.35.0
name: k8s
namespace: monitoring
spec:
alerting:
alertmanagers:
- apiVersion: v2
name: alertmanager-main
namespace: monitoring
port: web
enableFeatures: []
externalLabels: {}
image: quay.io/prometheus/prometheus:v2.35.0
nodeSelector:
kubernetes.io/os: linux
podMetadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.35.0
podMonitorNamespaceSelector: {}
podMonitorSelector: {}
probeNamespaceSelector: {}
probeSelector: {}
replicas: 2
retention: 6h
resources:
requests:
memory: 400Mi
ruleNamespaceSelector: {}
ruleSelector: {}
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: prometheus-k8s
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
version: 2.35.0
additionalScrapeConfigs: # 添加服务发现的配置
name: additional-configs
key: prometheus-additional.yaml
thanos: # 添加 thanos 配置
image: thanosio/thanos:v0.26.0
resources:
limits:
cpu: 500m
memory: 500Mi
requests:
cpu: 100m
memory: 500Mi
objectStorageConfig:
key: thanos.yaml
name: thanos-objectstorage
#storage: # 添加本地数据持久化
# volumeClaimTemplate:
# spec:
# storageClassName: rook-cephfs
# resources:
# requests:
# storage: 20Gi # 至少20G
#thanos: # 添加 thanos 配置
# objectStorageConfig:
# key: thanos.yaml
# name: thanos-objectstorage # 对象存储对应的 secret 资源对象

然后直接更新即可:

$ kubectl apply -f prometheus-prometheus.yaml
prometheus.monitoring.coreos.com/k8s configured

更新完成后我们再次查看更新后的 Prometheus Pod,可以发现已经变成了 4 个容器了: (原先就有3个容器了)

可以看到在原来的基础上新增了一个 sidecar 容器,正常每 2 个小时会上传一次数据,查看 sidecar 可以查看到相关日志:

# kubectl logs -f prometheus-k8s-0 -c thanos-sidecar -n monitoring
level=info ts=2022-06-08T02:23:04.21432378Z caller=options.go:27 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2022-06-08T02:23:04.215510591Z caller=factory.go:49 msg="loading bucket configuration"
level=info ts=2022-06-08T02:23:04.216213439Z caller=sidecar.go:360 msg="starting sidecar"
level=info ts=2022-06-08T02:23:04.216640996Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2022-06-08T02:23:04.21670998Z caller=http.go:73 service=http/server component=sidecar msg="listening for requests and metrics" address=:10902
level=info ts=2022-06-08T02:23:04.21707979Z caller=tls_config.go:195 service=http/server component=sidecar msg="TLS is disabled." http2=false
level=info ts=2022-06-08T02:23:04.218319048Z caller=reloader.go:199 component=reloader msg="nothing to be watched"
level=info ts=2022-06-08T02:23:04.218394592Z caller=intrumentation.go:56 msg="changing probe status" status=ready
level=info ts=2022-06-08T02:23:04.218450345Z caller=grpc.go:131 service=gRPC/server component=sidecar msg="listening for serving gRPC" address=:10901
level=info ts=2022-06-08T02:23:04.223323398Z caller=sidecar.go:179 msg="successfully loaded prometheus version"
level=info ts=2022-06-08T02:23:04.301263386Z caller=sidecar.go:201 msg="successfully loaded prometheus external labels" external_labels="{prometheus=\"monitoring/k8s\", prometheus_replica=\"prometheus-k8s-0\"}"
level=warn ts=2022-06-08T02:23:06.219784039Z caller=shipper.go:239 msg="reading meta file failed, will override it" err="failed to read /prometheus/thanos.shipper.json: open /prometheus/thanos.shipper.json: no such file or directory

Thanos Querier

Thanos Querier 组件提供了从所有 prometheus 实例中一次性检索指标的能力。它与原 prometheus 的 PromQL 和 HTTP API 是完全兼容的,所以同样可以和 Grafana 一起使用。

因为 Querier 组件是要和 Sidecar 以及 Store 组件进行对接的,所以在 Querier 组件的方向参数中需要配置上上面我们启动的 Thanos Sidecar,同样我们可以通过对应的 Headless Service 来进行发现,自动创建的 Service 名为 prometheus-operated(可以通过 Statefulset 查看):

# kubectl describe svc -n monitoring prometheus-operated
Name: prometheus-operated
Namespace: monitoring
Labels: operated-prometheus=true
Annotations: <none>
Selector: app.kubernetes.io/name=prometheus
Type: ClusterIP
IP Families: <none>
IP: None
IPs: None
Port: web 9090/TCP
TargetPort: web/TCP
Endpoints: 10.1.112.219:9090,10.1.112.222:9090
Port: grpc 10901/TCP
TargetPort: grpc/TCP
Endpoints: 10.1.112.219:10901,10.1.112.222:10901
Session Affinity: None
Events: <none>

Thanos Querier 组件完整的资源清单如下所示,需要注意的是 Prometheus Operator 部署的 prometheus 实例多副本的 external_labels 标签为 prometheus_replica:

# cat querier.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-querier
namespace: monitoring
labels:
app: thanos-querier
spec:
selector:
matchLabels:
app: thanos-querier
template:
metadata:
labels:
app: thanos-querier
spec:
containers:
- name: thanos
image: thanosio/thanos:v0.26.0
args:
- query
- --log.level=debug
- --query.replica-label=prometheus_replica # 注意这行
- --store=dnssrv+prometheus-operated:10901 # 注意这行
#- --store=dnssrv+thanos-store:10901 # 注意这行,先注释,一会儿再取消注释
ports:
- name: http
containerPort: 10902
- name: grpc
containerPort: 10901
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "2Gi"
cpu: "1"
livenessProbe:
httpGet:
path: /-/healthy
port: http
initialDelaySeconds: 10
readinessProbe:
httpGet:
path: /-/healthy
port: http
initialDelaySeconds: 15
---
apiVersion: v1
kind: Service
metadata:
name: thanos-querier
namespace: monitoring
labels:
app: thanos-querier
spec:
ports:
- port: 9090
protocol: TCP
targetPort: http
name: http
selector:
app: thanos-querier
type: NodePort # 这里直接用NodePort 方式访问查看,可以改用ingress-nginx方式

直接创建上面的资源对象即可:

# kubectl apply -f querier.yaml

# kubectl get pods -n monitoring -l app=thanos-querier
NAME READY STATUS RESTARTS AGE
thanos-querier-557c7ff9dd-j2r7q 1/1 Running 0 43m

部署完成后我们可以在浏览器中打开 Querier 的页面,查看已经关联上的 Stores:

比如在 Graph 页面查询 node_load1 指标,记住勾选上 Use Deduplication 用于去重查询:

Thanos Store

接着需要部署 Thanos Store 组件,该组件和可以 Querier 组件一起协作从指定对象存储的 bucket 中检索历史指标数据,所以自然在部署的时候我们需要指定对象存储的配置,Store 组件配置完成后还需要加入到 Querier 组件里面去:

# cat store.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: thanos-store
namespace: monitoring
labels:
app: thanos-store
spec:
replicas: 1
selector:
matchLabels:
app: thanos-store
serviceName: thanos-store
template:
metadata:
labels:
app: thanos-store
thanos-store-api: "true"
spec:
containers:
- name: thanos
image: thanosio/thanos:v0.26.0
args:
- "store"
- "--log.level=debug"
- "--data-dir=/data"
- "--objstore.config-file=/etc/secret/thanos.yaml"
- "--index-cache-size=500MB"
- "--chunk-pool-size=500MB"
ports:
- name: http
containerPort: 10902
- name: grpc
containerPort: 10901
livenessProbe:
httpGet:
port: 10902
path: /-/healthy
initialDelaySeconds: 10
readinessProbe:
httpGet:
port: 10902
path: /-/ready
initialDelaySeconds: 15
volumeMounts:
- name: object-storage-config
mountPath: /etc/secret
readOnly: false
volumes:
- name: object-storage-config
secret:
secretName: thanos-objectstorage
---
apiVersion: v1
kind: Service
metadata:
name: thanos-store
namespace: monitoring
spec:
type: ClusterIP
clusterIP: None
ports:
- name: grpc
port: 10901
targetPort: grpc
selector:
app: thanos-store

直接部署上面的资源对象即可:

$ kubectl apply -f thanos-store.yaml
statefulset.apps/thanos-store created
service/thanos-store created $ kubectl get pods -n monitoring -l app=thanos-store
NAME READY STATUS RESTARTS AGE
thanos-store-0 1/1 Running 0 106s

部署完成后为了让 Querier 组件能够发现 Store 组件,我们还需要在 Querier 组件中增加 Store 组件的发现:

containers:
- name: thanos
image: thanosio/thanos:v0.18.0
args:
- query
- --log.level=debug
- --query.replica-label=prometheus_replica
# Discover local store APIs using DNS SRV.
- --store=dnssrv+prometheus-operated:10901
- --store=dnssrv+thanos-store:10901

更新后再次前往 Querier 组件的页面查看发现的 Store 组件正常会多一个 Thanos Store 的组件。

Thanos Compactor

Thanos Compactor 组件可以对我们收集的历史数据进行下采样,可以减少文件的大小。部署方式和之前没什么太大的区别,主要也就是对接对象存储。

# cat compactor.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: thanos-compactor
namespace: monitoring
labels:
app: thanos-compactor
spec:
replicas: 1
selector:
matchLabels:
app: thanos-compactor
serviceName: thanos-compactor
template:
metadata:
labels:
app: thanos-compactor
spec:
containers:
- name: thanos
image: thanosio/thanos:v0.26.0
args:
- "compact"
- "--log.level=debug"
- "--data-dir=/data"
- "--objstore.config-file=/etc/secret/thanos.yaml"
- "--wait"
ports:
- name: http
containerPort: 10902
livenessProbe:
httpGet:
port: 10902
path: /-/healthy
initialDelaySeconds: 10
readinessProbe:
httpGet:
port: 10902
path: /-/ready
initialDelaySeconds: 15
volumeMounts:
- name: object-storage-config
mountPath: /etc/secret
readOnly: false
volumes:
- name: object-storage-config
secret:
secretName: thanos-objectstorage

同样直接创建上面的资源对象即可:

# kubectl apply -f thanos-compactor.yaml

最后如果想通过 Thanos 的 Ruler 组件来配置报警规则,可以直接使用 Prometheus Operator 提供的 ThanosRuler 这个 CRD 对象,不过还是推荐直接和单独的 prometheus 实例配置报警规则,这样调用链路更短,出现问题的时候排查也更方便。Thanos Ruler 组件允许配置记录和告警规则,跨越多个 prometheus 实例进行处理,一个 ThanosRuler 实例至少需要一个 queryEndpoint 指向 Thanos Queriers 或 prometheus 实例的位置,如下所示:

# ThanosRuler Demo
apiVersion: monitoring.coreos.com/v1
kind: ThanosRuler
metadata:
name: thanos-ruler-demo
labels:
example: thanos-ruler
namespace: monitoring
spec:
image: thanosio/thanos
ruleSelector:
matchLabels: # 匹配 Rule 规则
role: my-thanos-rules
queryEndpoints: # querier 地址
- dnssrv+_http._tcp.my-thanos-querier.monitoring.svc.cluster.local

ThanosRuler 组件使用的记录和警报规则与 Prometheus 里面配置的 PrometheusRule 对象,比如上面的示例中,表示包含 role=my-thanos-rules 标签的 PrometheusRule 对象规则会被添加到 Thanos Ruler Pod 中去。

最后通过 Prometheus Operator 对接上 Thanos 过后的所有资源对象如下所示:

# kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 2 20h
alertmanager-main-1 2/2 Running 2 20h
alertmanager-main-2 2/2 Running 2 20h
blackbox-exporter-5cb5d7479d-nb9td 3/3 Running 3 2d
grafana-6fc6fff957-jdjn7 1/1 Running 1 19h
kube-state-metrics-d64589d79-8gs9b 3/3 Running 3 2d
node-exporter-fvnbm 2/2 Running 2 2d
node-exporter-jlqmc 2/2 Running 2 2d
node-exporter-m76cj 2/2 Running 2 2d
prometheus-adapter-785b59bccc-jrpjj 1/1 Running 2 2d
prometheus-adapter-785b59bccc-zqlkx 1/1 Running 2 2d
prometheus-k8s-0 3/3 Running 0 92m
prometheus-k8s-1 3/3 Running 0 92m
prometheus-operator-d8c5b745d-l2trp 2/2 Running 2 2d
thanos-compactor-0 1/1 Running 0 44m
thanos-querier-557c7ff9dd-j2r7q 1/1 Running 0 46m
thanos-store-0 1/1 Running 0 47m

正常 minio 对象存储上面也会有上传的历史数据了:



Kubernetes 监控:Prometheus Operator + Thanos ---实践篇的更多相关文章

  1. Kubernetes 监控--Prometheus 高可用: Thanos

    前面我们已经学习了 Prometheus 的使用,了解了基本的 PromQL 语句以及结合 Grafana 来进行监控图表展示,通过 AlertManager 来进行报警,这些工具结合起来已经可以帮助 ...

  2. kubernetes监控-prometheus(十六)

    监控方案 cAdvisor+Heapster+InfluxDB+Grafana Y 简单 容器监控 cAdvisor/exporter+Prometheus+Grafana Y 扩展性好 容器,应用, ...

  3. kubernetes监控prometheus配置项解读

    前言 文中解决两个问题: 1. kubernetes官方推荐的监控 prometheus 的配置文件, 各项是什么含义 2. 配置好面板之后, 如换去配置 grafana 面板 当然这两个问题网上都有 ...

  4. Kubernetes 监控--Prometheus

    在早期的版本中 Kubernetes 提供了 heapster.influxDB.grafana 的组合来监控系统,在现在的版本中已经移除掉了 heapster,现在更加流行的监控工具是 Promet ...

  5. kubernetes监控--Prometheus

    本文基于kubernetes 1.5.2版本编写 kube-state-metrics kubectl create ns monitoring kubectl create sa -n monito ...

  6. 部署 Prometheus Operator - 每天5分钟玩转 Docker 容器技术(179)

    本节在实践时使用的是 Prometheus Operator 版本 v0.14.0.由于项目开发迭代速度很快,部署方法可能会更新,必要时请参考官方文档. 下载最新源码 git clone https: ...

  7. 部署 Prometheus Operator【转】

    本节在实践时使用的是 Prometheus Operator 版本 v0.14.0.由于项目开发迭代速度很快,部署方法可能会更新,必要时请参考官方文档. 下载最新源码 git clone https: ...

  8. Kubernetes 监控:Prometheus Operator

    安装 前面的章节中我们学习了用自定义的方式来对 Kubernetes 集群进行监控,基本上也能够完成监控报警的需求了.但实际上对上 Kubernetes 来说,还有更简单方式来监控报警,那就是 Pro ...

  9. Prometheus Operator 监控Kubernetes

    Prometheus Operator 监控Kubernetes 1. Prometheus的基本架构 ​ Prometheus是一个开源的完整监控解决方案,涵盖数据采集.查询.告警.展示整个监控流程 ...

随机推荐

  1. GRAPH CONVOLUTIONAL NETWORK WITH SEQUENTIAL ATTENTION FOR GOAL-ORIENTED DIALOGUE SYSTEMS

    面向领域特定目标的对话系统通常需要建模三种类型的输入,即(i)与领域相关的知识库,(ii)对话的历史(即话语序列)和(iii)需要生成响应的当前话语. 在对这些输入进行建模时,当前最先进的模型(如Me ...

  2. JDBC(Java Database Connectivity)编写步骤

    JDBC是代表一组公共的接口,是Java连接数据库技术: JDBC中的这些公共接口和DBMS数据库厂商提供的实现类(驱动jar),是为了实现Java代码可以连接DBMS,并且操作它里面的数据而声名的. ...

  3. FastASR——PaddleSpeech的C++实现

    FastASR 基于PaddleSpeech所使用的conformer模型,使用C++的高效实现模型推理,在树莓派4B等ARM平台运行也可流畅运行. 项目简介 本项目仅实现了PaddleSpeech ...

  4. 2020.7.19 区间 dp 阶段测试

    打崩了-- 事先说明,今天没有很在状态,所以题解就直接写在代码注释里的,非常抱歉 T1 颜色联通块 此题有争议,建议跳过 题目描述 N 个方块排成一排,第 i 个颜色为 Ci .定义一个颜色联通块 [ ...

  5. 流程控制语句continue

    continue语句 用于结束当前循环,进入下一次循环,同样通常与if分支结构一起使用 (这边和前面的break可以结合在一起与C中的一样的理解) 注意这个不是终止整个循环只是终止当前循环进行下一次循 ...

  6. 在 macOS 上搭建 Flutter 开发环境

    下载 Flutter SDK flutter官网下载:https://flutter.io/sdk-archive/#macos 若上述链接无法访问,可通过GitHub下载 https://githu ...

  7. 使用Python3将word文档和pdf电子书进行格式互转(兼容Windows/Linux)

    原文转载自「刘悦的技术博客」https://v3u.cn/a_id_96 一些重要文档格式之间的互转在目前显得尤为重要,pdf作为通用格式在现在各个平台上兼容性是最好的,所以写python脚本将这些w ...

  8. 抖音web端 s_v_web_id 参数生成分析与实现

    本文所有教程及源码.软件仅为技术研究.不涉及计算机信息系统功能的删除.修改.增加.干扰,更不会影响计算机信息系统的正常运行.不得将代码用于非法用途,如侵立删! 抖音web端 s_v_web_id 参数 ...

  9. Python 支付宝红包二维码制作步骤分享

    本文所有教程及源码.软件仅为技术研究.不涉及计算机信息系统功能的删除.修改.增加.干扰,更不会影响计算机信息系统的正常运行.不得将代码用于非法用途,如侵立删! 支付宝红包二维码制作步骤分享 2022. ...

  10. 都说Dapper性能好,突然就遇到个坑,还是个性能问题

    本来闲来无事,准备看看Dapper扩展的源码学习学习其中的编程思想,同时整理一个自己代码的单元测试,为以后的进一步改进打下基础. 突然就发现问题了,源码也不看了,改了好久. 测试Dapper.Lite ...