一、概述

首先Prometheus整体监控结构略微复杂,一个个部署并不简单。另外监控Kubernetes就需要访问内部数据,必定需要进行认证、鉴权、准入控制,

那么这一整套下来将变得难上加难,而且还需要花费一定的时间,如果你没有特别高的要求,我还是建议选用开源比较好的一些方案。

关于Prometheus具体介绍不再多说,可以参考另外一篇博文:Kubernetes实战总结 - Prometheus部署(v0.3.0)

本篇主要针对Kubernetes部署Prometheus相关配置介绍,本人采用的是github开源的部署方案:prometheus-operator/kube-prometheus

关于这个kube-prometheus目前应该是开源最好的方案了,该存储库收集Kubernetes清单,Grafana仪表板和Prometheus规则,以及文档和脚本,

以使用Prometheus Operator 通过Prometheus提供易于操作的端到端Kubernetes集群监视。以容器的方式部署到k8s集群,而且还可以自定义配置,非常的方便。

注意:本人使用的kubernetes-1.17.5  + release-0.3,由于网络问题本人已修改全部镜像地址。


二、结构分析

kube-prometheus相关部署文件在manifests目录中,共65个yaml,其中setup文件夹中包含所有自定义资源配置CustomResourceDefinition(一般不用修改,也不要轻易修改),所以部署时必须先执行这个文件夹。

其中包括告警(Alertmanager)、监控(Prometheus)、监控项(PrometheusRule)这三类资源定义,所以如果你想直接在k8s中修改对应控制器配置是没有用的(比如kubectl edit sts prometheus-k8s -n monitoring) 。

这里yaml文件看着很多,只要我们梳理一下就会很容易理解了,首先分为7个组件prometheus-operator、prometheus-adapter、prometheus、alertmanager、grafana、kube-state-metrics、node-exporter,

然后每个组件都会定义控制器、配置文件、集群权限、访问配置等, 但是我们一般只需要进行自定义告警配置和监控项,这样一筛选发现只需要修改几个文件即可(其中红色后面重点说明,紫色可根据项目情况调整资源配置)。

[root@ymt108 manifests]# tree
.
├── alertmanager-alertmanager.yaml
├── alertmanager-secret.yaml    # 告警配置
├── alertmanager-serviceAccount.yaml
├── alertmanager-serviceMonitor.yaml
├── alertmanager-service.yaml
├── grafana-dashboardDatasources.yaml
├── grafana-dashboardDefinitions.yaml
├── grafana-dashboardSources.yaml
├── grafana-deployment.yaml
├── grafana-serviceAccount.yaml
├── grafana-serviceMonitor.yaml
├── grafana-service.yaml
├── kube-state-metrics-clusterRoleBinding.yaml
├── kube-state-metrics-clusterRole.yaml
├── kube-state-metrics-deployment.yaml
├── kube-state-metrics-roleBinding.yaml
├── kube-state-metrics-role.yaml
├── kube-state-metrics-serviceAccount.yaml
├── kube-state-metrics-serviceMonitor.yaml
├── kube-state-metrics-service.yaml
├── node-exporter-clusterRoleBinding.yaml
├── node-exporter-clusterRole.yaml
├── node-exporter-daemonset.yaml
├── node-exporter-serviceAccount.yaml
├── node-exporter-serviceMonitor.yaml
├── node-exporter-service.yaml
├── prometheus-adapter-apiService.yaml
├── prometheus-adapter-clusterRoleAggregatedMetricsReader.yaml
├── prometheus-adapter-clusterRoleBindingDelegator.yaml
├── prometheus-adapter-clusterRoleBinding.yaml
├── prometheus-adapter-clusterRoleServerResources.yaml
├── prometheus-adapter-clusterRole.yaml
├── prometheus-adapter-configMap.yaml
├── prometheus-adapter-deployment.yaml
├── prometheus-adapter-roleBindingAuthReader.yaml
├── prometheus-adapter-serviceAccount.yaml
├── prometheus-adapter-service.yaml
├── prometheus-clusterRoleBinding.yaml
├── prometheus-clusterRole.yaml
├── prometheus-operator-serviceMonitor.yaml
├── prometheus-prometheus.yaml  # 监控配置
├── prometheus-roleBindingConfig.yaml
├── prometheus-roleBindingSpecificNamespaces.yaml
├── prometheus-roleConfig.yaml
├── prometheus-roleSpecificNamespaces.yaml
├── prometheus-rules.yaml  # 默认监控项
├── prometheus-serviceAccount.yaml
├── prometheus-serviceMonitorApiserver.yaml
├── prometheus-serviceMonitorCoreDNS.yaml
├── prometheus-serviceMonitorKubeControllerManager.yaml
├── prometheus-serviceMonitorKubelet.yaml
├── prometheus-serviceMonitorKubeScheduler.yaml
├── prometheus-serviceMonitor.yaml
├── prometheus-service.yaml
└── setup
├── 0namespace-namespace.yaml
├── prometheus-operator-0alertmanagerCustomResourceDefinition.yaml
├── prometheus-operator-0podmonitorCustomResourceDefinition.yaml
├── prometheus-operator-0prometheusCustomResourceDefinition.yaml
├── prometheus-operator-0prometheusruleCustomResourceDefinition.yaml
├── prometheus-operator-0servicemonitorCustomResourceDefinition.yaml
├── prometheus-operator-clusterRoleBinding.yaml
├── prometheus-operator-clusterRole.yaml
├── prometheus-operator-deployment.yaml
├── prometheus-operator-serviceAccount.yaml
└── prometheus-operator-service.yaml 1 directories, 65 files

三、修改Prometheus配置

为了保留原始文件,我们复制一份prometheus-prometheus.yaml进行如下修改:

1)replicas:根据项目情况调整副本数

2)retention:修改Prometheus数据保留期限,默认值为“24h”,并且必须与正则表达式“ [0-9] +(ms | s | m | h | d | w | y)”匹配。

3)additionalScrapeConfigs:增加额外监控项配置,具体配置查看第五部分“添加k8s外部监控”。

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: k8s
name: k8s
namespace: monitoring
spec:
alerting:
alertmanagers:
- name: alertmanager-main
namespace: monitoring
port: web
# baseImage: quay.io/prometheus/prometheus
baseImage: registry.cn-shanghai.aliyuncs.com/leozhanggg/prometheus/prometheus
additionalScrapeConfigs:
name: additional-scrape-configs
key: prometheus-additional.yaml
retention: 15d
nodeSelector:
kubernetes.io/os: linux
podMonitorSelector: {}
replicas:
resources:
requests:
memory: 400Mi
ruleSelector:
matchLabels:
prometheus: k8s
role: alert-rules
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: prometheus-k8s
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
version: v2.11.0

四、修改PrometheusRule配置

首先查看默认监控项配置prometheus-rules.yaml,其中包括76个告警项,基本覆盖了k8s常用监控点,同样为了保留源文件,我们复制一份prometheus-rules.yaml进行一些修改。

由于我使用的版本在裸机k8s集群上存在无法获取到scheduler和controller资源问题,被我注释了;另外由于general-rules规则与我自定义规则冲突也被我注释了;

最后增加了platform参数区分环境,以及进行部分提示语中译。

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: prometheus-k8s-rules
namespace: monitoring
spec:
groups:
- name: node-exporter.rules
rules:
- expr: |
count without (cpu) (
count without (mode) (
node_cpu_seconds_total{job="node-exporter"}
)
)
record: instance:node_num_cpu:sum
- expr: |
1 - avg without (cpu, mode) (
rate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[1m])
)
record: instance:node_cpu_utilisation:rate1m
- expr: |
(
node_load1{job="node-exporter"}
/
instance:node_num_cpu:sum{job="node-exporter"}
)
record: instance:node_load1_per_cpu:ratio
- expr: |
1 - (
node_memory_MemAvailable_bytes{job="node-exporter"}
/
node_memory_MemTotal_bytes{job="node-exporter"}
)
record: instance:node_memory_utilisation:ratio
- expr: |
rate(node_vmstat_pgmajfault{job="node-exporter"}[1m])
record: instance:node_vmstat_pgmajfault:rate1m
- expr: |
rate(node_disk_io_time_seconds_total{job="node-exporter", device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m])
record: instance_device:node_disk_io_time_seconds:rate1m
- expr: |
rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m])
record: instance_device:node_disk_io_time_weighted_seconds:rate1m
- expr: |
sum without (device) (
rate(node_network_receive_bytes_total{job="node-exporter", device!="lo"}[1m])
)
record: instance:node_network_receive_bytes_excluding_lo:rate1m
- expr: |
sum without (device) (
rate(node_network_transmit_bytes_total{job="node-exporter", device!="lo"}[1m])
)
record: instance:node_network_transmit_bytes_excluding_lo:rate1m
- expr: |
sum without (device) (
rate(node_network_receive_drop_total{job="node-exporter", device!="lo"}[1m])
)
record: instance:node_network_receive_drop_excluding_lo:rate1m
- expr: |
sum without (device) (
rate(node_network_transmit_drop_total{job="node-exporter", device!="lo"}[1m])
)
record: instance:node_network_transmit_drop_excluding_lo:rate1m
- name: kube-apiserver.rules
rules:
- expr: |
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver"}[5m])) without(instance, pod))
labels:
quantile: "0.99"
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.9, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver"}[5m])) without(instance, pod))
labels:
quantile: "0.9"
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.5, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver"}[5m])) without(instance, pod))
labels:
quantile: "0.5"
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
- name: k8s.rules
rules:
- expr: |
sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container!="POD"}[5m])) by (namespace)
record: namespace:container_cpu_usage_seconds_total:sum_rate
- expr: |
sum by (namespace, pod, container) (
rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container!="POD"}[5m])
) * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate
- expr: |
container_memory_working_set_bytes{job="kubelet", image!=""}
* on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
record: node_namespace_pod_container:container_memory_working_set_bytes
- expr: |
container_memory_rss{job="kubelet", image!=""}
* on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
record: node_namespace_pod_container:container_memory_rss
- expr: |
container_memory_cache{job="kubelet", image!=""}
* on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
record: node_namespace_pod_container:container_memory_cache
- expr: |
container_memory_swap{job="kubelet", image!=""}
* on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
record: node_namespace_pod_container:container_memory_swap
- expr: |
sum(container_memory_usage_bytes{job="kubelet", image!="", container!="POD"}) by (namespace)
record: namespace:container_memory_usage_bytes:sum
- expr: |
sum by (namespace, label_name) (
sum(kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"} * on (endpoint, instance, job, namespace, pod, service) group_left(phase) (kube_pod_status_phase{phase=~"Pending|Running"} == 1)) by (namespace, pod)
* on (namespace, pod)
group_left(label_name) kube_pod_labels{job="kube-state-metrics"}
)
record: namespace:kube_pod_container_resource_requests_memory_bytes:sum
- expr: |
sum by (namespace, label_name) (
sum(kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"} * on (endpoint, instance, job, namespace, pod, service) group_left(phase) (kube_pod_status_phase{phase=~"Pending|Running"} == 1)) by (namespace, pod)
* on (namespace, pod)
group_left(label_name) kube_pod_labels{job="kube-state-metrics"}
)
record: namespace:kube_pod_container_resource_requests_cpu_cores:sum
- expr: |
sum(
label_replace(
label_replace(
kube_pod_owner{job="kube-state-metrics", owner_kind="ReplicaSet"},
"replicaset", "$1", "owner_name", "(.*)"
) * on(replicaset, namespace) group_left(owner_name) kube_replicaset_owner{job="kube-state-metrics"},
"workload", "$1", "owner_name", "(.*)"
)
) by (namespace, workload, pod)
labels:
workload_type: deployment
record: mixin_pod_workload
- expr: |
sum(
label_replace(
kube_pod_owner{job="kube-state-metrics", owner_kind="DaemonSet"},
"workload", "$1", "owner_name", "(.*)"
)
) by (namespace, workload, pod)
labels:
workload_type: daemonset
record: mixin_pod_workload
- expr: |
sum(
label_replace(
kube_pod_owner{job="kube-state-metrics", owner_kind="StatefulSet"},
"workload", "$1", "owner_name", "(.*)"
)
) by (namespace, workload, pod)
labels:
workload_type: statefulset
record: mixin_pod_workload
- name: kube-scheduler.rules
rules:
- expr: |
histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.99"
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.99"
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.99, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.99"
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.9, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.9"
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.9, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.9"
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.9, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.9"
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.5, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.5"
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.5, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.5"
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.5, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.5"
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile
- name: node.rules
rules:
- expr: sum(min(kube_pod_info) by (node))
record: ':kube_pod_info_node_count:'
- expr: |
max(label_replace(kube_pod_info{job="kube-state-metrics"}, "pod", "$1", "pod", "(.*)")) by (node, namespace, pod)
record: 'node_namespace_pod:kube_pod_info:'
- expr: |
count by (node) (sum by (node, cpu) (
node_cpu_seconds_total{job="node-exporter"}
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
))
record: node:node_num_cpu:sum
- expr: |
sum(
node_memory_MemAvailable_bytes{job="node-exporter"} or
(
node_memory_Buffers_bytes{job="node-exporter"} +
node_memory_Cached_bytes{job="node-exporter"} +
node_memory_MemFree_bytes{job="node-exporter"} +
node_memory_Slab_bytes{job="node-exporter"}
)
)
record: :node_memory_MemAvailable_bytes:sum
- name: kube-prometheus-node-recording.rules
rules:
- expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[3m])) BY
(instance)
record: instance:node_cpu:rate:sum
- expr: sum((node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}))
BY (instance)
record: instance:node_filesystem_usage:sum
- expr: sum(rate(node_network_receive_bytes_total[3m])) BY (instance)
record: instance:node_network_receive_bytes:rate:sum
- expr: sum(rate(node_network_transmit_bytes_total[3m])) BY (instance)
record: instance:node_network_transmit_bytes:rate:sum
- expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[5m])) WITHOUT
(cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu_seconds_total)
BY (instance, cpu)) BY (instance)
record: instance:node_cpu:ratio
- expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[5m]))
record: cluster:node_cpu:sum_rate5m
- expr: cluster:node_cpu_seconds_total:rate5m / count(sum(node_cpu_seconds_total)
BY (instance, cpu))
record: cluster:node_cpu:ratio
- name: node-exporter
rules:
- alert: NodeFilesystemSpaceFillingUp
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available space left and is filling
up.
summary: "预计文件系统将在接下来的24小时内用完空间。"
expr: |
(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 40
and
predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
- alert: NodeFilesystemSpaceFillingUp
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available space left and is filling
up fast.
summary: "预计文件系统将在接下来的4个小时内用完空间。"
expr: |
(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 20
and
predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: critical
- alert: NodeFilesystemAlmostOutOfSpace
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available space left.
summary: "文件系统剩余空间不到5%。"
expr: |
(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
- alert: NodeFilesystemAlmostOutOfSpace
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available space left.
summary: "文件系统剩余空间不到3%。"
expr: |
(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 3
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: critical
- alert: NodeFilesystemFilesFillingUp
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available inodes left and is filling
up.
summary: "预计文件系统将在接下来的24小时内用尽inodes。"
expr: |
(
node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40
and
predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
- alert: NodeFilesystemFilesFillingUp
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available inodes left and is filling
up fast.
summary: "预计文件系统将在接下来的4小时内用尽inodes。"
expr: |
(
node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 20
and
predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: critical
- alert: NodeFilesystemAlmostOutOfFiles
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available inodes left.
summary: "文件系统仅剩不到5%的inodes。"
expr: |
(
node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 5
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
- alert: NodeFilesystemAlmostOutOfFiles
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available inodes left.
summary: "文件系统仅剩不到3%的inodes。"
expr: |
(
node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: critical
- alert: NodeNetworkReceiveErrs
annotations:
platform: "测试平台"
description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
{{ printf "%.0f" $value }} receive errors in the last two minutes.'
summary: "网络接口报告许多接收错误。"
expr: |
increase(node_network_receive_errs_total[2m]) > 10
for: 1h
labels:
severity: warning
- alert: NodeNetworkTransmitErrs
annotations:
platform: "测试平台"
description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
{{ printf "%.0f" $value }} transmit errors in the last two minutes.'
summary: "网络接口报告许多传输错误。"
expr: |
increase(node_network_transmit_errs_total[2m]) > 10
for: 1h
labels:
severity: warning
- name: kubernetes-apps
rules:
- alert: KubePodCrashLooping
annotations:
platform: "测试平台"
message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
}}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.
expr: |
rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60 * 5 > 0
for: 15m
labels:
severity: critical
- alert: KubePodNotReady
annotations:
platform: "测试平台"
message: "Pod {{$labels.namespace}}/{{$labels.pod}}处于未就绪状态的时间超过15分钟。"
expr: |
sum by (namespace, pod) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Failed|Pending|Unknown"} * on(namespace, pod) group_left(owner_kind) kube_pod_owner{owner_kind!="Job"}) > 0
for: 15m
labels:
severity: critical
- alert: KubeDeploymentGenerationMismatch
annotations:
platform: "测试平台"
message: "Deployment {{$labels.namespace}}/{{$labels.deployment}}生成不匹配,这表明Deployment已失败但尚未回滚。"
expr: |
kube_deployment_status_observed_generation{job="kube-state-metrics"}
!=
kube_deployment_metadata_generation{job="kube-state-metrics"}
for: 15m
labels:
severity: critical
- alert: KubeDeploymentReplicasMismatch
annotations:
platform: "测试平台"
message: "Deployment{{$labels.namespace}}/{{$labels.deployment}}超过15分钟未匹配预期的副本数。"
expr: |
kube_deployment_spec_replicas{job="kube-state-metrics"}
!=
kube_deployment_status_replicas_available{job="kube-state-metrics"}
for: 15m
labels:
severity: critical
- alert: KubeStatefulSetReplicasMismatch
annotations:
platform: "测试平台"
message: "StatefulSet {{$labels.namespace}}/{{$labels.statefulset}}超过15分钟未匹配预期的副本数。"
expr: |
kube_statefulset_status_replicas_ready{job="kube-state-metrics"}
!=
kube_statefulset_status_replicas{job="kube-state-metrics"}
for: 15m
labels:
severity: critical
- alert: KubeStatefulSetGenerationMismatch
annotations:
platform: "测试平台"
message: "StatefulSet {{$labels.namespace}}/{{$labels.statefulset}}生成不匹配,这表明StatefulSet已失败但尚未回滚。"
expr: |
kube_statefulset_status_observed_generation{job="kube-state-metrics"}
!=
kube_statefulset_metadata_generation{job="kube-state-metrics"}
for: 15m
labels:
severity: critical
- alert: KubeStatefulSetUpdateNotRolledOut
annotations:
platform: "测试平台"
message: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update
has not been rolled out.
expr: |
max without (revision) (
kube_statefulset_status_current_revision{job="kube-state-metrics"}
unless
kube_statefulset_status_update_revision{job="kube-state-metrics"}
)
*
(
kube_statefulset_replicas{job="kube-state-metrics"}
!=
kube_statefulset_status_replicas_updated{job="kube-state-metrics"}
)
for: 15m
labels:
severity: critical
- alert: KubeDaemonSetRolloutStuck
annotations:
platform: "测试平台"
message: Only {{ $value | humanizePercentage }} of the desired Pods of DaemonSet
{{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready.
expr: |
kube_daemonset_status_number_ready{job="kube-state-metrics"}
/
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} < 1.00
for: 15m
labels:
severity: critical
- alert: KubeContainerWaiting
annotations:
platform: "测试平台"
message: Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container}}
has been in waiting state for longer than 1 hour.
expr: |
sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 0
for: 1h
labels:
severity: warning
- alert: KubeDaemonSetNotScheduled
annotations:
platform: "测试平台"
message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset
}} are not scheduled.'
expr: |
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"}
-
kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0
for: 10m
labels:
severity: warning
- alert: KubeDaemonSetMisScheduled
annotations:
platform: "测试平台"
message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset
}} are running where they are not supposed to run.'
expr: |
kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0
for: 10m
labels:
severity: warning
- alert: KubeCronJobRunning
annotations:
platform: "测试平台"
message: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more
than 1h to complete.
expr: |
time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600
for: 1h
labels:
severity: warning
- alert: KubeJobCompletion
annotations:
platform: "测试平台"
message: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more
than one hour to complete.
expr: |
kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} > 0
for: 1h
labels:
severity: warning
- alert: KubeJobFailed
annotations:
platform: "测试平台"
message: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.
expr: |
kube_job_failed{job="kube-state-metrics"} > 0
for: 15m
labels:
severity: warning
- alert: KubeHpaReplicasMismatch
annotations:
platform: "测试平台"
message: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the
desired number of replicas for longer than 15 minutes.
expr: |
(kube_hpa_status_desired_replicas{job="kube-state-metrics"}
!=
kube_hpa_status_current_replicas{job="kube-state-metrics"})
and
changes(kube_hpa_status_current_replicas[15m]) == 0
for: 15m
labels:
severity: warning
- alert: KubeHpaMaxedOut
annotations:
platform: "测试平台"
message: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has been running at
max replicas for longer than 15 minutes.
expr: |
kube_hpa_status_current_replicas{job="kube-state-metrics"}
==
kube_hpa_spec_max_replicas{job="kube-state-metrics"}
for: 15m
labels:
severity: warning
- name: kubernetes-resources
rules:
- alert: KubeCPUOvercommit
annotations:
platform: "测试平台"
message: "群集已超额使用Pod的CPU资源请求,因此无法容忍节点故障。"
expr: |
sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum)
/
sum(kube_node_status_allocatable_cpu_cores)
>
(count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores)
for: 5m
labels:
severity: warning
- alert: KubeMemOvercommit
annotations:
platform: "测试平台"
message: "群集已过量使用Pod的内存资源请求,因此无法容忍节点故障。"
expr: |
sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum)
/
sum(kube_node_status_allocatable_memory_bytes)
>
(count(kube_node_status_allocatable_memory_bytes)-1)
/
count(kube_node_status_allocatable_memory_bytes)
for: 5m
labels:
severity: warning
- alert: KubeCPUOvercommit
annotations:
platform: "测试平台"
message: "群集已超额使用了对命名空间的CPU资源请求。"
expr: |
sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"})
/
sum(kube_node_status_allocatable_cpu_cores)
> 1.5
for: 5m
labels:
severity: warning
- alert: KubeMemOvercommit
annotations:
platform: "测试平台"
message: "群集已过量使用了对命名空间的内存资源请求。"
expr: |
sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="memory"})
/
sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"})
> 1.5
for: 5m
labels:
severity: warning
- alert: KubeQuotaExceeded
annotations:
platform: "测试平台"
message: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage
}} of its {{ $labels.resource }} quota.
expr: |
kube_resourcequota{job="kube-state-metrics", type="used"}
/ ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
> 0.90
for: 15m
labels:
severity: warning
# - alert: CPUThrottlingHigh
# annotations:
# message: '{{ $value | humanizePercentage }} throttling of CPU in namespace
# {{ $labels.namespace }} for container {{ $labels.container }} in pod {{
# $labels.pod }}.'
# expr: |
# sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace)
# /
# sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace)
# > ( 25 / 100 )
# for: 15m
# labels:
# severity: warning
- name: kubernetes-storage
rules:
- alert: KubePersistentVolumeUsageCritical
annotations:
platform: "测试平台"
message: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim
}} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage
}} free.
expr: |
kubelet_volume_stats_available_bytes{job="kubelet"}
/
kubelet_volume_stats_capacity_bytes{job="kubelet"}
< 0.03
for: 1m
labels:
severity: critical
- alert: KubePersistentVolumeFullInFourDays
annotations:
platform: "测试平台"
message: "根据最近的抽样,{{$labels.persistentvolumeclaim}}在命名空间{{$labels.namespace}}中声明的PersistentVolume预计将在四天内填满,目前{{$value | humanizePercentage}}可用。"
expr: |
(
kubelet_volume_stats_available_bytes{job="kubelet"}
/
kubelet_volume_stats_capacity_bytes{job="kubelet"}
) < 0.15
and
predict_linear(kubelet_volume_stats_available_bytes{job="kubelet"}[6h], 4 * 24 * 3600) < 0
for: 1h
labels:
severity: critical
- alert: KubePersistentVolumeErrors
annotations:
platform: "测试平台"
message: The persistent volume {{ $labels.persistentvolume }} has status {{
$labels.phase }}.
expr: |
kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0
for: 5m
labels:
severity: critical
- name: kubernetes-system
rules:
- alert: KubeVersionMismatch
annotations:
platform: "测试平台"
message: There are {{ $value }} different semantic versions of Kubernetes
components running.
expr: |
count(count by (gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[0-9]*.[0-9]*).*"))) > 1
for: 15m
labels:
severity: warning
- alert: KubeClientErrors
annotations:
platform: "测试平台"
message: Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance
}}' is experiencing {{ $value | humanizePercentage }} errors.'
expr: |
(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job)
/
sum(rate(rest_client_requests_total[5m])) by (instance, job))
> 0.01
for: 15m
labels:
severity: warning
- name: kubernetes-system-apiserver
rules:
- alert: KubeAPILatencyHigh
annotations:
platform: "测试平台"
message: The API server has a 99th percentile latency of {{ $value }} seconds
for {{ $labels.verb }} {{ $labels.resource }}.
expr: |
cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|PROXY|CONNECT"} > 1
for: 10m
labels:
severity: warning
- alert: KubeAPILatencyHigh
annotations:
platform: "测试平台"
message: The API server has a 99th percentile latency of {{ $value }} seconds
for {{ $labels.verb }} {{ $labels.resource }}.
expr: |
cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|PROXY|CONNECT"} > 4
for: 10m
labels:
severity: critical
- alert: KubeAPIErrorsHigh
annotations:
platform: "测试平台"
message: API server is returning errors for {{ $value | humanizePercentage
}} of requests.
expr: |
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m]))
/
sum(rate(apiserver_request_total{job="apiserver"}[5m])) > 0.03
for: 10m
labels:
severity: critical
- alert: KubeAPIErrorsHigh
annotations:
platform: "测试平台"
message: API server is returning errors for {{ $value | humanizePercentage
}} of requests.
expr: |
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m]))
/
sum(rate(apiserver_request_total{job="apiserver"}[5m])) > 0.01
for: 10m
labels:
severity: warning
- alert: KubeAPIErrorsHigh
annotations:
platform: "测试平台"
message: API server is returning errors for {{ $value | humanizePercentage
}} of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource
}}.
expr: |
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb)
/
sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb) > 0.10
for: 10m
labels:
severity: critical
- alert: KubeAPIErrorsHigh
annotations:
platform: "测试平台"
message: API server is returning errors for {{ $value | humanizePercentage
}} of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource
}}.
expr: |
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb)
/
sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb) > 0.05
for: 10m
labels:
severity: warning
- alert: KubeClientCertificateExpiration
annotations:
platform: "测试平台"
message: "用于验证apiserver的客户端证书的有效期限少于7.0天。"
expr: |
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800
labels:
severity: warning
- alert: KubeClientCertificateExpiration
annotations:
platform: "测试平台"
message: "用于验证apiserver的客户端证书的有效期限少于24.0小时。"
expr: |
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400
labels:
severity: critical
- alert: KubeAPIDown
annotations:
platform: "测试平台"
message: "KubeAPI已从Prometheus目标发现中消失。"
expr: |
absent(up{job="apiserver"} == 1)
for: 15m
labels:
severity: critical
- name: kubernetes-system-kubelet
rules:
- alert: KubeNodeNotReady
annotations:
platform: "测试平台"
message: "{{$labels.node}}尚未准备就绪超过15分钟。"
expr: |
kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0
for: 15m
labels:
severity: warning
- alert: KubeNodeUnreachable
annotations:
platform: "测试平台"
message: "{{$labels.node}}无法访问,某些工作负荷可能会重新安排。"
expr: |
kube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1
labels:
severity: warning
- alert: KubeletTooManyPods
annotations:
platform: "测试平台"
message: Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage
}} of its Pod capacity.
expr: |
max(max(kubelet_running_pod_count{job="kubelet"}) by(instance) * on(instance) group_left(node) kubelet_node_name{job="kubelet"}) by(node) / max(kube_node_status_capacity_pods{job="kube-state-metrics"}) by(node) > 0.95
for: 15m
labels:
severity: warning
- alert: KubeletDown
annotations:
platform: "测试平台"
message: "Kubelet已从Prometheus目标发现中消失。"
expr: |
absent(up{job="kubelet"} == 1)
for: 15m
labels:
severity: critical
# - name: kubernetes-system-scheduler
# rules:
# - alert: KubeSchedulerDown
# annotations:
# message: KubeScheduler has disappeared from Prometheus target discovery.
# runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeschedulerdown
# expr: |
# absent(up{job="kube-scheduler"} == 1)
# for: 15m
# labels:
# severity: critical
# - name: kubernetes-system-controller-manager
# rules:
# - alert: KubeControllerManagerDown
# annotations:
# message: KubeControllerManager has disappeared from Prometheus target discovery.
# runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontrollermanagerdown
# expr: |
# absent(up{job="kube-controller-manager"} == 1)
# for: 15m
# labels:
# severity: critical
- name: prometheus
rules:
- alert: PrometheusBadConfig
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to
reload its configuration.
summary: "Prometheus配置重新加载失败。"
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(prometheus_config_last_reload_successful{job="prometheus-k8s",namespace="monitoring"}[5m]) == 0
for: 10m
labels:
severity: critical
- alert: PrometheusNotificationQueueRunningFull
annotations:
platform: "测试平台"
description: Alert notification queue of Prometheus {{$labels.namespace}}/{{$labels.pod}}
is running full.
summary: "Prometheus警报通知队列预计将在30m以内用完。"
expr: |
# Without min_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
predict_linear(prometheus_notifications_queue_length{job="prometheus-k8s",namespace="monitoring"}[5m], 60 * 30)
>
min_over_time(prometheus_notifications_queue_capacity{job="prometheus-k8s",namespace="monitoring"}[5m])
)
for: 15m
labels:
severity: warning
- alert: PrometheusErrorSendingAlertsToSomeAlertmanagers
annotations:
platform: "测试平台"
description: '{{ printf "%.1f" $value }}% errors while sending alerts from
Prometheus {{$labels.namespace}}/{{$labels.pod}} to Alertmanager {{$labels.alertmanager}}.'
summary: "Prometheus在将警报发送到特定的Alertmanager时遇到了超过1%的错误。"
expr: |
(
rate(prometheus_notifications_errors_total{job="prometheus-k8s",namespace="monitoring"}[5m])
/
rate(prometheus_notifications_sent_total{job="prometheus-k8s",namespace="monitoring"}[5m])
)
* 100
> 1
for: 15m
labels:
severity: warning
- alert: PrometheusErrorSendingAlertsToAnyAlertmanager
annotations:
platform: "测试平台"
description: '{{ printf "%.1f" $value }}% minimum errors while sending alerts
from Prometheus {{$labels.namespace}}/{{$labels.pod}} to any Alertmanager.'
summary: "Prometheus在将警报发送到任何Alertmanager时遇到3%以上的错误。"
expr: |
min without(alertmanager) (
rate(prometheus_notifications_errors_total{job="prometheus-k8s",namespace="monitoring"}[5m])
/
rate(prometheus_notifications_sent_total{job="prometheus-k8s",namespace="monitoring"}[5m])
)
* 100
> 3
for: 15m
labels:
severity: critical
- alert: PrometheusNotConnectedToAlertmanagers
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not connected
to any Alertmanagers.
summary: "Prometheus未与任何Alertmanager连接。"
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(prometheus_notifications_alertmanagers_discovered{job="prometheus-k8s",namespace="monitoring"}[5m]) < 1
for: 10m
labels:
severity: warning
- alert: PrometheusTSDBReloadsFailing
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected
{{$value | humanize}} reload failures over the last 3h.
summary: "Prometheus从磁盘重新加载块时遇到问题。"
expr: |
increase(prometheus_tsdb_reloads_failures_total{job="prometheus-k8s",namespace="monitoring"}[3h]) > 0
for: 4h
labels:
severity: warning
- alert: PrometheusTSDBCompactionsFailing
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected
{{$value | humanize}} compaction failures over the last 3h.
summary: "Prometheus在压缩块时遇到问题。"
expr: |
increase(prometheus_tsdb_compactions_failed_total{job="prometheus-k8s",namespace="monitoring"}[3h]) > 0
for: 4h
labels:
severity: warning
- alert: PrometheusNotIngestingSamples
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not ingesting
samples.
summary: "Prometheus没有获取到样本"
expr: |
rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-k8s",namespace="monitoring"}[5m]) <= 0
for: 10m
labels:
severity: warning
- alert: PrometheusDuplicateTimestamps
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping
{{ printf "%.4g" $value }} samples/s with different values but duplicated
timestamp.
summary: "Prometheus正在删除带有重复时间戳的样本。"
expr: |
rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
for: 10m
labels:
severity: warning
- alert: PrometheusOutOfOrderTimestamps
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping
{{ printf "%.4g" $value }} samples/s with timestamps arriving out of order.
summary: "Prometheus丢弃带有乱序时间戳的样本。"
expr: |
rate(prometheus_target_scrapes_sample_out_of_order_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
for: 10m
labels:
severity: warning
- alert: PrometheusRemoteStorageFailures
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} failed to send
{{ printf "%.1f" $value }}% of the samples to queue {{$labels.queue}}.
summary: "Prometheus无法将样本发送到远程存储。"
expr: |
(
rate(prometheus_remote_storage_failed_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m])
/
(
rate(prometheus_remote_storage_failed_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m])
+
rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m])
)
)
* 100
> 1
for: 15m
labels:
severity: critical
- alert: PrometheusRemoteWriteBehind
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write
is {{ printf "%.1f" $value }}s behind for queue {{$labels.queue}}.
summary: "Prometheus远程写入落后了。"
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="prometheus-k8s",namespace="monitoring"}[5m])
- on(job, instance) group_right
max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="prometheus-k8s",namespace="monitoring"}[5m])
)
> 120
for: 15m
labels:
severity: critical
- alert: PrometheusRemoteWriteDesiredShards
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write
desired shards calculation wants to run {{ $value }} shards, which is more
than the max of {{ printf `prometheus_remote_storage_shards_max{instance="%s",job="prometheus-k8s",namespace="monitoring"}`
$labels.instance | query | first | value }}.
summary: "Prometheus远程写入所需的分片计算要比配置的最大分片运行更多。"
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
max_over_time(prometheus_remote_storage_shards_desired{job="prometheus-k8s",namespace="monitoring"}[5m])
>
max_over_time(prometheus_remote_storage_shards_max{job="prometheus-k8s",namespace="monitoring"}[5m])
)
for: 15m
labels:
severity: warning
- alert: PrometheusRuleFailures
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to
evaluate {{ printf "%.0f" $value }} rules in the last 5m.
summary: "Prometheus无法通过规则评估。"
expr: |
increase(prometheus_rule_evaluation_failures_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
for: 15m
labels:
severity: critical
- alert: PrometheusMissingRuleEvaluations
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has missed {{
printf "%.0f" $value }} rule group evaluations in the last 5m.
summary: "Prometheus由于规则组评估速度慢而缺少规则评估。"
expr: |
increase(prometheus_rule_group_iterations_missed_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
for: 15m
labels:
severity: warning
- name: alertmanager.rules
rules:
- alert: AlertmanagerConfigInconsistent
annotations:
platform: "测试平台"
message: "Alertmanager {{$labels.service}}实例的配置不同步。"
expr: |
count_values("config_hash", alertmanager_config_hash{job="alertmanager-main",namespace="monitoring"}) BY (service) / ON(service) GROUP_LEFT() label_replace(max(prometheus_operator_spec_replicas{job="prometheus-operator",namespace="monitoring",controller="alertmanager"}) by (name, job, namespace, controller), "service", "alertmanager-$1", "name", "(.*)") != 1
for: 5m
labels:
severity: critical
- alert: AlertmanagerFailedReload
annotations:
platform: "测试平台"
message: "Alertmanager {{$labels.namespace}}/{{$labels.pod}}重新加载配置失败。"
expr: |
alertmanager_config_last_reload_successful{job="alertmanager-main",namespace="monitoring"} == 0
for: 10m
labels:
severity: warning
- alert: AlertmanagerMembersInconsistent
annotations:
platform: "测试平台"
message: "Alertmanager尚未找到集群的所有其他成员。"
expr: |
alertmanager_cluster_members{job="alertmanager-main",namespace="monitoring"}
!= on (service) GROUP_LEFT()
count by (service) (alertmanager_cluster_members{job="alertmanager-main",namespace="monitoring"})
for: 5m
labels:
severity: critical
# - name: general.rules
# rules:
# - alert: TargetDown
# annotations:
# platform: "测试平台"
# message: '{{ printf "%.4g" $value }}% of the {{ $labels.job }} targets in
# {{ $labels.namespace }} namespace are down.'
# expr: 100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job,
# namespace, service)) > 10
# for: 10m
# labels:
# severity: warning
# - alert: Watchdog
# annotations:
# platform: "测试平台"
# message: "此警报始终处于触发状态,旨在确保整个警报管道均正常运行。"
# runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md
# expr: vector(1)
# labels:
# severity: none
- name: node-time
rules:
- alert: ClockSkewDetected
annotations:
platform: "测试平台"
message: Clock skew detected on node-exporter {{ $labels.namespace }}/{{ $labels.pod
}}. Ensure NTP is configured correctly on this host.
expr: |
abs(node_timex_offset_seconds{job="node-exporter"}) > 0.05
for: 2m
labels:
severity: warning
- name: node-network
rules:
- alert: NodeNetworkInterfaceFlapping
annotations:
platform: "测试平台"
message: Network interface "{{ $labels.device }}" changing it's up status
often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}"
expr: |
changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) > 2
for: 2m
labels:
severity: warning
- name: prometheus-operator
rules:
- alert: PrometheusOperatorReconcileErrors
annotations:
platform: "测试平台"
message: Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace
}} Namespace.
expr: |
rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
for: 10m
labels:
severity: warning
- alert: PrometheusOperatorNodeLookupErrors
annotations:
platform: "测试平台"
message: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace.
expr: |
rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
for: 10m
labels:
severity: warning

prometheus-rules.yaml

接下来参考prometheus-rules.yaml,新建自定义的告警项prometheus-additional-rules.yaml

kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: prometheus-additional-rules
namespace: monitoring
spec:
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 30s
labels:
severity: critical
annotations:
platform: "育苗通测试平台"
summary: "监控采集器 {{ $labels.instance }} 停止工作!!!"
value: "{{ $value }}" - name: node.rules
rules:
- alert: NodeFilesystemUsage
expr: 100 - (node_filesystem_free_bytes{device="rootfs"} / node_filesystem_size_bytes{device="rootfs"} * 100) > 85
for: 5m
labels:
severity: warning
annotations:
platform: "育苗通测试平台"
summary: "主机 {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率超过80%!!!"
value: "{{ $value }}" - alert: NodeMemoryUsage
expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
platform: "育苗通测试平台"
summary: "主机 {{ $labels.instance }} 内存使用率超过80%!!!"
value: "{{ $value }}" - alert: NodeCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
platform: "育苗通测试平台"
summary: "主机 {{ $labels.instance }} CPU使用率超过80%!!!"
value: "{{ $value }}"


五、添加k8s外部监控

一个项目开始可能很难实现全部容器化,比如数据库、CDH集群。但是我们依然需要监控他们,如果分成两套prometheus不利于管理,所以我们统一添加这些监控到kube-prometheus中。

那么接下来我们新建prometheus-additional.yaml文件,添加额外监控组件配置scrape_configs。

- job_name: 'node_export_others'
static_configs:
- targets:
- *.*.*.118:9591
- *.*.*.137:9591
- *.*.*.138:9591
- *.*.*.139:9591
- *.*.*.153:9591
- *.*.*.154:9591
- *.*.*.156:9591
- *.*.*.159:9591
- *.*.*.166:9591
- *.*.*.167:9591
- *.*.*.168:9591
- *.*.*.169:9591
- *.*.*.175:9591
- *.*.*.177:9591
- *.*.*.179:9591
- *.*.*.180:9591
- *.*.*.181:9591 - job_name: 'nginx'
static_configs:
- targets:
- *.*.*.159:9593 - job_name: 'mysql'
static_configs:
- targets:
- *.*.*.156:9592 - job_name: 'pushgateway-flink'
static_configs:
- targets:
- *.*.*.137:9091 - job_name: 'kafka_jmx_export'
static_configs:
- targets:
- *.*.*.118:9991
- *.*.*.138:9991
- *.*.*.139:9991 - job_name: 'zookeeper_cdh_export'
static_configs:
- targets:
- *.*.*.118:9595
- *.*.*.138:9595
- *.*.*.139:9595 - job_name: 'hdfs_namenode_cdh'
static_configs:
- targets:
- *.*.*.137:9600
- *.*.*.139:9600 - job_name: 'hdfs_datanode_cdh'
static_configs:
- targets:
- *.*.*.118:9601
- *.*.*.138:9601
- *.*.*.139:9601 - job_name: 'yarn_resourcemanager_cdh'
static_configs:
- targets:
- *.*.*.137:9602
- *.*.*.139:9602 - job_name: 'yarn_nodemanager_cdh'
static_configs:
- targets:
- *.*.*.118:9603
- *.*.*.138:9603
- *.*.*.139:9603 - job_name: 'redis_exporter'
static_configs:
- targets:
- *.*.*.166:9594 - job_name: 'elasticsearch'
metrics_path: "/_prometheus/metrics"
static_configs:
- targets:
- *.*.*.169:8970
- *.*.*.175:8970
- *.*.*.177:8970 - job_name: 'nacos'
metrics_path: '/nacos/actuator/prometheus'
static_configs:
- targets:
- *.*.*.166:8848
- *.*.*.167:8848
- *.*.*.168:8848 - job_name: 'redis_exporter_targets'
static_configs:
- targets:
- redis://*.*.*.179:6179
- redis://*.*.*.180:6179
- redis://*.*.*.181:6179
- redis://*.*.*.179:6178
- redis://*.*.*.180:6178
- redis://*.*.*.181:6178
- redis://*.*.*.179:6177
- redis://*.*.*.179:6176
metrics_path: /scrape
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: *.*.*.166:9594

prometheus-additional.yaml

然后我们需要将这些监控配置以secret资源类型存储到k8s集群中。

kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml -n monitoring


六、配置Alertmanager

监控和告警项已经配置好了,那么接下来我们将进行alertmanager告警配置了。

常用的接收方式就是邮件了,但这里我们将使用企业微信号进行接收,所以开发一个连接微信的应用appalertservice,进行消息转发和处理。

当然,你也可以直接配置微信号和消息模板,可参考:第3章 Prometheus告警处理

global:
resolve_timeout: 5m
# smtp_smarthost: 'smtp.sina.com:25'
# smtp_from: '******@sina.com'
# smtp_auth_username: '******@sina.com'
# smtp_auth_password: '******'
route:
group_by: ['job']
group_wait: 20s
group_interval: 30m
repeat_interval: 12h
receiver: webhook
receivers:
- name: webhook
webhook_configs:
- url: 'http://appalertservice:20119/'
# - name: 'email'
# email_configs:
# - to: '******@163.com'

然后我们需要将alertmanager配置以secret资源类型存储到k8s集群中。

kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring


七、配置Grafana

前面我们把监控和告警已经配置好了,那接下来就剩展示了。打开grafana  ->  点击添加按钮 ->Import ->Upload .json file,导入监控仪表板。

{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"description": "This dashboard provides cluster admins with the ability to monitor nodes and identify workload bottlenecks. It can be deployed with PSPs enabled using the following helm chart - https://github.com/pivotal-cf/charts-grafana",
"editable": true,
"gnetId": 10000,
"graphTooltip": 0,
"id": 102,
"iteration": 1597137794957,
"links": [],
"panels": [
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 0
},
"id": 34,
"panels": [],
"repeat": null,
"title": "Summary",
"type": "row"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": true,
"error": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 5,
"w": 8,
"x": 0,
"y": 1
},
"height": "180px",
"id": 4,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (container_memory_working_set_bytes{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}) / sum (machine_memory_bytes{kubernetes_io_hostname=~\"^$Node$\"}) * 100",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"refId": "A",
"step": 10
}
],
"thresholds": "65, 90",
"title": "Cluster memory usage",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 5,
"w": 8,
"x": 8,
"y": 1
},
"height": "180px",
"id": 6,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (rate (container_cpu_usage_seconds_total{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) / sum (machine_cpu_cores{kubernetes_io_hostname=~\"^$Node$\"}) * 100",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 10
}
],
"thresholds": "65, 90",
"title": "Cluster CPU usage",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 5,
"w": 8,
"x": 16,
"y": 1
},
"height": "180px",
"id": 7,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (container_fs_usage_bytes{id=\"/\"}) / sum (container_fs_limit_bytes{id=\"/\"}) * 100",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "",
"metric": "",
"refId": "A",
"step": 10
}
],
"thresholds": "65, 90",
"title": "Cluster filesystem usage",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "bytes",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 3,
"w": 4,
"x": 0,
"y": 6
},
"height": "1px",
"id": 9,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "20%",
"prefix": "",
"prefixFontSize": "20%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (container_memory_working_set_bytes{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"})",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"refId": "A",
"step": 10
}
],
"thresholds": "",
"title": "Used",
"type": "singlestat",
"valueFontSize": "50%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "bytes",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 3,
"w": 4,
"x": 4,
"y": 6
},
"height": "1px",
"id": 10,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (machine_memory_bytes{kubernetes_io_hostname=~\"^$Node$\"})",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"refId": "A",
"step": 10
}
],
"thresholds": "",
"title": "Total",
"type": "singlestat",
"valueFontSize": "50%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 3,
"w": 4,
"x": 8,
"y": 6
},
"height": "1px",
"id": 11,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": " cores",
"postfixFontSize": "30%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (rate (container_cpu_usage_seconds_total{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}[$interval]))",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 10
}
],
"thresholds": "",
"title": "Used",
"type": "singlestat",
"valueFontSize": "50%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 3,
"w": 4,
"x": 12,
"y": 6
},
"height": "1px",
"id": 12,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": " cores",
"postfixFontSize": "30%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (machine_cpu_cores{kubernetes_io_hostname=~\"^$Node$\"})",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"refId": "A",
"step": 10
}
],
"thresholds": "",
"title": "Total",
"type": "singlestat",
"valueFontSize": "50%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "bytes",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 3,
"w": 4,
"x": 16,
"y": 6
},
"height": "1px",
"id": 13,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (container_fs_usage_bytes{id=\"/\"})",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 10
}
],
"thresholds": "",
"title": "Used",
"type": "singlestat",
"valueFontSize": "50%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "bytes",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 3,
"w": 4,
"x": 20,
"y": 6
},
"height": "1px",
"id": 14,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (container_fs_limit_bytes{id=\"/\"})",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 10
}
],
"thresholds": "",
"title": "Total",
"type": "singlestat",
"valueFontSize": "50%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 9
},
"id": 35,
"panels": [],
"repeat": null,
"title": "Memory",
"type": "row"
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"fill": 0,
"fillGradient": 0,
"grid": {},
"gridPos": {
"h": 7,
"w": 24,
"x": 0,
"y": 10
},
"id": 25,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": false,
"max": false,
"min": false,
"rightSide": true,
"show": true,
"sideWidth": 200,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "connected",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": true,
"steppedLine": false,
"targets": [
{
"expr": "sum (container_memory_working_set_bytes{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}) by (pod)",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "{{ pod }}",
"metric": "container_memory_usage:sort_desc",
"refId": "A",
"step": 10
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Pods memory usage",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 2,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 17
},
"id": 37,
"panels": [],
"title": "CPU",
"type": "row"
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 3,
"editable": true,
"error": false,
"fill": 0,
"fillGradient": 0,
"grid": {},
"gridPos": {
"h": 7,
"w": 24,
"x": 0,
"y": 18
},
"height": "",
"id": 17,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"max": false,
"min": false,
"rightSide": true,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "connected",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": true,
"steppedLine": false,
"targets": [
{
"expr": "sum (rate (container_cpu_usage_seconds_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "{{ pod }}",
"metric": "container_cpu",
"refId": "A",
"step": 10
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Pods CPU usage",
"tooltip": {
"msResolution": true,
"shared": true,
"sort": 2,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "none",
"label": "cores",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 25
},
"id": 33,
"panels": [],
"repeat": null,
"title": "Network I/O",
"type": "row"
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"fill": 1,
"fillGradient": 0,
"grid": {},
"gridPos": {
"h": 7,
"w": 24,
"x": 0,
"y": 26
},
"id": 16,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"max": false,
"min": false,
"rightSide": true,
"show": true,
"sideWidth": 200,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum (rate (container_network_receive_bytes_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "-> {{ pod }}",
"metric": "network",
"refId": "A",
"step": 10
},
{
"expr": "- sum (rate (container_network_transmit_bytes_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "<- {{ pod }}",
"metric": "network",
"refId": "B",
"step": 10
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Pods network I/O",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 2,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "Bps",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"fill": 1,
"fillGradient": 0,
"grid": {},
"gridPos": {
"h": 5,
"w": 24,
"x": 0,
"y": 33
},
"height": "200px",
"id": 32,
"legend": {
"alignAsTable": false,
"avg": true,
"current": true,
"max": false,
"min": false,
"rightSide": false,
"show": false,
"sideWidth": 200,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum (rate (container_network_receive_bytes_total{kubernetes_io_hostname=~\"^$Node$\"}[$interval]))",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "Received",
"metric": "network",
"refId": "A",
"step": 10
},
{
"expr": "- sum (rate (container_network_transmit_bytes_total{kubernetes_io_hostname=~\"^$Node$\"}[$interval]))",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "Sent",
"metric": "network",
"refId": "B",
"step": 10
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Network I/O pressure",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "Bps",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "Bps",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
],
"refresh": "10s",
"schemaVersion": 20,
"style": "dark",
"tags": [
"Prometheus",
"Kubernetes"
],
"templating": {
"list": [
{
"auto": true,
"auto_count": 20,
"auto_min": "2m",
"current": {
"text": "auto",
"value": "$__auto_interval_interval"
},
"hide": 2,
"label": null,
"name": "interval",
"options": [
{
"selected": true,
"text": "auto",
"value": "$__auto_interval_interval"
},
{
"selected": false,
"text": "1m",
"value": "1m"
},
{
"selected": false,
"text": "10m",
"value": "10m"
},
{
"selected": false,
"text": "30m",
"value": "30m"
},
{
"selected": false,
"text": "1h",
"value": "1h"
},
{
"selected": false,
"text": "6h",
"value": "6h"
},
{
"selected": false,
"text": "12h",
"value": "12h"
},
{
"selected": false,
"text": "1d",
"value": "1d"
},
{
"selected": false,
"text": "7d",
"value": "7d"
},
{
"selected": false,
"text": "14d",
"value": "14d"
},
{
"selected": false,
"text": "30d",
"value": "30d"
}
],
"query": "1m,10m,30m,1h,6h,12h,1d,7d,14d,30d",
"refresh": 2,
"skipUrlSync": false,
"type": "interval"
},
{
"current": {
"text": "prometheus",
"value": "prometheus"
},
"hide": 0,
"includeAll": false,
"label": null,
"multi": false,
"name": "datasource",
"options": [],
"query": "prometheus",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"type": "datasource"
},
{
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"datasource": "prometheus",
"definition": "",
"hide": 0,
"includeAll": true,
"label": null,
"multi": false,
"name": "Node",
"options": [],
"query": "label_values(kubernetes_io_hostname)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-5m",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "育苗通K8S集群监控",
"uid": "6KoW2MIGk",
"version": 15
}

k8s-model.json

{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"description": "【中文版本】2020.06.28更新,增加整体资源展示!支持 Grafana6&7,Node Exporter v0.16及以上的版本,优化重要指标展示。包含整体资源展示与资源明细图表:CPU 内存 磁盘 IO 网络等监控指标。https://github.com/starsliao/Prometheus",
"editable": true,
"gnetId": 8919,
"graphTooltip": 0,
"id": 72,
"iteration": 1597137684806,
"links": [
{
"icon": "external link",
"tags": [],
"targetBlank": true,
"title": "更新node_exporter",
"tooltip": "",
"type": "link",
"url": "https://github.com/prometheus/node_exporter/releases"
},
{
"icon": "external link",
"tags": [],
"targetBlank": true,
"title": "更新当前仪表板",
"tooltip": "",
"type": "link",
"url": "https://grafana.com/dashboards/8919"
},
{
"icon": "external link",
"tags": [],
"targetBlank": true,
"title": "StarsL.cn",
"tooltip": "",
"type": "link",
"url": "https://starsl.cn"
},
{
"asDropdown": true,
"icon": "external link",
"tags": [],
"targetBlank": true,
"title": "",
"type": "dashboards"
}
],
"panels": [
{
"collapsed": false,
"datasource": "prometheus",
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 0
},
"id": 187,
"panels": [],
"title": "资源总览(关联JOB项)当前选中主机:【$show_hostname】实例:$node",
"type": "row"
},
{
"columns": [],
"datasource": "prometheus",
"description": "分区使用率、磁盘读取、磁盘写入、下载带宽、上传带宽,如果有多个网卡或者多个分区,是采集的使用率最高的网卡或者分区的数值。",
"fontSize": "100%",
"gridPos": {
"h": 12,
"w": 24,
"x": 0,
"y": 1
},
"id": 185,
"options": {},
"pageSize": 10,
"showHeader": true,
"sort": {
"col": 5,
"desc": false
},
"styles": [
{
"alias": "主机名",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 1,
"link": false,
"linkTooltip": "",
"linkUrl": "",
"mappingType": 1,
"pattern": "nodename",
"thresholds": [],
"type": "string",
"unit": "bytes"
},
{
"alias": "IP(链接到明细)",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "浏览主机明细",
"linkUrl": "/d/9CWBz0bik/node-exporter?orgId=1&var-job=${job}&var-hostname=All&var-node=${__cell}&var-device=All",
"mappingType": 1,
"pattern": "instance",
"thresholds": [],
"type": "number",
"unit": "short"
},
{
"alias": "内存",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"mappingType": 1,
"pattern": "Value #B",
"thresholds": [],
"type": "number",
"unit": "bytes"
},
{
"alias": "CPU核",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": null,
"mappingType": 1,
"pattern": "Value #C",
"thresholds": [],
"type": "number",
"unit": "short"
},
{
"alias": " 运行时间",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #D",
"thresholds": [],
"type": "number",
"unit": "s"
},
{
"alias": "分区使用率*",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #E",
"thresholds": [
"",
""
],
"type": "number",
"unit": "percent"
},
{
"alias": "CPU使用率",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #F",
"thresholds": [
"",
""
],
"type": "number",
"unit": "percent"
},
{
"alias": "内存使用率",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #G",
"thresholds": [
"",
""
],
"type": "number",
"unit": "percent"
},
{
"alias": "磁盘读取*",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #H",
"thresholds": [
"",
""
],
"type": "number",
"unit": "Bps"
},
{
"alias": "磁盘写入*",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #I",
"thresholds": [
"",
""
],
"type": "number",
"unit": "Bps"
},
{
"alias": "下载带宽*",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #J",
"thresholds": [
"",
""
],
"type": "number",
"unit": "bps"
},
{
"alias": "上传带宽*",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #K",
"thresholds": [
"",
""
],
"type": "number",
"unit": "bps"
},
{
"alias": "5m负载",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #L",
"thresholds": [],
"type": "number",
"unit": "short"
},
{
"alias": "",
"align": "right",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"decimals": 2,
"pattern": "/.*/",
"thresholds": [],
"type": "hidden",
"unit": "short"
}
],
"targets": [
{
"expr": "node_uname_info{job=~\"$job\"} - 0",
"format": "table",
"instant": true,
"interval": "",
"legendFormat": "主机名",
"refId": "A"
},
{
"expr": "sum(time() - node_boot_time_seconds{job=~\"$job\"})by(instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "运行时间",
"refId": "D"
},
{
"expr": "node_memory_MemTotal_bytes{job=~\"$job\"} - 0",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "总内存",
"refId": "B"
},
{
"expr": "count(node_cpu_seconds_total{job=~\"$job\",mode='system'}) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "总核数",
"refId": "C"
},
{
"expr": "node_load5{job=~\"$job\"}",
"format": "table",
"instant": true,
"interval": "",
"legendFormat": "5分钟负载",
"refId": "L"
},
{
"expr": "(1 - avg(irate(node_cpu_seconds_total{job=~\"$job\",mode=\"idle\"}[5m])) by (instance)) * 100",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "CPU使用率",
"refId": "F"
},
{
"expr": "(1 - (node_memory_MemAvailable_bytes{job=~\"$job\"} / (node_memory_MemTotal_bytes{job=~\"$job\"})))* 100",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "内存使用率",
"refId": "G"
},
{
"expr": "max((node_filesystem_size_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}-node_filesystem_free_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}) *100/(node_filesystem_avail_bytes {job=~\"$job\",fstype=~\"ext.?|xfs\"}+(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}-node_filesystem_free_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"})))by(instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "分区使用率",
"refId": "E"
},
{
"expr": "max(irate(node_disk_read_bytes_total{job=~\"$job\"}[5m])) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "最大读取",
"refId": "H"
},
{
"expr": "max(irate(node_disk_written_bytes_total{job=~\"$job\"}[5m])) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "最大写入",
"refId": "I"
},
{
"expr": "max(irate(node_network_receive_bytes_total{job=~\"$job\"}[5m])*8) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "下载带宽",
"refId": "J"
},
{
"expr": "max(irate(node_network_transmit_bytes_total{job=~\"$job\"}[5m])*8) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "上传带宽",
"refId": "K"
}
],
"timeFrom": null,
"timeShift": null,
"title": "服务器资源总览表(每页10行)",
"transform": "table",
"type": "table"
},
{
"aliasColors": {
"192.168.200.241:9100_Total": "dark-red",
"Idle - Waiting for something to happen": "#052B51",
"guest": "#9AC48A",
"idle": "#052B51",
"iowait": "#EAB839",
"irq": "#BF1B00",
"nice": "#C15C17",
"sdb_每秒I/O操作%": "#d683ce",
"softirq": "#E24D42",
"steal": "#FCE2DE",
"system": "#508642",
"user": "#5195CE",
"磁盘花费在I/O操作占比": "#ba43a9"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": null,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 0,
"y": 13
},
"hiddenSeries": false,
"id": 191,
"legend": {
"alignAsTable": false,
"avg": false,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"maxPerRow": 6,
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
{
"alias": "总平均使用率",
"lines": false,
"pointradius": 1,
"points": true,
"yaxis": 2
},
{
"alias": "总核数",
"color": "#C4162A"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "count(node_cpu_seconds_total{job=~\"$job\", mode='system'})",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总核数",
"refId": "B",
"step": 240
},
{
"expr": "sum(node_load5{job=~\"$job\"})",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总5分钟负载",
"refId": "A",
"step": 240
},
{
"expr": "avg(1 - avg(irate(node_cpu_seconds_total{job=~\"$job\",mode=\"idle\"}[5m])) by (instance)) * 100",
"format": "time_series",
"hide": false,
"interval": "30m",
"intervalFactor": 1,
"legendFormat": "总平均使用率",
"refId": "F",
"step": 240
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "$job:整体总负载与整体平均CPU使用率",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "short",
"label": "总负载",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"decimals": 0,
"format": "percent",
"label": "平均使用率",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"192.168.200.241:9100_总内存": "dark-red",
"内存_Avaliable": "#6ED0E0",
"内存_Cached": "#EF843C",
"内存_Free": "#629E51",
"内存_Total": "#6d1f62",
"内存_Used": "#eab839",
"可用": "#9ac48a",
"总内存": "#bf1b00"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 1,
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 8,
"y": 13
},
"height": "",
"hiddenSeries": false,
"id": 195,
"legend": {
"alignAsTable": false,
"avg": false,
"current": true,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"sort": "current",
"sortDesc": false,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "总内存",
"color": "#C4162A",
"fill": 0
},
{
"alias": "总平均使用率",
"lines": false,
"pointradius": 1,
"points": true,
"yaxis": 2
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(node_memory_MemTotal_bytes{job=~\"$job\"})",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总内存",
"refId": "A",
"step": 4
},
{
"expr": "sum(node_memory_MemTotal_bytes{job=~\"$job\"} - node_memory_MemAvailable_bytes{job=~\"$job\"})",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总已用",
"refId": "B",
"step": 4
},
{
"expr": "(sum(node_memory_MemTotal_bytes{job=~\"$job\"} - node_memory_MemAvailable_bytes{job=~\"$job\"}) / sum(node_memory_MemTotal_bytes{job=~\"$job\"}))*100",
"format": "time_series",
"hide": false,
"interval": "30m",
"intervalFactor": 1,
"legendFormat": "总平均使用率",
"refId": "H"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "$job:整体总内存与整体平均内存使用率",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "bytes",
"label": "总内存量",
"logBase": 1,
"max": null,
"min": "",
"show": true
},
{
"decimals": null,
"format": "percent",
"label": "平均使用率",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 1,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 16,
"y": 13
},
"hiddenSeries": false,
"id": 197,
"legend": {
"alignAsTable": false,
"avg": false,
"current": true,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "总平均使用率",
"lines": false,
"pointradius": 1,
"points": true,
"yaxis": 2
},
{
"alias": "总磁盘量",
"color": "#C4162A"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总磁盘量",
"refId": "E"
},
{
"expr": "sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总使用量",
"refId": "C"
},
{
"expr": "(sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))) *100/(sum(avg(node_filesystem_avail_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))+(sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))))",
"format": "time_series",
"instant": false,
"interval": "30m",
"intervalFactor": 1,
"legendFormat": "总平均使用率",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "$job:整体总磁盘与整体平均磁盘使用率",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": 1,
"format": "bytes",
"label": "总磁盘量",
"logBase": 1,
"max": null,
"min": "",
"show": true
},
{
"decimals": null,
"format": "percent",
"label": "平均使用率",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"collapsed": false,
"datasource": "prometheus",
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 21
},
"id": 189,
"panels": [],
"title": "资源明细:【$show_hostname】",
"type": "row"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorPostfix": false,
"colorPrefix": false,
"colorValue": true,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"decimals": 0,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"format": "s",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"threshcisLabels": false,
"threshcisMarkers": true
},
"gridPos": {
"h": 2,
"w": 2,
"x": 0,
"y": 22
},
"hideTimeOverride": true,
"id": 15,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "null",
"nullText": null,
"options": {},
"pluginVersion": "6.4.2",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "avg(time() - node_boot_time_seconds{instance=~\"$node\"})",
"format": "time_series",
"hide": false,
"instant": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 40
}
],
"threshciss": "1,2",
"thresholds": "1,3",
"title": "运行时间",
"type": "singlestat",
"valueFontSize": "70%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"datasource": "prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"custom": {},
"decimals": 2,
"displayName": "",
"mappings": [
{
"from": "",
"id": 1,
"operator": "",
"text": "N/A",
"to": "",
"type": 1,
"value": ""
}
],
"max": 100,
"min": 0,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 70
},
{
"color": "#EAB839",
"value": 90
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 6,
"w": 3,
"x": 2,
"y": 22
},
"id": 177,
"options": {
"displayMode": "lcd",
"fieldOptions": {
"calcs": [
"last"
],
"defaults": {
"decimals": 1,
"mappings": [
{
"from": "",
"id": 1,
"operator": "",
"text": "N/A",
"to": "",
"type": 1,
"value": ""
}
],
"max": 100,
"min": 0.1,
"thresholds": {
"": {
"color": "green",
"value": null
},
"": {
"color": "red",
"value": 80
},
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "#EAB839",
"value": 70
},
{
"color": "red",
"value": 90
}
]
},
"unit": "percent"
},
"override": {},
"overrides": [],
"values": false
},
"orientation": "horizontal",
"reduceOptions": {
"calcs": [
"mean"
],
"values": false
},
"showUnfilled": true
},
"pluginVersion": "6.4.3",
"targets": [
{
"expr": "100 - (avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"idle\"}[5m])) * 100)",
"instant": true,
"interval": "",
"legendFormat": "总CPU使用率",
"refId": "A"
},
{
"expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) * 100",
"hide": true,
"instant": true,
"interval": "",
"legendFormat": "IOwait使用率",
"refId": "C"
},
{
"expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$node\"} / (node_memory_MemTotal_bytes{instance=~\"$node\"})))* 100",
"instant": true,
"interval": "",
"legendFormat": "内存使用率",
"refId": "B"
},
{
"expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"})*100 /(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}))",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "最大分区({{mountpoint}})使用率",
"refId": "D"
},
{
"expr": "(1 - ((node_memory_SwapFree_bytes{instance=~\"$node\"} + 1)/ (node_memory_SwapTotal_bytes{instance=~\"$node\"} + 1))) * 100",
"instant": true,
"legendFormat": "交换分区使用率",
"refId": "F"
}
],
"timeFrom": null,
"timeShift": null,
"title": "",
"type": "bargauge"
},
{
"columns": [],
"datasource": "prometheus",
"description": "本看板中的:磁盘总量、使用量、可用量、使用率保持和df命令的Size、Used、Avail、Use% 列的值一致,并且Use%的值会四舍五入保留一位小数,会更加准确。\n\n注:df中Use%算法为:(size - free) * 100 / (avail + (size - free)),结果是整除则为该值,非整除则为该值+1,结果的单位是%。\n参考df命令源码:",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fontSize": "100%",
"gridPos": {
"h": 6,
"w": 10,
"x": 5,
"y": 22
},
"id": 181,
"links": [
{
"targetBlank": true,
"title": "https://github.com/coreutils/coreutils/blob/master/src/df.c",
"url": "https://github.com/coreutils/coreutils/blob/master/src/df.c"
}
],
"options": {},
"pageSize": null,
"scroll": true,
"showHeader": true,
"sort": {
"col": 6,
"desc": false
},
"styles": [
{
"alias": "分区",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "mountpoint",
"thresholds": [
""
],
"type": "string",
"unit": "bytes"
},
{
"alias": "可用空间",
"align": "auto",
"colorMode": "value",
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 1,
"mappingType": 1,
"pattern": "Value #A",
"thresholds": [
"",
""
],
"type": "number",
"unit": "bytes"
},
{
"alias": "使用率",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 1,
"mappingType": 1,
"pattern": "Value #B",
"thresholds": [
"",
""
],
"type": "number",
"unit": "percent"
},
{
"alias": "总空间",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 0,
"link": false,
"mappingType": 1,
"pattern": "Value #C",
"thresholds": [],
"type": "number",
"unit": "bytes"
},
{
"alias": "文件系统",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"mappingType": 1,
"pattern": "fstype",
"thresholds": [],
"type": "string",
"unit": "short"
},
{
"alias": "设备名",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"mappingType": 1,
"pattern": "device",
"preserveFormat": false,
"sanitize": false,
"thresholds": [],
"type": "string",
"unit": "short"
},
{
"alias": "",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"decimals": 2,
"pattern": "/.*/",
"preserveFormat": true,
"sanitize": false,
"thresholds": [],
"type": "hidden",
"unit": "short"
}
],
"targets": [
{
"expr": "node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-0",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总量",
"refId": "C"
},
{
"expr": "node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-0",
"format": "table",
"hide": false,
"instant": true,
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A"
},
{
"expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}) *100/(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}))",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "",
"refId": "B"
}
],
"title": "【$show_hostname】:各分区可用空间(EXT.*/XFS)",
"transform": "table",
"type": "table"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "prometheus",
"decimals": 2,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 2,
"w": 2,
"x": 15,
"y": 22
},
"id": 20,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"pluginVersion": "6.4.2",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": true,
"lineColor": "#3274D9",
"show": true,
"ymax": null,
"ymin": null
},
"tableColumn": "",
"targets": [
{
"expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) * 100",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 20
}
],
"thresholds": "20,50",
"timeFrom": null,
"timeShift": null,
"title": "CPU iowait",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"aliasColors": {
"cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_in": "light-red",
"cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_in下载": "green",
"cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_out上传": "yellow",
"cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_in下载": "purple",
"cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_out": "purple",
"cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_out上传": "blue"
},
"bars": true,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": true,
"error": false,
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"grid": {},
"gridPos": {
"h": 6,
"w": 7,
"x": 17,
"y": 22
},
"hiddenSeries": false,
"id": 183,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": false,
"show": false,
"sort": "current",
"sortDesc": true,
"total": true,
"values": true
},
"lines": false,
"linewidth": 2,
"links": [],
"nullPointMode": "null as zero",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 1,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
{
"alias": "/.*_out上传$/",
"transform": "negative-Y"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "increase(node_network_receive_bytes_total{instance=~\"$node\",device=~\"$device\"}[60m])",
"interval": "60m",
"intervalFactor": 1,
"legendFormat": "{{device}}_in下载",
"metric": "",
"refId": "A",
"step": 600,
"target": ""
},
{
"expr": "increase(node_network_transmit_bytes_total{instance=~\"$node\",device=~\"$device\"}[60m])",
"hide": false,
"interval": "60m",
"intervalFactor": 1,
"legendFormat": "{{device}}_out上传",
"refId": "B",
"step": 600
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "每小时流量$device",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"label": "上传(-)/下载(+)",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorPostfix": false,
"colorValue": true,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"format": "short",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 2,
"w": 2,
"x": 0,
"y": 24
},
"id": 14,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"maxPerRow": 6,
"nullPointMode": "null",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "count(node_cpu_seconds_total{instance=~\"$node\", mode='system'})",
"format": "time_series",
"instant": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 20
}
],
"thresholds": "1,2",
"title": "CPU 核数",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorPostfix": false,
"colorValue": true,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"decimals": null,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"format": "short",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 2,
"w": 2,
"x": 15,
"y": 24
},
"id": 179,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"maxPerRow": 6,
"nullPointMode": "null",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "avg(node_filesystem_files_free{instance=~\"$node\",mountpoint=\"$maxmount\",fstype=~\"ext.?|xfs\"})",
"format": "time_series",
"instant": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 20
}
],
"thresholds": "100000,1000000",
"title": "剩余节点数:$maxmount ",
"type": "singlestat",
"valueFontSize": "70%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"decimals": 0,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"format": "bytes",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 2,
"w": 2,
"x": 0,
"y": 26
},
"id": 75,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"maxPerRow": 6,
"nullPointMode": "null",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "70%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum(node_memory_MemTotal_bytes{instance=~\"$node\"})",
"format": "time_series",
"instant": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{instance}}",
"refId": "A",
"step": 20
}
],
"thresholds": "2,3",
"title": "总内存",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorPostfix": false,
"colorValue": true,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"decimals": null,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"format": "locale",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 2,
"w": 2,
"x": 15,
"y": 26
},
"id": 178,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"maxPerRow": 6,
"nullPointMode": "null",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "avg(node_filefd_maximum{instance=~\"$node\"})",
"format": "time_series",
"instant": true,
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 20
}
],
"thresholds": "1024,10000",
"title": "总文件描述符",
"type": "singlestat",
"valueFontSize": "70%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"aliasColors": {
"192.168.200.241:9100_Total": "dark-red",
"Idle - Waiting for something to happen": "#052B51",
"guest": "#9AC48A",
"idle": "#052B51",
"iowait": "#EAB839",
"irq": "#BF1B00",
"nice": "#C15C17",
"sdb_每秒I/O操作%": "#d683ce",
"softirq": "#E24D42",
"steal": "#FCE2DE",
"system": "#508642",
"user": "#5195CE",
"磁盘花费在I/O操作占比": "#ba43a9"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 0,
"y": 28
},
"hiddenSeries": false,
"id": 7,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"rightSide": false,
"show": true,
"sideWidth": null,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"maxPerRow": 6,
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
{
"alias": "/.*总使用率/",
"color": "#C4162A",
"fill": 0
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"system\"}[5m])) by (instance) *100",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "系统使用率",
"refId": "A",
"step": 20
},
{
"expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"user\"}[5m])) by (instance) *100",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "用户使用率",
"refId": "B",
"step": 240
},
{
"expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) by (instance) *100",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "磁盘IO使用率",
"refId": "D",
"step": 240
},
{
"expr": "(1 - avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"idle\"}[5m])) by (instance))*100",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总使用率",
"refId": "F",
"step": 240
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "CPU使用率",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": 0,
"format": "percent",
"label": "",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"192.168.200.241:9100_总内存": "dark-red",
"使用率": "yellow",
"内存_Avaliable": "#6ED0E0",
"内存_Cached": "#EF843C",
"内存_Free": "#629E51",
"内存_Total": "#6d1f62",
"内存_Used": "#eab839",
"可用": "#9ac48a",
"总内存": "#bf1b00"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 8,
"y": 28
},
"height": "",
"hiddenSeries": false,
"id": 156,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"rightSide": false,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "总内存",
"color": "#C4162A",
"fill": 0
},
{
"alias": "使用率",
"color": "rgb(0, 209, 255)",
"lines": false,
"pointradius": 1,
"points": true,
"yaxis": 2
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "node_memory_MemTotal_bytes{instance=~\"$node\"}",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总内存",
"refId": "A",
"step": 4
},
{
"expr": "node_memory_MemTotal_bytes{instance=~\"$node\"} - node_memory_MemAvailable_bytes{instance=~\"$node\"}",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "已用",
"refId": "B",
"step": 4
},
{
"expr": "node_memory_MemAvailable_bytes{instance=~\"$node\"}",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "可用",
"refId": "F",
"step": 4
},
{
"expr": "node_memory_Buffers_bytes{instance=~\"$node\"}",
"format": "time_series",
"hide": true,
"intervalFactor": 1,
"legendFormat": "内存_Buffers",
"refId": "D",
"step": 4
},
{
"expr": "node_memory_MemFree_bytes{instance=~\"$node\"}",
"format": "time_series",
"hide": true,
"intervalFactor": 1,
"legendFormat": "内存_Free",
"refId": "C",
"step": 4
},
{
"expr": "node_memory_Cached_bytes{instance=~\"$node\"}",
"format": "time_series",
"hide": true,
"intervalFactor": 1,
"legendFormat": "内存_Cached",
"refId": "E",
"step": 4
},
{
"expr": "node_memory_MemTotal_bytes{instance=~\"$node\"} - (node_memory_Cached_bytes{instance=~\"$node\"} + node_memory_Buffers_bytes{instance=~\"$node\"} + node_memory_MemFree_bytes{instance=~\"$node\"})",
"format": "time_series",
"hide": true,
"intervalFactor": 1,
"refId": "G"
},
{
"expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$node\"} / (node_memory_MemTotal_bytes{instance=~\"$node\"})))* 100",
"format": "time_series",
"hide": false,
"interval": "30m",
"intervalFactor": 10,
"legendFormat": "使用率",
"refId": "H"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "内存信息",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"label": null,
"logBase": 1,
"max": null,
"min": "",
"show": true
},
{
"format": "percent",
"label": "内存使用率",
"logBase": 1,
"max": "",
"min": "",
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"192.168.10.227:9100_em1_in下载": "super-light-green",
"192.168.10.227:9100_em1_out上传": "dark-blue"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 16,
"y": 28
},
"height": "",
"hiddenSeries": false,
"id": 157,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"rightSide": false,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/.*_out上传$/",
"transform": "negative-Y"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "irate(node_network_receive_bytes_total{instance=~'$node',device=~\"$device\"}[5m])*8",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_in下载",
"refId": "A",
"step": 4
},
{
"expr": "irate(node_network_transmit_bytes_total{instance=~'$node',device=~\"$device\"}[5m])*8",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_out上传",
"refId": "B",
"step": 4
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "每秒网络带宽使用$device",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bps",
"label": "上传(-)/下载(+)",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"15分钟": "#6ED0E0",
"1分钟": "#BF1B00",
"5分钟": "#CCA300"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 1,
"grid": {},
"gridPos": {
"h": 8,
"w": 8,
"x": 0,
"y": 36
},
"height": "",
"hiddenSeries": false,
"id": 13,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"rightSide": false,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"maxPerRow": 6,
"nullPointMode": "null as zero",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
{
"alias": "/.*总核数/",
"color": "#C4162A"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "node_load1{instance=~\"$node\"}",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "1分钟负载",
"metric": "",
"refId": "A",
"step": 20,
"target": ""
},
{
"expr": "node_load5{instance=~\"$node\"}",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "5分钟负载",
"refId": "B",
"step": 20
},
{
"expr": "node_load15{instance=~\"$node\"}",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "15分钟负载",
"refId": "C",
"step": 20
},
{
"expr": " sum(count(node_cpu_seconds_total{instance=~\"$node\", mode='system'}) by (cpu,instance)) by(instance)",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "CPU总核数",
"refId": "D",
"step": 20
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "系统平均负载",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 2,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"vda_write": "#6ED0E0"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"description": "Read bytes 每个磁盘分区每秒读取的比特数\nWritten bytes 每个磁盘分区每秒写入的比特数",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 1,
"gridPos": {
"h": 8,
"w": 8,
"x": 8,
"y": 36
},
"height": "",
"hiddenSeries": false,
"id": 168,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/.*_读取$/",
"transform": "negative-Y"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "irate(node_disk_read_bytes_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_读取",
"refId": "A",
"step": 10
},
{
"expr": "irate(node_disk_written_bytes_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_写入",
"refId": "B",
"step": 10
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "每秒磁盘读写容量",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "Bps",
"label": "读取(-)/写入(+)",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 1,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 16,
"y": 36
},
"hiddenSeries": false,
"id": 174,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"rightSide": false,
"show": true,
"sideWidth": null,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/Inodes.*/",
"yaxis": 2
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}) *100/(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}))",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{mountpoint}}",
"refId": "A"
},
{
"expr": "node_filesystem_files_free{instance=~'$node',fstype=~\"ext.?|xfs\"} / node_filesystem_files{instance=~'$node',fstype=~\"ext.?|xfs\"}",
"hide": true,
"interval": "",
"legendFormat": "Inodes:{{instance}}:{{mountpoint}}",
"refId": "B"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "磁盘使用率",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "percent",
"label": "",
"logBase": 1,
"max": "",
"min": "",
"show": true
},
{
"decimals": 2,
"format": "percentunit",
"label": null,
"logBase": 1,
"max": "",
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"vda_write": "#6ED0E0"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"description": "Reads completed: 每个磁盘分区每秒读完成次数\n\nWrites completed: 每个磁盘分区每秒写完成次数\n\nIO now 每个磁盘分区每秒正在处理的输入/输出请求数",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 0,
"gridPos": {
"h": 9,
"w": 8,
"x": 0,
"y": 44
},
"height": "",
"hiddenSeries": false,
"id": 161,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/.*_读取$/",
"transform": "negative-Y"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "irate(node_disk_reads_completed_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_读取",
"refId": "A",
"step": 10
},
{
"expr": "irate(node_disk_writes_completed_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_写入",
"refId": "B",
"step": 10
},
{
"expr": "node_disk_io_now{instance=~\"$node\"}",
"format": "time_series",
"hide": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}",
"refId": "C"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "磁盘读写速率(IOPS)",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "iops",
"label": "读取(-)/写入(+)I/O ops/sec",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"Idle - Waiting for something to happen": "#052B51",
"guest": "#9AC48A",
"idle": "#052B51",
"iowait": "#EAB839",
"irq": "#BF1B00",
"nice": "#C15C17",
"sdb_每秒I/O操作%": "#d683ce",
"softirq": "#E24D42",
"steal": "#FCE2DE",
"system": "#508642",
"user": "#5195CE",
"磁盘花费在I/O操作占比": "#ba43a9"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": null,
"description": "每一秒钟的自然时间内,花费在I/O上的耗时。(wall-clock time)\n\nnode_disk_io_time_seconds_total:\n磁盘花费在输入/输出操作上的秒数。该值为累加值。(Milliseconds Spent Doing I/Os)\n\nirate(node_disk_io_time_seconds_total[1m]):\n计算每秒的速率:(last值-last前一个值)/时间戳差值,即:1秒钟内磁盘花费在I/O操作的时间占比。",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 9,
"w": 8,
"x": 8,
"y": 44
},
"hiddenSeries": false,
"id": 175,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"sort": null,
"sortDesc": null,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"maxPerRow": 6,
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "irate(node_disk_io_time_seconds_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_每秒I/O操作%",
"refId": "C"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "每1秒内I/O操作耗时占比",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "percentunit",
"label": "",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"vda": "#6ED0E0"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"description": "Read time seconds 每个磁盘分区读操作花费的秒数\n\nWrite time seconds 每个磁盘分区写操作花费的秒数\n\nIO time seconds 每个磁盘分区输入/输出操作花费的秒数\n\nIO time weighted seconds每个磁盘分区输入/输出操作花费的加权秒数",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 1,
"gridPos": {
"h": 9,
"w": 8,
"x": 16,
"y": 44
},
"height": "",
"hiddenSeries": false,
"id": 160,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null as zero",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/,*_读取$/",
"transform": "negative-Y"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "irate(node_disk_read_time_seconds_total{instance=~\"$node\"}[5m]) / irate(node_disk_reads_completed_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_读取",
"refId": "B"
},
{
"expr": "irate(node_disk_write_time_seconds_total{instance=~\"$node\"}[5m]) / irate(node_disk_writes_completed_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_写入",
"refId": "C"
},
{
"expr": "irate(node_disk_io_time_seconds_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}",
"refId": "A",
"step": 10
},
{
"expr": "irate(node_disk_io_time_weighted_seconds_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_加权",
"refId": "D"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "每次IO读写的耗时(参考:小于100ms)(beta)",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "s",
"label": "读取(-)/写入(+)",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"192.168.200.241:9100_TCP_alloc": "semi-dark-blue",
"TCP": "#6ED0E0",
"TCP_alloc": "blue"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"description": "Sockets_used - 已使用的所有协议套接字总量\n\nCurrEstab - 当前状态为 ESTABLISHED 或 CLOSE-WAIT 的 TCP 连接数\n\nTCP_alloc - 已分配(已建立、已申请到sk_buff)的TCP套接字数量\n\nTCP_tw - 等待关闭的TCP连接数\n\nUDP_inuse - 正在使用的 UDP 套接字数量\n\nRetransSegs - TCP 重传报文数\n\nOutSegs - TCP 发送的报文数\n\nInSegs - TCP 接收的报文数",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 16,
"x": 0,
"y": 53
},
"height": "",
"hiddenSeries": false,
"id": 158,
"interval": "",
"legend": {
"alignAsTable": true,
"avg": false,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/.*Sockets_used/",
"color": "#E02F44",
"lines": false,
"pointradius": 1,
"points": true,
"yaxis": 2
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "node_netstat_Tcp_CurrEstab{instance=~'$node'}",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "CurrEstab",
"refId": "A",
"step": 20
},
{
"expr": "node_sockstat_TCP_tw{instance=~'$node'}",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "TCP_tw",
"refId": "D"
},
{
"expr": "node_sockstat_sockets_used{instance=~'$node'}",
"hide": false,
"interval": "30m",
"intervalFactor": 1,
"legendFormat": "Sockets_used",
"refId": "B"
},
{
"expr": "node_sockstat_UDP_inuse{instance=~'$node'}",
"interval": "",
"legendFormat": "UDP_inuse",
"refId": "C"
},
{
"expr": "node_sockstat_TCP_alloc{instance=~'$node'}",
"interval": "",
"legendFormat": "TCP_alloc",
"refId": "E"
},
{
"expr": "irate(node_netstat_Tcp_PassiveOpens{instance=~'$node'}[5m])",
"hide": true,
"interval": "",
"legendFormat": "{{instance}}_Tcp_PassiveOpens",
"refId": "G"
},
{
"expr": "irate(node_netstat_Tcp_ActiveOpens{instance=~'$node'}[5m])",
"hide": true,
"interval": "",
"legendFormat": "{{instance}}_Tcp_ActiveOpens",
"refId": "F"
},
{
"expr": "irate(node_netstat_Tcp_InSegs{instance=~'$node'}[5m])",
"interval": "",
"legendFormat": "Tcp_InSegs",
"refId": "H"
},
{
"expr": "irate(node_netstat_Tcp_OutSegs{instance=~'$node'}[5m])",
"interval": "",
"legendFormat": "Tcp_OutSegs",
"refId": "I"
},
{
"expr": "irate(node_netstat_Tcp_RetransSegs{instance=~'$node'}[5m])",
"hide": false,
"interval": "",
"legendFormat": "Tcp_RetransSegs",
"refId": "J"
},
{
"expr": "irate(node_netstat_TcpExt_ListenDrops{instance=~'$node'}[5m])",
"hide": true,
"interval": "",
"legendFormat": "",
"refId": "K"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "网络Socket连接信息",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"transformations": [],
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": "已使用的所有协议套接字总量",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"filefd_192.168.200.241:9100": "super-light-green",
"switches_192.168.200.241:9100": "semi-dark-red",
"使用的文件描述符_10.118.72.128:9100": "red",
"每秒上下文切换次数_10.118.71.245:9100": "yellow",
"每秒上下文切换次数_10.118.72.128:9100": "yellow"
},
"bars": false,
"cacheTimeout": null,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 1,
"gridPos": {
"h": 8,
"w": 8,
"x": 16,
"y": 53
},
"hiddenSeries": false,
"hideTimeOverride": false,
"id": 16,
"legend": {
"alignAsTable": false,
"avg": false,
"current": true,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pluginVersion": "6.4.2",
"pointradius": 1,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/每秒上下文切换次数.*/",
"color": "#FADE2A",
"lines": false,
"pointradius": 1,
"points": true,
"yaxis": 2
},
{
"alias": "/使用的文件描述符.*/",
"color": "#F2495C"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "node_filefd_allocated{instance=~\"$node\"}",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 5,
"legendFormat": "使用的文件描述符",
"refId": "B"
},
{
"expr": "irate(node_context_switches_total{instance=~\"$node\"}[5m])",
"interval": "",
"intervalFactor": 5,
"legendFormat": "每秒上下文切换次数",
"refId": "A"
},
{
"expr": " (node_filefd_allocated{instance=~\"$node\"}/node_filefd_maximum{instance=~\"$node\"}) *100",
"format": "time_series",
"hide": true,
"instant": false,
"interval": "",
"intervalFactor": 5,
"legendFormat": "使用的文件描述符占比_{{instance}}",
"refId": "C"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "打开的文件描述符(左 )/每秒上下文切换次数(右)",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": "使用的文件描述符",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": "context_switches",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
],
"refresh": "",
"schemaVersion": 20,
"style": "dark",
"tags": [
"Prometheus",
"node_exporter"
],
"templating": {
"list": [
{
"allValue": null,
"current": {
"tags": [],
"text": "node-exporter",
"value": "node-exporter"
},
"datasource": "prometheus",
"definition": "label_values(node_uname_info, job)",
"hide": 0,
"includeAll": false,
"index": -1,
"label": "JOB",
"multi": false,
"name": "job",
"options": [],
"query": "label_values(node_uname_info, job)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 5,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {
"text": "All",
"value": "$__all"
},
"datasource": "prometheus",
"definition": "label_values(node_uname_info{job=~\"$job\"}, nodename)",
"hide": 0,
"includeAll": true,
"index": -1,
"label": "主机名",
"multi": false,
"name": "hostname",
"options": [],
"query": "label_values(node_uname_info{job=~\"$job\"}, nodename)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 5,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allFormat": "glob",
"allValue": null,
"current": {
"text": "ymt108",
"value": "ymt108"
},
"datasource": "prometheus",
"definition": "label_values(node_uname_info{job=~\"$job\",nodename=~\"$hostname\"},instance)",
"hide": 0,
"includeAll": false,
"index": -1,
"label": "Instance",
"multi": true,
"multiFormat": "regex values",
"name": "node",
"options": [],
"query": "label_values(node_uname_info{job=~\"$job\",nodename=~\"$hostname\"},instance)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 5,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allFormat": "glob",
"allValue": null,
"current": {
"text": "All",
"value": "$__all"
},
"datasource": "prometheus",
"definition": "label_values(node_network_info{device!~'tap.*|veth.*|br.*|docker.*|virbr.*|lo.*|cni.*'},device)",
"hide": 0,
"includeAll": true,
"index": -1,
"label": "网卡",
"multi": true,
"multiFormat": "regex values",
"name": "device",
"options": [],
"query": "label_values(node_network_info{device!~'tap.*|veth.*|br.*|docker.*|virbr.*|lo.*|cni.*'},device)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 1,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {
"text": "/",
"value": "/"
},
"datasource": "prometheus",
"definition": "query_result(topk(1,sort_desc (max(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.?|xfs\",mountpoint!~\".*pods.*\"}) by (mountpoint))))",
"hide": 2,
"includeAll": false,
"index": -1,
"label": "最大挂载目录",
"multi": false,
"name": "maxmount",
"options": [],
"query": "query_result(topk(1,sort_desc (max(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.?|xfs\",mountpoint!~\".*pods.*\"}) by (mountpoint))))",
"refresh": 2,
"regex": "/.*\\\"(.*)\\\".*/",
"skipUrlSync": false,
"sort": 5,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {
"text": "ymt108",
"value": "ymt108"
},
"datasource": "prometheus",
"definition": "label_values(node_uname_info{job=~\"$job\",instance=~\"$node\"}, nodename)",
"hide": 2,
"includeAll": false,
"index": -1,
"label": "展示使用的主机名",
"multi": false,
"name": "show_hostname",
"options": [],
"query": "label_values(node_uname_info{job=~\"$job\",instance=~\"$node\"}, nodename)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 5,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-12h",
"to": "now"
},
"timepicker": {
"hidden": false,
"now": true,
"refresh_intervals": [
"15s",
"30s",
"1m",
"5m",
"15m",
"30m"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "育苗通Node资源监控",
"uid": "hb7fSE0Zz",
"version": 11
}

node-model.json

当然,默认还内置了很多k8s相关的资源监控模板。


八、汇总

当我们完成了所有配置, 那接下来还需要整理一下,编写升级脚本upgrade.sh,方便之后部署,以及修改更新。

#!/bin/sh

# upgrade alertmanager configuration
kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring# upgrade scrape configs
kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml -n monitoring# upgrade prometheus rules
kubectl apply -f prometheus-additional-rules.yaml
kubectl apply -f prometheus-rules.yaml # upgrade prometheus configuration
kubectl apply -f prometheus-prometheus.yaml

作者:Leozhanggg

出处:https://www.cnblogs.com/leozhanggg/p/13502983.html

本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。

Kubernetes实战总结 - 自定义Prometheus的更多相关文章

  1. Kubernetes实战总结 - 阿里云ECS自建K8S集群

    一.概述 详情参考阿里云说明:https://help.aliyun.com/document_detail/98886.html?spm=a2c4g.11186623.6.1078.323b1c9b ...

  2. Kubernetes实战(二):k8s v1.11.1 prometheus traefik组件安装及集群测试

    1.traefik traefik:HTTP层路由,官网:http://traefik.cn/,文档:https://docs.traefik.io/user-guide/kubernetes/ 功能 ...

  3. 新书推荐《再也不踩坑的Kubernetes实战指南》

      <再也不踩坑的Kubernetes实战指南>终于出版啦.目前可以在京东.天猫购买,京东自营和当当网预计一个星期左右上架. 本书贴合生产环境经验,解决在初次使用或者是构建集群中的痛点,帮 ...

  4. 2020 最新 Kubernetes实战指南

    1.Kubernetes带来的变革   对于开发人员 由于公司业务多,开发环境.测试环境.预生产环境和生产环境都是隔离的,而且除了生产环境,为了节省成本,其他环境可能是没有日志收集的,在没有用k8s的 ...

  5. Flink Native Kubernetes实战

    欢迎访问我的GitHub https://github.com/zq2599/blog_demos 内容:所有原创文章分类汇总及配套源码,涉及Java.Docker.Kubernetes.DevOPS ...

  6. 【SpringBoot】单元测试进阶实战、自定义异常处理、t部署war项目到tomcat9和启动原理讲解

    ========================4.Springboot2.0单元测试进阶实战和自定义异常处理 ============================== 1.@SpringBoot ...

  7. kubernetes实战(二十六):kubeadm 安装 高可用 k8s v1.16.x dashboard 2.x

    1.基本配置 基本配置.内核升级.基本服务安装参考https://www.cnblogs.com/dukuan/p/10278637.html,或者参考<再也不踩坑的Kubernetes实战指南 ...

  8. kubernetes实战(二十七):CentOS 8 二进制 高可用 安装 k8s 1.16.x

    1. 基本说明 本文章将演示CentOS 8二进制方式安装高可用k8s 1.16.x,相对于其他版本,二进制安装方式并无太大区别.CentOS 8相对于CentOS 7操作更加方便,比如一些服务的关闭 ...

  9. kubernetes实战(二十八):Kubernetes一键式资源管理平台Ratel安装及使用

    1. Ratel是什么? Ratel是一个Kubernetes资源平台,基于管理Kubernetes的资源开发,可以管理Kubernetes的Deployment.DaemonSet.Stateful ...

随机推荐

  1. 解决nginx在Linux中已经正常启动,Windows端的浏览器却无法访问的问题

    一:查看Linux中nginx已经正常启动 二:查看80端口,未被占用 三:检查防火墙的问题 关闭防火墙:chkconfig iptables off //失败 暂时关闭防火墙:service ipt ...

  2. Java中hashCode方法的理解以及此小结的总结练习(代码)

    笔记: “散列码”就是用来把一堆对象散到各自的队列里去的一种标识码. 举个形象一点的例子,一年有 365 天,从 1 编号到 365,下面我定义一种编码方法,每个人按照他生日那天的编号作为他的标识码, ...

  3. 回文树(回文自动机)(PAM)

    第一个能看懂的论文:国家集训队2017论文集 这是我第一个自己理解的自动机(AC自动机不懂KMP硬背,SAM看不懂一堆引理定理硬背) 参考文献:2017国家集训队论文集 回文树及其应用 翁文涛 参考博 ...

  4. 小白在使用ISE编写verilog代码综合时犯得错误及我自己的解决办法

    一:错误原因,顶层信号声明类别错误 错误前 更改后 二:综合时警告 更改前: 错误原因:调用子模块时 输出端口只能用wire类型变量进行映射 这是verilog语法规定的 tx_done在uart_t ...

  5. Oracle创建自动增长列

    前言: Oracle中不像SQL Server在创建表的时候使用identity(1001,1)来创建自动增长列,而是需要结合序列(Sequences)和触发器(Triggers)来实现 创建测试表 ...

  6. 水题----根据O出现次数判断分数

    There is an objective test result such as \OOXXOXXOOO". An `O' means a correct answer of a prob ...

  7. 读懂操作系统之快表(TLB)原理(七)

    前言 前不久.我们详细分析了TLB基本原理,本节我们通过一个简单的示例再次叙述TLB的算法和原理,希望借此示例能加深我们对TLB(又称之为快表,深入理解计算机系统(第三版)又称之为翻译后备缓冲区)的理 ...

  8. Django学习路17_聚合函数(Avg平均值,Count数量,Max最大,Min最小,Sum求和)基本使用

    使用方法: 类名.objects.aggregate(聚合函数名('表的列名')) 聚合函数名: Avg 平均值 Count数量 Max 最大 Min 最小 Sum 求和 示例: Student.ob ...

  9. java开发-flyway

    数据库版本管理工具 什么是数据库版本管理? 做过开发的小伙伴们都知道,实现一个需求时,一般情况下都需要设计到数据库表结构的修改.那么我们怎么能保证项目多人开发时,多个数据库环境(测试,生产环境)能够保 ...

  10. [转]为什么阿里巴巴要禁用Executors创建线程池?

    作者:何甜甜在吗 链接:https://juejin.im/post/5dc41c165188257bad4d9e69 来源:掘金 看阿里巴巴开发手册并发编程这块有一条:线程池不允许使用Executo ...