https://www.kancloud.cn/huyipow/prometheus/527092

https://songjiayang.gitbooks.io/prometheus/content/demo/target.html

 

创建 monitoring namespaces

  1. apiVersion: v1
  2. kind: Namespace
  3. metadata:
  4. name: monitoring

Prometheus RBAC 权限管理

创建prometheus-k8s 角色账号

  1. apiVersion: v1
  2. kind: ServiceAccount
  3. metadata:
  4. name: prometheus-k8s
  5. namespace: monitoring

在kube-system 与 monitoring namespaces 空间,创建 prometheus-k8s 角色用户权限 。

  1. apiVersion: rbac.authorization.k8s.io/v1beta1
  2. kind: Role
  3. metadata:
  4. name: prometheus-k8s
  5. namespace: monitoring
  6. rules:
  7. - apiGroups: [""]
  8. resources:
  9. - nodes
  10. - services
  11. - endpoints
  12. - pods
  13. verbs: ["get", "list", "watch"]
  14. - apiGroups: [""]
  15. resources:
  16. - configmaps
  17. verbs: ["get"]
  18. ---
  19. apiVersion: rbac.authorization.k8s.io/v1beta1
  20. kind: Role
  21. metadata:
  22. name: prometheus-k8s
  23. namespace: kube-system
  24. rules:
  25. - apiGroups: [""]
  26. resources:
  27. - services
  28. - endpoints
  29. - pods
  30. verbs: ["get", "list", "watch"]
  31. ---
  32. apiVersion: rbac.authorization.k8s.io/v1beta1
  33. kind: Role
  34. metadata:
  35. name: prometheus-k8s
  36. namespace: default
  37. rules:
  38. - apiGroups: [""]
  39. resources:
  40. - services
  41. - endpoints
  42. - pods
  43. verbs: ["get", "list", "watch"]
  44. ---
  45. apiVersion: rbac.authorization.k8s.io/v1beta1
  46. kind: ClusterRole
  47. metadata:
  48. name: prometheus-k8s
  49. rules:
  50. - nonResourceURLs: ["/metrics"]
  51. verbs: ["get"]

绑定用户Role

  1. apiVersion: rbac.authorization.k8s.io/v1beta1
  2. kind: RoleBinding
  3. metadata:
  4. name: prometheus-k8s
  5. namespace: default
  6. roleRef:
  7. apiGroup: rbac.authorization.k8s.io
  8. kind: Role
  9. name: prometheus-k8s
  10. subjects:
  11. - kind: ServiceAccount
  12. name: prometheus-k8s
  13. namespace: monitoring
  14. ---
  15. apiVersion: rbac.authorization.k8s.io/v1beta1
  16. kind: RoleBinding
  17. metadata:
  18. name: prometheus-k8s
  19. namespace: kube-system
  20. roleRef:
  21. apiGroup: rbac.authorization.k8s.io
  22. kind: Role
  23. name: prometheus-k8s
  24. subjects:
  25. - kind: ServiceAccount
  26. name: prometheus-k8s
  27. namespace: monitoring
  28. ---
  29. apiVersion: rbac.authorization.k8s.io/v1beta1
  30. kind: RoleBinding
  31. metadata:
  32. name: prometheus-k8s
  33. namespace: monitoring
  34. roleRef:
  35. apiGroup: rbac.authorization.k8s.io
  36. kind: Role
  37. name: prometheus-k8s
  38. subjects:
  39. - kind: ServiceAccount
  40. name: prometheus-k8s
  41. namespace: monitoring
  42. ---
  43. apiVersion: rbac.authorization.k8s.io/v1beta1
  44. kind: ClusterRoleBinding
  45. metadata:
  46. name: prometheus-k8s
  47. roleRef:
  48. apiGroup: rbac.authorization.k8s.io
  49. kind: ClusterRole
  50. name: prometheus-k8s
  51. subjects:
  52. - kind: ServiceAccount
  53. name: prometheus-k8s
  54. namespace: monitoring

使用statefulset 方式部署prometheus。

  1. kind: Secret
  2. apiVersion: v1
  3. data:
  4. key: QVFCZU54dFlkMVNvRUJBQUlMTUVXMldSS29mdWhlamNKaC8yRXc9PQ==
  5. metadata:
  6. name: ceph-secret
  7. namespace: monitoring
  8. type: kubernetes.io/rbd
  9. ---
  10. kind: StorageClass
  11. apiVersion: storage.k8s.io/v1
  12. metadata:
  13. name: prometheus-ceph-fast
  14. namespace: monitoring
  15. provisioner: ceph.com/rbd
  16. parameters:
  17. monitors: 10.18.19.91:6789
  18. adminId: admin
  19. adminSecretName: ceph-secret
  20. adminSecretNamespace: monitoring
  21. userSecretName: ceph-secret
  22. pool: prometheus-dev
  23. userId: admin
  24. ---
  25. apiVersion: monitoring.coreos.com/v1
  26. kind: Prometheus
  27. metadata:
  28. name: k8s
  29. namespace: monitoring
  30. labels:
  31. prometheus: k8s
  32. spec:
  33. replicas: 2
  34. version: v2.0.0
  35. serviceAccountName: prometheus-k8s
  36. serviceMonitorSelector:
  37. matchExpressions:
  38. - {key: k8s-app, operator: Exists}
  39. ruleSelector:
  40. matchLabels:
  41. role: prometheus-rulefiles
  42. prometheus: k8s
  43. resources:
  44. requests:
  45. memory: 4G
  46. storage:
  47. volumeClaimTemplate:
  48. metadata:
  49. name: prometheus-data
  50. annotations:
  51. volume.beta.kubernetes.io/storage-class: prometheus-ceph-fast
  52. spec:
  53. accessModes: [ "ReadWriteOnce" ]
  54. resources:
  55. requests:
  56. storage: 50Gi
  57. alerting:
  58. alertmanagers:
  59. - namespace: monitoring
  60. name: alertmanager-main
  61. port: web

使用ceph RBD 作为prometheus 持久化存储。(此处有坑)

本次环境使用容器化方式部署api-server 。因api-server 默认镜像没有ceph 驱动,会导致pod在挂载存储步奏启动失败。
使用provisioner方式给apis-server 提供ceph 驱动。

  1. apiVersion: extensions/v1beta1
  2. kind: Deployment
  3. metadata:
  4. name: rbd-provisioner
  5. namespace: monitoring
  6. spec:
  7. replicas: 1
  8. strategy:
  9. type: Recreate
  10. template:
  11. metadata:
  12. labels:
  13. app: rbd-provisioner
  14. spec:
  15. containers:
  16. - name: rbd-provisioner
  17. image: "quay.io/external_storage/rbd-provisioner:latest"
  18. env:
  19. - name: PROVISIONER_NAME
  20. value: ceph.com/rbd
  21. args: ["-master=http://10.18.19.143:8080", "-id=rbd-provisioner"]

创建prometheus svc

  1. apiVersion: v1
  2. kind: Service
  3. metadata:
  4. labels:
  5. prometheus: k8s
  6. name: prometheus-k8s
  7. namespace: monitoring
  8. spec:
  9. type: NodePort
  10. ports:
  11. - name: web
  12. nodePort: 30900
  13. port: 9090
  14. protocol: TCP
  15. targetPort: web
  16. selector:
  17. prometheus: k8s

使用configmap创建prometheus 报警规则文件

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: prometheus-k8s-rules
  5. namespace: monitoring
  6. labels:
  7. role: prometheus-rulefiles
  8. prometheus: k8s
  9. data:
  10. alertmanager.rules.yaml: |+
  11. groups:
  12. - name: alertmanager.rules
  13. rules:
  14. - alert: AlertmanagerConfigInconsistent
  15. expr: count_values("config_hash", alertmanager_config_hash) BY (service) / ON(service)
  16. GROUP_LEFT() label_replace(prometheus_operator_alertmanager_spec_replicas, "service",
  17. "alertmanager-$1", "alertmanager", "(.*)") != 1
  18. for: 5m
  19. labels:
  20. severity: critical
  21. annotations:
  22. description: The configuration of the instances of the Alertmanager cluster
  23. `{{$labels.service}}` are out of sync.
  24. - alert: AlertmanagerDownOrMissing
  25. expr: label_replace(prometheus_operator_alertmanager_spec_replicas, "job", "alertmanager-$1",
  26. "alertmanager", "(.*)") / ON(job) GROUP_RIGHT() sum(up) BY (job) != 1
  27. for: 5m
  28. labels:
  29. severity: warning
  30. annotations:
  31. description: An unexpected number of Alertmanagers are scraped or Alertmanagers
  32. disappeared from discovery.
  33. - alert: AlertmanagerFailedReload
  34. expr: alertmanager_config_last_reload_successful == 0
  35. for: 10m
  36. labels:
  37. severity: warning
  38. annotations:
  39. description: Reloading Alertmanager's configuration has failed for {{ $labels.namespace
  40. }}/{{ $labels.pod}}.
  41. etcd3.rules.yaml: |+
  42. groups:
  43. - name: ./etcd3.rules
  44. rules:
  45. - alert: InsufficientMembers
  46. expr: count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1)
  47. for: 3m
  48. labels:
  49. severity: critical
  50. annotations:
  51. description: If one more etcd member goes down the cluster will be unavailable
  52. summary: etcd cluster insufficient members
  53. - alert: NoLeader
  54. expr: etcd_server_has_leader{job="etcd"} == 0
  55. for: 1m
  56. labels:
  57. severity: critical
  58. annotations:
  59. description: etcd member {{ $labels.instance }} has no leader
  60. summary: etcd member has no leader
  61. - alert: HighNumberOfLeaderChanges
  62. expr: increase(etcd_server_leader_changes_seen_total{job="etcd"}[1h]) > 3
  63. labels:
  64. severity: warning
  65. annotations:
  66. description: etcd instance {{ $labels.instance }} has seen {{ $value }} leader
  67. changes within the last hour
  68. summary: a high number of leader changes within the etcd cluster are happening
  69. - alert: HighNumberOfFailedGRPCRequests
  70. expr: sum(rate(etcd_grpc_requests_failed_total{job="etcd"}[5m])) BY (grpc_method)
  71. / sum(rate(etcd_grpc_total{job="etcd"}[5m])) BY (grpc_method) > 0.01
  72. for: 10m
  73. labels:
  74. severity: warning
  75. annotations:
  76. description: '{{ $value }}% of requests for {{ $labels.grpc_method }} failed
  77. on etcd instance {{ $labels.instance }}'
  78. summary: a high number of gRPC requests are failing
  79. - alert: HighNumberOfFailedGRPCRequests
  80. expr: sum(rate(etcd_grpc_requests_failed_total{job="etcd"}[5m])) BY (grpc_method)
  81. / sum(rate(etcd_grpc_total{job="etcd"}[5m])) BY (grpc_method) > 0.05
  82. for: 5m
  83. labels:
  84. severity: critical
  85. annotations:
  86. description: '{{ $value }}% of requests for {{ $labels.grpc_method }} failed
  87. on etcd instance {{ $labels.instance }}'
  88. summary: a high number of gRPC requests are failing
  89. - alert: GRPCRequestsSlow
  90. expr: histogram_quantile(0.99, rate(etcd_grpc_unary_requests_duration_seconds_bucket[5m]))
  91. > 0.15
  92. for: 10m
  93. labels:
  94. severity: critical
  95. annotations:
  96. description: on etcd instance {{ $labels.instance }} gRPC requests to {{ $labels.grpc_method
  97. }} are slow
  98. summary: slow gRPC requests
  99. - alert: HighNumberOfFailedHTTPRequests
  100. expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m]))
  101. BY (method) > 0.01
  102. for: 10m
  103. labels:
  104. severity: warning
  105. annotations:
  106. description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd
  107. instance {{ $labels.instance }}'
  108. summary: a high number of HTTP requests are failing
  109. - alert: HighNumberOfFailedHTTPRequests
  110. expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m]))
  111. BY (method) > 0.05
  112. for: 5m
  113. labels:
  114. severity: critical
  115. annotations:
  116. description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd
  117. instance {{ $labels.instance }}'
  118. summary: a high number of HTTP requests are failing
  119. - alert: HTTPRequestsSlow
  120. expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m]))
  121. > 0.15
  122. for: 10m
  123. labels:
  124. severity: warning
  125. annotations:
  126. description: on etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method
  127. }} are slow
  128. summary: slow HTTP requests
  129. - alert: EtcdMemberCommunicationSlow
  130. expr: histogram_quantile(0.99, rate(etcd_network_member_round_trip_time_seconds_bucket[5m]))
  131. > 0.15
  132. for: 10m
  133. labels:
  134. severity: warning
  135. annotations:
  136. description: etcd instance {{ $labels.instance }} member communication with
  137. {{ $labels.To }} is slow
  138. summary: etcd member communication is slow
  139. - alert: HighNumberOfFailedProposals
  140. expr: increase(etcd_server_proposals_failed_total{job="etcd"}[1h]) > 5
  141. labels:
  142. severity: warning
  143. annotations:
  144. description: etcd instance {{ $labels.instance }} has seen {{ $value }} proposal
  145. failures within the last hour
  146. summary: a high number of proposals within the etcd cluster are failing
  147. - alert: HighFsyncDurations
  148. expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
  149. > 0.5
  150. for: 10m
  151. labels:
  152. severity: warning
  153. annotations:
  154. description: etcd instance {{ $labels.instance }} fync durations are high
  155. summary: high fsync durations
  156. - alert: HighCommitDurations
  157. expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))
  158. > 0.25
  159. for: 10m
  160. labels:
  161. severity: warning
  162. annotations:
  163. description: etcd instance {{ $labels.instance }} commit durations are high
  164. summary: high commit durations
  165. general.rules.yaml: |+
  166. groups:
  167. - name: general.rules
  168. rules:
  169. - alert: TargetDown
  170. expr: 100 * (count(up == 0) BY (job) / count(up) BY (job)) > 10
  171. for: 10m
  172. labels:
  173. severity: warning
  174. annotations:
  175. description: '{{ $value }}% of {{ $labels.job }} targets are down.'
  176. summary: Targets are down
  177. - alert: DeadMansSwitch
  178. expr: vector(1)
  179. labels:
  180. severity: none
  181. annotations:
  182. description: This is a DeadMansSwitch meant to ensure that the entire Alerting
  183. pipeline is functional.
  184. summary: Alerting DeadMansSwitch
  185. - record: fd_utilization
  186. expr: process_open_fds / process_max_fds
  187. - alert: FdExhaustionClose
  188. expr: predict_linear(fd_utilization[1h], 3600 * 4) > 1
  189. for: 10m
  190. labels:
  191. severity: warning
  192. annotations:
  193. description: '{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance
  194. will exhaust in file/socket descriptors within the next 4 hours'
  195. summary: file descriptors soon exhausted
  196. - alert: FdExhaustionClose
  197. expr: predict_linear(fd_utilization[10m], 3600) > 1
  198. for: 10m
  199. labels:
  200. severity: critical
  201. annotations:
  202. description: '{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance
  203. will exhaust in file/socket descriptors within the next hour'
  204. summary: file descriptors soon exhausted
  205. kube-controller-manager.rules.yaml: |+
  206. groups:
  207. - name: kube-controller-manager.rules
  208. rules:
  209. - alert: K8SControllerManagerDown
  210. expr: absent(up{job="kube-controller-manager"} == 1)
  211. for: 5m
  212. labels:
  213. severity: critical
  214. annotations:
  215. description: There is no running K8S controller manager. Deployments and replication
  216. controllers are not making progress.
  217. runbook: https://coreos.com/tectonic/docs/latest/troubleshooting/controller-recovery.html#recovering-a-controller-manager
  218. summary: Controller manager is down
  219. kube-scheduler.rules.yaml: |+
  220. groups:
  221. - name: kube-scheduler.rules
  222. rules:
  223. - record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
  224. expr: histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
  225. BY (le, cluster)) / 1e+06
  226. labels:
  227. quantile: "0.99"
  228. - record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
  229. expr: histogram_quantile(0.9, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
  230. BY (le, cluster)) / 1e+06
  231. labels:
  232. quantile: "0.9"
  233. - record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
  234. expr: histogram_quantile(0.5, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
  235. BY (le, cluster)) / 1e+06
  236. labels:
  237. quantile: "0.5"
  238. - record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
  239. expr: histogram_quantile(0.99, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
  240. BY (le, cluster)) / 1e+06
  241. labels:
  242. quantile: "0.99"
  243. - record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
  244. expr: histogram_quantile(0.9, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
  245. BY (le, cluster)) / 1e+06
  246. labels:
  247. quantile: "0.9"
  248. - record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
  249. expr: histogram_quantile(0.5, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
  250. BY (le, cluster)) / 1e+06
  251. labels:
  252. quantile: "0.5"
  253. - record: cluster:scheduler_binding_latency_seconds:quantile
  254. expr: histogram_quantile(0.99, sum(scheduler_binding_latency_microseconds_bucket)
  255. BY (le, cluster)) / 1e+06
  256. labels:
  257. quantile: "0.99"
  258. - record: cluster:scheduler_binding_latency_seconds:quantile
  259. expr: histogram_quantile(0.9, sum(scheduler_binding_latency_microseconds_bucket)
  260. BY (le, cluster)) / 1e+06
  261. labels:
  262. quantile: "0.9"
  263. - record: cluster:scheduler_binding_latency_seconds:quantile
  264. expr: histogram_quantile(0.5, sum(scheduler_binding_latency_microseconds_bucket)
  265. BY (le, cluster)) / 1e+06
  266. labels:
  267. quantile: "0.5"
  268. - alert: K8SSchedulerDown
  269. expr: absent(up{job="kube-scheduler"} == 1)
  270. for: 5m
  271. labels:
  272. severity: critical
  273. annotations:
  274. description: There is no running K8S scheduler. New pods are not being assigned
  275. to nodes.
  276. runbook: https://coreos.com/tectonic/docs/latest/troubleshooting/controller-recovery.html#recovering-a-scheduler
  277. summary: Scheduler is down
  278. kube-state-metrics.rules.yaml: |+
  279. groups:
  280. - name: kube-state-metrics.rules
  281. rules:
  282. - alert: DeploymentGenerationMismatch
  283. expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
  284. for: 15m
  285. labels:
  286. severity: warning
  287. annotations:
  288. description: Observed deployment generation does not match expected one for
  289. deployment {{$labels.namespaces}}{{$labels.deployment}}
  290. - alert: DeploymentReplicasNotUpdated
  291. expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
  292. or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
  293. unless (kube_deployment_spec_paused == 1)
  294. for: 15m
  295. labels:
  296. severity: warning
  297. annotations:
  298. description: Replicas are not updated and available for deployment {{$labels.namespaces}}/{{$labels.deployment}}
  299. - alert: DaemonSetRolloutStuck
  300. expr: kube_daemonset_status_current_number_ready / kube_daemonset_status_desired_number_scheduled
  301. * 100 < 100
  302. for: 15m
  303. labels:
  304. severity: warning
  305. annotations:
  306. description: Only {{$value}}% of desired pods scheduled and ready for daemon
  307. set {{$labels.namespaces}}/{{$labels.daemonset}}
  308. - alert: K8SDaemonSetsNotScheduled
  309. expr: kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled
  310. > 0
  311. for: 10m
  312. labels:
  313. severity: warning
  314. annotations:
  315. description: A number of daemonsets are not scheduled.
  316. summary: Daemonsets are not scheduled correctly
  317. - alert: DaemonSetsMissScheduled
  318. expr: kube_daemonset_status_number_misscheduled > 0
  319. for: 10m
  320. labels:
  321. severity: warning
  322. annotations:
  323. description: A number of daemonsets are running where they are not supposed
  324. to run.
  325. summary: Daemonsets are not scheduled correctly
  326. - alert: PodFrequentlyRestarting
  327. expr: increase(kube_pod_container_status_restarts[1h]) > 5
  328. for: 10m
  329. labels:
  330. severity: warning
  331. annotations:
  332. description: Pod {{$labels.namespaces}}/{{$labels.pod}} is was restarted {{$value}}
  333. times within the last hour
  334. kubelet.rules.yaml: |+
  335. groups:
  336. - name: kubelet.rules
  337. rules:
  338. - alert: K8SNodeNotReady
  339. expr: kube_node_status_condition{condition="Ready",status="true"} == 0
  340. for: 1h
  341. labels:
  342. severity: warning
  343. annotations:
  344. description: The Kubelet on {{ $labels.node }} has not checked in with the API,
  345. or has set itself to NotReady, for more than an hour
  346. summary: Node status is NotReady
  347. - alert: K8SManyNodesNotReady
  348. expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0)
  349. > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} ==
  350. 0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
  351. for: 1m
  352. labels:
  353. severity: critical
  354. annotations:
  355. description: '{{ $value }}% of Kubernetes nodes are not ready'
  356. - alert: K8SKubeletDown
  357. expr: count(up{job="kubelet"} == 0) / count(up{job="kubelet"}) * 100 > 3
  358. for: 1h
  359. labels:
  360. severity: warning
  361. annotations:
  362. description: Prometheus failed to scrape {{ $value }}% of kubelets.
  363. - alert: K8SKubeletDown
  364. expr: (absent(up{job="kubelet"} == 1) or count(up{job="kubelet"} == 0) / count(up{job="kubelet"}))
  365. * 100 > 1
  366. for: 1h
  367. labels:
  368. severity: critical
  369. annotations:
  370. description: Prometheus failed to scrape {{ $value }}% of kubelets, or all Kubelets
  371. have disappeared from service discovery.
  372. summary: Many Kubelets cannot be scraped
  373. - alert: K8SKubeletTooManyPods
  374. expr: kubelet_running_pod_count > 100
  375. for: 10m
  376. labels:
  377. severity: warning
  378. annotations:
  379. description: Kubelet {{$labels.instance}} is running {{$value}} pods, close
  380. to the limit of 110
  381. summary: Kubelet is close to pod limit
  382. kubernetes.rules.yaml: |+
  383. groups:
  384. - name: kubernetes.rules
  385. rules:
  386. - record: pod_name:container_memory_usage_bytes:sum
  387. expr: sum(container_memory_usage_bytes{container_name!="POD",pod_name!=""}) BY
  388. (pod_name)
  389. - record: pod_name:container_spec_cpu_shares:sum
  390. expr: sum(container_spec_cpu_shares{container_name!="POD",pod_name!=""}) BY (pod_name)
  391. - record: pod_name:container_cpu_usage:sum
  392. expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m]))
  393. BY (pod_name)
  394. - record: pod_name:container_fs_usage_bytes:sum
  395. expr: sum(container_fs_usage_bytes{container_name!="POD",pod_name!=""}) BY (pod_name)
  396. - record: namespace:container_memory_usage_bytes:sum
  397. expr: sum(container_memory_usage_bytes{container_name!=""}) BY (namespace)
  398. - record: namespace:container_spec_cpu_shares:sum
  399. expr: sum(container_spec_cpu_shares{container_name!=""}) BY (namespace)
  400. - record: namespace:container_cpu_usage:sum
  401. expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD"}[5m]))
  402. BY (namespace)
  403. - record: cluster:memory_usage:ratio
  404. expr: sum(container_memory_usage_bytes{container_name!="POD",pod_name!=""}) BY
  405. (cluster) / sum(machine_memory_bytes) BY (cluster)
  406. - record: cluster:container_spec_cpu_shares:ratio
  407. expr: sum(container_spec_cpu_shares{container_name!="POD",pod_name!=""}) / 1000
  408. / sum(machine_cpu_cores)
  409. - record: cluster:container_cpu_usage:ratio
  410. expr: rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m])
  411. / sum(machine_cpu_cores)
  412. - record: apiserver_latency_seconds:quantile
  413. expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) /
  414. 1e+06
  415. labels:
  416. quantile: "0.99"
  417. - record: apiserver_latency:quantile_seconds
  418. expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) /
  419. 1e+06
  420. labels:
  421. quantile: "0.9"
  422. - record: apiserver_latency_seconds:quantile
  423. expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) /
  424. 1e+06
  425. labels:
  426. quantile: "0.5"
  427. - alert: APIServerLatencyHigh
  428. expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
  429. > 1
  430. for: 10m
  431. labels:
  432. severity: warning
  433. annotations:
  434. description: the API server has a 99th percentile latency of {{ $value }} seconds
  435. for {{$labels.verb}} {{$labels.resource}}
  436. - alert: APIServerLatencyHigh
  437. expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
  438. > 4
  439. for: 10m
  440. labels:
  441. severity: critical
  442. annotations:
  443. description: the API server has a 99th percentile latency of {{ $value }} seconds
  444. for {{$labels.verb}} {{$labels.resource}}
  445. - alert: APIServerErrorsHigh
  446. expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])
  447. * 100 > 2
  448. for: 10m
  449. labels:
  450. severity: warning
  451. annotations:
  452. description: API server returns errors for {{ $value }}% of requests
  453. - alert: APIServerErrorsHigh
  454. expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])
  455. * 100 > 5
  456. for: 10m
  457. labels:
  458. severity: critical
  459. annotations:
  460. description: API server returns errors for {{ $value }}% of requests
  461. - alert: K8SApiserverDown
  462. expr: absent(up{job="apiserver"} == 1)
  463. for: 20m
  464. labels:
  465. severity: critical
  466. annotations:
  467. description: No API servers are reachable or all have disappeared from service
  468. discovery
  469. node.rules.yaml: |+
  470. groups:
  471. - name: node.rules
  472. rules:
  473. - record: instance:node_cpu:rate:sum
  474. expr: sum(rate(node_cpu{mode!="idle",mode!="iowait",mode!~"^(?:guest.*)$"}[3m]))
  475. BY (instance)
  476. - record: instance:node_filesystem_usage:sum
  477. expr: sum((node_filesystem_size{mountpoint="/"} - node_filesystem_free{mountpoint="/"}))
  478. BY (instance)
  479. - record: instance:node_network_receive_bytes:rate:sum
  480. expr: sum(rate(node_network_receive_bytes[3m])) BY (instance)
  481. - record: instance:node_network_transmit_bytes:rate:sum
  482. expr: sum(rate(node_network_transmit_bytes[3m])) BY (instance)
  483. - record: instance:node_cpu:ratio
  484. expr: sum(rate(node_cpu{mode!="idle"}[5m])) WITHOUT (cpu, mode) / ON(instance)
  485. GROUP_LEFT() count(sum(node_cpu) BY (instance, cpu)) BY (instance)
  486. - record: cluster:node_cpu:sum_rate5m
  487. expr: sum(rate(node_cpu{mode!="idle"}[5m]))
  488. - record: cluster:node_cpu:ratio
  489. expr: cluster:node_cpu:rate5m / count(sum(node_cpu) BY (instance, cpu))
  490. - alert: NodeExporterDown
  491. expr: absent(up{job="node-exporter"} == 1)
  492. for: 10m
  493. labels:
  494. severity: warning
  495. annotations:
  496. description: Prometheus could not scrape a node-exporter for more than 10m,
  497. or node-exporters have disappeared from discovery
  498. - alert: NodeDiskRunningFull
  499. expr: predict_linear(node_filesystem_free[6h], 3600 * 24) < 0
  500. for: 30m
  501. labels:
  502. severity: warning
  503. annotations:
  504. description: device {{$labels.device}} on node {{$labels.instance}} is running
  505. full within the next 24 hours (mounted at {{$labels.mountpoint}})
  506. - alert: NodeDiskRunningFull
  507. expr: predict_linear(node_filesystem_free[30m], 3600 * 2) < 0
  508. for: 10m
  509. labels:
  510. severity: critical
  511. annotations:
  512. description: device {{$labels.device}} on node {{$labels.instance}} is running
  513. full within the next 2 hours (mounted at {{$labels.mountpoint}})
  514. - alert: NodeCPUUsage
  515. expr: (100 - (avg by (instance) (irate(node_cpu{job="node-exporter",mode="idle"}[5m])) * 100)) > 5
  516. for: 10m
  517. labels:
  518. severity: critical
  519. annotations:
  520. # description: {{$labels.instance}} CPU usage is above 75% (current value is {{ $value }})
  521. prometheus.rules.yaml: |+
  522. groups:
  523. - name: prometheus.rules
  524. rules:
  525. - alert: PrometheusConfigReloadFailed
  526. expr: prometheus_config_last_reload_successful == 0
  527. for: 10m
  528. labels:
  529. severity: warning
  530. annotations:
  531. description: Reloading Prometheus' configuration has failed for {{$labels.namespace}}/{{$labels.pod}}
  532. - alert: PrometheusNotificationQueueRunningFull
  533. expr: predict_linear(prometheus_notifications_queue_length[5m], 60 * 30) > prometheus_notifications_queue_capacity
  534. for: 10m
  535. labels:
  536. severity: warning
  537. annotations:
  538. description: Prometheus' alert notification queue is running full for {{$labels.namespace}}/{{
  539. $labels.pod}}
  540. - alert: PrometheusErrorSendingAlerts
  541. expr: rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])
  542. > 0.01
  543. for: 10m
  544. labels:
  545. severity: warning
  546. annotations:
  547. description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{
  548. $labels.pod}} to Alertmanager {{$labels.Alertmanager}}
  549. - alert: PrometheusErrorSendingAlerts
  550. expr: rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])
  551. > 0.03
  552. for: 10m
  553. labels:
  554. severity: critical
  555. annotations:
  556. description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{
  557. $labels.pod}} to Alertmanager {{$labels.Alertmanager}}
  558. - alert: PrometheusNotConnectedToAlertmanagers
  559. expr: prometheus_notifications_alertmanagers_discovered < 1
  560. for: 10m
  561. labels:
  562. severity: warning
  563. annotations:
  564. description: Prometheus {{ $labels.namespace }}/{{ $labels.pod}} is not connected
  565. to any Alertmanagers
  566. noah_pod.rules.yaml: |+
  567. groups:
  568. - name: noah_pod.rules
  569. rules:
  570. - alert: Pod_all_cpu_usage
  571. expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 10
  572. for: 5m
  573. labels:
  574. severity: critical
  575. service: pods
  576. annotations:
  577. description: 容器 {{ $labels.name }} CPU 资源利用率大于 75% , (current value is {{ $value }})
  578. summary: Dev CPU 负载告警
  579. - alert: Pod_all_memory_usage
  580. expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 1024*10^3*2
  581. for: 10m
  582. labels:
  583. severity: critical
  584. annotations:
  585. description: 容器 {{ $labels.name }} Memory 资源利用率大于 2G , (current value is {{ $value }})
  586. summary: Dev Memory 负载告警
  587. - alert: Pod_all_network_receive_usage
  588. expr: sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 1024*1024*50
  589. for: 10m
  590. labels:
  591. severity: critical
  592. annotations:
  593. description: 容器 {{ $labels.name }} network_receive 资源利用率大于 50M , (current value is {{ $value }})
  594. summary: network_receive 负载告警

k8s部署prometheus的更多相关文章

  1. K8s 部署 Prometheus + Grafana

    一.简介 1. Prometheus 一款开源的监控&报警&时间序列数据库的组合,起始是由 SoundCloud 公司开发的 基本原理是通过 HTTP 协议周期性抓取被监控组件的状态, ...

  2. Prometheus K8S部署

    Prometheus K8S部署 部署方式:https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/prometheus ...

  3. 基于k8s集群部署prometheus监控ingress nginx

    目录 基于k8s集群部署prometheus监控ingress nginx 1.背景和环境概述 2.修改prometheus配置 3.检查是否生效 4.配置grafana图形 基于k8s集群部署pro ...

  4. 基于k8s集群部署prometheus监控etcd

    目录 基于k8s集群部署prometheus监控etcd 1.背景和环境概述 2.修改prometheus配置 3.检查是否生效 4.配置grafana图形 基于k8s集群部署prometheus监控 ...

  5. k8s之自定义指标API部署prometheus

    1.自定义指标-prometheus node_exporter是agent;PromQL相当于sql语句来查询数据; k8s-prometheus-adapter:prometheus是不能直接解析 ...

  6. 部署 Prometheus 和 Grafana 到 k8s

    在 k8s 中部署 Prometheus 和 Grafana Intro 上次我们主要分享了 asp.net core 集成 prometheus,以及简单的 prometheus 使用,在实际在 k ...

  7. 容器编排系统K8s之Prometheus监控系统+Grafana部署

    前文我们聊到了k8s的apiservice资源结合自定义apiserver扩展原生apiserver功能的相关话题,回顾请参考:https://www.cnblogs.com/qiuhom-1874/ ...

  8. K8S(13)监控实战-部署prometheus

    k8s监控实战-部署prometheus 目录 k8s监控实战-部署prometheus 1 prometheus前言相关 1.1 Prometheus的特点 1.2 基本原理 1.2.1 原理说明 ...

  9. k8s实战之部署Prometheus+Grafana可视化监控告警平台

    写在前面 之前部署web网站的时候,架构图中有一环节是监控部分,并且搭建一套有效的监控平台对于运维来说非常之重要,只有这样才能更有效率的保证我们的服务器和服务的稳定运行,常见的开源监控软件有好几种,如 ...

随机推荐

  1. Linux大棚命令记录

    查看系统支持的shell: cat  /etc/shells 查看当前系统用的shell: echo $SHELL 从bash切换到zsh: 先yum安装,然后 chsh -s /bin/zsh ,退 ...

  2. 基于SSH 供应链管理系统质量属性说明

    产品的易用程度如何,执行速度如何,可靠性如何,当发生异常情况时,系统如何处理.这些被称为软件质量属性,而特性是指系统非功能(也叫非行为)部分的需求. 性能:性能就是一个东西有多快,通常指响应时间或延迟 ...

  3. JUnit4 单元测试

    一. 题目简介 这次的单元测试我作了一个基本运算的程序,该程序实现了加,减,乘,除,平方,倒数的运算,该程序进行测试比较的简单,对于初步接触JUnit的我来说测试起来也比较容易理解. 二.源码的git ...

  4. Maven -Maven配置tomcat插件 两种

    Maven Tomcat插件现在主要有两个版本,tomcat-maven-plugin和tomcat7-maven-plugin,使用方式基本相同. tomcat-maven-plugin 插件官网: ...

  5. windows下net命令失败

    D:\apache-tomcat-7.0.57\bin>net start mysql57发生系统错误 5. 拒绝访问. 以管理员身份运行 run as administrator 打开cmd. ...

  6. WIN10 评估版 查看过期时间

    命令行运行winver,弹出的窗口显示过期时间,如 下图: 又可以再用一段时间教育版了,本机预装的的家庭版序列号,还没法从教育版降级到家庭版,可悲吧(win7时代就不允许从高级降低到低级用啊)

  7. Docker 安装私有镜像库的简单使用

    公司的网络实在是太差了, 想着自己搭建一个私有的镜像库进行使用测试使用.... docker pull registry.docker-cn.com/library/registry docker t ...

  8. React 表单refs

    <!DOCTYPE html><html><head lang="en"> <meta charset="UTF-8" ...

  9. delphi 通过事务插入数据

    orsn1.StartTransaction; try qry1.Sql.Clear; qry1.Sql.Text:=' select * from log '; qry1.Open; qry1.In ...

  10. Python连接字符串用join还是+

    我们先来看一下用join和+连接字符串的例子 str1 = " ".join(["hello", "world"]) str2 = &quo ...