本次的想法是做服务监控 并告警  主要线路如下图所示

1、运行prometheus  docker方式

  1. docker run -itd \
  2. -p 9090:9090 \
  3. -v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
  4. prom/prometheus

2、prometheus.yml 初始配置文件如下:

  1. global:
  2. scrape_interval: 15s # By default, scrape targets every 15 seconds. 全局默认值 15秒抓取一次数据
  3.  
  4. # Attach these labels to any time series or alerts when communicating with
  5. # external systems (federation, remote storage, Alertmanager).
  6. external_labels:
  7. monitor: 'codelab-monitor'
  8.  
  9. # A scrape configuration containing exactly one endpoint to scrape:
  10. # Here it's Prometheus itself.
  11. scrape_configs:
  12. # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  13. - job_name: 'prometheus'
  14.  
  15. # Override the global default and scrape targets from this job every 5 seconds.
  16. scrape_interval: 5s
  17.  
  18. static_configs:
  19. - targets: ['localhost:9090']

3、默认 prometheus 会有自己的指标接口http://192.168.246.2:9090/metrics 内容部分截取如下

  1. # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
  2. # TYPE go_gc_duration_seconds summary
  3. go_gc_duration_seconds{quantile="0"} 2.6636e-05
  4. go_gc_duration_seconds{quantile="0.25"} 0.000123346
  5. go_gc_duration_seconds{quantile="0.5"} 0.000159706
  6. go_gc_duration_seconds{quantile="0.75"} 0.000190857
  7. go_gc_duration_seconds{quantile="1"} 0.001369042

4、可以登录9090端口去看看prometheus主界面  可以执行PromQL (Prometheus Query Language)   来excute得到结果

比如这个 prometheus_target_interval_length_seconds{quantile="0.99"}

具体PromQL语法示例请参考官网https://prometheus.io/docs/prometheus/latest/querying/basics/

5、上面的数据是prometheus自己的 ,下面我们自己生产数据给它 有很多公共的exporter 可以用 比如 node_exporter 他可以暴露机器一些基本的通用指标。

也可以执行python自定义编程 取指标 让自己成为一个exporter

安装node_exporter 官网例子 但是不要使用127.0.0.1  因为我的prometheus是docker起的  和宿主机的127.0.0.1是不通的  它抓取不到数据的,请改成实际的主机地址

ps:其他exporter 可参考地址 https://prometheus.io/docs/instrumenting/exporters/

  1. tar -xzvf node_exporter-*.*.tar.gz
  2. cd node_exporter-*.*
  3.  
  4. # Start 3 example targets in separate terminals:
  5. ./node_exporter --web.listen-address 127.0.0.1:8080
  6. ./node_exporter --web.listen-address 127.0.0.1:8081
  7. ./node_exporter --web.listen-address 127.0.0.1:8082

6、需要修改prometheus.yml增加job 抓取exproter  修改后的如下 增加了 一个job  里面有三个exporter  标签是随便配的

  1. global:
  2. scrape_interval: 15s # By default, scrape targets every 15 seconds.
  3.  
  4. # Attach these labels to any time series or alerts when communicating with
  5. # external systems (federation, remote storage, Alertmanager).
  6. external_labels:
  7. monitor: 'codelab-monitor'
  8.  
  9. # A scrape configuration containing exactly one endpoint to scrape:
  10. # Here it's Prometheus itself.
  11. scrape_configs:
  12. # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  13. - job_name: 'prometheus'
  14.  
  15. # Override the global default and scrape targets from this job every 5 seconds.
  16. scrape_interval: 5s
  17.  
  18. static_configs:
  19. - targets: ['localhost:9090']
  20. - job_name: 'node'
  21.  
  22. # Override the global default and scrape targets from this job every 5 seconds.
  23. scrape_interval: 5s
  24.  
  25. static_configs:
  26. - targets: ['192.168.246.2:8080', '192.168.246.2:8081']
  27. labels:
  28. group: 'production'
  29.  
  30. - targets: ['192.168.246.2:8082']
  31. labels:
  32. group: 'canary'

6、查看页面 是否ok了

7、可以看看node_exporter 暴露的指标例子 比如有如下的

  1. node_cpu_seconds_total{cpu="0",mode="idle"} 2963.27
  2. node_cpu_seconds_total{cpu="0",mode="iowait"} 0.38
  3. node_cpu_seconds_total{cpu="0",mode="irq"} 0
  4. node_cpu_seconds_total{cpu="0",mode="nice"} 0
  5. node_cpu_seconds_total{cpu="0",mode="softirq"} 0.35
  6. node_cpu_seconds_total{cpu="0",mode="steal"} 0
  7. node_cpu_seconds_total{cpu="0",mode="system"} 19.19
  8. node_cpu_seconds_total{cpu="0",mode="user"} 16.96
  9. node_cpu_seconds_total{cpu="1",mode="idle"} 2965.47
  10. node_cpu_seconds_total{cpu="1",mode="iowait"} 0.37
  11. node_cpu_seconds_total{cpu="1",mode="irq"} 0
  12. node_cpu_seconds_total{cpu="1",mode="nice"} 0.03
  13. node_cpu_seconds_total{cpu="1",mode="softirq"} 0.28
  14. node_cpu_seconds_total{cpu="1",mode="steal"} 0
  15. node_cpu_seconds_total{cpu="1",mode="system"} 18.42
  16. node_cpu_seconds_total{cpu="1",mode="user"} 17.95

8、如果我们想看 近5分钟内 每个实例的所有cpus的平均每秒CPU时间速率 可以这样写

  1. avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

图例结果

9、下面设置一个rules规则,写一个文件 prometheus.rules.yml

  1. groups:
  2. - name: cpu-node
  3. rules:
  4. - record: job_instance_mode:node_cpu_seconds:avg_rate5m
  5. expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

10、现在发现配置文件太多了 我们重新用另一种方式启动docker 优化一下 把本地配置文件都放在 /opt/prometheus/,原来的docker可删除了。

  1. --web.enable-lifecycle 参数支持热更新 接口是curl -X POST http://192.168.246.2:9090/-/reload
  1. docker run -itd -p 9090:9090 -v /opt/prometheus/:/etc/prometheus/ prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle

11、查看rules

12、上面只是规则,并没有告警,我们假设 avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m])) > 0.5 就触发 cpu警告 这是假设的测试

我们需要rules文件如下:

  1. groups:
  2. - name: example
  3. rules:
  4. - alert: HighCpuLatency
  5. expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m])) > 0.5
  6. for: 10s
  7. labels:
  8. severity: page
  9. annotations:
  10. summary: High request latency

写到prometheus.rules.yml文件中 ,注意groups不能复制进去 key不能重复  可以使用命令检查rules文件是否正确

  1. [root@test prometheus]# ./prometheus-2.26.0.linux-amd64/promtool check rules prometheus.rules.yml
  2. Checking prometheus.rules.yml
  3. FAILED:
  4. prometheus.rules.yml: yaml: unmarshal errors:
  5. line 6: mapping key "groups" already defined at line 1
  6. prometheus.rules.yml: yaml: unmarshal errors:
  7. line 6: mapping key "groups" already defined at line 1
  8.  
  9. [root@test prometheus]# ./prometheus-2.26.0.linux-amd64/promtool check rules prometheus.rules.yml
  10. Checking prometheus.rules.yml
  11. SUCCESS: 2 rules found
  12.  
  13. [root@test prometheus]#

13、热更新一下

  1. curl -X POST http://192.168.246.2:9090/-/reload

这次不用重启docker了

查看页面 rules会增加一个 且alert会先有pending状态,等符合条件后就触发告警

14、下面启动alertmanager

启动之前要做两个配置,首先把 alertmanager 的IP和端口配置到prometheus.yml中

最会面增加了 alerting的配置 这样 prometheus 就连上 alertmanager

  1. global:
  2. scrape_interval: 15s # By default, scrape targets every 15 seconds.
  3.  
  4. # Attach these labels to any time series or alerts when communicating with
  5. # external systems (federation, remote storage, Alertmanager).
  6. external_labels:
  7. monitor: 'codelab-monitor'
  8.  
  9. rule_files:
  10. - 'prometheus.rules.yml'
  11. # A scrape configuration containing exactly one endpoint to scrape:
  12. # Here it's Prometheus itself.
  13. scrape_configs:
  14. # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  15. - job_name: 'prometheus'
  16.  
  17. # Override the global default and scrape targets from this job every 5 seconds.
  18. scrape_interval: 5s
  19.  
  20. static_configs:
  21. - targets: ['localhost:9090']
  22. - job_name: 'node'
  23.  
  24. # Override the global default and scrape targets from this job every 5 seconds.
  25. scrape_interval: 5s
  26.  
  27. static_configs:
  28. - targets: ['192.168.246.2:8080', '192.168.246.2:8081']
  29. labels:
  30. group: 'production'
  31.  
  32. - targets: ['192.168.246.2:8082']
  33. labels:
  34. group: 'canary'
  35. alerting:
  36. alertmanagers:
  37. - static_configs:
  38. - targets: ["192.168.246.2:9093"]

第二个配置 我们先测试邮件告警,写的 alertmanager 配置如下

注意修改的部分  qq如何申请授权码请百度一下

  1. smtp_smarthost: 'smtp.qq.com:465'
  2. smtp_from: '6171391@qq.com'
  3. smtp_auth_username: '6171391@qq.com'
  4. smtp_auth_password: 'qq授权码'
  5. smtp_require_tls: false
    默认不做任何过滤选择的接收人
  1. - to: 'dfwl@163.com'
  1.  
  1. global:
  2. # The smarthost and SMTP sender used for mail notifications.
  3. smtp_smarthost: 'smtp.qq.com:465'
  4. smtp_from: '6171391@qq.com'
  5. smtp_auth_username: '6171391@qq.com'
  6. smtp_auth_password: 'qq授权码'
  7. smtp_require_tls: false
  8. # The directory from which notification templates are read.
  9. templates:
  10. - '/etc/alertmanager/template/*.tmpl'
  11.  
  12. # The root route on which each incoming alert enters.
  13. route:
  14.  
  15. group_by: ['alertname', 'cluster', 'service']
  16.  
  17. group_wait: 30s
  18.  
  19. # When the first notification was sent, wait 'group_interval' to send a batch
  20. # of new alerts that started firing for that group.
  21. group_interval: 1m
  22.  
  23. # If an alert has successfully been sent, wait 'repeat_interval' to
  24. # resend them.
  25. repeat_interval: 7h
  26.  
  27. # A default receiver
  28. receiver: team-X-mails
  29.  
  30. # The child route trees.
  31. routes:
  32. # This routes performs a regular expression match on alert labels to
  33. # catch alerts that are related to a list of services.
  34. - match_re:
  35. service: ^(foo1|foo2|baz)$
  36. receiver: team-X-mails
  37. # The service has a sub-route for critical alerts, any alerts
  38. # that do not match, i.e. severity != critical, fall-back to the
  39. # parent node and are sent to 'team-X-mails'
  40. routes:
  41. - match:
  42. severity: critical
  43. receiver: team-X-pager
  44. - match:
  45. service: files
  46. receiver: team-Y-mails
  47.  
  48. routes:
  49. - match:
  50. severity: critical
  51. receiver: team-Y-pager
  52.  
  53. # This route handles all alerts coming from a database service. If there's
  54. # no team to handle it, it defaults to the DB team.
  55. - match:
  56. service: database
  57. receiver: team-DB-pager
  58. # Also group alerts by affected database.
  59. group_by: [alertname, cluster, database]
  60. routes:
  61. - match:
  62. owner: team-X
  63. receiver: team-X-pager
  64. continue: true
  65. - match:
  66. owner: team-Y
  67. receiver: team-Y-pager
  68.  
  69. # Inhibition rules allow to mute a set of alerts given that another alert is
  70. # firing.
  71. # We use this to mute any warning-level notifications if the same alert is
  72. # already critical.
  73. inhibit_rules:
  74. - source_match:
  75. severity: 'critical'
  76. target_match:
  77. severity: 'warning'
  78. # Apply inhibition if the alertname is the same.
  79. # CAUTION:
  80. # If all label names listed in `equal` are missing
  81. # from both the source and target alerts,
  82. # the inhibition rule will apply!
  83. equal: ['alertname', 'cluster', 'service']
  84.  
  85. receivers:
  86. - name: 'team-X-mails'
  87. email_configs:
  88. - to: 'dfwl@163.com'
  89.  
  90. - name: 'team-X-pager'
  91. email_configs:
  92. - to: 'team-X+alerts-critical@example.org'
  93. pagerduty_configs:
  94. - service_key: <team-X-key>
  95.  
  96. - name: 'team-Y-mails'
  97. email_configs:
  98. - to: 'team-Y+alerts@example.org'
  99.  
  100. - name: 'team-Y-pager'
  101. pagerduty_configs:
  102. - service_key: <team-Y-key>
  103.  
  104. - name: 'team-DB-pager'
  105. pagerduty_configs:
  106. - service_key: <team-DB-key>

15、手动启动测试一下 生产环境可以docker或k8s等方式启动

./alertmanager --config.file=alertmanager.yml

16、alertmanager页面能同步到告警

过一会就会发邮件了

  1. group_interval: 1m 意思 从第一次接受告警1m后还在就发



prometheus从零开始的更多相关文章

  1. 从零开始搭建Prometheus自动监控报警系统

    从零搭建Prometheus监控报警系统 什么是Prometheus? Prometheus是由SoundCloud开发的开源监控报警系统和时序列数据库(TSDB).Prometheus使用Go语言开 ...

  2. 从零开始学习Prometheus监控报警系统

    Prometheus简介 Prometheus是一个开源的监控报警系统,它最初由SoundCloud开发. 2016年,Prometheus被纳入了由谷歌发起的Linux基金会旗下的云原生基金会( C ...

  3. Prometheus监控学习记录

    官方文档 Prometheus基础文档 从零开始:Prometheus 进阶之路:Prometheus —— 技巧篇 进阶之路:Prometheus —— 理解篇 prometheus的数据类型介绍 ...

  4. 基于prometheus监控k8s集群

    本文建立在你已经会安装prometheus服务的基础之上,如果你还不会安装,请参考:prometheus多维度监控容器 如果你还没有安装库k8s集群,情参考: 从零开始搭建基于calico的kuben ...

  5. kubernetes 1.15.1 高可用部署 -- 从零开始

    这是一本书!!! 一本写我在容器生态圈的所学!!! 重点先知: 1. centos 7.6安装优化 2. k8s 1.15.1 高可用部署 3. 网络插件calico 4. dashboard 插件 ...

  6. 你必须知道的容器监控 (3) Prometheus

    本篇已加入<.NET Core on K8S学习实践系列文章索引>,可以点击查看更多容器化技术相关系列文章.上一篇介绍了Google开发的容器监控工具cAdvisor,但是其提供的操作界面 ...

  7. 从零搭建Prometheus监控报警系统

    什么是Prometheus? Prometheus是由SoundCloud开发的开源监控报警系统和时序列数据库(TSDB).Prometheus使用Go语言开发,是Google BorgMon监控系统 ...

  8. 3W字干货深入分析基于Micrometer和Prometheus实现度量和监控的方案

    前提 最近线上的项目使用了spring-actuator做度量统计收集,使用Prometheus进行数据收集,Grafana进行数据展示,用于监控生成环境机器的性能指标和业务数据指标.一般,我们叫这样 ...

  9. Prometheus入门教程(二):Prometheus + Grafana实现可视化、告警

    文章首发于[陈树义]公众号,点击跳转到原文:https://mp.weixin.qq.com/s/56S290p4j9KROB5uGRcGkQ Prometheus UI 提供了快速验证 PromQL ...

随机推荐

  1. 四、C#简单操作MinIO

    MinIO的官方网站非常详细,以下只是本人学习过程的整理 一.MinIO的基本概念 二.Windows安装与简单使用MinIO 三.Linux部署MinIO分布式集群 四.C#简单操作MinIO He ...

  2. 天梯赛 L1-058 6翻了

    传送门:https://pintia.cn/problem-sets/994805046380707840/problems/1111914599408664577 这道字符串题,只是天梯赛L1的题, ...

  3. Linux的磁盘管理和文件系统

    一.磁盘结构 1.1.硬盘的物理结构 盘头:硬盘有多个盘片,每盘片2面 磁头:每面一个磁头 1.2.硬盘的数据结构 扇区:盘片被分为多个扇形区域,每个扇区存放512字节的数据,硬盘的最小存储单位 磁道 ...

  4. Shell-12-linux信号

    信号类型 信号:信号是在软件层次上对中断机制的一种模拟,通过给一个进程发送信号,执行相应的处理函数 进程可以通过三种方式来响应一个信号: 1.忽略信号,即对信号不做任何处理,其中有两个信号不能忽略: ...

  5. 【笔记】集成学习入门之soft voting classifier和hard voting classifier

    集成学习入门之soft voting classifier和hard voting classifier 集成学习 通过构建并结合多个学习器来完成学习任务,一般是先产生一组"个体学习器&qu ...

  6. tomcat9配置https-pfx

    下载tomcat9 wget https://mirrors.bfsu.edu.cn/apache/tomcat/tomcat-9/v9.0.37/bin/apache-tomcat-9.0.37.t ...

  7. springboot分页插件的使用

    在springboot工程下的pom.xml中添加依赖 <!--分页 pagehelper --> <dependency> <groupId>com.github ...

  8. 请问在电脑里PNP是什么意思啊?

    PnP(Plug and Play,即插即用)是指用户不必干预计算机的各个外围设备对系统资源的分配,而将这一繁杂的工作交给系统,由系统自身去解决底层硬件资源,包括IRQ(中断请求).I/O(输入输出端 ...

  9. dataTemplate 之 ContentTemplate 的使用

    <Window x:Class="WpfApplication1.Window38" xmlns="http://schemas.microsoft.com/win ...

  10. springMVC学习日志一

    一.springMVC流程图省略 二.写一个简单的springmvc的demo来说明 2.1引入springMVC相关的jar包 2.2配置DispatcherServlet 在web.xml < ...