一套完整的中小级别的企业级监控prometheus

一相信有很多博客都已经详细的说明了prometheus的作用以及相关的作用以及原理，这里不在赘述，仅仅从部署和配置2个方面来记录一下，为公司产品组搭建的prometheus告警平台的过程以及踩过的坑,废话不多说，直接开始搭建部署，需要在一台服务器上面搭建prometheus+grafana+alertmanager+pushgateway，其余被监控的节点部署node_exporter,也可以在prometheus服务端部署node_exporter

　　1.1 部署prometheus，并且使用systemctl进行管控

　　　　安装版本：prometheus-2.6.1

百度云下载：https://pan.baidu.com/s/1w16lQZKw8PCHqlRuSK2i7A

提取码：lw1q

　　　　　之后将包解压到: /usr/local/prometheus目录下面，建议使用ansible脚本进行部署

　　　　　这里附上安装管理的管理文件以及目录地址/usr/lib/systemd/system/prometheus.service

[Unit]

  Description=https://prometheus.io

  [Service]

  Restart=on-failure

  ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml

  [Install]

  WantedBy=multi-user.target

　　 1.2 整理后的prometheus配置文件,添加新的监控节点job_name和机器的节点，并且节点需要安装相应的node_exporter

# my global config

global:

  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration

alerting:

  alertmanagers:

  - static_configs:

    - targets:

      - 172.16.5.3:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

   - "rules/first_rules.yml"

   - "rules/second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: 'prometheus'

    static_configs:

    - targets: ['localhost:9090','172.16.5.3:9100']- job_name: 'pushgateway'

    scrape_interval: 5s

    static_configs:

    - targets: ['172.16.5.3:9091']

      labels:

        instance: pushgateway

　　1.3 对服务器的基础监控项如如下所示

#cat second_rules.yml
groups:

- name: 实例存活告警规则

  rules:

  - alert: 实例存活告警

    expr: up{job="prometheus"} == 0 or up{job="Linux-host"} == 0

    for: 1m

    labels:

      user: prometheus

      severity: emergency

      team: HTY

    annotations:

      summary: "Instance {{ $labels.instance }} is down"

      description: "Instance {{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

      value: "{{ $value }}"

- name: 内存告警规则

  rules:

  - alert: "内存使用率告警"

    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 30

    for: 1m

    labels:

      team: C3

      user: prometheus

      severity: warning

    annotations:

      summary: "服务器: {{$labels.alertname}} 内存报警"

      description: "{{ $labels.alertname }} 内存资源利用率大于30%！(当前值: {{ $value }}%)"

      value: "{{ $value }}"

- name: 内存告警规则2

  rules:

  - alert: "内存使用率告警2"

    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 50

    for: 1m

    labels:

      team: C3

      user: prometheus

      severity: critical

    annotations:

      summary: "服务器: {{$labels.alertname}} 内存报警"

      description: "{{ $labels.alertname }} 内存资源利用率大于50%！(当前值: {{ $value }}%)"

      value: "{{ $value }}"

- name: CPU报警规则

  rules:

  - alert: CPU使用率告警

    expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 70

    for: 1m

    labels:

      user: prometheus

      severity: warning

    annotations:

      summary: "服务器: {{$labels.alertname}} CPU报警"

      description: "服务器: CPU使用超过70%！(当前值: {{ $value }}%)"

      value: "{{ $value }}"

- name: 磁盘报警规则

  rules:

  - alert: 磁盘使用率告警

    expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80

    for: 1m

    labels:

      user: prometheus

      severity: warning

    annotations:

      summary: "服务器: {{$labels.alertname}} 磁盘报警"

      description: "服务器:{{$labels.alertname}},磁盘设备: 使用超过80%！(挂载点: {{ $labels.mountpoint }} 当前值: {{ $value }}%)"

      value: "{{ $value }}"

　　2 安装以及配置alertmanager

global:

  # 企业微信告警配置

  resolve_timeout: 5m

  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'

  wechat_api_corp_id: 'ww41a2b13ef47aac58'

  wechat_api_secret: 'xxxxx'

  # qq邮箱告警配置

  smtp_from: xxx@qq.com

  smtp_auth_username: xx@qq.com

  smtp_auth_password: xxxx #需要从qq邮箱上面获取

  smtp_require_tls: false

  smtp_smarthost: 'smtp.qq.com:465'

templates:

  - "/usr/local/alertmanager/template/*.tmpl"

route:

  receiver: 'default-receiver'

  group_wait: 10s

  group_interval: 30s

  repeat_interval: 1m

  group_by: ['team']

  routes:

  - group_by: ['test']

    group_wait: 10s

    group_interval: 30s

    repeat_interval: 1m

    receiver: 'wechat'

    match:

      team: test1

receivers:

- name: 'wechat'

  wechat_configs:

  - send_resolved: true

    message: '{{ template "wechat.default.message" .}}'

    to_party: 'xxxx'

    agent_id: "xxx"需要从企业微信上面获取

    api_secret: 'xxxxxxxx'

- name: 'default-receiver'

  email_configs:

  - to: 'xxxxxx@qq.com'

    send_resolved: true

    # html: '{{ template "wechat.default.message" .}}'

inhibit_rules:

  - source_match:

      severity: 'critical'

    target_match:

      severity: 'warning'

    equal: ['env','team','instance','type','group','job','alertname']

　　获取企业微信的方式参考这个链接：https://www.cnblogs.com/miaocbin/p/13706164.html

　　获取qq邮箱参考这个链接：https://blog.csdn.net/knight_zhou/article/details/105137581　

3 附上模版信息

{{ define "wechat.default.message" }}

{{- if gt (len .Alerts.Firing) 0 -}}

{{- range $index, $alert := .Alerts -}}

{{- if eq $index 0 }}

========= 监控报警 =========

告警状态：{{   .Status }}

告警级别：{{ .Labels.severity }}

告警类型：{{ $alert.Labels.alertname }}

故障主机: {{ $alert.Labels.instance }}

告警主题: {{ $alert.Annotations.summary }}

告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};

触发阀值：{{ .Annotations.value }}

故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

========= = end =  =========

{{- end }}

{{- end }}

{{- end }}

{{- if gt (len .Alerts.Resolved) 0 -}}

{{- range $index, $alert := .Alerts -}}

{{- if eq $index 0 }}

========= 异常恢复 =========

告警类型：{{ .Labels.alertname }}

告警状态：{{   .Status }}

告警主题: {{ $alert.Annotations.summary }}

告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};

故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

{{- if gt (len $alert.Labels.instance) 0 }}

实例信息: {{ $alert.Labels.instance }}

{{- end }}

========= = end =  =========

{{- end }}

{{- end }}

{{- end }}

{{- end }}

　　4. 安装以及部署grafana，推荐安装最新版的prometheus，然后使用插件，附上一个比较简洁的grafana看板

　　直接倒入模板，倒入步骤参考这便博客：https://www.cnblogs.com/wukc/p/14231042.html

一套完整的中小级别的企业级监控prometheus的更多相关文章

Web自动化框架之五一套完整demo的点点滴滴（excel功能案例参数化+业务功能分层设计+mysql数据存储封装+截图+日志+测试报告+对接缺陷管理系统+自动编译部署环境+自动验证false、error案例）
标题很大,想说的很多,不知道从那开始~~直接步入正题吧个人也是由于公司的人员的现状和项目的特殊情况,今年年中后开始折腾web自动化这块:整这个原因很简单,就是想能让自己偷点懒.也让减轻一点同事的苦力 ...
Egret是一套完整的HTML5游戏开发解决方案
Egret是一套完整的HTML5游戏开发解决方案.Egret中包含多个工具以及项目.Egret Engine是一个基于TypeScript语言开发的HTML5游戏引擎,该项目在BSD许可证下发布.使用 ...
基于springboot+bootstrap+mysql+redis搭建一套完整的权限架构【六】【引入bootstrap前端框架】
https://blog.csdn.net/linzhefeng89/article/details/78752658 基于springboot+bootstrap+mysql+redis搭建一套完整 ...
一套完整的VI包含哪些元素
VI设计,即视觉识别系统,企业VI设计是企业品牌建设的重中之重.最近很多人都在问,一套完整的企业VI设计都包括哪些内容?笔者站在一个高级设计师的角度,来简单谈一谈VI设计包括哪些内容.文中指出,一套完 ...
EasyRTMP+EasyDSS实现一套完整的紧急视频回传直播与存储回放方案
需求来源紧急视频回传云端:即拍即传.云端存储.紧急录像.云拍云录!这些需求现在可能对于我们来说比较远,大部分也是在行业中才会用到,但相信在不就的将来肯定会落地到每个人的手中,因为这是一个自我保护.自 ...
分享Node.js + Koa2 + MySQL + Vue.js 实战开发一套完整个人博客项目网站
这是个什么的项目? 使用 Node.js + Koa2 + MySQL + Vue.js 实战开发一套完整个人博客项目网站. 博客线上地址:www.boblog.com Github地址:https: ...
部署一套完整的Kubernetes高可用集群（二进制，v1.18版）
一.前置知识点 1.1 生产环境可部署Kubernetes集群的两种方式目前生产部署Kubernetes集群主要有两种方式: kubeadm Kubeadm是一个K8s部署工具,提供kubeadm ...
部署一套完整的Kubernetes高可用集群（二进制，最新版v1.18）下
七.高可用架构(扩容多Master架构) Kubernetes作为容器集群系统,通过健康检查+重启策略实现了Pod故障自我修复能力,通过调度算法实现将Pod分布式部署,并保持预期副本数,根据Node失 ...
Linux实战教学笔记34：企业级监控Nagios实践（上）
一,Nagios监控简介生活中大家应该对监控已司空见惯了,例如:餐馆门前的监控探头,小区里的视频监控,城市道路告诉监控探头等,这些监控的目的大家都很清楚,无须多说.那么,企业工作中为什么要部署监控系 ...
互联网企业级监控系统 OpenFalcon
Open-Falcon 人性化的互联网企业级监控系统,Open-Falcon 整体可以分为两部分,即绘图组件.告警组件.其中: 安装绘图组件负责数据的采集.收集.存储.归档.采样.查询.展示(Das ...

随机推荐

稀疏镜像在OpenHarmony上的应用
一.稀疏镜像升级背景常用系统镜像格式为原始镜像,即RAW格式.镜像体积比较大,在烧录固件或者升级固件时比较耗时,而且在移动设备升级过程时比较耗费流量.为此,将原始镜像用稀疏描述,可以大大地缩减镜像体 ...
深入理解 C++ 语法：从基础知识到高级应用
C++ 语法让我们将以下代码分解以更好地理解它: 示例 #include <iostream> using namespace std; int main() { cout <&l ...
C++ 异常和错误处理机制：如何使您的程序更加稳定和可靠
在C++编程中,异常处理和错误处理机制是非常重要的.它们可以帮助程序员有效地处理运行时错误和异常情况.本文将介绍C++中的异常处理和错误处理机制. 什么是异常处理? 异常处理是指在程序执行过程中发生异 ...
Matplotlib绘图设置--- 图例设置
plt.legend()和ax.legend()参数设置自动会将每条线的标签与其风格.颜色进行匹配. plt.legend(*args, **kwargs) Place a legend on th ...
树模型-label boosting-GBDT
GBDT GBDT是boosting系列算法的代表之一,其核心是梯度+提升+决策树. GBDT回归问题通俗的理解: 先来个通俗理解:假如有个人30岁,我们首先用20岁去拟合,发现损失有10岁,这时 ...
cesiumjs GIS引擎源码编译并运行-2021年3月18日最新版【1.68~1.79.1版本亲测成功】
前言本篇最初是在2020年的[macOS Big Sur + Cesium 1.76版本]下编译成功,后在[macOS Catalina+cesium 1.79.1版本]编译过程中,出现编译的错误和 ...
pageSpy - 远程调试利器
视频版: https://www.bilibili.com/video/BV1Zi4y167TZ 前言在工作中, 经常需要面对的问题就是处理客户提出的bug. 但是这个事儿最耗费精力甚至决定能不能修 ...
手动给docusaurus添加一个搜索
新版博客用docusaurus重构已经有些日子了,根据docusaurus的文档上也申请了Algolia,想一劳永逸的解决博客的搜索问题.但是流水有意,落花无情. algolia总是不给我回复,我只能 ...
Linux基础——shell
shell ############# shell是什么 -Bash Shell是一个命令解释器(python解释器),它在操作系统的最外层,负责用户程序与内核进行交互操作的一种接口,将用户输入的命令 ...
HarmonyOS NEXT应用开发之使用AKI轻松实现跨语言调用
介绍针对JS与C/C++跨语言访问场景,NAPI使用比较繁琐.而AKI提供了极简语法糖使用方式,一行代码完成JS与C/C++的无障碍跨语言互调,使用方便.本示例将介绍使用AKI编写C++跨线程调用J ...

一套完整的中小级别的企业级监控prometheus

一套完整的中小级别的企业级监控prometheus的更多相关文章

随机推荐

热门专题