【最佳实践】prometheus+grafana+Alertmanager基本实践
来源:bilibili
【0】前置信息
学习目标
需求目标
实验环境准备(必看)
prometheus 服务器:192.168.175.131:9090
grafana /alertmanager 服务器: 192.168.175.130:3000 192.168.175.130:9030
被监控的服务器: 192.168.175.129
prometheus 默认端口:9090
node_exporter采集器默认端口:9100
grafana 默认端口:3000
mysqld_exporter 默认端口 9104
alertmanager 默认端口:9030
【1】prometheus 的下载与安装
【1.1】下载
官网:https://prometheus.io/download/#mysqld_exporter
linux:
mkdir /soft cd /soft wget https://github.com/prometheus/prometheus/releases/download/v2.20.1/prometheus-2.20.1.linux-amd64.tar.gz
【1.2】安装
#【1】安装go yum -y install go #【2】安装 prometheus 服务端 cd /soft tar -zxf prometheus-2.20.1.linux-amd64.tar.gz ln -s prometheus-2.20.1.linux-amd64 prometheus cd prometheus
#【4】启动
#prometheus启动命令添加参数 --web.enable-lifecycle ,这样修改配置文件后就不用再重启 prometheus 了
#使用curl -X POST http://localhost:9090/-/reload 就可以在线重载配置文件
nohup ./prometheus --config.file=./prometheus.yml --web.enable-lifecycle &
配置、核验 prometheus.yml
l#####down is add info ############# 之后是新增的数据,为了配合我们的规划,我们提前配置好mysqld_exporter 与 node_exporter 的目标监控
vim prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
####down is add info #############
- job_name: 'agent_linux'
static_configs:
- targets: ['192.168.175.129:9100']
labels:
name: linux_db1
- job_name: 'agent_mysql'
static_configs:
- targets: ['192.168.175.129:9104']
labels:
name: mysql_db1
也可以这样
- job_name: 测试机 static_configs: - labels: name: etcd-1 targets: - 192.168.148.39:2379
- job_name: 'elasticsearch' scrape_interval: 60s scrape_timeout: 30s metrics_path: "/metrics" static_configs: - targets: ['192.168.75.21:9308'] labels: service: elasticsearch
- job_name: 'redis_exporter_targets' static_configs: - targets: - redis://136.127.102.112:7000 - redis://136.127.102.112:7001 - redis://136.127.102.112:7002 - redis://136.127.102.113:7000 - redis://136.127.102.113:7001 - redis://136.127.102.113:7002 params: check-keys: ["metrics:*"] metrics_path: /scrape relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 136.127.102.112:9122
- job_name: 'rabbitmq'
scrape_interval: 60s
scrape_timeout: 60s
static_configs:
- targets: ['yourIP:15692','yourIP:15692']
核验配置文件是否有语法、引用等错误: ./promtool check config prometheus.yml
扩展:封装成系统服务(可以略过)
vi /usr/lib/systemd/system/prometheus.service
[Unit] Description=Prometheus Documentation=https://prometheus.io/ After=network.target [Service] # Type设置为notify时,服务会不断重启 #user=prometheus Type=simple ExecStart=/soft/prometheus/prometheus --config.file=/soft/prometheus/prometheus.yml --web.enable-lifecycle Restart=on-failure [Install] WantedBy=multi-user.target
配置好文件后要重载,然后用服务启动
systemctl daemon-reload
pkill prometheus
systemctl start prometheus
systemctl status prometheus
systemctl enable prometheus
【1.3】核验
ps -ef|grep prome netstat -anp|grep 9090
输入端口+IP,进入界面
【1.4】收集器信息
http://192.168.175.131:9090/metrics
不能用localhost,要用ip噢
由上可知,prometheus默认监控了服务端主机信息。通过 http://192.168.175.131:9090/metrics 可以看到数据
【1.5】基本查看prometheus监控数据与图表
这里的按钮,可以切换显示方式,一种是数值,一种是图表
【2】安装node_exporter组件监控 linux主机
192.168.175.129 安装,node_exporter采集器默认端口:9100
【2.1】什么是 node_exporter
举个例子,如果你有一台服务器,你想要获取它运行时候的参数,比如当前的CPU负载、系统负载、内存消耗、硬盘使用量、网络IO等等。
那么你就可以在服务器上运行一个 node_exporter
,它能帮你把这些参数收集好,并且暴露出一个HTTP接口以便你访问查询。废话不多说我们直接试一试
【2.2】node_exporter 下载
官网:https://github.com/prometheus/node_exporter/releases
linux:
mkdir /soft cd /soft wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz
【2.3】安装与启动
sudo tar -zxf node_exporter-1.0.1.linux-amd64.tar.gz
ln -s node_exporter-1.0.1.linux-amd64 node_exporter cd node_exporter-1.0.1.linux-amd64 nohup ./node_exporter &
启动成功会显示如下信息:注意后后续有没有报错
【2.4】核验
(1)curl访问核验
192.168.175.129:9100/metrics
有数据就没有问题
(2)进程和端口访问核验
(3)进入prometheus界面核验
192.168.175.131:9090
status=>targets
【2.5】封装成系统服务(可以忽略)
vi /usr/lib/systemd/system/node_exporter.service
[Unit] Description=node_exporter Documentation=https://prometheus.io/ After=network.target [Service] Type=simple #User=prometheus ExecStart=/soft/node_exporter/node_exporter Restart=on-failure [Install] WantedBy=multi-user.target
#然后需要一些系统操作来应用
systemctl daemon-reload pkill node_exporter systemctl start node_exporter systemctl status node_exporter
systemctl enable node_exporter
【2.6】在线重载配置文件办法
curl -X POST http://localhost:9090/-/reload
【3】mysqld_exporter 采集mysql
192.168.175.129安装,mysqld_exporter 默认端口是 9104
【3.1】下载
官网:https://prometheus.io/download/#mysqld_exporter
linux:
mkdir /soft cd /soft wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.1/mysqld_exporter-0.12.1.linux-amd64.tar.gz
【3.2】解压、配置、启动、核验
tar -zxf mysqld_exporter-0.12.1.linux-amd64.tar.gz cd mysqld_exporter-0.12.1.linux-amd64/ vim mysqld_exporter.cnf #记住,配置文件中的账户,要是mysql账户,且host要是localhost,比如我这里就用 root@localhost登录
#如果要使用其他用户,需要这些权限:grant select,replication client,process on *.* to mysql_monitor@'localhost' identified by '123456';
[client] user=root password=123456
nohup ./mysqld_exporter --config.my-cnf ./mysqld_exporter.cnf &
如果我们启动不加参数,就会报错,因为默认的配置文件位置是 /root/.my.cnf ,因为是2进制装的,又没有这个用户,所以必须向上面步骤一样,创建配置文件并且启动时制定配置文件位置
【3.3】核验
(1)url核验
http://192.168.175.129:9104/metrics
(2)prometheus
【4】grafana
在192.168.175.130上安装,默认端口 3000
【4.1】下载安装
官网:https://grafana.com/grafana/download
安装:https://grafana.com/docs/grafana/latest/installation/rpm/
(1)下载安装
cd /soft wget https://dl.grafana.com/oss/release/grafana-7.1.3.linux-amd64.tar.gz tar -zxf grafana-7.1.3.linux-amd64.tar.gz cd grafana-7.1.3
(2)启动
nohup ./bin/grafana-server web &
(3)核验
netstat -anp|grep -E "3000"
(4)相关信息
- 默认端口:3000
- 默认日志:var/log/grafana/grafana.log
- 默认持久化文件:/var/lib/grafana/grafana.db
- web默认账户密码:admin/admin
【4.2】登录 grafana
在网页上输入URL :192.168.175.130:3000
默认账户密码都是admin,登录上之后,第一次登录会要求我们改密码,怕不安全。
最终登录上的界面
【4.3】grafana 添加 prometheus 数据源
然后点击 add data source,点击prometheus
输入好 prometheus 的服务端地址和端口
然后拉到界面最下面,点击 save & test
【4.4】手动添加仪表盘
如下图,我们选择好我们的数据源,然后输入我们的显示指标,仪表盘就出现内容了。
主要是多看看右边的面板和下方的监控项
我们配置完后,点击右上角的保存,再次进入,我们就可以查看到我们保存的仪表盘了
然后点击一下这个仪表盘,就可以看到了之前的图表了
我们还可以通过 instance 或者 jobname 来筛选,举例如下
我们还可以在这个仪表盘上添加更多图表
【4.5】导入官方模板
官网:https://grafana.com/grafana/dashboards
然后导入这个9777模板
然后我们就可以看到仪表盘数据出来了。
我们还可以对这个模板仪表盘进行设置
我们可以从这里看到一些信息,这个variables 就是筛选按钮了,这里可以筛选出我们的 lab 下面定义的 instance
然后仪表盘也可以直接修改一下
点一下右上角的APPLY,主页也有了
【4.6】Linux推荐模板 9276
【4.7】显示not date 不显示图的原因分析
【4.6】中的仪表盘中有几个没有数据
原因分析:
(1)没数据
(2)服务器时间与客户端/浏览器时间不匹配
(3)promQL 语句写的不对
验证解决思路:
(1)时间不对
一般就算时间不对,也可能就差个几秒钟或者几个小时,我们把时间查阅范围选择2天、7天甚至更高,还没有就不是这个原因
(2)没数据,promQL 语句是否正确
用这个(仪表盘)panel 里面的语句
我们直接上prometheus=>Graph 上面查看,变量值 和 网卡设备名 改一下(我的是ens33)。发现是有数据的,那么应该就是 device 这个网卡设备名称和我们实际被监控机器网卡名称不一致的问题了
然后我们回去改一下 面板(panel)里面的表达式就好了,我把设备名改成了 被监控机器的网卡名称
然后点击右上角的应用
好了,有数据了,完成
【4.8】仪表盘标题、仪表盘变量
【5】mysql仪表盘监控
【5.1】下载json模板
GITHUB下载地址:https://github.com/percona/grafana-dashboards/tree/master/dashboards
官网仪表盘下载:https://grafana.com/grafana/dashboards?dataSource=prometheus&search=mysql
但官网坑爹啊,仪表盘和采集器不匹配,一用就如【4.5】一样,各种not date
这里我们下载 mysql_overview.json
https://github.com/percona/grafana-dashboards/blob/master/dashboards/MySQL_Overview.json
直接复制
【5.2】应用json模板
把【5.1】中找到的json 贴进来就OK了
【最终效果】
注意,有一个坑的地方,那就是,这里的监控数据源默认是 Prometheus ,大小写不一样也会出问题。如果我们添加的数据源名字不叫这个,估计得改,要么改panel(图表)中的数据源,要么修改数据源名字
【6】prometheus告警配置
【6.1】查看当前各个job状态
http://192.168.175.131:9090/targets
好,都是OK的;
【6.2】配置 rule 文件、prometheus文件
(1)我们回到 prometheus 服务器(192.168.175.131)
(2)进入配置文件所在目录 /soft/prometheus
(3)查看、修改prometheus.yml 配置文件
cd /soft/prometheus
vim /prometheus.yml,把rule_files 放开,那么我们就引用了一个叫 first_rules.yml 的配置文件
(4)新建编辑 first_rules.yml 文件
cd /soft/prometheus
vim first_rules.yml
[root@DB3 prometheus]# vim first_rules.yml groups: - name: simulator-alert-rule #组名称 rules: - alert: HttpSimulatorDown #报警名称,必须唯一 expr: sum(up{job="agent_linux"}) == 0 #监控指标表达式,这里是验证 agent_linux 节点是否是可访问的 for: 1m #持续时间,1分钟内表达式持续生效则报警,避免瞬间故障引起报警的可能性 labels: severity: critical
annotations:
summary: Linux node status is {{ humanize $value}}% for 1m #警报描述
我们关键监控指标
sum(up{job="agent_linux"}) == 0 就是判断 agent_linux 下面对应的 target 也就是 192.168.175.129:9100 这个IP+端口是否可以访问。
我们还可以使用 prometheus里面自带的 promtool 命令工具来核验语法是否正确
./promtool check rules first_rules.yml
(5)重载配置文件
curl -X POST http://localhost:9090/-/reload
【6.3】查阅config、rule、alert
(1)rule
可以用URL访问,也可以点击 status=>rules,如下图,我们可以看到Rules确实已经配置好了,当前状态也是OK的(表示表达式并没有触发成功)
(2)config
可以用URL访问,也可以点击 status=>rules,如下图,我们可以看出在线重载配置文件确实生效了。
【6.4】验证报警
(1)查看 192.168.175.131:9090/alerts
我们访问prometheus服务器/alerts,查看当前报警情况,具体图如下:
三个选项意思分别是:
Inactive :未触发报警
Pending:质疑状态,即将发生报警(即现在表达式已经失败了,但还没有到达for 后面的时间标准,用我们这个监控来说,就是agent_linux 下面的target 即192.168.175.129:9100 端口已经无法访问了,但这种情况还没持续1分钟)
Firing: 发生报警
(2)关闭 agent_linux 下面的 target
即关闭 192.168.175.129:9100 ,这个是 node_exporter 程序
再看这个:已经变成了 Pending状态,过一分钟之后,就变成右图了
【7】Alertmanager
192.168.175.130 上安装,默认端口 9030
【7.1】Alertmanager 下载安装
(1)下载
下载官网:https://prometheus.io/download/#alertmanager
linx:
#下载
cd /soft wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
#解压
tar -zxf alertmanager-0.21.0.linux-amd64.tar.gz
ln -s alertmanager-0.21.0.linux-amd64 alertmanager
cd alertmanager
【7.2】构建配置文件(邮件配置)
(记得修改smtp 信息 换成你自己的)
[root@DB2 alertmanager]# vi alertmanager.yml global: # 在没有报警的情况下声明为已解决的时间 resolve_timeout: 5m # 配置邮件发送信息 smtp_smarthost: 'smtp.qq.com' smtp_from: '815202984@qq.com' smtp_auth_username: '815202984@qq.com' smtp_auth_password: 'xxxxxx' smtp_require_tls: false # 禁用tls # 所有报警信息进入后的根路由,用来设置报警的分发策略 route: # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个 分组里面 group_by: ['alertname', 'cluster'] # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。 group_wait: 30s # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。 group_interval: 5m # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们 repeat_interval: 5m # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器 receiver: default receivers: - name: 'default' # 自定义名称 供receiver: default使用 email_configs: # 邮件报警模块 - to: '815202984@qq.com,123456' #接收人 send_resolved: true
# 一个 receivers 条目可以写多个接收者 receivers: - name: slack_and_email # Slack slack_configs: - api_url: '<THE_WEBHOOK_URL>' channel: '#general' - api_url: '<ANOTHER_WEBHOOK_URL>' channel: '#alerts' # Email email_configs: - to: 'k4nz@example.com' route: receiver: slack_and_email
可以使用命令核验配置文件是否有错误:
[root@DB2 alertmanager]# ./amtool check-config alertmanager.yml
生产参考文档
注意,告警模板路径应该写绝对路径
templates:
- '/data/prometheus/alertmanager/wechat.tmpl'
- '/data/prometheus/alertmanager/email.tmpl'
global: resolve_timeout: 5m smtp_smarthost: '1.1.1.1:25' #邮件服务器地址 smtp_from: '123@qq.com' #发送邮件的地址 smtp_auth_username: '123' #登录邮箱的账户,和上面应该一样 smtp_auth_password: '123456' smtp_require_tls: false templates: - '/data/prometheus/alertmanager/wechat.tmpl' - '/data/prometheus/alertmanager/email.tmpl' route: group_by: ['instance'] #将类似性质的报警 合并为单个通知 group_wait: 10s # 收到告警时 等待10s确认时间内是否有新告警 如果有则一并发送 group_interval: 10s #下一次评估过程中,同一个组的alert生效,则会等待该时长发送告警通知,此时不会等待group_wait设置时间 repeat_interval: 10m #告警发送间隔时间 建议10m 或者30m receiver: 'wechat' routes: - receiver: 'happy' group_wait: 10s group_interval: 10s repeat_interval: 10m match_re: job: ^快乐子公司.*$ #以【快乐子公司】开头的 job 的告警信息 - receiver: 'wechat' continue: true - receiver: 'default-receiver' continue: true # - receiver: 'test_dba' # group_wait: 10s # group_interval: 10s # repeat_interval: 10m # match: # job: 大连娱网_mssql receivers: - name: 'default-receiver' email_configs: - to: '123@qq.com,456@qq.com' send_resolved: true html: '{{ template "email.html" .}}' headers: { Subject: 'prometheus 告警' } - name: 'wechat' wechat_configs: # 企业微信报警配置 - send_resolved: true to_party: '2' # 接收组的id # to_user: 'abc|efg|xyz' # 接收组的id # to_user: '@all' # 接收组的id agent_id: '1000003' # (企业微信-->自定应用-->AgentId) corp_id: 'qwer' # 企业信息(我的企业-->CorpId[在底部]) api_secret: 'xx--qq' # 企业微信(企业微信-->自定应用-->Secret) api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' message: '{{ template "wechat.default.message" . }}' - name: 'happy' email_configs: - to: 'hayyp@company.com' send_resolved: true html: '{{ template "email.html" .}}' headers: { Subject: 'prometheus 告警' } inhibit_rules: - source_match: # 当此告警发生,其他的告警被抑制 severity: 'critical' equal: ['id', 'instance']
(3)启动
配置成系统服务
vim /usr/lib/systemd/system/alertmanager.service
[Unit] Description=alertmanager Documentation=https://prometheus.io/ After=network.target [Service] Type=simple #User=prometheus ExecStart=/soft/alertmanager/alertmanager --config.file=/soft/alertmanager/alertmanager.yml Restart=on-failure [Install] WantedBy=multi-user.target
使用系统服务启动
systemctl daemon-reload
systemctl start alertmanager
systemctl enable alertmanager
systemctl status alertmanager
我们可以看到启动成功了,且监听端口为9030
【7.3】核验是否启动成功,允许访问
核心还是上web看看,能不能访问,如下图,可以访问就OK了
【8】整合 prometheus 与 alertmanager
【8.1】修改配置 prometheus.yml 配置文件
如下图,主要是把 Alerting 的信息修改,把这个 targets 数据填上我们的 alertmanager 服务器地址和端口
然后顺道在 prometheus所在服务器上 ,执行在线重载命令
curl -X POST http://localhost:9090/-/reload
【8.2】邮件报警
因为我们在【6.4】中,已经关闭了 agent_linux 下面的target 即192.168.175.129:9100 ,所以我们一关联上 prometheus 服务器和 alertmanger ,报警邮件立马就出来了。
然后我们重新把 这个节点起来
然后我们发现这个也好了
但已经解决的报警信息并没有及时发出来,原因是因为我们 alertmanager 配置文件中有2个参数设置了 ,两个参数一起,造成了10分钟一次
# 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
group_interval: 5m
# 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
repeat_interval: 5m
我们的故障解决信息页收到了,到此,完成
【8.3】邮件模板,email模板 参考
【email】 {{ define "email.default.message" }} {{- if gt (len .Alerts.Firing) 0 -}}{{ range $i, $alert :=.Alerts }} {{- if eq $alert.Labels.severity "紧急" }} ========紧急告警==========<br> {{- else }} ========监控告警==========<br> {{- end }} 告警主机:{{ $alert.Labels.host_label }}<br> 告警级别:{{ $alert.Labels.severity }}<br> 告警状态:{{ .Status }}<br> 告警类型:{{ $alert.Labels.alertname }}<br> 告警概述:{{ $alert.Annotations.summary }}<br> 告警取值:{{ $alert.Annotations.value }}<br> 告警时间:{{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br> 运营团队: {{ $alert.Labels.team }}<br> 告警详情:{{ $alert.Annotations.description }}<br> ========end============= {{ end }} {{ end -}} {{- if gt (len .Alerts.Resolved) 0 -}}{{ range $i, $alert :=.Alerts }} ====告警恢复=====<br> 告警主机:{{ $alert.Labels.host_label }}<br> 告警状态:{{ .Status }}<br> 告警类型:{{ $alert.Labels.alertname }}<br> 告警概述:{{ $alert.Annotations.summary }}-->恢复<br> 告警时间:{{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br> 恢复时间:{{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br> ========end============= {{ end }} {{ end -}} {{ end }}
【9】企业微信报警
【9.1】企业微信管理设置
企业微信注册地址:https://work.weixin.qq.com/
然后随便创建一个,然后把ID 什么的都保存下来,后面要用。
然后企业ID也要保存下来
【9.2】重构 Alertmanager.yml 添加企业微信接收人
官网参考:
# Whether or not to notify about resolved alerts. [ send_resolved: <boolean> | default = false ] # The API key to use when talking to the WeChat API. [ api_secret: <secret> | default = global.wechat_api_secret ] # The WeChat API URL. [ api_url: <string> | default = global.wechat_api_url ] # The corp id for authentication. [ corp_id: <string> | default = global.wechat_api_corp_id ] # API request data as defined by the WeChat API. [ message: <tmpl_string> | default = '{{ template "wechat.default.message" . }}' ] [ agent_id: <string> | default = '{{ template "wechat.default.agent_id" . }}' ] [ to_user: <string> | default = '{{ template "wechat.default.to_user" . }}' ] [ to_party: <string> | default = '{{ template "wechat.default.to_party" . }}' ] [ to_tag: <string> | default = '{{ template "wechat.default.to_tag" . }}' ]
可以通过 to_party,来区分小组,然后直接把相关人员在企业微信中加入分组即可;
我的实际参考:
global: # 在没有报警的情况下声明为已解决的时间 resolve_timeout: 5m # 配置邮件发送信息 smtp_smarthost: 'smtp.qq.com:25' smtp_from: '815202984@qq.com' smtp_auth_username: '815202984@qq.com' smtp_auth_password: 'a123456!' smtp_require_tls: false # 禁用tls templates: - 'test.tmpl' # 所有报警信息进入后的根路由,用来设置报警的分发策略 route: # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个 分组里面 group_by: ['alertname', 'cluster'] # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。 group_wait: 30s # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。 group_interval: 10s # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们 repeat_interval: 10s # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器 #receiver: "default" receiver: "wechat" receivers: - name: 'default' # 自定义名称 供receiver: default使用 email_configs: # 邮件报警模块 - to: '815202984@qq.com' send_resolved: true - name: 'wechat' wechat_configs: - send_resolved: true agent_id: '1000002' #应用ID to_user: 'GuoChaoQun|Zhangsan' #接受成员账号 api_secret: 'xxx' #应用秘钥 corp_id: 'xxx' #企业微信ID
这个成员账户,要是这个噢
[root@DB2 alertmanager]# ./amtool check-config alertmanager.yml
【9.3】构建报警模板 test.tmpl
cd /soft
/soft/alertmanager
vim test.tmpl
{{ define "wechat.default.message" }} {{ range .Alerts }} ========监控报警========== 告警状态:{{ .Status }} 告警级别:{{ .Labels.severity }} 告警类型:{{ .Labels.alertname }} 告警应用:{{ .Annotations.summary }} 告警主机:{{ .Labels.instance }} 告警详情:{{ .Annotations.description }} 触发阀值:{{ .Annotations.value }} 告警时间:{{ .StartsAt.Format "2006-01-02 15:04:05" }}
========end============= {{ end }} {{ end }}
在检查一遍,运行如下命令,如下图 出现了模板才算对。
[root@DB2 alertmanager]# ./amtool check-config alertmanager.yml
【9.4】在prometheus服务器 构建报警配置文件
vim first_rules.yml
groups: - name: node-alert-rule rules: - alert: "监控节点宕机" expr: sum(up{job="agent_linux"}) == 0 for: 1m labels: severity: critical annotations: summary: "服务名:{{$labels.alertname}} 监控节点宕机报警" description: "{{ $labels.alertname }} 监控节点挂了啊" value: "{{ $value }}"
核验一下 ./promtool check rules first_rules.yml
【9.5】测试
修改后,记得重启 Alertmanager 服务器
【启动参数最佳实践】
(1)prometheus
/data/prometheus/prometheus/prometheus --config.file=/data/prometheus/prometheus/prometheus.yml \
--web.read-timeout=5m --web.max-connections=512 --storage.tsdb.retention.time=30d \
--storage.tsdb.path=/data/prometheus/prometheus/data/ --query.timeout=2m --query.max-concurrency=20 \
--web.listen-address=0.0.0.0:9090 --web.enable-lifecycle --web.enable-admin-api
(2)altermanager
/data/prometheus/alertmanager/alertmanager --log.level=debug --config.file=/data/prometheus/alertmanager/alertmanager.yml >> /data/prometheus/alertmanager/alertmanager.log 2>&1
(3)mysqld_exporter
nohup mysqld_exporter --config.my-cnf=/etc/my.cnf --collect.info_schema.tables --collect.info_schema.innodb_metrics --collect.auto_increment.columns >>/var/log/messages 2>&1 &
(4)node_exporter
nohup node_exporter --collector.meminfo_numa --collector.processes >>/var/log/messages 2>&1 &
(5)redis_exporter
nohup redis_exporter -web.listen-address :9121 -redis.addr 127.0.0.1:6379 -redis.password asdGLvQYeW >>/var/log/messages 2>&1 &
(6)grafana
/usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pidfile=/var/run/grafana/grafana-server.pid \
--packaging=rpm cfg:default.paths.logs=/var/log/grafana \
cfg:default.paths.data=/var/lib/grafana cfg:default.paths.plugins=/var/lib/grafana/plugins cfg:default.paths.provisioning=/etc/grafana/provisioning
(7)windows_exporter
.\windows_exporter.exe --collectors.enabled "cpu,cs,logical_disk,net,os,service,system,textfile,mssql" --collector.service.services-where "Name='windows_exporter'"
【prometheus + Alertmanager】本身的告警 Rules 参考
groups: - name: prometheus告警规则 rules: - alert: 采集服务down expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Prometheus target missing (instance {{ $labels.instance }})" #description: "A Prometheus target has disappeared. An exporter might be crashed.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: 采集服务整组down expr: count by (job) (up) == 0 for: 1m labels: severity: critical annotations: summary: "Prometheus all targets missing (instance {{ $labels.instance }})" description: "A Prometheus job does not have living target anymore.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: prometheus配置加载失败 expr: prometheus_config_last_reload_successful != 1 for: 1m labels: severity: warning annotations: summary: "Prometheus configuration reload failure (instance {{ $labels.instance }})" description: "Prometheus configuration reload error\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: Alertmanager配置加载失败 expr: alertmanager_config_last_reload_successful != 1 for: 1m labels: severity: warning annotations: summary: "Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})" description: "AlertManager configuration reload error\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: Prometheus连接Alertmanager失败 expr: prometheus_notifications_alertmanagers_discovered < 1 for: 5m labels: severity: critical annotations: summary: "Prometheus not connected to alertmanager (instance {{ $labels.instance }})" description: "Prometheus cannot connect the alertmanager\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusRuleEvaluationFailures expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: "Prometheus rule evaluation failures (instance {{ $labels.instance }})" description: "Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusTemplateTextExpansionFailures expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: "Prometheus template text expansion failures (instance {{ $labels.instance }})" description: "Prometheus encountered {{ $value }} template text expansion failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusRuleEvaluationSlow expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds for: 5m labels: severity: warning annotations: summary: "Prometheus rule evaluation slow (instance {{ $labels.instance }})" description: "Prometheus rule evaluation took more time than the scheduled interval. I indicates a slower storage backend access or too complex query.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusNotificationsBacklog expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0 for: 5m labels: severity: warning annotations: summary: "Prometheus notifications backlog (instance {{ $labels.instance }})" description: "The Prometheus notification queue has not been empty for 10 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusAlertmanagerNotificationFailing expr: rate(alertmanager_notifications_failed_total[1m]) > 0 for: 5m labels: severity: critical annotations: summary: "Prometheus AlertManager notification failing (instance {{ $labels.instance }})" description: "Alertmanager is failing sending notifications\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusTargetEmpty expr: prometheus_sd_discovered_targets == 0 for: 5m labels: severity: critical annotations: summary: "Prometheus target empty (instance {{ $labels.instance }})" description: "Prometheus has no target in service discovery\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusTargetScrapingSlow expr: prometheus_target_interval_length_seconds{quantile="0.9"} > 60 for: 5m labels: severity: warning annotations: summary: "Prometheus target scraping slow (instance {{ $labels.instance }})" description: "Prometheus is scraping exporters slowly\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusTargetScrapeDuplicate expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: "Prometheus target scrape duplicate (instance {{ $labels.instance }})" description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusTsdbCheckpointCreationFailures expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: "Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})" description: "Prometheus encountered {{ $value }} checkpoint creation failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusTsdbCheckpointDeletionFailures expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: "Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})" description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusTsdbCompactionsFailed expr: increase(prometheus_tsdb_compactions_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: "Prometheus TSDB compactions failed (instance {{ $labels.instance }})" description: "Prometheus encountered {{ $value }} TSDB compactions failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusTsdbHeadTruncationsFailed expr: increase(prometheus_tsdb_head_truncations_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: "Prometheus TSDB head truncations failed (instance {{ $labels.instance }})" description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusTsdbReloadFailures expr: increase(prometheus_tsdb_reloads_failures_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: "Prometheus TSDB reload failures (instance {{ $labels.instance }})" description: "Prometheus encountered {{ $value }} TSDB reload failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusTsdbWalCorruptions expr: increase(prometheus_tsdb_wal_corruptions_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: "Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})" description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: PrometheusTsdbWalTruncationsFailed expr: increase(prometheus_tsdb_wal_truncations_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: "Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})" description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
【参考文档】
企业微信报警:https://www.cnblogs.com/guoxiangyue/p/11958522.html
直方图、自定义采集器:https://mp.weixin.qq.com/s/KcIVIE30X5IlB4JM9P1ucg
【故障处理】
(1)promhttp.go:38] Error gathering metrics: [from Gatherer #1] context deadline exceeded
查看端口、网络是否通
(2)访问 grafana web 报错:If you're seeing this Grafana has failed to load its application files
访问Grafana服务端时报以下问题
If you’re seeing this Grafana has failed to load its application files This could be caused by your reverse proxy settings. If you host grafana under subpath make sure your grafana.ini root_url setting includes subpath. If not using a reverse proxy make sure to set serve_from_sub_path to true. If you have a local dev build make sure you build frontend using: yarn start, yarn start:hot, or yarn build Sometimes restarting grafana-server can help Check if you are using a non-supported browser. For more information, refer to the list of supported browsers.
原因:浏览器版本不兼容
解决:换个浏览器,比如火狐,或者升级谷歌浏览器到最新版本
相关解决办法:只需要disable 浏览器cache就可以重新访问了。(F12=>设置=》network下的 disable cache )