Prometheus发送告警机制
Prometheus会根据rules中的规则,不断的评估是否需要发出告警信息,
如果满足规则中的条件,则会向alertmanagers中配置的地址发送告警,
告警是通过alertmanager配置的地址post告警,比如targets: ['192.168.1.104:8090'],则会向http://192.168.1.104:8090/api/v2/alerts发送告警信息。
1. 目标
一般都会通过alertmanager组件处理告警信息,但是这样信息都被alertmanager分组或者抑制处理了,看不到原始的告警信息
这里自己写一个alertmanger程序,来接收Prometheus发送的告警,并将告警打印出来。以此来研究告警信息,发送频率以及告警解除处理。
名词解释
evaluation_interval: prometheus中配置评估规则时间的间隔
for: 告警规则中配置的告警等待时间,值等于: fireAt - activeAt
2. alertmanger程序编写
alertmanger.go
package main
import (
"time"
"io/ioutil"
"net/http"
"fmt"
)
type MyHandler struct{}
func (mh *MyHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
body, err := ioutil.ReadAll(r.Body)
if err != nil {
fmt.Printf("read body err, %v\n", err)
return
}
fmt.Println(time.Now())
fmt.Printf("%s\n\n", string(body))
}
func main() {
http.Handle("/api/v1/alerts", &MyHandler{})
http.ListenAndServe(":8090", nil)
}
启动程序: go run alertmanager.go
3. 配置文件
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['192.168.1.104:8090']
rule_files:
- "/etc/prometheus/rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['192.168.1.200:9100']
我们配置的alertmanager的地址为192.168.1.104:8090,这个地址就是接下来我们要启动的自已的程序,用来接收prometheus的告警信息并打印出来。
rules.yml
groups:
- name: example
rules:
# Alert for any instance that is unreachable for >1 minutes.
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
serverity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
4. 日志解析
关闭实例192.168.1.200:9100的node_exporter的服务,程序打印的日志如下:
2022-11-01 16:04:01.4538613 +0800 CST m=+23803.323087701
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:08:01.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]
2022-11-01 16:05:16.4596299 +0800 CST m=+23878.328856301
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:09:16.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]
2022-11-01 16:06:31.4571604 +0800 CST m=+23953.326386801
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:10:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]
从日志中可以看出:
Prometheus发送告警频率为1分15秒,正好等于 evaluation_interval + for
endsAt处于未来时间,这里为日志发送时间加上4分钟(不明白为啥是4分钟)
启动node_exporter服务,时间为: 16:11:16
日志显示如下:
2022-11-01 16:11:31.4596592 +0800 CST m=+24253.328885601
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:11:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]
2022-11-01 16:12:46.4868947 +0800 CST m=+24328.356121101
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:11:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]
...............
...............
2022-11-01 16:25:16.4505975 +0800 CST m=+25078.319823901
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:11:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]
2022-11-01 16:26:31.4480177 +0800 CST m=+25153.317244101
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:11:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":A{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]
启动node_exporter服务后,根据上面的输出我们还可以发现,当“解除告警”发出去以后,Prometheus还坚持把“解除告警”发送了好多次,这里总共发送了13次警报,
发送频率也是等于 evaluation_interval + for
相当于告警已经解除,于是Prometheus会在等待for的时间后立即发送一条告警出去,表明告警已解除。也就是说,下面第一条告警其实是一条“解除告警”,为什么呢?因为endsAt的时间就是发送该条告警的时间,当AlertManager接收到以后,发现这个时间已经是一个过去的时间了,也就是说,这条告警已经结束了。
endsAt的时间为为inactiveAt + for
5. 总结
Prometheus会以evaluation_interval的间隔评估是否应该发送告警,
当满足告警条件时Prometheus会以evaluation_interval + for
的频率发送告警
日志中的key解释:
startAt: 告警激活时间,activeAt+for
endsAt: 解除警报时间,为 inactiveAt + for 需要注意的是警报处于激活时,他的时间应该处于未来的某个时间(fireAt+ 一段时间)
告警是通过alertmanager配置的地址post告警,比如targets: ['192.168.1.104:8090'],则会向http://192.168.1.104:8090/api/v2/alerts发送告警信息。