AlertManager告警机制
上一篇[文章](https://www.cnblogs.com/zydev/p/16848444.html)我们讲了Prometheus如何将告警发送出来,
但是这种告警信息是没有分组和告警抑制,重复告警处理等功能的,AlertManager组件就是干这种事的。
这里我们使用AlertManager接受告警,并研究三个配置参数的效果:group_wait、group_interval、repeat_interval。
1. 启动自定义程序
这里我们使用一个webhook的方式接受AlertManager传过来的告警
webhook-receiver.go
func AlertHandler(w http.ResponseWriter, r *http.Request) {
body, err := ioutil.ReadAll(r.Body)
if err != nil {
fmt.Printf("read body err, %v\n", err)
return
}
fmt.Println(time.Now())
fmt.Printf("%s\n\n", string(body))
fmt.Fprintf(w, "Hi there, I love %s!", r.URL.Path[1:])
}
func main() {
http.HandleFunc("/alert/webhook", AlertHandler)
log.Fatal(http.ListenAndServe(":8090", nil))
}
2. 配置alertmanager
配置alertmanager,使用上面的程序接受告警
alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 10s
repeat_interval: 1h
receiver: 'webhook'
receivers:
- name: webhook
webhook_configs:
- url: http://192.168.1.104:8090/alert/webhook
3. 发送告警
alertmanager提供了API https://github.com/prometheus/alertmanager/blob/main/api/v2/openapi.yaml,
将该文件粘贴到https://editor.swagger.io,即可看见所有API。
这里使用posman向http://192.168.1.200:9093/api/v2/alerts
发送一个post请求创建一个告警。
EndsAt一定要大于发送的时间,否则alertmanager发自动忽略
[
{
"Labels": {
"alertname": "NodeCpuPressure",
"IP": "192.168.2.101"
},
"Annotations": {
"summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"StartsAt": "2022-11-02T06:33:44.662Z",
"EndsAt": "2022-11-02T08:33:44.662Z"
}
]
然后,我们调AlertManager的API来查询Alerts(GET /api/v2/alerts)与Groups(GET /api/v2/alerts/groups),可以通过浏览器直接调或者通过命令行curl来调
[
{
"annotations": {
"summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"endsAt": "2022-11-02T08:33:44.662Z",
"fingerprint": "27e1a08813b1ec3b",
"receivers": [
{
"name": "webhook"
}
],
"startsAt": "2022-11-02T06:33:44.662Z",
"status": {
"inhibitedBy": [],
"silencedBy": [],
"state": "active"
},
"updatedAt": "2022-11-02T06:37:02.089Z",
"labels": {
"IP": "192.168.2.101",
"alertname": "NodeCpuPressure"
}
}
]
查看分组的日志
[
{
"alerts": [
{
"annotations": {
"summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"endsAt": "2022-11-02T08:33:44.662Z",
"fingerprint": "27e1a08813b1ec3b",
"receivers": [
{
"name": "webhook"
}
],
"startsAt": "2022-11-02T06:33:44.662Z",
"status": {
"inhibitedBy": [],
"silencedBy": [],
"state": "active"
},
"updatedAt": "2022-11-02T06:37:02.089Z",
"labels": {
"IP": "192.168.2.101",
"alertname": "NodeCpuPressure"
}
}
],
"labels": {
"alertname": "NodeCpuPressure"
},
"receiver": {
"name": "webhook"
}
}
]
我们发现,AlertManger自动创建了一个Group,其Labels为{alertname=NodeCpuPressure},里面包含了刚才的告警。
查看alertmanager发送给程序的日志
{
"receiver": "webhook",
"status": "firing",
"alerts": [
{
"status": "firing",
"labels": {
"IP": "192.168.2.101",
"alertname": "NodeCpuPressure"
},
"annotations": {
"summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"startsAt": "2022-11-02T06:33:44.662Z",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "",
"fingerprint": "27e1a08813b1ec3b"
}
],
"groupLabels": {
"alertname": "NodeCpuPressure"
},
"commonLabels": {
"IP": "192.168.2.101",
"alertname": "NodeCpuPressure"
},
"commonAnnotations": {
"summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"externalURL": "http://629720e6c34d:9093",
"version": "4",
"groupKey": "{}:{alertname=\"NodeCpuPressure\"}",
"truncatedAlerts": 0
}
接着我们再发一个alert,alertname与上面的一样
[
{
"Labels": {
"alertname": "NodeCpuPressure",
"instance": "192.168.1.200:9100"
},
"Annotations": {
"summary": "NodeCpuPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"
},
"StartsAt": "2022-11-02T06:33:44.662Z",
"EndsAt": "2022-11-02T08:33:44.662Z"
}
]
在查看程序的警告
{
"receiver": "webhook",
"status": "firing",
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "NodeCpuPressure",
"instance": "192.168.1.200:9100"
},
"annotations": {
"summary": "NodeCpuPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"
},
"startsAt": "2022-11-02T06:33:44.662Z",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "",
"fingerprint": "e729dc1c8ec3316e"
},
{
"status": "firing",
"labels": {
"IP": "192.168.2.101",
"alertname": "NodeCpuPressure"
},
"annotations": {
"summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
},
"startsAt": "2022-11-02T06:33:44.662Z",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "",
"fingerprint": "27e1a08813b1ec3b"
}
],
"groupLabels": {
"alertname": "NodeCpuPressure"
},
"commonLabels": {
"alertname": "NodeCpuPressure"
},
"commonAnnotations": {},
"externalURL": "http://629720e6c34d:9093",
"version": "4",
"groupKey": "{}:{alertname=\"NodeCpuPressure\"}",
"truncatedAlerts": 0
}
发现了这个两个alert被编成一个告警发送给我们的程序,这就是告警分组。
接着我们在发送一个alertname不一样的告警
[
{
"Labels": {
"alertname": "NodeMemPressure",
"instance": "192.168.1.200:9100"
},
"Annotations": {
"summary": "NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"
},
"StartsAt": "2022-11-02T06:33:44.662Z",
"EndsAt": "2022-11-02T08:33:44.662Z"
}
]
程序马上就接收到了一个告警
2022-11-02 15:03:27.3966965 +0800 CST m=+13583.346347101
{"receiver":"webhook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.200:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T06:33:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"543ee1360ede8267"}],"groupLabels":{"alertname":"NodeMemPressure"},"commonLabels":{"alertname":"NodeMemPressure","instance":"192.168.1.200:9100"},"commonAnnotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"externalURL":"http://629720e6c34d:9093","version":"4","groupKey":"{}:{alertname=\"NodeMemPressure\"}","truncatedAlerts":0}
其中只包含alertname为NodeMemPressure的告警。
4. 解除告警
我们再发送以下的“解除告警”(即把EndsAt设置为一个过去的时间)
[
{
"Labels": {
"alertname": "NodeMemPressure",
"instance": "192.168.1.200:9100"
},
"Annotations": {
"summary": "NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"
},
"StartsAt": "2022-11-02T06:33:44.662Z",
"EndsAt": "2022-11-02T06:50:44.662Z"
}
]
程序马上接受到了一个解除告警警报
2022-11-02 15:28:17.5246315 +0800 CST m=+15073.474282101
{"receiver":"webhook","status":"resolved","alerts":[{"status":"resolved","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.200:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T06:33:44.662Z","endsAt":"2022-11-02T06:50:44.662Z","generatorURL":"","fingerprint":"543ee1360ede8267"}],"groupLabels":{"alertname":"NodeMemPressure"},"commonLabels":{"alertname":"NodeMemPressure","instance":"192.168.1.200:9100"},"commonAnnotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"externalURL":"http://629720e6c34d:9093","version":"4","groupKey":"{}:{alertname=\"NodeMemPressure\"}","truncatedAlerts":0}
警报的状态变成了resolved
5. 参数解释
group_wait(default: 30s)
How long to initially wait to send a notification for a group of alerts. Allows to wait for an inhibiting alert to arrive or collect more initial alerts for the same group. (Usually ~0s to few minutes.)
一组告警第一次发送之前等待的时间。用于等待抑制告警,或等待同一组告警采集更多初始告警后一起发送。(一般设置为0秒 ~ 几分钟)
group_interval(default: 5m)
How long to wait before sending a notification about new alerts that are added to a group of alerts for which an initial notification has already been sent. (Usually ~5m or more.)
一组已发送初始通知的告警接收到新告警后,再次发送通知前等待的时间(一般设置为5分钟或更多)
repeat_interval(default: 4h)
How long to wait before sending a notification again if it has already been sent successfully for an alert. (Usually ~3h or more).
一条成功发送的告警,在再次发送通知之前等待的时间。 (通常设置为3小时或更长时间)。
6. 实验
参数
group_wait: 10s
group_interval: 5m
repeat_interval: 20m
在16:24:55 创建第一个告警 instance=192.168.1.100:9100
在16:25:00 创建第二个告警 instance=192.168.1.101:9100
在16:26:14 创建第三个告警 instance=192.168.1.103:9100
程序接受到的告警
2022-11-02 16:25:25.1830502 +0800 CST m=+18501.132700801
{"receiver":"webhook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.100:9100"},"annotations":{"summary":"NodeMemPressure, IP:
192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"80a1a8cfbe10362e"},{"status":"firing","lab192.168.1.200, Vaels":{"alertname":"NodeMemPressure","instance":"192.168.1.101:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24::"NodeMemPressure44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"8a4e47f851b7fa01"}],"groupLabels":{"alertname":"NodeMemPressure"},"commonLabels":{"alertname":"NodeMemPressure"},"c0:00Z","generatorommonAnnotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"externalURL":"http://629720e6c34d:9093","version":"4","groupKey":"{}:{alertname=\"NodeMemPres192.168.1.200, Vasure\"}","truncatedAlerts":0}
2022-11-02 16:30:25.1791246 +0800 CST m=+18801.128775201
{"receiver":"webhook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.100:9100"},"annotations":{"summary":"NodeMemPressure, IP:
192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"80a1a8cfbe10362e"},{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.101:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"8a4e47f851b7fa01"},{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.102:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"45fb02806009ff64"}],"groupLabels":{"alertname":"NodeMemPressure"},"commonLabels":{"alertname":"NodeMemPressure"},"commonAnnotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value:
90%, Threshold: 85%"},"externalURL":"http://629720e6c34d:9093","version":"4","groupKey":"{}:{alertname=\"NodeMemPressure\"}","truncatedAlerts":0}
2022-11-02 16:55:25.1825506 +0800 CST m=+20301.132201201
{"receiver":"webhook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.100:9100"},"annotations":{"summary":"NodeMemPressure, IP:
192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"80a1a8cfbe10362e"},{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.101:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"8a4e47f851b7fa01"},{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.102:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"45fb02806009ff64"}],"groupLabels":{"alertname":"NodeMemPressure"},"commonLabels":{"alertname":"NodeMemPressure"},"commonAnnotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value:
90%, Threshold: 85%"},"externalURL":"http://629720e6c34d:9093","version":"4","groupKey":"{}:{alertname=\"NodeMemPressure\"}","truncatedAlerts":0}
因为第一个和第二个告警发送间隔小于group_wait,所以被编在一起发送
第三个告警与第一个告警产生时间大于group_wait,所以等待了group_interval(group_wait + group_interval)发送第二次告警
经过(group_interval * 4 > repeat_interval ),发送第三次告警(内容与第二次相同)。
单个告警过程
alertmanager收到告警后,等待group_wait(10s),发送第一次通知
未达到group_interval(5m 10s),休眠
达到group_interval(5m 10s)时,小于repeat_interval(20m 10s),休眠
到下一个group_interval(5m 10s),大于repeat_interval(20m 10s),发送第二次通知