Loading

Prometheus发送告警机制

Prometheus会根据rules中的规则,不断的评估是否需要发出告警信息,
如果满足规则中的条件,则会向alertmanagers中配置的地址发送告警,
告警是通过alertmanager配置的地址post告警,比如targets: ['192.168.1.104:8090'],则会向http://192.168.1.104:8090/api/v2/alerts发送告警信息。

1. 目标

一般都会通过alertmanager组件处理告警信息,但是这样信息都被alertmanager分组或者抑制处理了,看不到原始的告警信息
这里自己写一个alertmanger程序,来接收Prometheus发送的告警,并将告警打印出来。以此来研究告警信息,发送频率以及告警解除处理。

名词解释
evaluation_interval: prometheus中配置评估规则时间的间隔
for: 告警规则中配置的告警等待时间,值等于: fireAt - activeAt

2. alertmanger程序编写

alertmanger.go

package main

import (
    "time"
    "io/ioutil"
    "net/http"
    "fmt"
)

type MyHandler struct{}

func (mh *MyHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    body, err := ioutil.ReadAll(r.Body)
    if err != nil {
        fmt.Printf("read body err, %v\n", err)
        return
    }
    fmt.Println(time.Now())
    fmt.Printf("%s\n\n", string(body))
}

func main() {
    http.Handle("/api/v1/alerts", &MyHandler{})
    http.ListenAndServe(":8090", nil)
}

启动程序: go run alertmanager.go

3. 配置文件

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['192.168.1.104:8090']
rule_files:
  - "/etc/prometheus/rules.yml"
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node'
    static_configs:
    - targets: ['192.168.1.200:9100']

我们配置的alertmanager的地址为192.168.1.104:8090,这个地址就是接下来我们要启动的自已的程序,用来接收prometheus的告警信息并打印出来。

rules.yml

groups:
- name: example
  rules:
 # Alert for any instance that is unreachable for >1 minutes.
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      serverity: page
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

4. 日志解析

关闭实例192.168.1.200:9100的node_exporter的服务,程序打印的日志如下:

2022-11-01 16:04:01.4538613 +0800 CST m=+23803.323087701
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:08:01.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

2022-11-01 16:05:16.4596299 +0800 CST m=+23878.328856301
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:09:16.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

2022-11-01 16:06:31.4571604 +0800 CST m=+23953.326386801
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:10:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

从日志中可以看出:
Prometheus发送告警频率为1分15秒,正好等于 evaluation_interval + for
endsAt处于未来时间,这里为日志发送时间加上4分钟(不明白为啥是4分钟)

启动node_exporter服务,时间为: 16:11:16
日志显示如下:

2022-11-01 16:11:31.4596592 +0800 CST m=+24253.328885601
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:11:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

2022-11-01 16:12:46.4868947 +0800 CST m=+24328.356121101
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:11:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

...............
...............

2022-11-01 16:25:16.4505975 +0800 CST m=+25078.319823901
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:11:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

2022-11-01 16:26:31.4480177 +0800 CST m=+25153.317244101
[{"annotations":{"description":"192.168.1.200:9100 of job node has been down for more than 1 minutes.","summary":"Instance 192.168.1.200:9100 down"},"endsAt":"2022-11-01T08:11:31.431Z","startsAt":"2022-11-01T02:01:31.431Z","generatorURL":"http://192.168.1.200:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","labels":A{"alertname":"InstanceDown","instance":"192.168.1.200:9100","job":"node","project":"cmdb","serverity":"page"}}]

启动node_exporter服务后,根据上面的输出我们还可以发现,当“解除告警”发出去以后,Prometheus还坚持把“解除告警”发送了好多次,这里总共发送了13次警报,
发送频率也是等于 evaluation_interval + for

相当于告警已经解除,于是Prometheus会在等待for的时间后立即发送一条告警出去,表明告警已解除。也就是说,下面第一条告警其实是一条“解除告警”,为什么呢?因为endsAt的时间就是发送该条告警的时间,当AlertManager接收到以后,发现这个时间已经是一个过去的时间了,也就是说,这条告警已经结束了。

endsAt的时间为为inactiveAt + for

5. 总结

Prometheus会以evaluation_interval的间隔评估是否应该发送告警,
当满足告警条件时Prometheus会以evaluation_interval + for的频率发送告警
日志中的key解释:
startAt: 告警激活时间,activeAt+for
endsAt: 解除警报时间,为 inactiveAt + for 需要注意的是警报处于激活时,他的时间应该处于未来的某个时间(fireAt+ 一段时间)
告警是通过alertmanager配置的地址post告警,比如targets: ['192.168.1.104:8090'],则会向http://192.168.1.104:8090/api/v2/alerts发送告警信息。

posted @ 2022-11-01 17:16  头痛不头痛  阅读(1221)  评论(0编辑  收藏  举报