Spring Cloud 之 Prometheus集成Alertmanager实现微服务预警(二十三)

上两篇我们讲了Prometheus+Grafana+Eureka实现微服务监控,通过Grafana可以查看监控大屏。但是还有一个问题,就是如果出现预警,不可能靠人一直盯着,一是可能漏掉重要的预警信息,二是人工预警不及时,更重的是上千台服务器得需要多少人。这时自动,准确的预警尤为重要。

Spring Cloud 之 Prometheus+Grafana实现监控微服务(二十一) 

Spring Cloud 之 Prometheus+Grafana+Eureka实现动态微服务监控(二十二)

 在官方给的架构图中,如下图右上角标记为蓝色框内,则为普罗米修斯的预警模块。本篇主要实现机器宕机后,Prometheus把宕机的预警信息发给Alertmanager,再通过AlertManager把预警信息转发给我们自己的预警应用。预警应用可以通过邮件,短信,企业微信预警,通知相关业务及开发人员。

 

 

 

 

Prometheus集成Alertmanager预警架构图

 

1、Prometheus集成alertmanager配置

prometheus.yml文件中配置好alertmanager地址,9093是alertmanager默认启动端口。

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["localhost:9093"]  

 

2、Prometheus配置预警规则

2.1 在prometheus.yml同级目录下新建一个规则配置文件,名称为first_rules.yml。

first_rules.yml内容如下:

groups:
- name: example
  rules:
  - alert:  InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Instance has been down for more than 5 minutes

上面配置的意思是当有实例下线时发送告警信息。

 

2.2 prometheus.yml配置中配置规则文件first_rules.yml,默认是注释掉的,打开即可。

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "first_rules.yml"
  # - "second_rules.yml"

3、alertmanager安装及配置

3.1 alertmanager下载

地址:https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.windows-amd64.tar.gz

3.2 修改alertmanager.yml文件

下载完成后解压,修改receivers.webhook_configs.url,指向我们自己的预警应用地址(spring-cloud-alertmanager地址)

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/alertMessage/receive'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

 

4、预警模块spring-cloud-alertmanager开发

4.1 创建接收预警消息的Controller,注意Post提交,参数为数据流。

/**
 * @author Leo
 */
@RestController
@RequestMapping("alertMessage")
@Slf4j
public class ReceiveAlertMessageController {

    @PostMapping("receive")
    public String receiveMsg(@RequestBody byte[] data) {
        String msg = new String(data, 0, data.length, Charset.forName("UTF-8"));
        log.info("接收AlertManager预警消息:" + msg);
        return "success";
    }
}

 

4.2 创建启动类

/**
 * @author Leo
 */
@SpringBootApplication
@EnableEurekaClient
public class AlertManagerApplication {

    public static void main(String[] args) {
        SpringApplication.run(AlertManagerApplication.class, args);
    }

}

 

5、预警流程验证

5.1 启动

启动eureka

启动prometheus:D:\soft\springcloud\prometheus-2.25.1\prometheus.exe

启动alertmanager:D:\soft\springcloud\alertmanager-0.21.0\alertmanager.exe

启动pring-cloud-alertmanager

5.2 查看预警规则

浏览器中输入:http://localhost:9090/classic/rules,可以查看到我们之前在first_rules.yml文件中配置的规则

 

 

点击Alert菜单,可以看到现在有3个实例处于下线状态(其实这里不是真正的下线,只是我们没有在应用里配置Prometheus,而Prometheus又可以从eureka拉取应用列表,但是不能从应用侧拉取采集信息)

 

 

 5.3 查看Alertmanager管理平台

浏览器输入:http://localhost:9093/,点击Alert菜单,可以看到现在有3条预警,证明Prometheus已经把告警信息推送到Alertmanager端了。

 

 

 5.4 查看spring-cloud-alertmanager后台日志

2021-03-17 10:28:05.621  INFO 49924 --- [nio-5001-exec-5] c.x.a.c.ReceiveAlertMessageController    : 接收AlertManager预警消息:{"receiver":"web\\.hook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"InstanceDown","instance":"172.16.43.41:5001","job":"eureka","severity":"critical"},"annotations":{"summary":"Instance has been down for more than 5 minutes"},"startsAt":"2021-03-17T00:27:55.050285364Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","fingerprint":"69addef300b8a5b1"},{"status":"firing","labels":{"alertname":"InstanceDown","instance":"windows10.microdone.cn:apollo-adminservice:8090","job":"eureka","severity":"critical"},"annotations":{"summary":"Instance has been down for more than 5 minutes"},"startsAt":"2021-03-17T00:27:55.050285364Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","fingerprint":"bfde9dc4159405b2"},{"status":"firing","labels":{"alertname":"InstanceDown","instance":"windows10.microdone.cn:apollo-configservice:8080","job":"eureka","severity":"critical"},"annotations":{"summary":"Instance has been down for more than 5 minutes"},"startsAt":"2021-03-17T00:27:55.050285364Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1","fingerprint":"80adda6540e0cfba"}],"groupLabels":{"alertname":"InstanceDown"},"commonLabels":{"alertname":"InstanceDown","job":"eureka","severity":"critical"},"commonAnnotations":{"summary":"Instance has been down for more than 5 minutes"},"externalURL":"http://DESKTOP-TK67BLR:9093","version":"4","groupKey":"{}:{alertname=\"InstanceDown\"}","truncatedAlerts":0}

可以看到我们通过http://127.0.0.1:5001/alertMessage/receive接口接收到了Alertmanager推送过来的消息,用JSON工具格式化接收到的报文:

{
    "receiver":"web\.hook",
    "status":"firing",
    "alerts":[
        {
            "status":"firing",
            "labels":{
                "alertname":"InstanceDown",
                "instance":"127.0.0.1:5001",
                "job":"eureka",
                "severity":"critical"
            },
            "annotations":{
                "summary":"Instance has been down for more than 5 minutes"
            },
            "startsAt":"2021-03-17T00:27:55.050285364Z",
            "endsAt":"0001-01-01T00:00:00Z",
            "generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0&g0.tab=1",
            "fingerprint":"69addef300b8a5b1"
        },
        {
            "status":"firing",
            "labels":{
                "alertname":"InstanceDown",
                "instance":"windows10.microdone.cn:apollo-adminservice:8090",
                "job":"eureka",
                "severity":"critical"
            },
            "annotations":{
                "summary":"Instance has been down for more than 5 minutes"
            },
            "startsAt":"2021-03-17T00:27:55.050285364Z",
            "endsAt":"0001-01-01T00:00:00Z",
            "generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0&g0.tab=1",
            "fingerprint":"bfde9dc4159405b2"
        },
        {
            "status":"firing",
            "labels":{
                "alertname":"InstanceDown",
                "instance":"windows10.microdone.cn:apollo-configservice:8080",
                "job":"eureka",
                "severity":"critical"
            },
            "annotations":{
                "summary":"Instance has been down for more than 5 minutes"
            },
            "startsAt":"2021-03-17T00:27:55.050285364Z",
            "endsAt":"0001-01-01T00:00:00Z",
            "generatorURL":"http://DESKTOP-TK67BLR:9090/graph?g0.expr=up+%3D%3D+0&g0.tab=1",
            "fingerprint":"80adda6540e0cfba"
        }
    ],
    "groupLabels":{
        "alertname":"InstanceDown"
    },
    "commonLabels":{
        "alertname":"InstanceDown",
        "job":"eureka",
        "severity":"critical"
    },
    "commonAnnotations":{
        "summary":"Instance has been down for more than 5 minutes"
    },
    "externalURL":"http://DESKTOP-TK67BLR:9093",
    "version":"4",
    "groupKey":"{}:{alertname="InstanceDown"}",
    "truncatedAlerts":0
}

到此Prometheus集成Alertmanager集成就完成了。

补充:不通过Alertmanager直接调邮件预警是因为生产上预警信息量很大,我们可以通过在spring-cloud-alertmanager中将接收到的预警信息存入MQ或数据库,然后再调邮件,短信服务预警。而且预警的方式也更灵活。

 

posted @ 2021-03-17 10:51  shileishmily  阅读(1071)  评论(0编辑  收藏  举报