「Alertmanager」- Prometheus 告警发送工具 @20210522

问题描述

在 Prometheus Monitoring System 中，当指标异常而触发 Promethesu 的告警规则时，Promethesu 将请求 Alertmanager 服务，来要求 Alertmanager 发送告警信息，而 Alertmanger 负责处理这些告警消息，来决定是是否发送消息、是否合并消息、何时发送、发送渠道等等。所以说，Alertmanager 负责管理告警消息的发送，但并不负责检查指标是否达到告警阈值（这属于 Prometheus 的任务）。

该笔记将记录：Alertmanager 的使用方法，其主要内容是对官方文档的学习、记录、整理，还包含部分常用配置示例。子章节将记录如何使用 Alertmanager 完成具体的工作任务，比如配置 Slack 告警、定制告警消息格式。

解决方案

这里的主要内容均来自于官方文档，但是提取出主要及关键内容，能让我们形成对 Alertmanger 的整体认识。

关于安装方法

参考 alertmanager/README.md at master 页面。如果希望以容器运行，参考 prom/alertmanager 页面。

核心的基本概念

Alertmanager | Prometheus

消息分组（Grouping）：合并告警消息。例如，当数据中心网络故障，所有服务均不可连接，此时将产生大量告警消息。而我们仅仅希望收到一条汇总的告警消息，而不是针对每个服务的告警消息。这就是 Grouping 的作用，能够按照告警、时间、接收人来发送告警信息。对于该配置，需要在配置文件中进行设置。

消息抑制（Inhibition）：例如，如果集群无法访问，那么我们希望抑制与这个集群相关的告警信息。这就是 Inhibition 的作用，根据关系而仅发送最关键的告警。对于该配置，需要在配置文件中进行设置。

条件判断（Silences）：允许我们根据告警消息的条件、标签来过滤告警消息，以决定某些告警是否发送。对于该配置，需要在 Alertmanager 的 Web 界面中完成。

灵活请求（Client Behavior）：除了配合 Promethesu 来发送告警消息，还能够单独使用 Alertmanager 服务。我们能够发送自定义 HTTP 请求并附带某些信息，来要求 Alertmanager 发送告警。

高可用性（High Availability）：支持配置高可用 Alertmanager 集群，来防止丢失告警消息。注意事项，官方文档要求这种高可用不能使用负载均衡等工具，而是以 --cluster.* 启动 Alertmanager 并在 Prometheus 中配置多个 Alertmanager 节点地址。

配置参数及配置文件

Configuration | Prometheus

能够通过命令行和配置文件指定运行参数。命令行仅能简单配置，而配置文件能够定义继承规则、通知路由、告警接收人等等复杂设置。

命令行选项

使用 alertmanager -h 查看全部命令行选项。

指定配置文件：alertmanager --config.file /path/to/alertmanager.yml

增加日志等级：alertmanager --log.level=debug

配置文件结构与语法

我们这里侧重学习配置文件的结构，而不是每个属性及字段的含义。当对配置文件的结构形成整体的认识，在解决具体问题时也会非常清楚该如何解决。

关于官方文档的阅读：
1）[]，表示该参数是可选择
2）占位符（Placeholder），即每个字段的结构及取值类型，是文档的主要内容并且需要我们了解其使用方法。

配置文件采用 YAML 格式。官方提供的示例配置文件：alertmanager/simple.yml at master · prometheus/alertmanager

根据我们学习，我们整理出如下示例配置文件：

# 定义全局配置，还定义某些参数
global:
  resolve_timeout: 5m
  # 邮箱相关信息
  smtp_smarthost: 'smtp.example.com'
  smtp_auth_username: 'username'
  smtp_auth_password: 'password'
  smtp_from: 'k4nz@example.com'
  
# 在发送通知时，使用的模板文件的路径
templates:
  - '/etc/alertmanager/template/*.tmpl'

# 定义消息的方式渠道及相关参数
# 注意，receivers 仅定义告警消息的接收方式，而发送规则需要通过 route 定义 
receivers:
  # 告警到 Slack 平台
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#alerts'
        text: 'https://internal.example.com/wiki/alerts/{{ .GroupLabels.app }}/{{ .GroupLabels.alertname }}'
  # 邮件告警到 OP 团队
  - name: 'mail-to-op'
    email_configs:
      - to: 'team-op@example.org'
        # 覆盖 global: 中的设置
        auth_username: 'username2'
        auth_password: 'password2'
  # 邮件告警到 DB 团队
  - name: 'mail-to-db'
    email_configs:
      - to: 'team-db-x@example.org'
  # 邮件告警到 DB-X 团队
  - name: 'mail-to-db-x'
    email_configs:
      - to: 'team-db-x@example.org'
  # 邮件告警到 DB-Z 团队
  - name: 'mail-to-db-z'
    email_configs:
      - to: 'team-db-z@example.org'
      
# 定义消息该发送给谁 
route:
  group_by: ['alertname', 'cluster', 'service']
  # 默认接收人
  receiver: slack-notifications
  routes:
    - receiver: slack-notifications
      match_re: 
        service: ^(foo1|foo2|baz)$
    - receiver: mail-to-op
      match:
        severity: critical
    # 下面是更复杂的规则，将不同类型的数据库告警发送给不同组
    - mactch:
        service: database
      receiver: mail-to-db
      routes:
        - match:
            owner: team-op
          receiver: mail-to-db-x
        - match:
            owner: team-db
          receiver: mail-to-db-z
  
# 消息抑制规则
# 将具有 equal: 这些 label 的消息视为一组，如果出现  severity: 'critical' 告警，则抑制 severity: 'warning' 告警。
inhibit_rules:
- equal: ['alertname', 'cluster', 'service']
  source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'

group_wait, group_interval, repeat_interval

这些参数都与时间有关，但是我们对其含义的理解十分模糊，因此整理该部分以记录这些参数的含义。

当最初的告警信息到达时，Alertmanager 根据 group_by 属性来创建消息组，将具有相同 label 的消息”合并“到相同的组中。

在创建消息组后的 group_wait 时间内，任何具有这些 label 的告警消息，都将合并到该消息组中。当到达 group_wait 时间后，这消息组累积的告警消息将被发送。

当该**消息组**的首条告警信息发送成功之后，如果再有新的告警消息到达，该消息组会继续等待 group_interval 时间。当到达 group_interval 时间后，该消息组累计的告警信息将被发送。

至此，我们能够看出 group_wait、group_interval 是为了控制消息的聚合。

当告警消息发送之后，每经过 repeat_interval 时间，将重复发送前面已经发送的告警信息，直至告警问题得到解决（解决的标志是：请求包含 EndsAt 字段，或者 resolve_timeout 属性设置的时间。在 Prometheus 中，如果问题得到解决，Prometheus 将会发送 EndsAt 字段）。

但是，在实际实践中，“收到告警的时间”与“预期收到告警的时间”还是存在差异（这个时间差异是固定值，所以不像是告警媒介延迟引起的）

定制通知模板

Notification template reference | Prometheus
Notification template examples | Prometheus

Alertmanager 发送的告警消息将通过模板（Go templating）进行渲染，然后再将其发送给接受人。

鉴于介绍 Go templating 的使用方法会偏离主题，所以该笔记不会记录 Go templating 的使用方法，与其相关内容参考 Go templating 笔记。

模板大部分是开箱即用的，通常无需定制。如果希望定制模板，需要学习以下内容：
1）学习 Go templating 使用；
2）学习 Notification template reference 文档，以了解能在模板中使用的变量；

常用配置场景

官方 Notification template examples 文档，提供简单的配置示例，涵盖几个常用场景，足够我们学习入门。

这里我们将记录我们的使用场景，及相关配置：
1）发送告警消息到多个渠道，比如 Email、Slack 等等。参考 Send messages to multiple destinations 笔记。

常用工具及操作

配置文件重载（reload）

向进程发送 SIGHUP 信号。或者发送 HTTP POST 到 /-/reload 地址：

# curl -XPOST http://127.0.0.1:9093/-/reload

// 在 Alertmanger 中，命令行日志将输出类似如下内容：
level=info ts=2021-04-27T08:35:31.553347364Z caller=main.go:322 msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml

如果配置文件格式错误，则不会应用配置，并会记录错误。但似乎没有错误原因，因此我们无法得知错误之处。

发送自定义消息（测试 AlertManager 配置）

Alertmanager 启动之后，将监听 HTTP 9093 端口，然后 Prometheus 将调用 Alertmanager 发送告警消息参数。

除此之外，我们能构建自己的告警参数，并 HTTP 请求 Alertmanger 发送告警消息。比如：
1）测试 Alertmanager 配置：send-alerts.sh

amtool

与 Alertmanager 配套的工具，具有如下功能：
1）检查配置文件是否正确：amtool check-config alertmanager.yml
2）查看当前触发的告警、并且能对这些告警进行相关操作；
3）测试 route 配置（类似 Routing tree editor 工具，但为命令行使用）

Routing tree editor

官方提供的 route 编辑器（Routing tree editor），能够用来测试消息的路由：根据 route 配置，以及当在告警信息中存在某些 Lable 时，哪些 receiver 将接收到消息。

参考文献

Alertmanager | Prometheus
Configuration | Prometheus
Clients | Prometheus
Notification template reference | Prometheus
Notification template examples | Prometheus
Management API | Prometheus
How to check your prometheus.yml is valid – Robust Perception | Prometheus Monitoring Experts
How do I configure the log level of Prometheus' Alertmanager? - Server Fault
alertmanager/simple.yml at master · prometheus/alertmanager
What’s the difference between group_interval, group_wait, and repeat_interval? – Robust Perception | Prometheus Monitoring Experts
Prometheus: understanding the delays on alerting

posted @ 2021-05-22 21:27 研究林纳斯写的阅读(781) 评论(0) 编辑收藏举报

刷新页面返回顶部