云原生监控系统Prometheus——Alertmanager 部署和配置
Alertmanager 部署和配置
在 Prometheus 生态架构中,告警是由独立的两部分组成,可以通过下图来了解我们 Prometheus 告警机制。
其中 Prometheus server 与 Alertmanager 是相互分离的两个组件。
我们使用 Prometheus server 采集各类监控指标,然后基于 PromQL 对这些指标定义阈值告警规则(Rules)。Prometheus server 对告警规则周期性地进行计算,如果满足告警触发条件,便生成一条告警信息,并将其推送到 Alertmanager 组件。收到告警信息后,Alertmanager 会处理告警,进行分组(grouping)并将它们路由(routing)到正确的接收器(receiver),如 Email、pagerduty 和 Hipchat 等,最终把异常事件的通知发送给接收者。
在 Prometheus 中,我们不仅可以对一条告警规则进行命名并通过 PromQL 定义规则,更多时候是对相关的告警进行分组后统一定义。下面我们先对 AlertManager 中告警的分组(Grouping)、抑制(Inhibition)和静默(Sliences)介绍。
一、概念介绍
1.1、告警分组
分组机制(Grouping)是指,AlertManager 将同类型的告警进行分组,合并多条告警到一个通知中。在实际环境中,特别是云计算环境中的业务场景之间密集耦合时,弱出现多台设备宕机,可能会导致成百上千个告警被触发。在这种情况下使用分组机制,可以将这些被触发的告警合并为一个告警进行通知,从而避免瞬间突发性地接收大量的告警通知,使得管理员无法对问题进行快速定位。
1.2、告警抑制
Alertmanager 的抑制机制(Inhibition)是指,当某告警已经发出,停止重复发送由此告警引发的其他异常或故障的告警机制。在生产环境中,例如 IDC 托管机柜中,若每个机制接入层仅仅是单台交换机,那么该机柜接入交换机故障会造成机柜中服务器非 UP状态告警;再有服务器上部署的应用不可发昂文也会触发告警。此时,可以配置 Alertmanger 忽略由交换机故障造成的机柜所有服务器及其应用不可访问而产生的告警。
1.3、告警静默
告警静默(Silences)提供了一个简单的机制,可以根据标签快速对告警进行静默处理。对传入的告警进行匹配检查,如果接收到的告警符合静默的配置,Alertmanager 则不会发送告警通知。管理员可以直接在 Alertmanager 的 Web 界面中临时屏蔽
二、部署安装
2.1、使用二进制部署安装
github 下载地址:https://github.com/prometheus/alertmanager
Alertmanager 官网下载地址:https://prometheus.io/download/
因为 Alertmanager 也是基于 Go 语言编写而成,因此 可以直接使用官方提供的二进制软件安装包,解压缩后快速安装。
Prometheus Alertmanager prometheus/alertmanager 0.24.0 / 2022-03-24 Release notes
|
||||
File name
|
OS
|
Arch
|
Size
|
SHA256 Checksum
|
---|---|---|---|---|
darwin
|
amd64
|
24.34 MiB
|
a726fa727b6d8e422665a93f90c85b8bca15e2821594288944e21cbe340c68f1
|
|
linux
|
amd64
|
24.68 MiB
|
6fa982232ae7be5524f5b3cd4e90b7c6a75086368200c917644eaa3e806555dc
|
|
windows
|
amd64
|
25.13 MiB
|
1a29cdd112bbc3e76a95f951ce6679a5fd9f424ed2964d1065ab9706f70e9ec0
|
-
-
2.1.1)官方下载 Alertmanager 完成后,获取软件包的哈希值
[root@iZj6cdj8nbj0zla2lnq2g9Z zy]# wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz --2022-08-01 20:14:49-- http://wget/ Resolving wget (wget)... failed: Name or service not known. wget: unable to resolve host address ‘wget’ --2022-08-01 20:14:49-- https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz Resolving github.com (github.com)... 20.205.243.166 Connecting to github.com (github.com)|20.205.243.166|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/11452538/d3143d23-5eab-4248-ae7b-2d04eb2e6084?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220801%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220801T121430Z&X-Amz-Expires=300&X-Amz-Signature=7d06247b82eb679c427703cda2f31b6ca64b76806dcbb09fd297af880e842f83&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=11452538&response-content-disposition=attachment%3B%20filename%3Dalertmanager-0.24.0.linux-amd64.tar.gz&response-content-type=application%2Foctet-stream [following] --2022-08-01 20:14:50-- https://objects.githubusercontent.com/github-production-release-asset-2e65be/11452538/d3143d23-5eab-4248-ae7b-2d04eb2e6084?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220801%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220801T121430Z&X-Amz-Expires=300&X-Amz-Signature=7d06247b82eb679c427703cda2f31b6ca64b76806dcbb09fd297af880e842f83&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=11452538&response-content-disposition=attachment%3B%20filename%3Dalertmanager-0.24.0.linux-amd64.tar.gz&response-content-type=application%2Foctet-stream Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ... Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.109.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 25880024 (25M) [application/octet-stream] Saving to: ‘alertmanager-0.24.0.linux-amd64.tar.gz’ 100%[============================================================================>] 25,880,024 4.02MB/s in 6.5s 2022-08-01 20:14:56 (3.82 MB/s) - ‘alertmanager-0.24.0.linux-amd64.tar.gz’ saved [25880024/25880024] FINISHED --2022-08-01 20:14:56-- Total wall clock time: 7.1s Downloaded: 1 files, 25M in 6.5s (3.82 MB/s) [root@iZj6cdj8nbj0zla2lnq2g9Z zy]# sha256sum alertmanager-0.24.0.linux-amd64.tar.gz 6fa982232ae7be5524f5b3cd4e90b7c6a75086368200c917644eaa3e806555dc alertmanager-0.24.0.linux-amd64.tar.gz [root@iZj6cdj8nbj0zla2lnq2g9Z zy]#
-
2.1.2)解压缩软件包都指定的安装目录
[root@iZj6cdj8nbj0zla2lnq2g9Z zy]# tar -zxvf alertmanager-0.24.0.linux-amd64.tar.gz alertmanager-0.24.0.linux-amd64/ alertmanager-0.24.0.linux-amd64/alertmanager.yml alertmanager-0.24.0.linux-amd64/LICENSE alertmanager-0.24.0.linux-amd64/NOTICE alertmanager-0.24.0.linux-amd64/alertmanager alertmanager-0.24.0.linux-amd64/amtool [root@iZj6cdj8nbj0zla2lnq2g9Z zy]# mkdir -pv /usr/local/alertmanager/release/ mkdir: created directory ‘/usr/local/alertmanager’ mkdir: created directory ‘/usr/local/alertmanager/release/’ [root@iZj6cdj8nbj0zla2lnq2g9Z zy]# mv alertmanager-0.24.0.linux-amd64 /usr/local/alertmanager/release/ [root@iZj6cdj8nbj0zla2lnq2g9Z zy]# cd /usr/local/alertmanager/ [root@iZj6cdj8nbj0zla2lnq2g9Z alertmanager]# ls -l total 4 drwxr-xr-x 3 root root 4096 Aug 1 20:19 release [root@iZj6cdj8nbj0zla2lnq2g9Z alertmanager]# ln -nsvf release/alertmanager-0.24.0.linux-amd64 current ‘current’ -> ‘release/alertmanager-0.24.0.linux-amd64’ [root@iZj6cdj8nbj0zla2lnq2g9Z alertmanager]# tree . . ├── current -> release/alertmanager-0.24.0.linux-amd64 └── release └── alertmanager-0.24.0.linux-amd64 ├── alertmanager ├── alertmanager.yml ├── amtool ├── LICENSE └── NOTICE 3 directories, 5 files [root@iZj6cdj8nbj0zla2lnq2g9Z alertmanager]# =
-
2.1.3)添加 alertmanager 为系统服务开机启动
[Unit] Description=AlertManager Server Service daemon Wants=network-online.target After=network-online.target [Service] User=root Group=root Type=Simple Restart=on-failure ExecStart=/usr/local/alertmanager/current/alertmanager \ --config.file "/usr/local/alertmanager/current/alertmanager.yml" \ --storage.path="/usr/local/alertmanager/current/data" \ --data.retention=120h \ --alerts.gc-interval=30m \ --web.external-url "http://192.168.2.2.121:9093" --web.listen-address=":9093" [Install] WantedBy=multi-user.target
-
2.1.4) 启动服务
[root@iZj6cdj8nbj0zla2lnq2g9Z current]# systemctl daemon-reload [root@iZj6cdj8nbj0zla2lnq2g9Z current]# systemctl enable alertmanager.service Created symlink from /etc/systemd/system/multi-user.target.wants/alertmanager.service to /usr/lib/systemd/system/alertmanager.service. [root@iZj6cdj8nbj0zla2lnq2g9Z current]# systemctl start alertmanager.service [root@iZj6cdj8nbj0zla2lnq2g9Z current]# systemctl status alertmanager.service ● alertmanager.service - AlertManager Server Service daemon Loaded: loaded (/usr/lib/systemd/system/alertmanager.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2022-08-01 20:30:17 CST; 6s ago Main PID: 24760 (alertmanager) Tasks: 7 Memory: 13.4M CGroup: /system.slice/alertmanager.service └─24760 /usr/local/alertmanager/current/alertmanager --config.file /usr/lo... Aug 01 20:30:17 iZj6cdj8nbj0zla2lnq2g9Z systemd[1]: Started AlertManager Server Serv.... Aug 01 20:30:17 iZj6cdj8nbj0zla2lnq2g9Z alertmanager[24760]: ts=2022-08-01T12:30:17.9... Aug 01 20:30:17 iZj6cdj8nbj0zla2lnq2g9Z alertmanager[24760]: ts=2022-08-01T12:30:17.9... Aug 01 20:30:17 iZj6cdj8nbj0zla2lnq2g9Z alertmanager[24760]: ts=2022-08-01T12:30:17.9... Aug 01 20:30:17 iZj6cdj8nbj0zla2lnq2g9Z alertmanager[24760]: ts=2022-08-01T12:30:17.9... Aug 01 20:30:17 iZj6cdj8nbj0zla2lnq2g9Z alertmanager[24760]: ts=2022-08-01T12:30:17.9... Aug 01 20:30:17 iZj6cdj8nbj0zla2lnq2g9Z alertmanager[24760]: ts=2022-08-01T12:30:17.9... Aug 01 20:30:17 iZj6cdj8nbj0zla2lnq2g9Z alertmanager[24760]: ts=2022-08-01T12:30:17.9... Aug 01 20:30:17 iZj6cdj8nbj0zla2lnq2g9Z alertmanager[24760]: ts=2022-08-01T12:30:17.9... Aug 01 20:30:19 iZj6cdj8nbj0zla2lnq2g9Z alertmanager[24760]: ts=2022-08-01T12:30:19.9... Hint: Some lines were ellipsized, use -l to show in full. [root@iZj6cdj8nbj0zla2lnq2g9Z current]#
-
2.2)使用 Docker 安装
-
-
2.2.1)下载 Alertmanager 服务
docker pull prom/alertmanager:v0.24.0
-
2.2.2)运行 Alertmanger 镜像
官方示例: $ docker run --name alertmanager -d -p 127.0.0.1:9093:9093 quay.io/prometheus/alertmanager
-
2.3)使用 Kubernetes 安装
-
-
2.3.1)kubernetes 配置清单如下:
kind: Pod apiVersion: v1 metadata: name: alertmanager-main-0 generateName: alertmanager-main- namespace: kubesphere-monitoring-system labels: alertmanager: main app.kubernetes.io/component: alert-router app.kubernetes.io/instance: main app.kubernetes.io/managed-by: prometheus-operator app.kubernetes.io/name: alertmanager app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 0.23.0 controller-revision-hash: alertmanager-main-cd5bc8fdc statefulset.kubernetes.io/pod-name: alertmanager-main-0 annotations: cni.projectcalico.org/containerID: bcc41ec96a5168bb0d4847fccba5329b0e0af5302f84cbb6eaa3f1696b1c16a1 cni.projectcalico.org/podIP: 10.233.122.54/32 cni.projectcalico.org/podIPs: 10.233.122.54/32 k8s.v1.cni.cncf.io/network-status: |- [{ "name": "k8s-pod-network", "ips": [ "10.233.122.54" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: |- [{ "name": "k8s-pod-network", "ips": [ "10.233.122.54" ], "default": true, "dns": {} }] kubectl.kubernetes.io/default-container: alertmanager spec: volumes: - name: config-volume secret: secretName: alertmanager-main-generated defaultMode: 420 - name: tls-assets projected: sources: - secret: name: alertmanager-main-tls-assets-0 defaultMode: 420 - name: alertmanager-main-db emptyDir: {} - name: kube-api-access-l48d9 projected: sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: name: kube-root-ca.crt items: - key: ca.crt path: ca.crt - downwardAPI: items: - path: namespace fieldRef: apiVersion: v1 fieldPath: metadata.namespace defaultMode: 420 containers: - name: alertmanager image: 'registry.cn-beijing.aliyuncs.com/kubesphereio/alertmanager:v0.24.0' args: - '--config.file=/etc/alertmanager/config/alertmanager.yaml' - '--storage.path=/alertmanager' - '--data.retention=120h' - '--cluster.listen-address=' - '--web.listen-address=:9093' - '--web.route-prefix=/' - '--cluster.peer=alertmanager-main-0.alertmanager-operated:9094' - '--cluster.reconnect-timeout=5m' ports: - name: web containerPort: 9093 protocol: TCP - name: mesh-tcp containerPort: 9094 protocol: TCP - name: mesh-udp containerPort: 9094 protocol: UDP env: - name: POD_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP resources: limits: cpu: 200m memory: 200Mi requests: cpu: 20m memory: 30Mi volumeMounts: - name: config-volume mountPath: /etc/alertmanager/config - name: tls-assets readOnly: true mountPath: /etc/alertmanager/certs - name: alertmanager-main-db mountPath: /alertmanager - name: kube-api-access-l48d9 readOnly: true mountPath: /var/run/secrets/kubernetes.io/serviceaccount livenessProbe: httpGet: path: /-/healthy port: web scheme: HTTP timeoutSeconds: 3 periodSeconds: 10 successThreshold: 1 failureThreshold: 10 readinessProbe: httpGet: path: /-/ready port: web scheme: HTTP initialDelaySeconds: 3 timeoutSeconds: 3 periodSeconds: 5 successThreshold: 1 failureThreshold: 10 terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError imagePullPolicy: IfNotPresent securityContext: capabilities: drop: - ALL readOnlyRootFilesystem: true allowPrivilegeEscalation: false - name: config-reloader image: >- registry.cn-beijing.aliyuncs.com/kubesphereio/prometheus-config-reloader:v0.55.1 command: - /bin/prometheus-config-reloader args: - '--listen-address=:8080' - '--reload-url=http://localhost:9093/-/reload' - '--watched-dir=/etc/alertmanager/config' ports: - name: reloader-web containerPort: 8080 protocol: TCP env: - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: SHARD value: '-1' resources: limits: cpu: 100m memory: 50Mi requests: cpu: 100m memory: 50Mi volumeMounts: - name: config-volume readOnly: true mountPath: /etc/alertmanager/config - name: kube-api-access-l48d9 readOnly: true mountPath: /var/run/secrets/kubernetes.io/serviceaccount terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError imagePullPolicy: IfNotPresent securityContext: capabilities: drop: - ALL readOnlyRootFilesystem: true allowPrivilegeEscalation: false restartPolicy: Always terminationGracePeriodSeconds: 120 dnsPolicy: ClusterFirst nodeSelector: kubernetes.io/os: linux serviceAccountName: alertmanager-main serviceAccount: alertmanager-main nodeName: k8s-workernode-2-129 securityContext: runAsUser: 1000 runAsNonRoot: true fsGroup: 2000 hostname: alertmanager-main-0 subdomain: alertmanager-operated affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: alertmanager operator: In values: - main namespaces: - kubesphere-monitoring-system topologyKey: kubernetes.io/hostname schedulerName: default-scheduler tolerations: - key: node.kubernetes.io/not-ready operator: Exists effect: NoExecute tolerationSeconds: 300 - key: node.kubernetes.io/unreachable operator: Exists effect: NoExecute tolerationSeconds: 300 priority: 0 enableServiceLinks: true preemptionPolicy: PreemptLowerPriority
-
三、Alertmanager 配置
在上面,我们对 Alertmanager 进行了快速安装,对于配置文件,使用的是软件包默认自带的 alertmanager.yml 配置文件,它同样也是基于 yaml 语言格式实现的。通过下图,可以看到 Alertmanager 配置文件格式通常包括 global(全局配置)、templates(告警模板)、route(告警路由)、receivers(接收器)和 inhibit_rules(抑制规则)等主要配置项模块,参考地址:https://prometheus.io/docs/alerting/latest/configuration/
global:
# The default SMTP From header field.
[ smtp_from: <tmpl_string> ]
# The default SMTP smarthost used for sending emails, including port number.
# Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS).
# Example: smtp.example.org:587
[ smtp_smarthost: <string> ]
# The default hostname to identify to the SMTP server.
[ smtp_hello: <string> | default = "localhost" ]
# SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server.
[ smtp_auth_username: <string> ]
# SMTP Auth using LOGIN and PLAIN.
[ smtp_auth_password: <secret> ]
# SMTP Auth using PLAIN.
[ smtp_auth_identity: <string> ]
# SMTP Auth using CRAM-MD5.
[ smtp_auth_secret: <secret> ]
# The default SMTP TLS requirement.
# Note that Go does not support unencrypted connections to remote SMTP endpoints.
[ smtp_require_tls: <bool> | default = true ]
# The API URL to use for Slack notifications.
[ slack_api_url: <secret> ]
[ slack_api_url_file: <filepath> ]
[ victorops_api_key: <secret> ]
[ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
[ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
[ opsgenie_api_key: <secret> ]
[ opsgenie_api_key_file: <filepath> ]
[ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
[ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
[ wechat_api_secret: <secret> ]
[ wechat_api_corp_id: <string> ]
[ telegram_api_url: <string> | default = "https://api.telegram.org" ]
# The default HTTP client configuration
[ http_config: <http_config> ]
# ResolveTimeout is the default value used by alertmanager if the alert does
# not include EndsAt, after this time passes it can declare the alert as resolved if it has not been updated.
# This has no impact on alerts from Prometheus, as they always include EndsAt.
[ resolve_timeout: <duration> | default = 5m ]
# Files from which custom notification template definitions are read.
# The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'.
templates:
[ - <filepath> ... ]
# The root node of the routing tree.
route: <route>
# A list of notification receivers.
receivers:
- <receiver> ...
# A list of inhibition rules.
inhibit_rules:
[ - <inhibit_rule> ... ]
# DEPRECATED: use time_intervals below.
# A list of mute time intervals for muting routes.
mute_time_intervals:
[ - <mute_time_interval> ... ]
# A list of time intervals for muting/activating routes.
time_intervals:
[ - <time_interval> ... ]
3.1)global(全局配置)
即全局配置,在 Alertmanager 配置文件中,只要全局配置项中配置的选项内容均为公共配置,便可以作为其他配置项的默认值,也可以被其他配置项中的设置覆盖掉。其中 resolve_timeout 用于设置处理超时时间,是声明告警状态为已解决的时间,它的时长设定有可能影响告警恢复通知的接收时间,默认是 5分钟。如果每一次告警均需要通过电子邮件接收,可以设置用于发送电子邮件的 SMTP 服务器信息和通知服务,其中对应的配置内容如下:
-
-
3.1.1)设置 smtp 服务
global: # The default SMTP From header field. [ smtp_from: <tmpl_string> ] #发送邮件的名称 # The default SMTP smarthost used for sending emails, including port number. # Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS). # Example: smtp.example.org:587 [ smtp_smarthost: <string> ] #邮件 SMTP 服务器代理地址 # The default hostname to identify to the SMTP server. [ smtp_hello: <string> | default = "localhost" ] # SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server. [ smtp_auth_username: <string> ] #邮箱用户名称 # SMTP Auth using LOGIN and PLAIN. [ smtp_auth_password: <secret> ] #邮箱用户密码 # SMTP Auth using PLAIN. [ smtp_auth_identity: <string> ] # SMTP Auth using CRAM-MD5. [ smtp_auth_secret: <secret> ] # The default SMTP TLS requirement. # Note that Go does not support unencrypted connections to remote SMTP endpoints. [ smtp_require_tls: <bool> | default = true ] # 如果使用 TLS 则设置为 true
-
3.2)templates
告警模块可以自定义告警通知的外观格式及其包含的对应的告警数据。在 templates 部分中包含告警模板的目录列表,也就是设置已存在的模板文件的路径,例如:
templates:
- '/usr/local/alertmanager/current/templates/*.tmpl'
Alertmanager 启动时加载该路径下的模板文件。可以自己定义告警模板。
3.3)route
告警路由模块描述了在收到 Prometheus server 生产的告警后,将告警发送到 receiver 指定的目的地址的规则。Alertmanager 对传入的告警信息进行处理,根据所定义的规则和操作进行匹配。所有路由的组合可以理解为树状结构,设置的第一个 route 称为根节点,其后包括的节点称为子节点。每个告警都从配置的根节点路由进入路由树,按照深度优先从左到右进行遍历匹配,在所匹配的节点上停止。如果告警与节点的任何子节点都不匹配,也就是说没有可匹配的子节点,则根据当前节点的配置参数处理告警。
-
-
3.3.1)route 的常见配置选项内容
[ receiver: <string> ] # The labels by which incoming alerts are grouped together. For example, # multiple alerts coming in for cluster=A and alertname=LatencyHigh would # be batched into a single group. # # To aggregate by all possible labels use the special value '...' as the sole label name, for example: # group_by: ['...'] # This effectively disables aggregation entirely, passing through all # alerts as-is. This is unlikely to be what you want, unless you have # a very low alert volume or your upstream notification system performs # its own grouping. [ group_by: '[' <labelname>, ... ']' ] # Whether an alert should continue matching subsequent sibling nodes. [ continue: <boolean> | default = false ] # DEPRECATED: Use matchers below. # A set of equality matchers an alert has to fulfill to match the node. match: [ <labelname>: <labelvalue>, ... ] # DEPRECATED: Use matchers below. # A set of regex-matchers an alert has to fulfill to match the node. match_re: [ <labelname>: <regex>, ... ] # A list of matchers that an alert has to fulfill to match the node. matchers: [ - <matcher> ... ] # How long to initially wait to send a notification for a group # of alerts. Allows to wait for an inhibiting alert to arrive or collect # more initial alerts for the same group. (Usually ~0s to few minutes.) [ group_wait: <duration> | default = 30s ] # How long to wait before sending a notification about new alerts that # are added to a group of alerts for which an initial notification has # already been sent. (Usually ~5m or more.) [ group_interval: <duration> | default = 5m ] # How long to wait before sending a notification again if it has already # been sent successfully for an alert. (Usually ~3h or more). [ repeat_interval: <duration> | default = 4h ] # Times when the route should be muted. These must match the name of a # mute time interval defined in the mute_time_intervals section. # Additionally, the root node cannot have any mute times. # When a route is muted it will not send any notifications, but # otherwise acts normally (including ending the route-matching process # if the `continue` option is not set.) mute_time_intervals: [ - <string> ...] # Times when the route should be active. These must match the name of a # time interval defined in the time_intervals section. An empty value # means that the route is always active. # Additionally, the root node cannot have any active times. # The route will send notifications only when active, but otherwise # acts normally (including ending the route-matching process # if the `continue` option is not set). active_time_intervals: [ - <string> ...] # Zero or more child routes. routes: [ - <route> ... ]
-
3.3.2)route 的一个配置示例
# The root route with all parameters, which are inherited by the child # routes if they are not overwritten. route: receiver: 'default-receiver' group_wait: 30s group_interval: 5m repeat_interval: 4h group_by: [cluster, alertname] # All alerts that do not match the following child routes # will remain at the root node and be dispatched to 'default-receiver'. routes: # All alerts with service=mysql or service=cassandra # are dispatched to the database pager. - receiver: 'database-pager' group_wait: 10s matchers: - service=~"mysql|cassandra" # All alerts with the team=frontend label match this sub-route. # They are grouped by product and environment rather than cluster # and alertname. - receiver: 'frontend-pager' group_by: [product, environment] matchers: - team="frontend" # All alerts with the service=inhouse-service label match this sub-route. # the route will be muted during offhours and holidays time intervals. # even if it matches, it will continue to the next sub-route - receiver: 'dev-pager' matchers: - service="inhouse-service" mute_time_intervals: - offhours - holidays continue: true # All alerts with the service=inhouse-service label match this sub-route # the route will be active only during offhours and holidays time intervals. - receiver: 'on-call-pager' matchers: - service="inhouse-service" active_time_intervals: - offhours - holidays
-
3.3.3)route 选项说明
route 选项说明 选型 说明 [ receiver: <string> ]
配置要发送告警说明的接收器名称。
[ group_by: '[' <labelname>, ... ']' ]
指定要分组的标签,若告警中包含的标签符合 group_by 中指定的标签名称,这些告警会被合并为一个通知发送接收器,即实现告警分组。
[ continue: <boolean> | default = false ]
若设置为 false,则告警在满足条件时终止所有匹配处理,并且忽略后续节点;若设置为 true,则告警会继续进行后续子节点的匹配。
match: [ <labelname>: <labelvalue>, ... ]
通过字符形式进行告警匹配设置,用于判断当前告警中是否具有标签 labelname 且等于 labelvalue
match_re: [ <labelname>: <regex>, ... ]
通过正则表达式进行告警匹配设置,判断当前告警标签是否适配正则表达式的信息。
matchers: [ - <matcher> ... ]
警报必须满足以匹配节点的匹配器列表。
[ group_wait: <duration> | default = 30s ]
设置从接收告警到发送的等待时间,若在等待时间内当前 group 接收到了新的告警,这些告警会被合并为一个通知进行发送,默认设置为 30秒
[ group_interval: <duration> | default = 5m ]
设置相同的 group 之间发送告警通知的时间间隔,默认设置为 5分钟。
[ repeat_interval: <duration> | default = 4h ]
设置告警成功发送后能够再次发送完全相同的告警的时间间隔,默认是 4小时。
routes: [ - <route> ... ]
可进行子路由节点匹配设置
-
3.3.4) web 浏览器管理 alertmanager 配置文件
- 3.3.4.1)打开浏览器,输入地址:https://prometheus.io/webtools/alerting/routing-tree-editor/
- 3.3.4.2)复制 Alertmanager 配置文件内容到编辑工具中,点击 "Draw Routing Tree" 按钮即可看到路由结构信息,以上面 alertmanager 示例为例
-
3.4)receivers
接收器是一个统称,每个 receiver 需要一个全局唯一名称,并且对应一个或者多个通知方式,包括电子邮箱、微信、PagerDuty、HipChat 和 Webhook 等。目前官方提供的接收器配置选项如下(https://prometheus.io/docs/alerting/latest/configuration/#duration):
# The unique name of the receiver.
name: <string> #设置接收器名称
# Configurations for several notification integrations.
email_configs: #常用电子邮件告警
[ - <email_config>, ... ]
opsgenie_configs:
[ - <opsgenie_config>, ... ]
pagerduty_configs:
[ - <pagerduty_config>, ... ]
pushover_configs:
[ - <pushover_config>, ... ]
slack_configs:
[ - <slack_config>, ... ]
sns_configs:
[ - <sns_config>, ... ]
victorops_configs:
[ - <victorops_config>, ... ]
webhook_configs: #官方建议通过 webhook 接收器实现自定义通知集成
[ - <webhook_config>, ... ]
wechat_configs: #已经对微信告警进行支持
[ - <wechat_config>, ... ]
telegram_configs:
[ - <telegram_config>, ... ]
3.5)inhibit_rules (英[ɪnˈhɪbɪt] 美[ɪnˈhɪbɪt])
在 inhibit_rules 模块中设置实现告警抑制功能,我们可以指定在特定条件下要忽略的告警条件。可以使用此选项设置首选项,例如优先处理某些告警,如果同一组中的告警同时发生,则忽略其他告警。合理设置抑制规则可以减少 "垃圾" 告警的产生。
-
-
3.5.1) inhibit_rules 模块的常见配置内容
# DEPRECATED: Use target_matchers below. # Matchers that have to be fulfilled in the alerts to be muted. target_match: [ <labelname>: <labelvalue>, ... ] # DEPRECATED: Use target_matchers below. target_match_re: [ <labelname>: <regex>, ... ] # A list of matchers that have to be fulfilled by the target # alerts to be muted. target_matchers: [ - <matcher> ... ] # DEPRECATED: Use source_matchers below. # Matchers for which one or more alerts have to exist for the # inhibition to take effect. source_match: [ <labelname>: <labelvalue>, ... ] # DEPRECATED: Use source_matchers below. source_match_re: [ <labelname>: <regex>, ... ] # A list of matchers for which one or more alerts have # to exist for the inhibition to take effect. source_matchers: [ - <matcher> ... ] # Labels that must have an equal value in the source and target # alert for the inhibition to take effect. [ equal: '[' <labelname>, ... ']' ]
-
3.5.2)inhibit_rules 模块的一个配置示例
# 一个inhibition规则是在与另一组匹配器匹配的警报存在的条件下,使匹配一组匹配器的警报失效的规则。 # 两个警报必须具有一组相同的标签。 inhibit_rules: # 抑制规则 - source_match: # 当存在源标签告警触发时抑制含有目标标签的告警 alertname: 'TORouterDown' target_match_re: alertname: '.*Unrechable' # 目标标签值正则匹配,如 RedisUnreachable equal: ['dc', 'rack'] # 保证该配置下标签内容相同才会被抑制
-
当 alertmanager: 'TORouterDown' 时,目标和源告警必须具有与 equal 列表中标签名称相同的标签值,目标中正则匹配到的 alertmanager: '.*Unreachable' 才会被抑制从而不发送。
四、Alertmanager 默认配置文件内容
/alertmanager $ cat /etc/alertmanager/alertmanager.yml
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
/alertmanager $