prometheus系列【3】告警
告警主要由alertmanager实现,由prometheus配置的rules规则触发,然后由alertmanager来进行告警。
alertmanager支持多种途径的告警,例如邮件、短信、微信、钉钉等等等。。
我们这里说明微信和邮件的告警。
告警配置alertmanager.yml
global:
resolve_timeout: 1m
# 微信告警
wechat_api_corp_id: 'xxxxxx'
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_secret: 'xxxxxxx'
# 邮件告警
smtp_smarthost: 'mail.xxx.com:25'
smtp_from: '123456@qq.com'
smtp_auth_username: '123456@qq.com'
smtp_auth_password: '123456'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 30m
receiver: 'email'
routes:
- match_re:
# 可分别匹配相关的job触发不同的告警方式
job: .*
receiver: 'wechat'
repeat_interval: 2h
- match_re:
job: .*
receiver: 'email'
repeat_interval: 2h
receivers:
- name: 'wechat'
wechat_configs:
- send_resolved: true
to_party: 'Prometheus'
# 发送给应用组所有人
to_user: '@all'
agent_id: xxx
corp_id: 'xxx'
api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
api_secret: '企业微信中相关应用的秘钥'
# 配置默认模板,可参考alert-template.tmpl文件
message: '{{ template "wechat.default.message" . }}'
- name: 'email'
email_configs:
- to: 'xxx@xxx.com'
headers: { Subject: "[WARN] 报警邮件" }
# 这里可自定义html的模板样式,可取到告警信息中的任意值,部分取值可参考以下
html: '{{ index $alert.Labels "group" }}
{{ index $alert.Labels "alertname" }}
{{ index $alert.Labels "status" }}
{{ index $alert.Labels "instance" }}
{{ index $alert.Labels "job" }}
{{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
{{ index $alert.Annotations "summary" }}
{{ index $alert.Annotations "description" }}'
#inhibit_rules:
# - source_match:
# severity: 'critical'
# target_match:
# severity: 'warning'
# equal: ['alertname', 'dev', 'instance']
告警模板alert-template.tmpl
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__description" }}{{ end }}
{{ define "wechat.default.message" }}
{{ if gt (len .Alerts.Firing) 0 -}}
Alerts Firing:
{{ range .Alerts}}
告警级别: {{ .Labels.severity }}
告警类型: {{ .Labels.alertname }}
故障主机: {{ .Labels.pod_name }}
告警主题: {{ .Annotations.summary }}
告警详情: {{ .Annotations.description }}
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
================
{{- end }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
Alerts Resolved:
{{ range .Alerts}}
告警级别: {{ .Labels.severity }}
告警类型: {{ .Labels.alertname }}
故障主机: {{ .Labels.pod_name }}
告警主题: {{ .Annotations.summary }}
告警详情: {{ .Annotations.description }}
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }}
================
{{- end }}
{{- end }}
{{- end }}
{{ define "email.html" }}
<table>
<tr><td>报警名</td><td>开始时间</td></tr>
{{ range $i, $alert := .Alerts }}
<tr><td>{{ index $alert.Labels "alertname" }}</td><td>{{ $alert.StartsAt }}</td></tr>
{{ end }}
</table>
{{ end }}
<tr><td>报警名</td><td>开始时间</td></tr>
{{ range $i, $alert := .Alerts }}
<tr><td>{{ index $alert.Labels "alertname" }}</td><td>{{ $alert.StartsAt }}</td></tr>
{{ end }}
时在中春,阳和方起