Prometheus+Alertmanager+钉钉告警
一、安装配置alertmanager
1.1、下载安装包
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
tar -xf alertmanager-0.24.0.linux-amd64.tar.gz -C /opt/
cd /opt/
mv alertmanager-0.24.0.linux-amd64/ alertmanager
[root@monitoring alertmanager]# vim /etc/systemd/system/alertmanager.service [root@monitoring alertmanager]# cat /etc/systemd/system/alertmanager.service [Unit] Description=Prometheus alertmanager After=network.target [Service] ExecStart=/opt/alertmanager/alertmanager --config.file="/opt/alertmanager/alertmanager.yml" [Install] WantedBy=multi-user.target [root@monitoring alertmanager]#
systemctl enable --now alertmanager
1.2、配置邮箱接收告警
root@monitoring alertmanager]# vim alertmanager.yml [root@monitoring alertmanager]# cat alertmanager.yml global: resolve_timeout: 1m smtp_smarthost: 'smtp.qq.com:465' smtp_from: '2xxxxx5@qq.com' smtp_auth_username: '2xxxxx5@qq.com' smtp_auth_password: 'xxxxxxxxxx' #填写邮箱授权码 smtp_hello: '@qq.com' smtp_require_tls: false route: group_by: ['alertname'] group_wait: 10s group_interval: 2m repeat_interval: 5m receiver: 'web.hook' receivers: - name: 'web.hook' email_configs: - to: '1xxxxxxxx3@139.com' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance'] [root@monitoring alertmanager]# ./amtool check-config alertmanager.yml Checking 'alertmanager.yml' SUCCESS Found: - global config - route - 1 inhibit rules - 1 receivers - 0 templates [root@monitoring alertmanager]# [root@monitoring alertmanager]# systemctl enable --now alertmanager.service Created symlink /etc/systemd/system/multi-user.target.wants/alertmanager.service → /etc/systemd/system/alertmanager.service. [root@monitoring alertmanager]# systemctl status alertmanager.service ● alertmanager.service - Prometheus alertmanager Loaded: loaded (/etc/systemd/system/alertmanager.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2022-09-27 22:33:06 CST; 1min 22s ago Main PID: 31820 (alertmanager) Tasks: 9 (limit: 49440) Memory: 14.9M CGroup: /system.slice/alertmanager.service └─31820 /opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml Sep 27 22:33:06 monitoring alertmanager[31820]: ts=2022-09-27T14:33:06.960Z caller=main.go:231 level=info msg="Starting Alertmanager" version="(version=0.24> Sep 27 22:33:06 monitoring alertmanager[31820]: ts=2022-09-27T14:33:06.960Z caller=main.go:232 level=info build_context="(go=go1.17.8, user=root@265f14f5c6f> Sep 27 22:33:06 monitoring alertmanager[31820]: ts=2022-09-27T14:33:06.972Z caller=cluster.go:185 level=info component=cluster msg="setting advertise addres> Sep 27 22:33:06 monitoring alertmanager[31820]: ts=2022-09-27T14:33:06.985Z caller=cluster.go:680 level=info component=cluster msg="Waiting for gossip to se> Sep 27 22:33:07 monitoring alertmanager[31820]: ts=2022-09-27T14:33:07.135Z caller=coordinator.go:113 level=info component=configuration msg="Loading config> Sep 27 22:33:07 monitoring alertmanager[31820]: ts=2022-09-27T14:33:07.137Z caller=coordinator.go:126 level=info component=configuration msg="Completed load> Sep 27 22:33:07 monitoring alertmanager[31820]: ts=2022-09-27T14:33:07.154Z caller=main.go:535 level=info msg=Listening address=:9093 Sep 27 22:33:07 monitoring alertmanager[31820]: ts=2022-09-27T14:33:07.155Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false Sep 27 22:33:08 monitoring alertmanager[31820]: ts=2022-09-27T14:33:08.986Z caller=cluster.go:705 level=info component=cluster msg="gossip not settled" poll> Sep 27 22:33:16 monitoring alertmanager[31820]: ts=2022-09-27T14:33:16.989Z caller=cluster.go:697 level=info component=cluster msg="gossip settled; proceedi> [root@monitoring alertmanager]# netstat -tnlp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 835/sshd tcp6 0 0 :::22 :::* LISTEN 835/sshd tcp6 0 0 :::3000 :::* LISTEN 24712/grafana-serve tcp6 0 0 :::9115 :::* LISTEN 29832/blackbox_expo tcp6 0 0 :::9090 :::* LISTEN 30218/prometheus tcp6 0 0 :::51234 :::* LISTEN 24847/node_exporter tcp6 0 0 :::9093 :::* LISTEN 31820/alertmanager tcp6 0 0 :::9094 :::* LISTEN 31820/alertmanager tcp6 0 0 :::9256 :::* LISTEN 24879/process-expor [root@monitoring alertmanager]#
配置Prometheus告警规则
mkdir /opt/prometheus/rules
cd /opt/prometheus/rules/
vim server_rules.yaml
[root@monitoring prometheus]# vim rules/server_rules.yaml [root@monitoring prometheus]# cat rules/server_rules.yaml groups: - name: alertmanager_pod.rules rules: - alert: Pod_all_cpu_usage expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 1 for: 2m labels: severity: critical service: pods annotations: description: 容器 {{ $labels.name }} CPU 资源利用率大于 10% , (current value is {{ $value }}) summary: Dev CPU 负载告警 - alert: Pod_all_memory_usage #expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 10 #内存大于 10% expr: sort_desc(avg by(name)(irate(node_memory_MemFree_bytes {name!=""}[5m]))) > 2 #内存大于 2G for: 2m labels: severity: critical annotations: description: 容 器 {{ $labels.name }} Memory 资 源 利 用 率 大 于 2G , (current value is {{ $value }}) summary: Dev Memory 负载告警 - alert: Pod_all_network_receive_usage expr: sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 1 for: 2m labels: severity: critical annotations: description: 容器 {{ $labels.name }} network_receive 资源利用率大于 50M , (current value is {{ $value }}) - alert: pod 内存可用大小 expr: node_memory_MemFree_bytes > 1 #故意写错的,正确写法1G=1024x1024x1024x1024=1099511627776 bit for: 2m labels: severity: critical annotations: description: 容器可用内存小于 100k [root@monitoring prometheus]# ./promtool check rules rules/server_rules.yaml Checking rules/server_rules.yaml SUCCESS: 4 rules found [root@monitoring prometheus]#
配置Prometheus规则
[root@monitoring prometheus]# vim prometheus.yml [root@monitoring prometheus]# cat prometheus.yml |head -20 # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - 172.16.88.20:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" - "/opt/prometheus/rules/*" # A scrape configuration containing exactly one endpoint to scrape: [root@monitoring prometheus]# ./promtool check config prometheus.yml Checking prometheus.yml SUCCESS: 1 rule files found SUCCESS: prometheus.yml is valid prometheus config file syntax Checking /opt/prometheus/rules/server_rules.yaml SUCCESS: 4 rules found [root@monitoring prometheus]# systemctl restart prometheus.service [root@monitoring prometheus]#
查看Prometheus页面是否有相关告警规则触发
此时也看到alertmanager产生很多规则
通过命令也可以./amtool alert --alertmanager.url=http://172.16.88.20:9093可以看到当前告警事件
查看邮箱也收到相关告警
1.3、配置钉钉接收告警
配置钉钉告警机器
编写钉钉认证-关键字测试脚本
钉钉认证-关键字-shell 脚本
[root@monitoring prometheus]# cat /opt/scripts/dingding-keywords.sh #!/bin/bash source /etc/profile #PHONE=$1 #SUBJECT=$2 MESSAGE=$1 /usr/bin/curl -X "POST" 'https://oapi.dingtalk.com/robot/send?access_token=axxxxxxxxxxxxxxxxxxxxxxxxxxxx5d' \ -H 'Content-Type: application/json' \ -d '{"msgtype": "text", "text": { "content": "'${MESSAGE}'" } }'
钉钉认证-关键字-python 脚本
[root@monitoring prometheus]# cat /opt/scripts/dingding-keywords.py #!/usr/bin/python3 import sys import requests import json #钉钉告警: def info(msg): url = 'https://oapi.dingtalk.com/robot/send?access_token=axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5d' headers = { 'Content-Type': 'application/json;charset=utf-8' } formdata = { "msgtype": "text", "text": {"content":str(msg)} } #print(formdata) requests.post(url=url, data=json.dumps(formdata),headers=headers) info(sys.argv[1])
测试是否能正常发送消息
bash dingding-keywords.sh "node=172.16.88.20:51234,alertname=node内存可用大小"
shell脚本对发送信息存在空格支持不是太好,所以"node=172.16.88.20:51234,alertname=node内存可用大小"里面字段不能存在空格
此时使用python测试不存在该问题
需要提前安装python环境
yum install python38 -y
pip3 install requests
部署webhook-dingtalk
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
tar -xf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz -C /opt/
cd /opt/
mv prometheus-webhook-dingtalk-2.1.0.linux-amd64/ prometheus-webhook-dingtalk
vim /opt/prometheus-webhook-dingtalk/config.yml
## Request timeout # timeout: 5s ## Uncomment following line in order to write template from scratch (be careful!) #no_builtin_template: true ## Customizable templates path templates: - contrib/templates/dingding.yml ## You can also override default template using `default_message` ## The following example to use the 'legacy' template from v0.3.0 #default_message: # title: '{{ template "legacy.title" . }}' # text: '{{ template "legacy.content" . }}' ## Targets, previously was known as "profiles" targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5d # secret for signature secret: SECd1557e7bd1b609a7be1ac1407316caea32fa5ab34a4a529dea67c6684d7ebaf8 webhook2: url: https://oapi.dingtalk.com/robot/send?access_token=axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5d webhook_legacy: url: https://oapi.dingtalk.com/robot/send?access_token=axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5d
vim contrib/templates/dingding.yml
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}{{ if eq .Status "resolved" }}:{{ .Alerts.Resolved | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }} {{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }} {{ define "__text_alert_list" }}{{ range . }} **Labels** {{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} **Annotations** {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }} {{ end }} {{ end }}{{ end }} {{ define "ding.link.title" }}{{ template "__subject" . }}{{ end }} {{ define "ding.link.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})** {{ template "__text_alert_list" .Alerts.Firing }} {{ template "__text_alert_list" .Alerts.Resolved }} {{ end }}
配置钉钉服务开机自启动
[root@monitoring prometheus-webhook-dingtalk]# vi /etc/systemd/system/dingtalk.service [root@monitoring prometheus-webhook-dingtalk]# cat /etc/systemd/system/dingtalk.service [Unit] Description=Prometheus webhook-dingtalk After=network.target [Service] ExecStart=/opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/opt/prometheus-webhook-dingtalk/config.yml --web.listen-address="0.0.0.0:8060" [Install] WantedBy=multi-user.target [root@monitoring prometheus-webhook-dingtalk]# systemctl enable --now dingtalk.service [root@monitoring prometheus-webhook-dingtalk]# systemctl status dingtalk.service ● dingtalk.service - Prometheus webhook-dingtalk Loaded: loaded (/etc/systemd/system/dingtalk.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2022-09-28 01:05:55 CST; 8s ago Main PID: 34202 (prometheus-webh) Tasks: 9 (limit: 49440) Memory: 3.6M CGroup: /system.slice/dingtalk.service └─34202 /opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/opt/prometheus-webhook-dingtalk/config.yml --web.listen-addre> Sep 28 01:05:55 monitoring systemd[1]: Started Prometheus webhook-dingtalk. Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.583Z caller=main.go:59 level=info msg="Starting prometheus-webhook-din> Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.583Z caller=main.go:60 level=info msg="Build context" (gogo1.18.1,user> Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.583Z caller=coordinator.go:83 level=info component=configuration file=> Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.585Z caller=coordinator.go:91 level=info component=configuration file=> Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.585Z caller=main.go:97 level=info component=configuration msg="Loading> Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.588Z caller=main.go:113 component=configuration msg="Webhook urls for > Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.589Z caller=web.go:208 level=info component=web msg="Start listening f> [root@monitoring prometheus-webhook-dingtalk]# netstat -tnlp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 835/sshd tcp6 0 0 :::22 :::* LISTEN 835/sshd tcp6 0 0 :::3000 :::* LISTEN 24712/grafana-serve tcp6 0 0 :::9115 :::* LISTEN 29832/blackbox_expo tcp6 0 0 :::8060 :::* LISTEN 34202/prometheus-we tcp6 0 0 :::9090 :::* LISTEN 33195/prometheus tcp6 0 0 :::51234 :::* LISTEN 24847/node_exporter tcp6 0 0 :::9093 :::* LISTEN 31820/alertmanager tcp6 0 0 :::9094 :::* LISTEN 31820/alertmanager tcp6 0 0 :::9256 :::* LISTEN 24879/process-expor [root@monitoring prometheus-webhook-dingtalk]#
配置alertmanager服务
vim /opt/alertmanager/alertmanager.yml
global: resolve_timeout: 1m smtp_smarthost: 'smtp.qq.com:465' smtp_from: '2xxxxxx5@qq.com' smtp_auth_username: '2xxxxxx5@qq.com' smtp_auth_password: 'yxxxxxxxh' smtp_hello: '@qq.com' smtp_require_tls: false route: group_by: ['alertname'] group_wait: 10s group_interval: 2m repeat_interval: 5m receiver: 'dingding-webhook' receivers: - name: 'dingding-webhook' webhook_configs: - url: 'http://localhost:8060/dingtalk/webhook1/send' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
修改Prometheus rules 文件让其发送告警
重启Prometheus服务
故障恢复后