Prometheus+Alertmanager+钉钉告警

一、安装配置alertmanager

1.1、下载安装包

wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz

tar -xf alertmanager-0.24.0.linux-amd64.tar.gz -C /opt/

cd /opt/

mv alertmanager-0.24.0.linux-amd64/ alertmanager

[root@monitoring alertmanager]# vim /etc/systemd/system/alertmanager.service
[root@monitoring alertmanager]# cat /etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus alertmanager
After=network.target

[Service]
ExecStart=/opt/alertmanager/alertmanager --config.file="/opt/alertmanager/alertmanager.yml"

[Install]
WantedBy=multi-user.target
[root@monitoring alertmanager]#

systemctl  enable --now alertmanager

1.2、配置邮箱接收告警

root@monitoring alertmanager]# vim alertmanager.yml
[root@monitoring alertmanager]# cat alertmanager.yml
global:
    resolve_timeout: 1m
    smtp_smarthost: 'smtp.qq.com:465'
    smtp_from: '2xxxxx5@qq.com'
    smtp_auth_username: '2xxxxx5@qq.com'
    smtp_auth_password: 'xxxxxxxxxx' #填写邮箱授权码
    smtp_hello: '@qq.com'
    smtp_require_tls: false

route:
    group_by: ['alertname']
    group_wait: 10s
    group_interval: 2m
    repeat_interval: 5m
    receiver: 'web.hook'

receivers:
  - name: 'web.hook'
    email_configs:
     - to: '1xxxxxxxx3@139.com'

inhibit_rules: 
  - source_match: 
     severity: 'critical' 
    target_match: 
     severity: 'warning' 
    equal: ['alertname', 'dev', 'instance']
[root@monitoring alertmanager]# ./amtool check-config alertmanager.yml
Checking 'alertmanager.yml'  SUCCESS
Found:
 - global config
 - route
 - 1 inhibit rules
 - 1 receivers
 - 0 templates

[root@monitoring alertmanager]# 
[root@monitoring alertmanager]# systemctl enable --now alertmanager.service 
Created symlink /etc/systemd/system/multi-user.target.wants/alertmanager.service → /etc/systemd/system/alertmanager.service.
[root@monitoring alertmanager]# systemctl status alertmanager.service 
● alertmanager.service - Prometheus alertmanager
   Loaded: loaded (/etc/systemd/system/alertmanager.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2022-09-27 22:33:06 CST; 1min 22s ago
 Main PID: 31820 (alertmanager)
    Tasks: 9 (limit: 49440)
   Memory: 14.9M
   CGroup: /system.slice/alertmanager.service
           └─31820 /opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml

Sep 27 22:33:06 monitoring alertmanager[31820]: ts=2022-09-27T14:33:06.960Z caller=main.go:231 level=info msg="Starting Alertmanager" version="(version=0.24>
Sep 27 22:33:06 monitoring alertmanager[31820]: ts=2022-09-27T14:33:06.960Z caller=main.go:232 level=info build_context="(go=go1.17.8, user=root@265f14f5c6f>
Sep 27 22:33:06 monitoring alertmanager[31820]: ts=2022-09-27T14:33:06.972Z caller=cluster.go:185 level=info component=cluster msg="setting advertise addres>
Sep 27 22:33:06 monitoring alertmanager[31820]: ts=2022-09-27T14:33:06.985Z caller=cluster.go:680 level=info component=cluster msg="Waiting for gossip to se>
Sep 27 22:33:07 monitoring alertmanager[31820]: ts=2022-09-27T14:33:07.135Z caller=coordinator.go:113 level=info component=configuration msg="Loading config>
Sep 27 22:33:07 monitoring alertmanager[31820]: ts=2022-09-27T14:33:07.137Z caller=coordinator.go:126 level=info component=configuration msg="Completed load>
Sep 27 22:33:07 monitoring alertmanager[31820]: ts=2022-09-27T14:33:07.154Z caller=main.go:535 level=info msg=Listening address=:9093
Sep 27 22:33:07 monitoring alertmanager[31820]: ts=2022-09-27T14:33:07.155Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
Sep 27 22:33:08 monitoring alertmanager[31820]: ts=2022-09-27T14:33:08.986Z caller=cluster.go:705 level=info component=cluster msg="gossip not settled" poll>
Sep 27 22:33:16 monitoring alertmanager[31820]: ts=2022-09-27T14:33:16.989Z caller=cluster.go:697 level=info component=cluster msg="gossip settled; proceedi>
[root@monitoring alertmanager]# netstat -tnlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      835/sshd            
tcp6       0      0 :::22                   :::*                    LISTEN      835/sshd            
tcp6       0      0 :::3000                 :::*                    LISTEN      24712/grafana-serve 
tcp6       0      0 :::9115                 :::*                    LISTEN      29832/blackbox_expo 
tcp6       0      0 :::9090                 :::*                    LISTEN      30218/prometheus    
tcp6       0      0 :::51234                :::*                    LISTEN      24847/node_exporter 
tcp6       0      0 :::9093                 :::*                    LISTEN      31820/alertmanager  
tcp6       0      0 :::9094                 :::*                    LISTEN      31820/alertmanager  
tcp6       0      0 :::9256                 :::*                    LISTEN      24879/process-expor 
[root@monitoring alertmanager]# 

配置Prometheus告警规则

mkdir /opt/prometheus/rules

cd /opt/prometheus/rules/

vim server_rules.yaml

[root@monitoring prometheus]# vim rules/server_rules.yaml 
[root@monitoring prometheus]# cat rules/server_rules.yaml 
groups:
  - name: alertmanager_pod.rules
    rules:
    - alert: Pod_all_cpu_usage
      expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 1
      for: 2m
      labels:
        severity: critical
        service: pods
      annotations:
        description: 容器 {{ $labels.name }} CPU 资源利用率大于 10% , (current value is {{ $value }})
        summary: Dev CPU 负载告警

    - alert: Pod_all_memory_usage
      #expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 10
      #内存大于 10%
      expr: sort_desc(avg by(name)(irate(node_memory_MemFree_bytes {name!=""}[5m]))) > 2 #内存大于 2G
      for: 2m
      labels:
        severity: critical
      annotations:
        description: 容 器 {{ $labels.name }} Memory 资 源 利 用 率 大 于 2G , (current value is {{ $value }})
        summary: Dev Memory 负载告警
    
    - alert: Pod_all_network_receive_usage
      expr: sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 1
      for: 2m
      labels:
        severity: critical
      annotations:
        description: 容器 {{ $labels.name }} network_receive 资源利用率大于 50M , (current value is {{ $value }})
    
    - alert: pod 内存可用大小
      expr: node_memory_MemFree_bytes > 1 #故意写错的,正确写法1G=1024x1024x1024x1024=1099511627776 bit
      for: 2m
      labels:
        severity: critical
      annotations:
        description: 容器可用内存小于 100k
[root@monitoring prometheus]# ./promtool check rules rules/server_rules.yaml 
Checking rules/server_rules.yaml
  SUCCESS: 4 rules found

[root@monitoring prometheus]# 

配置Prometheus规则

[root@monitoring prometheus]# vim prometheus.yml
[root@monitoring prometheus]# cat prometheus.yml |head -20
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
        - 172.16.88.20:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "/opt/prometheus/rules/*"

# A scrape configuration containing exactly one endpoint to scrape:
[root@monitoring prometheus]# ./promtool check config prometheus.yml
Checking prometheus.yml
  SUCCESS: 1 rule files found
 SUCCESS: prometheus.yml is valid prometheus config file syntax

Checking /opt/prometheus/rules/server_rules.yaml
  SUCCESS: 4 rules found

[root@monitoring prometheus]# systemctl restart prometheus.service 
[root@monitoring prometheus]# 

查看Prometheus页面是否有相关告警规则触发

此时也看到alertmanager产生很多规则

通过命令也可以./amtool alert --alertmanager.url=http://172.16.88.20:9093可以看到当前告警事件

 查看邮箱也收到相关告警

1.3、配置钉钉接收告警

配置钉钉告警机器

编写钉钉认证-关键字测试脚本

钉钉认证-关键字-shell 脚本 
[root@monitoring prometheus]# cat
/opt/scripts/dingding-keywords.sh #!/bin/bash source /etc/profile #PHONE=$1 #SUBJECT=$2 MESSAGE=$1 /usr/bin/curl -X "POST" 'https://oapi.dingtalk.com/robot/send?access_token=axxxxxxxxxxxxxxxxxxxxxxxxxxxx5d' \ -H 'Content-Type: application/json' \ -d '{"msgtype": "text", "text": { "content": "'${MESSAGE}'" } }'
钉钉认证-关键字-python 脚本
[root@monitoring prometheus]# cat /opt/scripts/dingding-keywords.py #!/usr/bin/python3 import sys import requests import json #钉钉告警: def info(msg): url = 'https://oapi.dingtalk.com/robot/send?access_token=axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5d' headers = { 'Content-Type': 'application/json;charset=utf-8' } formdata = { "msgtype": "text", "text": {"content":str(msg)} } #print(formdata) requests.post(url=url, data=json.dumps(formdata),headers=headers) info(sys.argv[1])

测试是否能正常发送消息

bash dingding-keywords.sh "node=172.16.88.20:51234,alertname=node内存可用大小"

shell脚本对发送信息存在空格支持不是太好,所以"node=172.16.88.20:51234,alertname=node内存可用大小"里面字段不能存在空格

此时使用python测试不存在该问题

需要提前安装python环境

yum install python38 -y

pip3 install requests

部署webhook-dingtalk

wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz

tar -xf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz -C /opt/

cd /opt/

mv prometheus-webhook-dingtalk-2.1.0.linux-amd64/ prometheus-webhook-dingtalk

vim /opt/prometheus-webhook-dingtalk/config.yml

## Request timeout
# timeout: 5s

## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true

## Customizable templates path
templates:
  - contrib/templates/dingding.yml

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
#  title: '{{ template "legacy.title" . }}'
#  text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"
targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5d
    # secret for signature
    secret: SECd1557e7bd1b609a7be1ac1407316caea32fa5ab34a4a529dea67c6684d7ebaf8
  webhook2:
    url: https://oapi.dingtalk.com/robot/send?access_token=axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5d
  webhook_legacy:
    url: https://oapi.dingtalk.com/robot/send?access_token=axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5d

vim contrib/templates/dingding.yml

{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}{{ if eq .Status "resolved" }}:{{ .Alerts.Resolved | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}

{{ define "__text_alert_list" }}{{ range . }}
**Labels**
{{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Annotations**
{{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}

{{ end }}{{ end }}

{{ define "ding.link.title" }}{{ template "__subject" . }}{{ end }}
{{ define "ding.link.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ template "__text_alert_list" .Alerts.Firing }}
{{ template "__text_alert_list" .Alerts.Resolved }}
{{ end }}

配置钉钉服务开机自启动

[root@monitoring prometheus-webhook-dingtalk]# vi /etc/systemd/system/dingtalk.service 
[root@monitoring prometheus-webhook-dingtalk]# cat /etc/systemd/system/dingtalk.service 
[Unit]
Description=Prometheus webhook-dingtalk 
After=network.target

[Service]
ExecStart=/opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/opt/prometheus-webhook-dingtalk/config.yml --web.listen-address="0.0.0.0:8060"

[Install]
WantedBy=multi-user.target

[root@monitoring prometheus-webhook-dingtalk]# systemctl enable --now dingtalk.service 
[root@monitoring prometheus-webhook-dingtalk]# systemctl status dingtalk.service 
● dingtalk.service - Prometheus webhook-dingtalk
   Loaded: loaded (/etc/systemd/system/dingtalk.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2022-09-28 01:05:55 CST; 8s ago
 Main PID: 34202 (prometheus-webh)
    Tasks: 9 (limit: 49440)
   Memory: 3.6M
   CGroup: /system.slice/dingtalk.service
           └─34202 /opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/opt/prometheus-webhook-dingtalk/config.yml --web.listen-addre>

Sep 28 01:05:55 monitoring systemd[1]: Started Prometheus webhook-dingtalk.
Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.583Z caller=main.go:59 level=info msg="Starting prometheus-webhook-din>
Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.583Z caller=main.go:60 level=info msg="Build context" (gogo1.18.1,user>
Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.583Z caller=coordinator.go:83 level=info component=configuration file=>
Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.585Z caller=coordinator.go:91 level=info component=configuration file=>
Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.585Z caller=main.go:97 level=info component=configuration msg="Loading>
Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.588Z caller=main.go:113 component=configuration msg="Webhook urls for >
Sep 28 01:05:55 monitoring prometheus-webhook-dingtalk[34202]: ts=2022-09-27T17:05:55.589Z caller=web.go:208 level=info component=web msg="Start listening f>
[root@monitoring prometheus-webhook-dingtalk]# netstat -tnlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      835/sshd            
tcp6       0      0 :::22                   :::*                    LISTEN      835/sshd            
tcp6       0      0 :::3000                 :::*                    LISTEN      24712/grafana-serve 
tcp6       0      0 :::9115                 :::*                    LISTEN      29832/blackbox_expo 
tcp6       0      0 :::8060                 :::*                    LISTEN      34202/prometheus-we 
tcp6       0      0 :::9090                 :::*                    LISTEN      33195/prometheus    
tcp6       0      0 :::51234                :::*                    LISTEN      24847/node_exporter 
tcp6       0      0 :::9093                 :::*                    LISTEN      31820/alertmanager  
tcp6       0      0 :::9094                 :::*                    LISTEN      31820/alertmanager  
tcp6       0      0 :::9256                 :::*                    LISTEN      24879/process-expor 
[root@monitoring prometheus-webhook-dingtalk]# 

配置alertmanager服务

vim /opt/alertmanager/alertmanager.yml

global:
    resolve_timeout: 1m
    smtp_smarthost: 'smtp.qq.com:465'
    smtp_from: '2xxxxxx5@qq.com'
    smtp_auth_username: '2xxxxxx5@qq.com'
    smtp_auth_password: 'yxxxxxxxh'
    smtp_hello: '@qq.com'
    smtp_require_tls: false

route:
    group_by: ['alertname']
    group_wait: 10s
    group_interval: 2m
    repeat_interval: 5m
    receiver: 'dingding-webhook'

receivers:
  - name: 'dingding-webhook'
    webhook_configs:
    - url: 'http://localhost:8060/dingtalk/webhook1/send'
      send_resolved: true

inhibit_rules: 
  - source_match: 
     severity: 'critical' 
    target_match: 
     severity: 'warning' 
    equal: ['alertname', 'dev', 'instance']

 修改Prometheus rules 文件让其发送告警

重启Prometheus服务

 

故障恢复后

 

posted @ 2022-09-27 22:44  cyh00001  阅读(471)  评论(0编辑  收藏  举报