随笔- 680 文章- 48 评论- 48 阅读- 267万

prometheus使用3

回到顶部

不错链接

60、Prometheus-alertmanager、邮件告警配置 https://www.cnblogs.com/ygbh/p/17306539.html

回到顶部

服务发现

基于文件的服务发现

现有配置：

[root@mcw03 ~]# cat /etc/prometheus.yml 
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/node_rules.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'agent1'   
    static_configs:
    - targets: ['10.0.0.14:9100','10.0.0.12:9100']
  - job_name: 'promserver'   
    static_configs:
    - targets: ['10.0.0.13:9100']
  - job_name: 'server_mariadb' 
    static_configs:
    - targets: ['10.0.0.13:9104']
  - job_name: 'docker' 
    static_configs:
    - targets: ['10.0.0.12:8080']
    metric_relabel_configs:
    - regex: 'kernelVersion'
      action: labeldrop
[root@mcw03 ~]#

把static_configs 替换成file_sd_configs

配置刷新重载文件配置的时间。可以不用手动刷新

创建目录并修改配置，指定使用的文件配置

下面红色配置错了，直接指定文件路径就可以，不需要targets键

[root@mcw03 ~]# ls /etc/prometheus.yml 
/etc/prometheus.yml
[root@mcw03 ~]# mkdir -p /etc/targets/{nodes,docker}
[root@mcw03 ~]# vim /etc/prometheus.yml 
[root@mcw03 ~]# cat /etc/prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/node_rules.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'agent1'   
    file_sd_configs:
      - files:
        - targets: targets/nodes/*.json
        refresh_interval: 5m
  - job_name: 'promserver'   
    static_configs:
    - targets: ['10.0.0.13:9100']
  - job_name: 'server_mariadb' 
    static_configs:
    - targets: ['10.0.0.13:9104']
  - job_name: 'docker' 
    file_sd_configs:
      - files: 
        - targets: targets/docker/*.json
        refresh_interval: 5m
   # metric_relabel_configs:
   # - regex: 'kernelVersion'
   #   action: labeldrop
[root@mcw03 ~]#

创建配置文件

[root@mcw03 ~]# touch /etc/targets/nodes/nodes.json
[root@mcw03 ~]# touch /etc/targets/docker/daemons.json
[root@mcw03 ~]#

修改到json文件配置中

[root@mcw03 ~]# vim  /etc/targets/nodes/nodes.json
[root@mcw03 ~]# vim /etc/targets/docker/daemons.json 
[root@mcw03 ~]# cat /etc/targets/nodes/nodes.json
[{
  "targets": [
    "10.0.0.14:9100",
    "10.0.0.12:9100"
  ]
}]
[root@mcw03 ~]# cat /etc/targets/docker/daemons.json
[{
  "targets": [
    "10.0.0.12:8080"
  ]
}]
[root@mcw03 ~]#

报错了

[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
failed to reload config: couldn't load configuration (--config.file="/etc/prometheus.yml"): parsing YAML file /etc/prometheus.yml: yaml: unmarshal errors:
  line 34: cannot unmarshal !!map into string
  line 45: cannot unmarshal !!map into string
[root@mcw03 ~]#

上面配置写错了

[root@mcw03 ~]# vim /etc/prometheus.yml 
[root@mcw03 ~]# cat /etc/prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/node_rules.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'agent1'   
    file_sd_configs:
      - files:
        - targets/nodes/*.json
        refresh_interval: 5m
  - job_name: 'promserver'   
    static_configs:
    - targets: ['10.0.0.13:9100']
  - job_name: 'server_mariadb' 
    static_configs:
    - targets: ['10.0.0.13:9104']
  - job_name: 'docker' 
    file_sd_configs:
      - files: 
        - targets/docker/*.json
        refresh_interval: 5m
   # metric_relabel_configs:
   # - regex: 'kernelVersion'
   #   action: labeldrop
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

此时看，可以看到服务发现的客户端

http://10.0.0.13:9090/service-discovery

改为yml格式

[root@mcw03 ~]# vim /etc/prometheus.yml 
[root@mcw03 ~]# cat /etc/prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/node_rules.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'agent1'   
    file_sd_configs:
      - files:
        - targets/nodes/*.json
        refresh_interval: 5m
  - job_name: 'promserver'   
    static_configs:
    - targets: ['10.0.0.13:9100']
  - job_name: 'server_mariadb' 
    static_configs:
    - targets: ['10.0.0.13:9104']
  - job_name: 'docker' 
    file_sd_configs:
      - files: 
        - targets/docker/*.yml
        refresh_interval: 5m
   # metric_relabel_configs:
   # - regex: 'kernelVersion'
   #   action: labeldrop
[root@mcw03 ~]# cp  /etc/targets/docker/daemons.json /etc/targets/docker/daemons.yml
[root@mcw03 ~]# vim /etc/targets/docker/daemons.yml
[root@mcw03 ~]# cat /etc/targets/docker/daemons.yml
- targets:
  - "10.0.0.12:8080"
[root@mcw03 ~]#

[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

重载之后正常，

从标签里可以看到，服务自动发现来自哪里

因为target是yml或者json数据，所以可以用salt，cmdb等等各种，进行配置集中管理，实现监控

基于文件的自动发现，添加标签的实现

修改配置

[root@mcw03 ~]# vim /etc/targets/nodes/nodes.json 
[root@mcw03 ~]# cat /etc/targets/nodes/nodes.json
[{
  "targets": [
    "10.0.0.14:9100",
    "10.0.0.12:9100"
  ],
  "labels": {
     "datacenter": "mcwhome"
  }
}]
[root@mcw03 ~]# vim /etc/targets/docker/daemons.yml 
[root@mcw03 ~]# cat /etc/targets/docker/daemons.yml
- targets:
  - "10.0.0.12:8080"
- labels:
    "datacenter": "mcwymlhome"
[root@mcw03 ~]#

不需要重启服务，这个标签自动就有了。不过yml格式的，添加标签，没有生效。不清楚咋添加

基于api的服务发现

基于dns的服务发现

回到顶部

警报管理 alertmanager

安装alertmanager

wget https://github.com/prometheus/alertmanager/download/v0.15.2/alertmanager-0.15.2.linux-amd64.tar.gz

下载方式：
https://github.com/prometheus/alertmanager
https://prometheus.io/download/
下载完成后上传到服务器中
步骤一：
解压 tar -xzf alertmanager-0.25.0.linux-amd64.tar.gz

解压完成后进入alertmanager目录
步骤二：
创建文件夹
mkdir /etc/alertmanager
mkdir /usr/lib/alertmanager
步骤三：
复制文件和授权
cp alertmanager.yml /etc/alertmanager/
chown prometheus /var/lib/alertmanager/
cp alertmanager /usr/local/bin/
步骤四：
编写系统服务文件
vi /etc/systemd/system/alertmanager.service

[Unit]
Description=Prometheus Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Restart=always
Type=simple
ExecStart=/usr/local/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/var/lib/alertmanager/

[Install]
WantedBy=multi-user.target

访问：
在浏览器输入 http://IP:9093/

步骤五：
在Prometheus配置文件中添加如下配置

alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - alertmanager:9093

参考：https://www.cnblogs.com/LILEIYAO/p/17309000.html

@@@@

[root@mcw04 tmp]# ls
alertmanager-0.26.0.linux-amd64.tar.gz                                   systemd-private-a0e7e3d7293d454882c643c5f1a8ce7c-vmtoolsd.service-4BK6V6
systemd-private-3cf99c02a7114f738c3140f943aa9417-httpd.service-BpHja5    systemd-private-b04829df8fdd485f9add302ef649283a-chronyd.service-oxOzvx
systemd-private-a0e7e3d7293d454882c643c5f1a8ce7c-chronyd.service-uChRN0  systemd-private-b04829df8fdd485f9add302ef649283a-httpd.service-zRmTsv
systemd-private-a0e7e3d7293d454882c643c5f1a8ce7c-httpd.service-8kD7xq    systemd-private-b04829df8fdd485f9add302ef649283a-mariadb.service-4yFwtp
systemd-private-a0e7e3d7293d454882c643c5f1a8ce7c-mariadb.service-VXa7sc  systemd-private-b04829df8fdd485f9add302ef649283a-vgauthd.service-IRzCTg
systemd-private-a0e7e3d7293d454882c643c5f1a8ce7c-vgauthd.service-uF9wkU  systemd-private-b04829df8fdd485f9add302ef649283a-vmtoolsd.service-UM1nFT
[root@mcw04 tmp]# tar xf alertmanager-0.26.0.linux-amd64.tar.gz 
[root@mcw04 tmp]# ls
alertmanager-0.26.0.linux-amd64                                          systemd-private-a0e7e3d7293d454882c643c5f1a8ce7c-vmtoolsd.service-4BK6V6
alertmanager-0.26.0.linux-amd64.tar.gz                                   systemd-private-b04829df8fdd485f9add302ef649283a-chronyd.service-oxOzvx
systemd-private-3cf99c02a7114f738c3140f943aa9417-httpd.service-BpHja5    systemd-private-b04829df8fdd485f9add302ef649283a-httpd.service-zRmTsv
systemd-private-a0e7e3d7293d454882c643c5f1a8ce7c-chronyd.service-uChRN0  systemd-private-b04829df8fdd485f9add302ef649283a-mariadb.service-4yFwtp
systemd-private-a0e7e3d7293d454882c643c5f1a8ce7c-httpd.service-8kD7xq    systemd-private-b04829df8fdd485f9add302ef649283a-vgauthd.service-IRzCTg
systemd-private-a0e7e3d7293d454882c643c5f1a8ce7c-mariadb.service-VXa7sc  systemd-private-b04829df8fdd485f9add302ef649283a-vmtoolsd.service-UM1nFT
systemd-private-a0e7e3d7293d454882c643c5f1a8ce7c-vgauthd.service-uF9wkU
[root@mcw04 tmp]# cd alertmanager-0.26.0.linux-amd64/
[root@mcw04 alertmanager-0.26.0.linux-amd64]# ls
alertmanager  alertmanager.yml  amtool  LICENSE  NOTICE
[root@mcw04 alertmanager-0.26.0.linux-amd64]# mkdir /etc/alertmanager
[root@mcw04 alertmanager-0.26.0.linux-amd64]# mkdir /usr/lib/alertmanager
[root@mcw04 alertmanager-0.26.0.linux-amd64]# cp alertmanager.yml /etc/alertmanager/
[root@mcw04 alertmanager-0.26.0.linux-amd64]# chown prometheus /usr/lib/alertmanager/
chown: invalid user: ‘prometheus’
[root@mcw04 alertmanager-0.26.0.linux-amd64]# useradd prometheus
[root@mcw04 alertmanager-0.26.0.linux-amd64]# chown prometheus /usr/lib/alertmanager/
[root@mcw04 alertmanager-0.26.0.linux-amd64]# cp alertmanager /usr/local/bin/
[root@mcw04 alertmanager-0.26.0.linux-amd64]# vim /etc/systemd/system/alertmanager.service
[root@mcw04 alertmanager-0.26.0.linux-amd64]# mkdir /var/lib/alertmanager
[root@mcw04 alertmanager-0.26.0.linux-amd64]# chown prometheus /var/lib/alertmanager/
[root@mcw04 alertmanager-0.26.0.linux-amd64]# systemctl daemon-reload 
[root@mcw04 alertmanager-0.26.0.linux-amd64]# systemctl status alertmanager.service 
● alertmanager.service - Prometheus Alertmanager
   Loaded: loaded (/etc/systemd/system/alertmanager.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
[root@mcw04 alertmanager-0.26.0.linux-amd64]# systemctl start alertmanager.service 
[root@mcw04 alertmanager-0.26.0.linux-amd64]# ps -ef|grep alertman
prometh+  15558      1  3 21:26 ?        00:00:00 /usr/local/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/var/lib/alertmanager/
root      15574   2038  0 21:26 pts/0    00:00:00 grep --color=auto alertman
[root@mcw04 alertmanager-0.26.0.linux-amd64]# 
[root@mcw04 alertmanager-0.26.0.linux-amd64]#

http://10.0.0.14:9093/

访问：

global:
  resolve_timeout: 5m
  http_config:
    follow_redirects: true
    enable_http2: true
  smtp_hello: localhost
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
  telegram_api_url: https://api.telegram.org
  webex_api_url: https://webexapis.com/v1/messages
route:
  receiver: web.hook
  group_by:
  - alertname
  continue: false
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
inhibit_rules:
- source_match:
    severity: critical
  target_match:
    severity: warning
  equal:
  - alertname
  - dev
  - instance
receivers:
- name: web.hook
  webhook_configs:
  - send_resolved: true
    http_config:
      follow_redirects: true
      enable_http2: true
    url: <secret>
    url_file: ""
    max_alerts: 0
templates: []

在Prometheus配置里面添加配置。修改前是这样的

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

修改后是这样的.也可以用主机名代替ip，不过需要本机可以解析

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - 10.0.0.14:9093

重载之后查看是否生效了

http://10.0.0.13:9090/status

可以看到已经多出了我们的链接

监控alertmanager

添加它的监控

[root@mcw03 ~]# vim /etc/prometheus.yml
  - job_name: 'alertmanager'   
    static_configs:
    - targets: ['10.0.0.14:9093']
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

返回一堆alertmanager_开头的指标。包含警报计数，接收器分类的成功和失败通知的计数等等

# HELP alertmanager_alerts How many alerts by state.
# TYPE alertmanager_alerts gauge
alertmanager_alerts{state="active"} 0
alertmanager_alerts{state="suppressed"} 0
alertmanager_alerts{state="unprocessed"} 0
# HELP alertmanager_alerts_invalid_total The total number of received alerts that were invalid.
# TYPE alertmanager_alerts_invalid_total counter
alertmanager_alerts_invalid_total{version="v1"} 0
alertmanager_alerts_invalid_total{version="v2"} 0
# HELP alertmanager_alerts_received_total The total number of received alerts.
# TYPE alertmanager_alerts_received_total counter
alertmanager_alerts_received_total{status="firing",version="v1"} 0
alertmanager_alerts_received_total{status="firing",version="v2"} 0
alertmanager_alerts_received_total{status="resolved",version="v1"} 0
alertmanager_alerts_received_total{status="resolved",version="v2"} 0
# HELP alertmanager_build_info A metric with a constant '1' value labeled by version, revision, branch, goversion from which alertmanager was built, and the goos and goarch for the build.
# TYPE alertmanager_build_info gauge
alertmanager_build_info{branch="HEAD",goarch="amd64",goos="linux",goversion="go1.20.7",revision="d7b4f0c7322e7151d6e3b1e31cbc15361e295d8d",tags="netgo",version="0.26.0"} 1
# HELP alertmanager_cluster_alive_messages_total Total number of received alive messages.
# TYPE alertmanager_cluster_alive_messages_total counter
alertmanager_cluster_alive_messages_total{peer="01HPC5HJFBDP3C8WFXKE165XXV"} 1
# HELP alertmanager_cluster_enabled Indicates whether the clustering is enabled or not.
# TYPE alertmanager_cluster_enabled gauge
alertmanager_cluster_enabled 1
# HELP alertmanager_cluster_failed_peers Number indicating the current number of failed peers in the cluster.
# TYPE alertmanager_cluster_failed_peers gauge
alertmanager_cluster_failed_peers 0
# HELP alertmanager_cluster_health_score Health score of the cluster. Lower values are better and zero means 'totally healthy'.
# TYPE alertmanager_cluster_health_score gauge
alertmanager_cluster_health_score 0
# HELP alertmanager_cluster_members Number indicating current number of members in cluster.
# TYPE alertmanager_cluster_members gauge
alertmanager_cluster_members 1
# HELP alertmanager_cluster_messages_pruned_total Total number of cluster messages pruned.
# TYPE alertmanager_cluster_messages_pruned_total counter
alertmanager_cluster_messages_pruned_total 0
# HELP alertmanager_cluster_messages_queued Number of cluster messages which are queued.
# TYPE alertmanager_cluster_messages_queued gauge
alertmanager_cluster_messages_queued 0
# HELP alertmanager_cluster_messages_received_size_total Total size of cluster messages received.
# TYPE alertmanager_cluster_messages_received_size_total counter
alertmanager_cluster_messages_received_size_total{msg_type="full_state"} 0
alertmanager_cluster_messages_received_size_total{msg_type="update"} 0
# HELP alertmanager_cluster_messages_received_total Total number of cluster messages received.
# TYPE alertmanager_cluster_messages_received_total counter
alertmanager_cluster_messages_received_total{msg_type="full_state"} 0
alertmanager_cluster_messages_received_total{msg_type="update"} 0
# HELP alertmanager_cluster_messages_sent_size_total Total size of cluster messages sent.
# TYPE alertmanager_cluster_messages_sent_size_total counter
alertmanager_cluster_messages_sent_size_total{msg_type="full_state"} 0
alertmanager_cluster_messages_sent_size_total{msg_type="update"} 0
# HELP alertmanager_cluster_messages_sent_total Total number of cluster messages sent.
# TYPE alertmanager_cluster_messages_sent_total counter
alertmanager_cluster_messages_sent_total{msg_type="full_state"} 0
alertmanager_cluster_messages_sent_total{msg_type="update"} 0
# HELP alertmanager_cluster_peer_info A metric with a constant '1' value labeled by peer name.
# TYPE alertmanager_cluster_peer_info gauge
alertmanager_cluster_peer_info{peer="01HPC5HJFBDP3C8WFXKE165XXV"} 1
# HELP alertmanager_cluster_peers_joined_total A counter of the number of peers that have joined.
# TYPE alertmanager_cluster_peers_joined_total counter
alertmanager_cluster_peers_joined_total 1
# HELP alertmanager_cluster_peers_left_total A counter of the number of peers that have left.
# TYPE alertmanager_cluster_peers_left_total counter
alertmanager_cluster_peers_left_total 0
# HELP alertmanager_cluster_peers_update_total A counter of the number of peers that have updated metadata.
# TYPE alertmanager_cluster_peers_update_total counter
alertmanager_cluster_peers_update_total 0
# HELP alertmanager_cluster_reconnections_failed_total A counter of the number of failed cluster peer reconnection attempts.
# TYPE alertmanager_cluster_reconnections_failed_total counter
alertmanager_cluster_reconnections_failed_total 0
# HELP alertmanager_cluster_reconnections_total A counter of the number of cluster peer reconnections.
# TYPE alertmanager_cluster_reconnections_total counter
alertmanager_cluster_reconnections_total 0
# HELP alertmanager_cluster_refresh_join_failed_total A counter of the number of failed cluster peer joined attempts via refresh.
# TYPE alertmanager_cluster_refresh_join_failed_total counter
alertmanager_cluster_refresh_join_failed_total 0
# HELP alertmanager_cluster_refresh_join_total A counter of the number of cluster peer joined via refresh.
# TYPE alertmanager_cluster_refresh_join_total counter
alertmanager_cluster_refresh_join_total 0
# HELP alertmanager_config_hash Hash of the currently loaded alertmanager configuration.
# TYPE alertmanager_config_hash gauge
alertmanager_config_hash 2.6913785254066e+14
# HELP alertmanager_config_last_reload_success_timestamp_seconds Timestamp of the last successful configuration reload.
# TYPE alertmanager_config_last_reload_success_timestamp_seconds gauge
alertmanager_config_last_reload_success_timestamp_seconds 1.7076579723241663e+09
# HELP alertmanager_config_last_reload_successful Whether the last configuration reload attempt was successful.
# TYPE alertmanager_config_last_reload_successful gauge
alertmanager_config_last_reload_successful 1
# HELP alertmanager_dispatcher_aggregation_groups Number of active aggregation groups
# TYPE alertmanager_dispatcher_aggregation_groups gauge
alertmanager_dispatcher_aggregation_groups 0
# HELP alertmanager_dispatcher_alert_processing_duration_seconds Summary of latencies for the processing of alerts.
# TYPE alertmanager_dispatcher_alert_processing_duration_seconds summary
alertmanager_dispatcher_alert_processing_duration_seconds_sum 0
alertmanager_dispatcher_alert_processing_duration_seconds_count 0
# HELP alertmanager_http_concurrency_limit_exceeded_total Total number of times an HTTP request failed because the concurrency limit was reached.
# TYPE alertmanager_http_concurrency_limit_exceeded_total counter
alertmanager_http_concurrency_limit_exceeded_total{method="get"} 0
# HELP alertmanager_http_request_duration_seconds Histogram of latencies for HTTP requests.
# TYPE alertmanager_http_request_duration_seconds histogram
alertmanager_http_request_duration_seconds_bucket{handler="/",method="get",le="0.05"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/",method="get",le="0.1"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/",method="get",le="0.25"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/",method="get",le="0.5"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/",method="get",le="0.75"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/",method="get",le="1"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/",method="get",le="2"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/",method="get",le="5"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/",method="get",le="20"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/",method="get",le="60"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/",method="get",le="+Inf"} 5
alertmanager_http_request_duration_seconds_sum{handler="/",method="get"} 0.04409479
alertmanager_http_request_duration_seconds_count{handler="/",method="get"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/alerts",method="post",le="0.05"} 2
alertmanager_http_request_duration_seconds_bucket{handler="/alerts",method="post",le="0.1"} 2
alertmanager_http_request_duration_seconds_bucket{handler="/alerts",method="post",le="0.25"} 2
alertmanager_http_request_duration_seconds_bucket{handler="/alerts",method="post",le="0.5"} 2
alertmanager_http_request_duration_seconds_bucket{handler="/alerts",method="post",le="0.75"} 2
alertmanager_http_request_duration_seconds_bucket{handler="/alerts",method="post",le="1"} 2
alertmanager_http_request_duration_seconds_bucket{handler="/alerts",method="post",le="2"} 2
alertmanager_http_request_duration_seconds_bucket{handler="/alerts",method="post",le="5"} 2
alertmanager_http_request_duration_seconds_bucket{handler="/alerts",method="post",le="20"} 2
alertmanager_http_request_duration_seconds_bucket{handler="/alerts",method="post",le="60"} 2
alertmanager_http_request_duration_seconds_bucket{handler="/alerts",method="post",le="+Inf"} 2
alertmanager_http_request_duration_seconds_sum{handler="/alerts",method="post"} 0.000438549
alertmanager_http_request_duration_seconds_count{handler="/alerts",method="post"} 2
alertmanager_http_request_duration_seconds_bucket{handler="/favicon.ico",method="get",le="0.05"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/favicon.ico",method="get",le="0.1"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/favicon.ico",method="get",le="0.25"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/favicon.ico",method="get",le="0.5"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/favicon.ico",method="get",le="0.75"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/favicon.ico",method="get",le="1"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/favicon.ico",method="get",le="2"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/favicon.ico",method="get",le="5"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/favicon.ico",method="get",le="20"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/favicon.ico",method="get",le="60"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/favicon.ico",method="get",le="+Inf"} 3
alertmanager_http_request_duration_seconds_sum{handler="/favicon.ico",method="get"} 0.0018690550000000001
alertmanager_http_request_duration_seconds_count{handler="/favicon.ico",method="get"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/lib/*path",method="get",le="0.05"} 20
alertmanager_http_request_duration_seconds_bucket{handler="/lib/*path",method="get",le="0.1"} 20
alertmanager_http_request_duration_seconds_bucket{handler="/lib/*path",method="get",le="0.25"} 20
alertmanager_http_request_duration_seconds_bucket{handler="/lib/*path",method="get",le="0.5"} 20
alertmanager_http_request_duration_seconds_bucket{handler="/lib/*path",method="get",le="0.75"} 20
alertmanager_http_request_duration_seconds_bucket{handler="/lib/*path",method="get",le="1"} 20
alertmanager_http_request_duration_seconds_bucket{handler="/lib/*path",method="get",le="2"} 20
alertmanager_http_request_duration_seconds_bucket{handler="/lib/*path",method="get",le="5"} 20
alertmanager_http_request_duration_seconds_bucket{handler="/lib/*path",method="get",le="20"} 20
alertmanager_http_request_duration_seconds_bucket{handler="/lib/*path",method="get",le="60"} 20
alertmanager_http_request_duration_seconds_bucket{handler="/lib/*path",method="get",le="+Inf"} 20
alertmanager_http_request_duration_seconds_sum{handler="/lib/*path",method="get"} 0.029757111999999995
alertmanager_http_request_duration_seconds_count{handler="/lib/*path",method="get"} 20
alertmanager_http_request_duration_seconds_bucket{handler="/metrics",method="get",le="0.05"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/metrics",method="get",le="0.1"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/metrics",method="get",le="0.25"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/metrics",method="get",le="0.5"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/metrics",method="get",le="0.75"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/metrics",method="get",le="1"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/metrics",method="get",le="2"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/metrics",method="get",le="5"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/metrics",method="get",le="20"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/metrics",method="get",le="60"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/metrics",method="get",le="+Inf"} 3
alertmanager_http_request_duration_seconds_sum{handler="/metrics",method="get"} 0.006149267
alertmanager_http_request_duration_seconds_count{handler="/metrics",method="get"} 3
alertmanager_http_request_duration_seconds_bucket{handler="/script.js",method="get",le="0.05"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/script.js",method="get",le="0.1"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/script.js",method="get",le="0.25"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/script.js",method="get",le="0.5"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/script.js",method="get",le="0.75"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/script.js",method="get",le="1"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/script.js",method="get",le="2"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/script.js",method="get",le="5"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/script.js",method="get",le="20"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/script.js",method="get",le="60"} 5
alertmanager_http_request_duration_seconds_bucket{handler="/script.js",method="get",le="+Inf"} 5
alertmanager_http_request_duration_seconds_sum{handler="/script.js",method="get"} 0.01638322
alertmanager_http_request_duration_seconds_count{handler="/script.js",method="get"} 5
# HELP alertmanager_http_requests_in_flight Current number of HTTP requests being processed.
# TYPE alertmanager_http_requests_in_flight gauge
alertmanager_http_requests_in_flight{method="get"} 1
# HELP alertmanager_http_response_size_bytes Histogram of response size for HTTP requests.
# TYPE alertmanager_http_response_size_bytes histogram
alertmanager_http_response_size_bytes_bucket{handler="/",method="get",le="100"} 0
alertmanager_http_response_size_bytes_bucket{handler="/",method="get",le="1000"} 0
alertmanager_http_response_size_bytes_bucket{handler="/",method="get",le="10000"} 5
alertmanager_http_response_size_bytes_bucket{handler="/",method="get",le="100000"} 5
alertmanager_http_response_size_bytes_bucket{handler="/",method="get",le="1e+06"} 5
alertmanager_http_response_size_bytes_bucket{handler="/",method="get",le="1e+07"} 5
alertmanager_http_response_size_bytes_bucket{handler="/",method="get",le="1e+08"} 5
alertmanager_http_response_size_bytes_bucket{handler="/",method="get",le="+Inf"} 5
alertmanager_http_response_size_bytes_sum{handler="/",method="get"} 8270
alertmanager_http_response_size_bytes_count{handler="/",method="get"} 5
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="100"} 2
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1000"} 2
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="10000"} 2
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="100000"} 2
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1e+06"} 2
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1e+07"} 2
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="1e+08"} 2
alertmanager_http_response_size_bytes_bucket{handler="/alerts",method="post",le="+Inf"} 2
alertmanager_http_response_size_bytes_sum{handler="/alerts",method="post"} 40
alertmanager_http_response_size_bytes_count{handler="/alerts",method="post"} 2
alertmanager_http_response_size_bytes_bucket{handler="/favicon.ico",method="get",le="100"} 0
alertmanager_http_response_size_bytes_bucket{handler="/favicon.ico",method="get",le="1000"} 0
alertmanager_http_response_size_bytes_bucket{handler="/favicon.ico",method="get",le="10000"} 0
alertmanager_http_response_size_bytes_bucket{handler="/favicon.ico",method="get",le="100000"} 3
alertmanager_http_response_size_bytes_bucket{handler="/favicon.ico",method="get",le="1e+06"} 3
alertmanager_http_response_size_bytes_bucket{handler="/favicon.ico",method="get",le="1e+07"} 3
alertmanager_http_response_size_bytes_bucket{handler="/favicon.ico",method="get",le="1e+08"} 3
alertmanager_http_response_size_bytes_bucket{handler="/favicon.ico",method="get",le="+Inf"} 3
alertmanager_http_response_size_bytes_sum{handler="/favicon.ico",method="get"} 45258
alertmanager_http_response_size_bytes_count{handler="/favicon.ico",method="get"} 3
alertmanager_http_response_size_bytes_bucket{handler="/lib/*path",method="get",le="100"} 0
alertmanager_http_response_size_bytes_bucket{handler="/lib/*path",method="get",le="1000"} 0
alertmanager_http_response_size_bytes_bucket{handler="/lib/*path",method="get",le="10000"} 5
alertmanager_http_response_size_bytes_bucket{handler="/lib/*path",method="get",le="100000"} 15
alertmanager_http_response_size_bytes_bucket{handler="/lib/*path",method="get",le="1e+06"} 20
alertmanager_http_response_size_bytes_bucket{handler="/lib/*path",method="get",le="1e+07"} 20
alertmanager_http_response_size_bytes_bucket{handler="/lib/*path",method="get",le="1e+08"} 20
alertmanager_http_response_size_bytes_bucket{handler="/lib/*path",method="get",le="+Inf"} 20
alertmanager_http_response_size_bytes_sum{handler="/lib/*path",method="get"} 1.306205e+06
alertmanager_http_response_size_bytes_count{handler="/lib/*path",method="get"} 20
alertmanager_http_response_size_bytes_bucket{handler="/metrics",method="get",le="100"} 0
alertmanager_http_response_size_bytes_bucket{handler="/metrics",method="get",le="1000"} 0
alertmanager_http_response_size_bytes_bucket{handler="/metrics",method="get",le="10000"} 3
alertmanager_http_response_size_bytes_bucket{handler="/metrics",method="get",le="100000"} 3
alertmanager_http_response_size_bytes_bucket{handler="/metrics",method="get",le="1e+06"} 3
alertmanager_http_response_size_bytes_bucket{handler="/metrics",method="get",le="1e+07"} 3
alertmanager_http_response_size_bytes_bucket{handler="/metrics",method="get",le="1e+08"} 3
alertmanager_http_response_size_bytes_bucket{handler="/metrics",method="get",le="+Inf"} 3
alertmanager_http_response_size_bytes_sum{handler="/metrics",method="get"} 16537
alertmanager_http_response_size_bytes_count{handler="/metrics",method="get"} 3
alertmanager_http_response_size_bytes_bucket{handler="/script.js",method="get",le="100"} 0
alertmanager_http_response_size_bytes_bucket{handler="/script.js",method="get",le="1000"} 0
alertmanager_http_response_size_bytes_bucket{handler="/script.js",method="get",le="10000"} 0
alertmanager_http_response_size_bytes_bucket{handler="/script.js",method="get",le="100000"} 0
alertmanager_http_response_size_bytes_bucket{handler="/script.js",method="get",le="1e+06"} 5
alertmanager_http_response_size_bytes_bucket{handler="/script.js",method="get",le="1e+07"} 5
alertmanager_http_response_size_bytes_bucket{handler="/script.js",method="get",le="1e+08"} 5
alertmanager_http_response_size_bytes_bucket{handler="/script.js",method="get",le="+Inf"} 5
alertmanager_http_response_size_bytes_sum{handler="/script.js",method="get"} 551050
alertmanager_http_response_size_bytes_count{handler="/script.js",method="get"} 5
# HELP alertmanager_integrations Number of configured integrations.
# TYPE alertmanager_integrations gauge
alertmanager_integrations 1
# HELP alertmanager_marked_alerts How many alerts by state are currently marked in the Alertmanager regardless of their expiry.
# TYPE alertmanager_marked_alerts gauge
alertmanager_marked_alerts{state="active"} 0
alertmanager_marked_alerts{state="suppressed"} 0
alertmanager_marked_alerts{state="unprocessed"} 0
# HELP alertmanager_nflog_gc_duration_seconds Duration of the last notification log garbage collection cycle.
# TYPE alertmanager_nflog_gc_duration_seconds summary
alertmanager_nflog_gc_duration_seconds_sum 5.37e-07
alertmanager_nflog_gc_duration_seconds_count 1
# HELP alertmanager_nflog_gossip_messages_propagated_total Number of received gossip messages that have been further gossiped.
# TYPE alertmanager_nflog_gossip_messages_propagated_total counter
alertmanager_nflog_gossip_messages_propagated_total 0
# HELP alertmanager_nflog_maintenance_errors_total How many maintenances were executed for the notification log that failed.
# TYPE alertmanager_nflog_maintenance_errors_total counter
alertmanager_nflog_maintenance_errors_total 0
# HELP alertmanager_nflog_maintenance_total How many maintenances were executed for the notification log.
# TYPE alertmanager_nflog_maintenance_total counter
alertmanager_nflog_maintenance_total 1
# HELP alertmanager_nflog_queries_total Number of notification log queries were received.
# TYPE alertmanager_nflog_queries_total counter
alertmanager_nflog_queries_total 0
# HELP alertmanager_nflog_query_duration_seconds Duration of notification log query evaluation.
# TYPE alertmanager_nflog_query_duration_seconds histogram
alertmanager_nflog_query_duration_seconds_bucket{le="0.005"} 0
alertmanager_nflog_query_duration_seconds_bucket{le="0.01"} 0
alertmanager_nflog_query_duration_seconds_bucket{le="0.025"} 0
alertmanager_nflog_query_duration_seconds_bucket{le="0.05"} 0
alertmanager_nflog_query_duration_seconds_bucket{le="0.1"} 0
alertmanager_nflog_query_duration_seconds_bucket{le="0.25"} 0
alertmanager_nflog_query_duration_seconds_bucket{le="0.5"} 0
alertmanager_nflog_query_duration_seconds_bucket{le="1"} 0
alertmanager_nflog_query_duration_seconds_bucket{le="2.5"} 0
alertmanager_nflog_query_duration_seconds_bucket{le="5"} 0
alertmanager_nflog_query_duration_seconds_bucket{le="10"} 0
alertmanager_nflog_query_duration_seconds_bucket{le="+Inf"} 0
alertmanager_nflog_query_duration_seconds_sum 0
alertmanager_nflog_query_duration_seconds_count 0
# HELP alertmanager_nflog_query_errors_total Number notification log received queries that failed.
# TYPE alertmanager_nflog_query_errors_total counter
alertmanager_nflog_query_errors_total 0
# HELP alertmanager_nflog_snapshot_duration_seconds Duration of the last notification log snapshot.
# TYPE alertmanager_nflog_snapshot_duration_seconds summary
alertmanager_nflog_snapshot_duration_seconds_sum 1.8017e-05
alertmanager_nflog_snapshot_duration_seconds_count 1
# HELP alertmanager_nflog_snapshot_size_bytes Size of the last notification log snapshot in bytes.
# TYPE alertmanager_nflog_snapshot_size_bytes gauge
alertmanager_nflog_snapshot_size_bytes 0
# HELP alertmanager_notification_latency_seconds The latency of notifications in seconds.
# TYPE alertmanager_notification_latency_seconds histogram
alertmanager_notification_latency_seconds_bucket{integration="email",le="1"} 0
alertmanager_notification_latency_seconds_bucket{integration="email",le="5"} 0
alertmanager_notification_latency_seconds_bucket{integration="email",le="10"} 0
alertmanager_notification_latency_seconds_bucket{integration="email",le="15"} 0
alertmanager_notification_latency_seconds_bucket{integration="email",le="20"} 0
alertmanager_notification_latency_seconds_bucket{integration="email",le="+Inf"} 0
alertmanager_notification_latency_seconds_sum{integration="email"} 0
alertmanager_notification_latency_seconds_count{integration="email"} 0
alertmanager_notification_latency_seconds_bucket{integration="msteams",le="1"} 0
alertmanager_notification_latency_seconds_bucket{integration="msteams",le="5"} 0
alertmanager_notification_latency_seconds_bucket{integration="msteams",le="10"} 0
alertmanager_notification_latency_seconds_bucket{integration="msteams",le="15"} 0
alertmanager_notification_latency_seconds_bucket{integration="msteams",le="20"} 0
alertmanager_notification_latency_seconds_bucket{integration="msteams",le="+Inf"} 0
alertmanager_notification_latency_seconds_sum{integration="msteams"} 0
alertmanager_notification_latency_seconds_count{integration="msteams"} 0
alertmanager_notification_latency_seconds_bucket{integration="opsgenie",le="1"} 0
alertmanager_notification_latency_seconds_bucket{integration="opsgenie",le="5"} 0
alertmanager_notification_latency_seconds_bucket{integration="opsgenie",le="10"} 0
alertmanager_notification_latency_seconds_bucket{integration="opsgenie",le="15"} 0
alertmanager_notification_latency_seconds_bucket{integration="opsgenie",le="20"} 0
alertmanager_notification_latency_seconds_bucket{integration="opsgenie",le="+Inf"} 0
alertmanager_notification_latency_seconds_sum{integration="opsgenie"} 0
alertmanager_notification_latency_seconds_count{integration="opsgenie"} 0
alertmanager_notification_latency_seconds_bucket{integration="pagerduty",le="1"} 0
alertmanager_notification_latency_seconds_bucket{integration="pagerduty",le="5"} 0
alertmanager_notification_latency_seconds_bucket{integration="pagerduty",le="10"} 0
alertmanager_notification_latency_seconds_bucket{integration="pagerduty",le="15"} 0
alertmanager_notification_latency_seconds_bucket{integration="pagerduty",le="20"} 0
alertmanager_notification_latency_seconds_bucket{integration="pagerduty",le="+Inf"} 0
alertmanager_notification_latency_seconds_sum{integration="pagerduty"} 0
alertmanager_notification_latency_seconds_count{integration="pagerduty"} 0
alertmanager_notification_latency_seconds_bucket{integration="pushover",le="1"} 0
alertmanager_notification_latency_seconds_bucket{integration="pushover",le="5"} 0
alertmanager_notification_latency_seconds_bucket{integration="pushover",le="10"} 0
alertmanager_notification_latency_seconds_bucket{integration="pushover",le="15"} 0
alertmanager_notification_latency_seconds_bucket{integration="pushover",le="20"} 0
alertmanager_notification_latency_seconds_bucket{integration="pushover",le="+Inf"} 0
alertmanager_notification_latency_seconds_sum{integration="pushover"} 0
alertmanager_notification_latency_seconds_count{integration="pushover"} 0
alertmanager_notification_latency_seconds_bucket{integration="slack",le="1"} 0
alertmanager_notification_latency_seconds_bucket{integration="slack",le="5"} 0
alertmanager_notification_latency_seconds_bucket{integration="slack",le="10"} 0
alertmanager_notification_latency_seconds_bucket{integration="slack",le="15"} 0
alertmanager_notification_latency_seconds_bucket{integration="slack",le="20"} 0
alertmanager_notification_latency_seconds_bucket{integration="slack",le="+Inf"} 0
alertmanager_notification_latency_seconds_sum{integration="slack"} 0
alertmanager_notification_latency_seconds_count{integration="slack"} 0
alertmanager_notification_latency_seconds_bucket{integration="sns",le="1"} 0
alertmanager_notification_latency_seconds_bucket{integration="sns",le="5"} 0
alertmanager_notification_latency_seconds_bucket{integration="sns",le="10"} 0
alertmanager_notification_latency_seconds_bucket{integration="sns",le="15"} 0
alertmanager_notification_latency_seconds_bucket{integration="sns",le="20"} 0
alertmanager_notification_latency_seconds_bucket{integration="sns",le="+Inf"} 0
alertmanager_notification_latency_seconds_sum{integration="sns"} 0
alertmanager_notification_latency_seconds_count{integration="sns"} 0
alertmanager_notification_latency_seconds_bucket{integration="telegram",le="1"} 0
alertmanager_notification_latency_seconds_bucket{integration="telegram",le="5"} 0
alertmanager_notification_latency_seconds_bucket{integration="telegram",le="10"} 0
alertmanager_notification_latency_seconds_bucket{integration="telegram",le="15"} 0
alertmanager_notification_latency_seconds_bucket{integration="telegram",le="20"} 0
alertmanager_notification_latency_seconds_bucket{integration="telegram",le="+Inf"} 0
alertmanager_notification_latency_seconds_sum{integration="telegram"} 0
alertmanager_notification_latency_seconds_count{integration="telegram"} 0
alertmanager_notification_latency_seconds_bucket{integration="victorops",le="1"} 0
alertmanager_notification_latency_seconds_bucket{integration="victorops",le="5"} 0
alertmanager_notification_latency_seconds_bucket{integration="victorops",le="10"} 0
alertmanager_notification_latency_seconds_bucket{integration="victorops",le="15"} 0
alertmanager_notification_latency_seconds_bucket{integration="victorops",le="20"} 0
alertmanager_notification_latency_seconds_bucket{integration="victorops",le="+Inf"} 0
alertmanager_notification_latency_seconds_sum{integration="victorops"} 0
alertmanager_notification_latency_seconds_count{integration="victorops"} 0
alertmanager_notification_latency_seconds_bucket{integration="webhook",le="1"} 0
alertmanager_notification_latency_seconds_bucket{integration="webhook",le="5"} 0
alertmanager_notification_latency_seconds_bucket{integration="webhook",le="10"} 0
alertmanager_notification_latency_seconds_bucket{integration="webhook",le="15"} 0
alertmanager_notification_latency_seconds_bucket{integration="webhook",le="20"} 0
alertmanager_notification_latency_seconds_bucket{integration="webhook",le="+Inf"} 0
alertmanager_notification_latency_seconds_sum{integration="webhook"} 0
alertmanager_notification_latency_seconds_count{integration="webhook"} 0
alertmanager_notification_latency_seconds_bucket{integration="wechat",le="1"} 0
alertmanager_notification_latency_seconds_bucket{integration="wechat",le="5"} 0
alertmanager_notification_latency_seconds_bucket{integration="wechat",le="10"} 0
alertmanager_notification_latency_seconds_bucket{integration="wechat",le="15"} 0
alertmanager_notification_latency_seconds_bucket{integration="wechat",le="20"} 0
alertmanager_notification_latency_seconds_bucket{integration="wechat",le="+Inf"} 0
alertmanager_notification_latency_seconds_sum{integration="wechat"} 0
alertmanager_notification_latency_seconds_count{integration="wechat"} 0
# HELP alertmanager_notification_requests_failed_total The total number of failed notification requests.
# TYPE alertmanager_notification_requests_failed_total counter
alertmanager_notification_requests_failed_total{integration="email"} 0
alertmanager_notification_requests_failed_total{integration="msteams"} 0
alertmanager_notification_requests_failed_total{integration="opsgenie"} 0
alertmanager_notification_requests_failed_total{integration="pagerduty"} 0
alertmanager_notification_requests_failed_total{integration="pushover"} 0
alertmanager_notification_requests_failed_total{integration="slack"} 0
alertmanager_notification_requests_failed_total{integration="sns"} 0
alertmanager_notification_requests_failed_total{integration="telegram"} 0
alertmanager_notification_requests_failed_total{integration="victorops"} 0
alertmanager_notification_requests_failed_total{integration="webhook"} 0
alertmanager_notification_requests_failed_total{integration="wechat"} 0
# HELP alertmanager_notification_requests_total The total number of attempted notification requests.
# TYPE alertmanager_notification_requests_total counter
alertmanager_notification_requests_total{integration="email"} 0
alertmanager_notification_requests_total{integration="msteams"} 0
alertmanager_notification_requests_total{integration="opsgenie"} 0
alertmanager_notification_requests_total{integration="pagerduty"} 0
alertmanager_notification_requests_total{integration="pushover"} 0
alertmanager_notification_requests_total{integration="slack"} 0
alertmanager_notification_requests_total{integration="sns"} 0
alertmanager_notification_requests_total{integration="telegram"} 0
alertmanager_notification_requests_total{integration="victorops"} 0
alertmanager_notification_requests_total{integration="webhook"} 0
alertmanager_notification_requests_total{integration="wechat"} 0
# HELP alertmanager_notifications_failed_total The total number of failed notifications.
# TYPE alertmanager_notifications_failed_total counter
alertmanager_notifications_failed_total{integration="email",reason="clientError"} 0
alertmanager_notifications_failed_total{integration="email",reason="other"} 0
alertmanager_notifications_failed_total{integration="email",reason="serverError"} 0
alertmanager_notifications_failed_total{integration="msteams",reason="clientError"} 0
alertmanager_notifications_failed_total{integration="msteams",reason="other"} 0
alertmanager_notifications_failed_total{integration="msteams",reason="serverError"} 0
alertmanager_notifications_failed_total{integration="opsgenie",reason="clientError"} 0
alertmanager_notifications_failed_total{integration="opsgenie",reason="other"} 0
alertmanager_notifications_failed_total{integration="opsgenie",reason="serverError"} 0
alertmanager_notifications_failed_total{integration="pagerduty",reason="clientError"} 0
alertmanager_notifications_failed_total{integration="pagerduty",reason="other"} 0
alertmanager_notifications_failed_total{integration="pagerduty",reason="serverError"} 0
alertmanager_notifications_failed_total{integration="pushover",reason="clientError"} 0
alertmanager_notifications_failed_total{integration="pushover",reason="other"} 0
alertmanager_notifications_failed_total{integration="pushover",reason="serverError"} 0
alertmanager_notifications_failed_total{integration="slack",reason="clientError"} 0
alertmanager_notifications_failed_total{integration="slack",reason="other"} 0
alertmanager_notifications_failed_total{integration="slack",reason="serverError"} 0
alertmanager_notifications_failed_total{integration="sns",reason="clientError"} 0
alertmanager_notifications_failed_total{integration="sns",reason="other"} 0
alertmanager_notifications_failed_total{integration="sns",reason="serverError"} 0
alertmanager_notifications_failed_total{integration="telegram",reason="clientError"} 0
alertmanager_notifications_failed_total{integration="telegram",reason="other"} 0
alertmanager_notifications_failed_total{integration="telegram",reason="serverError"} 0
alertmanager_notifications_failed_total{integration="victorops",reason="clientError"} 0
alertmanager_notifications_failed_total{integration="victorops",reason="other"} 0
alertmanager_notifications_failed_total{integration="victorops",reason="serverError"} 0
alertmanager_notifications_failed_total{integration="webhook",reason="clientError"} 0
alertmanager_notifications_failed_total{integration="webhook",reason="other"} 0
alertmanager_notifications_failed_total{integration="webhook",reason="serverError"} 0
alertmanager_notifications_failed_total{integration="wechat",reason="clientError"} 0
alertmanager_notifications_failed_total{integration="wechat",reason="other"} 0
alertmanager_notifications_failed_total{integration="wechat",reason="serverError"} 0
# HELP alertmanager_notifications_total The total number of attempted notifications.
# TYPE alertmanager_notifications_total counter
alertmanager_notifications_total{integration="email"} 0
alertmanager_notifications_total{integration="msteams"} 0
alertmanager_notifications_total{integration="opsgenie"} 0
alertmanager_notifications_total{integration="pagerduty"} 0
alertmanager_notifications_total{integration="pushover"} 0
alertmanager_notifications_total{integration="slack"} 0
alertmanager_notifications_total{integration="sns"} 0
alertmanager_notifications_total{integration="telegram"} 0
alertmanager_notifications_total{integration="victorops"} 0
alertmanager_notifications_total{integration="webhook"} 0
alertmanager_notifications_total{integration="wechat"} 0
# HELP alertmanager_oversize_gossip_message_duration_seconds Duration of oversized gossip message requests.
# TYPE alertmanager_oversize_gossip_message_duration_seconds histogram
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="nfl",le="0.005"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="nfl",le="0.01"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="nfl",le="0.025"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="nfl",le="0.05"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="nfl",le="0.1"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="nfl",le="0.25"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="nfl",le="0.5"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="nfl",le="1"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="nfl",le="2.5"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="nfl",le="5"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="nfl",le="10"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="nfl",le="+Inf"} 0
alertmanager_oversize_gossip_message_duration_seconds_sum{key="nfl"} 0
alertmanager_oversize_gossip_message_duration_seconds_count{key="nfl"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="sil",le="0.005"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="sil",le="0.01"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="sil",le="0.025"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="sil",le="0.05"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="sil",le="0.1"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="sil",le="0.25"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="sil",le="0.5"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="sil",le="1"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="sil",le="2.5"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="sil",le="5"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="sil",le="10"} 0
alertmanager_oversize_gossip_message_duration_seconds_bucket{key="sil",le="+Inf"} 0
alertmanager_oversize_gossip_message_duration_seconds_sum{key="sil"} 0
alertmanager_oversize_gossip_message_duration_seconds_count{key="sil"} 0
# HELP alertmanager_oversized_gossip_message_dropped_total Number of oversized gossip messages that were dropped due to a full message queue.
# TYPE alertmanager_oversized_gossip_message_dropped_total counter
alertmanager_oversized_gossip_message_dropped_total{key="nfl"} 0
alertmanager_oversized_gossip_message_dropped_total{key="sil"} 0
# HELP alertmanager_oversized_gossip_message_failure_total Number of oversized gossip message sends that failed.
# TYPE alertmanager_oversized_gossip_message_failure_total counter
alertmanager_oversized_gossip_message_failure_total{key="nfl"} 0
alertmanager_oversized_gossip_message_failure_total{key="sil"} 0
# HELP alertmanager_oversized_gossip_message_sent_total Number of oversized gossip message sent.
# TYPE alertmanager_oversized_gossip_message_sent_total counter
alertmanager_oversized_gossip_message_sent_total{key="nfl"} 0
alertmanager_oversized_gossip_message_sent_total{key="sil"} 0
# HELP alertmanager_peer_position Position the Alertmanager instance believes it's in. The position determines a peer's behavior in the cluster.
# TYPE alertmanager_peer_position gauge
alertmanager_peer_position 0
# HELP alertmanager_receivers Number of configured receivers.
# TYPE alertmanager_receivers gauge
alertmanager_receivers 1
# HELP alertmanager_silences How many silences by state.
# TYPE alertmanager_silences gauge
alertmanager_silences{state="active"} 0
alertmanager_silences{state="expired"} 0
alertmanager_silences{state="pending"} 0
# HELP alertmanager_silences_gc_duration_seconds Duration of the last silence garbage collection cycle.
# TYPE alertmanager_silences_gc_duration_seconds summary
alertmanager_silences_gc_duration_seconds_sum 1.421e-06
alertmanager_silences_gc_duration_seconds_count 1
# HELP alertmanager_silences_gossip_messages_propagated_total Number of received gossip messages that have been further gossiped.
# TYPE alertmanager_silences_gossip_messages_propagated_total counter
alertmanager_silences_gossip_messages_propagated_total 0
# HELP alertmanager_silences_maintenance_errors_total How many maintenances were executed for silences that failed.
# TYPE alertmanager_silences_maintenance_errors_total counter
alertmanager_silences_maintenance_errors_total 0
# HELP alertmanager_silences_maintenance_total How many maintenances were executed for silences.
# TYPE alertmanager_silences_maintenance_total counter
alertmanager_silences_maintenance_total 1
# HELP alertmanager_silences_queries_total How many silence queries were received.
# TYPE alertmanager_silences_queries_total counter
alertmanager_silences_queries_total 16
# HELP alertmanager_silences_query_duration_seconds Duration of silence query evaluation.
# TYPE alertmanager_silences_query_duration_seconds histogram
alertmanager_silences_query_duration_seconds_bucket{le="0.005"} 13
alertmanager_silences_query_duration_seconds_bucket{le="0.01"} 13
alertmanager_silences_query_duration_seconds_bucket{le="0.025"} 13
alertmanager_silences_query_duration_seconds_bucket{le="0.05"} 13
alertmanager_silences_query_duration_seconds_bucket{le="0.1"} 13
alertmanager_silences_query_duration_seconds_bucket{le="0.25"} 13
alertmanager_silences_query_duration_seconds_bucket{le="0.5"} 13
alertmanager_silences_query_duration_seconds_bucket{le="1"} 13
alertmanager_silences_query_duration_seconds_bucket{le="2.5"} 13
alertmanager_silences_query_duration_seconds_bucket{le="5"} 13
alertmanager_silences_query_duration_seconds_bucket{le="10"} 13
alertmanager_silences_query_duration_seconds_bucket{le="+Inf"} 13
alertmanager_silences_query_duration_seconds_sum 3.3388e-05
alertmanager_silences_query_duration_seconds_count 13
# HELP alertmanager_silences_query_errors_total How many silence received queries did not succeed.
# TYPE alertmanager_silences_query_errors_total counter
alertmanager_silences_query_errors_total 0
# HELP alertmanager_silences_snapshot_duration_seconds Duration of the last silence snapshot.
# TYPE alertmanager_silences_snapshot_duration_seconds summary
alertmanager_silences_snapshot_duration_seconds_sum 4.817e-06
alertmanager_silences_snapshot_duration_seconds_count 1
# HELP alertmanager_silences_snapshot_size_bytes Size of the last silence snapshot in bytes.
# TYPE alertmanager_silences_snapshot_size_bytes gauge
alertmanager_silences_snapshot_size_bytes 0
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 6.6062e-05
go_gc_duration_seconds{quantile="0.25"} 8.594e-05
go_gc_duration_seconds{quantile="0.5"} 0.000157875
go_gc_duration_seconds{quantile="0.75"} 0.00022753
go_gc_duration_seconds{quantile="1"} 0.000495779
go_gc_duration_seconds_sum 0.002599715
go_gc_duration_seconds_count 14
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 33
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.20.7"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 8.579632e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 2.3776552e+07
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.459904e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 144509
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 8.607616e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 8.579632e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 4.407296e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.1845632e+07
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 50067
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 4.112384e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 1.6252928e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.7076591024133735e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 194576
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 2400
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 15600
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 185120
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 195840
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 1.4392264e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 597208
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 524288
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 524288
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 2.7653384e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 7
# HELP net_conntrack_dialer_conn_attempted_total Total number of connections attempted by the given dialer a given name.
# TYPE net_conntrack_dialer_conn_attempted_total counter
net_conntrack_dialer_conn_attempted_total{dialer_name="webhook"} 0
# HELP net_conntrack_dialer_conn_closed_total Total number of connections closed which originated from the dialer of a given name.
# TYPE net_conntrack_dialer_conn_closed_total counter
net_conntrack_dialer_conn_closed_total{dialer_name="webhook"} 0
# HELP net_conntrack_dialer_conn_established_total Total number of connections successfully established by the given dialer a given name.
# TYPE net_conntrack_dialer_conn_established_total counter
net_conntrack_dialer_conn_established_total{dialer_name="webhook"} 0
# HELP net_conntrack_dialer_conn_failed_total Total number of connections failed to dial by the dialer a given name.
# TYPE net_conntrack_dialer_conn_failed_total counter
net_conntrack_dialer_conn_failed_total{dialer_name="webhook",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="webhook",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="webhook",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="webhook",reason="unknown"} 0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 1.46
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 4096
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 13
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 3.2780288e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.70765797104e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 7.55372032e+08
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 3
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

监控指标

配置alertmanager

默认配置

[root@mcw04 ~]# cat /etc/alertmanager/alertmanager.yml 
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
[root@mcw04 ~]# ss -lntup|grep 5001
[root@mcw04 ~]#

修改配置如下

[root@mcw04 ~]# cat /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: '13xx2@163.com'
  smtp_auth_username: '13xx32'
  smtp_auth_password: 'xxx3456'
  smtp_require_tls: false

templates:
  - '/etc/alertmanager/template/*.tmpl'

route:
  receiver: email
receivers:
  - name: 'email'
    email_configs:
      - to: '8xx5@qq.com'
[root@mcw04 ~]#

创建目录，

[root@mcw04 ~]# sudo  mkdir -p /etc/alertmanager/template

重启一下

[root@mcw04 ~]# systemctl restart alertmanager.service

查看配置，

已经修改为如下：下面的配置，没有在配置文件出现的，我们看情况应该也可以修改

global:
  resolve_timeout: 5m
  http_config:
    follow_redirects: true
    enable_http2: true
  smtp_from: 135xx632@163.com
  smtp_hello: localhost
  smtp_smarthost: smtp.163.com:25
  smtp_auth_username: "13xxx32"
  smtp_auth_password: <secret>
  smtp_require_tls: false
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
  telegram_api_url: https://api.telegram.org
  webex_api_url: https://webexapis.com/v1/messages
route:
  receiver: email
  continue: false
receivers:
- name: email
  email_configs:
  - send_resolved: false
    to: 89xx15@qq.com
    from: 13xx32@163.com
    hello: localhost
    smarthost: smtp.163.com:25
    auth_username: "13xx32"
    auth_password: <secret>
    headers:
      From: 13xx32@163.com
      Subject: '{{ template "email.default.subject" . }}'
      To: 89xx15@qq.com
    html: '{{ template "email.default.html" . }}'
    require_tls: false
templates:
- /etc/alertmanager/template/*.tmpl

回到顶部

添加报警规则

添加第一条告警规则

修改前

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/node_rules.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

修改后

[root@mcw03 ~]# vim /etc/prometheus.yml
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*_rules.yml"
  - "rules/*_alerts.yml"

这是之前添加的记录规则

[root@mcw03 ~]# cat /etc/rules/node_rules.yml 
groups:
  - name: node_rules
    interval: 10s
    rules:
    - record: instance:node_cpu:avg_rate5m
      expr:  100 - avg(irate(node_cpu_seconds_total{job='agent1',mode='idle'}[5m])) by (instance)*100
    - record: instace:node_memory_usage:percentage
      expr: (node_memory_MemTotal_bytes-(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes))/node_memory_MemTotal_bytes*100
      labels:
        metric_type: aggregation
        name: machangwei
  - name: xiaoma_rules
    rules:
    - record: mcw:diskusage
      expr: (node_filesystem_size_bytes{mountpoint="/"}-node_filesystem_free_bytes{mountpoint="/"})/node_filesystem_size_bytes{mountpoint="/"}*100
[root@mcw03 ~]#

修改上面的配置，重载之后，记录规则没有啥影响

告警的配置，需要用到第一个记录规则

编辑告警配置文件。HighNodeCPU是告警的名称，expr下面可以用指标或者记录规则，使用运算符来指定触发阈值，

[root@mcw03 ~]# ls /etc/rules/
node_rules.yml
[root@mcw03 ~]# vim /etc/rules/node_alerts.yml
[root@mcw03 ~]# cat /etc/rules/node_alerts.yml
groups:
  - name: node_alerts
    rules:
    - alert: HighNodeCPU
      expr: instance:node_cpu:avg_rete5m > 80
      for: 60m
      labels:
        servrity: warning
      annotations:
        summary: High Node CPU for 1 hour
        console: You might want to check the Node Dashboard at http://grafana.example.com/dashboard/db/node-dashboard
[root@mcw03 ~]# ls /etc/rules/
node_alerts.yml  node_rules.yml
[root@mcw03 ~]#

重载

[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload

重载之后，刷新告警页面，点击绿色的地方

可以看到我们定义的告警规则

触发告警，以及配置邮件告警

下面触发告警，调低阈值，for改为10s，上面的记录规则写错了，修改正确，由rete改为rate ,并且触发阈值是大于1就触发告警，重载配置

[root@mcw03 ~]# vim /etc/rules/node_alerts.yml 
[root@mcw03 ~]# cat /etc/rules/node_alerts.yml
groups:
  - name: node_alerts
    rules:
    - alert: HighNodeCPU
      expr: instance:node_cpu:avg_rate5m > 1
      for: 10s
      labels:
        servrity: warning
      annotations:
        summary: High Node CPU for 1 hour
        console: You might want to check the Node Dashboard at http://grafana.example.com/dashboard/db/node-dashboard
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

从表达式浏览器这里可以看到，有个机器是可以触发告警阈值的

alert这里，显示有个活跃的告警，之前是0活跃的绿色

点击打开之后，可以看到相关触发告警的信息

alertmanager页面，也可以看到这个告警

点击信息，可以看到我们告警规则里面注册的信息

点击来源的时候

跳转到Prometheus的浏览器表达式地址，我们给笔记本添加这个主机的解析记录

添加解析记录之后，刷新一下页面可以看到是这样的

又过了一阵子，查看状态已经变化了

没有看到发送邮件，查看报错，域名解析有问题

[root@mcw04 ~]# tail /var/log/messages
Feb 11 23:56:14 mcw04 alertmanager: ts=2024-02-11T15:56:14.706Z caller=notify.go:745 level=warn component=dispatcher receiver=email integration=email[0] aggrGroup={}:{} msg="Notify attempt failed, will retry later" attempts=3 err="establish connection to server: dial tcp: lookup smtp.163.com on 223.5.5.5:53: read udp 192.168.80.4:34027->223.5.5.5:53: i/o timeout"
F

重启网络之后，可以解析域名了，但是通知失败

[root@mcw04 ~]# systemctl restart network
[root@mcw04 ~]# 
[root@mcw04 ~]# 
[root@mcw04 ~]# ping www.baidu.com
PING www.a.shifen.com (220.181.38.149) 56(84) bytes of data.
64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=1 ttl=128 time=18.2 ms
64 bytes from 220.181.38.149 (220.181.38.149): icmp_seq=2 ttl=128 time=16.1 ms
^C
--- www.a.shifen.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 16.161/17.199/18.237/1.038 ms
[root@mcw04 ~]# 
[root@mcw04 ~]# 
[root@mcw04 ~]# 
[root@mcw04 ~]# tail /var/log/messages
Feb 11 23:59:42 mcw04 network: [  OK  ]
Feb 11 23:59:42 mcw04 systemd: Started LSB: Bring up/down networking.
Feb 11 23:59:43 mcw04 kernel: IPv6: ens33: IPv6 duplicate address fe80::495b:ff7:d185:f95d detected!
Feb 11 23:59:43 mcw04 NetworkManager[865]: <info>  [1707667183.2015] device (ens33): ipv6: duplicate address check failed for the fe80::495b:ff7:d185:f95d/64 lft forever pref forever lifetime 90305-0[4294967295,4294967295] dev 2 flags tentative,permanent,0x8 src kernel address
Feb 11 23:59:43 mcw04 kernel: IPv6: ens33: IPv6 duplicate address fe80::f32c:166d:40de:8f2e detected!
Feb 11 23:59:43 mcw04 NetworkManager[865]: <info>  [1707667183.7803] device (ens33): ipv6: duplicate address check failed for the fe80::f32c:166d:40de:8f2e/64 lft forever pref forever lifetime 90305-0[4294967295,4294967295] dev 2 flags tentative,permanent,0x8 src kernel address
Feb 11 23:59:43 mcw04 NetworkManager[865]: <warn>  [1707667183.7803] device (ens33): linklocal6: failed to generate an address: Too many DAD collisions
Feb 11 23:59:52 mcw04 alertmanager: ts=2024-02-11T15:59:52.266Z caller=notify.go:745 level=warn component=dispatcher receiver=email integration=email[0] aggrGroup={}:{} msg="Notify attempt failed, will retry later" attempts=14 err="*email.loginAuth auth: 550 User has no permission"
Feb 12 00:00:44 mcw04 alertmanager: ts=2024-02-11T16:00:44.697Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="email/email[0]: notify retry canceled after 15 attempts: *email.loginAuth auth: 550 User has no permission"
Feb 12 00:00:45 mcw04 alertmanager: ts=2024-02-11T16:00:45.028Z caller=notify.go:745 level=warn component=dispatcher receiver=email integration=email[0] aggrGroup={}:{} msg="Notify attempt failed, will retry later" attempts=1 err="*email.loginAuth auth: 550 User has no permission"
[root@mcw04 ~]#

没有开启pop3等这种服务，开启之后，报错认证失败

[root@mcw04 ~]# tail /var/log/messages
Feb 12 00:15:44 mcw04 alertmanager: ts=2024-02-11T16:15:44.700Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="email/email[0]: notify retry canceled after 16 attempts: *email.loginAuth auth: 550 User has no permission"
Feb 12 00:15:45 mcw04 alertmanager: ts=2024-02-11T16:15:45.048Z caller=notify.go:745 level=warn component=dispatcher receiver=email integration=email[0] aggrGroup={}:{} msg="Notify attempt failed, will retry later" attempts=1 err="*email.loginAuth auth: 550 User has no permission"
Feb 12 00:20:44 mcw04 alertmanager: ts=2024-02-11T16:20:44.700Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="email/email[0]: notify retry canceled after 16 attempts: *email.loginAuth auth: 550 User has no permission"
Feb 12 00:20:45 mcw04 alertmanager: ts=2024-02-11T16:20:45.055Z caller=notify.go:745 level=warn component=dispatcher receiver=email integration=email[0] aggrGroup={}:{} msg="Notify attempt failed, will retry later" attempts=1 err="*email.loginAuth auth: 550 User has no permission"
Feb 12 00:25:33 mcw04 grafana-server: logger=cleanup t=2024-02-12T00:25:33.606714112+08:00 level=info msg="Completed cleanup jobs" duration=37.876505ms
Feb 12 00:25:44 mcw04 alertmanager: ts=2024-02-11T16:25:44.701Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="email/email[0]: notify retry canceled after 16 attempts: *email.loginAuth auth: 550 User has no permission"
Feb 12 00:25:45 mcw04 alertmanager: ts=2024-02-11T16:25:45.032Z caller=notify.go:745 level=warn component=dispatcher receiver=email integration=email[0] aggrGroup={}:{} msg="Notify attempt failed, will retry later" attempts=1 err="*email.loginAuth auth: 550 User has no permission"
Feb 12 00:28:10 mcw04 alertmanager: ts=2024-02-11T16:28:10.588Z caller=notify.go:745 level=warn component=dispatcher receiver=email integration=email[0] aggrGroup={}:{} msg="Notify attempt failed, will retry later" attempts=13 err="*email.loginAuth auth: 535 Error: authentication failed"
Feb 12 00:30:44 mcw04 alertmanager: ts=2024-02-11T16:30:44.703Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="email/email[0]: notify retry canceled after 16 attempts: *email.loginAuth auth: 535 Error: authentication failed"
Feb 12 00:30:45 mcw04 alertmanager: ts=2024-02-11T16:30:45.389Z caller=notify.go:745 level=warn component=dispatcher receiver=email integration=email[0] aggrGroup={}:{} msg="Notify attempt failed, will retry later" attempts=1 err="*email.loginAuth auth: 535 Error: authentication failed"
[root@mcw04 ~]#

修改配置如下：

[root@mcw04 ~]# cat /etc/alertmanager/alertmanager.yml 
global:
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: '自己邮箱@163.com'
  smtp_auth_username: '自己邮箱32@163.com'
  smtp_auth_password: '自己的smtp授权密码'
  smtp_require_tls: false

templates:
  - '/etc/alertmanager/template/*.tmpl'

route:
  receiver: email
receivers:
  - name: 'email'
    email_configs:
      - to: '8发送给那个邮箱5@qq.com'
[root@mcw04 ~]#

然后重启alertmanager才算成功发送邮件

告警信息如下

对比如下，把注册信息，还有下面触发后的标签发送出去了，我们自己顶一顶额标签告警级别也有

参考邮件alertmanger配置

参考：https://blog.csdn.net/qq_42527269/article/details/128914049

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: '自己邮箱@163.com'
  smtp_auth_username: '自己邮箱@163.com'
  smtp_auth_password: 'PLAPPSJXJCQABYAF'
  smtp_require_tls: false
templates:
  - 'template/*.tmpl'
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 20m
  receiver: 'email'

receivers:
  - name: 'email'
    email_configs:
      - to: '接收人邮箱@qq.com'
        html: '{{ template "test.html" . }}'
        send_resolved: true

添加新警报和模板，获取标签值，指标值

      annotations:
        summary: Host {{ $labels.instance }} of {{ $labels.job }} is up!
        myname: xiaoma {{ humanize $value }}

将原来的告警配置文件移动成告警2配置文件，重载

[root@mcw03 ~]# mv /etc/rules/node_alerts.yml /etc/rules/node_alerts2.yml
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

文件没匹配上

重新改名

[root@mcw03 ~]# ls /etc/rules/
node_alerts2.yml  node_rules.yml
[root@mcw03 ~]# mv /etc/rules/node_alerts2.yml /etc/rules/node2_alerts.yml
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

刷新一下，之前消失的数据又回来了，并且触发告警，发送邮件通知了

新增同样的文件，然后写两个告警配置文件。

注解中要使用标签，需要用引用变量的方式，从$labels里面获取

[root@mcw03 ~]# vim /etc/rules/node_alerts.yml 
[root@mcw03 ~]# cat /etc/rules/node_alerts.yml
groups:
  - name: node_alerts
    rules:
    - alert: DiskWillFillIn4Hours
      expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h],4*3600) < 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Disk on {{ $labels.instance }} will fill in approximately 4 hours
    - alert: InstanceDown
      expr: up{job="node"} == 0
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: Host {{ $labels.instance }} of {{ $labels.job }} is down!
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

查看页面，已经生成了两个警报规则

修改磁盘使用预测的值，将0改为102400000000，将for 改为10s ，触发告警

[root@mcw03 ~]# vim /etc/rules/node_alerts.yml 
[root@mcw03 ~]# cat /etc/rules/node_alerts.yml
groups:
  - name: node_alerts
    rules:
    - alert: DiskWillFillIn4Hours
      expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h],4*3600) < 102400000000
      for: 10s
      labels:
        severity: critical
      annotations:
        summary: Disk on {{ $labels.instance }} will fill in approximately 4 hours
    - alert: InstanceDown
      expr: up{job="node"} == 0
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: Host {{ $labels.instance }} of {{ $labels.job }} is down!
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

可以看到，这个警报规则触发了四个告警，并发送了邮件

并且四条告警是一起发送出来的，都把标签和注解作为邮件内容发送出来了。而且使用标签变量的部分，都是渲染对应告警机器的标签值了

改回去之后，告警取消

根据标签过滤一下

报错了

修改下job,是docker的，修改下表达式是，结果是1的就触发告警。修改时间for是10s。添加注解，注解中获取表达式的值，值是1

然后看邮件发送的结果，可以看到，所有的告警，都汇总到一个邮件里面了，并且获取到表达式的值，在注解中

注解中获取到表达式值为1

获取表达式的值

[root@mcw03 ~]# cat /etc/rules/node_alerts.yml
groups:
  - name: node_alerts
    rules:
    - alert: DiskWillFillIn4Hours
      expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h],4*3600) < 102400000000
      for: 10s
      labels:
        severity: critical
      annotations:
        summary: Disk on {{ $labels.instance }} will fill in approximately 4 hours
    - alert: InstanceDown
      expr: up{job="docker"} == 1
      for: 10s
      labels:
        severity: critical
      annotations:
        summary: Host {{ $labels.instance }} of {{ $labels.job }} is up!
        myname: xiaoma {{ humanize $value }}
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

Prometheus警报

[root@mcw03 ~]# touch /etc/rules/prometheus_alerts.yml
[root@mcw03 ~]# vim /etc/rules/prometheus_alerts.yml
[root@mcw03 ~]# cat /etc/rules/prometheus_alerts.yml
groups:
  - name: prometheus_alerts
    rules:
    - alert: PrometheusConfigReloadFailed
      expr: prometheus_config_last_reload_successful == 0
      for: 10m
      labels:
        severity: warning
      annotations:
        description: Reloading Prometheus configuration has failed on {{ $labels.instance }} .
    - alert: PrometheusNotConnectedToAlertmanagers
      expr: prometheus_notifications_alertmanagers_discovered < 1
      for: 10m
      labels:
        severity: warning
      annotations:
        description: Prometheus {{ $labels.instance }} is not connected to any Alertmanagers
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

修改配置，让重载失败，因为for是10分钟，估计得是10分钟后，还是这个状态，才会发送通知

[root@mcw03 ~]# vim /etc/prometheus.yml 
[root@mcw03 ~]# tail -2 /etc/prometheus.yml
   #   action: labeldrop
 xxxxx
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
failed to reload config: couldn't load configuration (--config.file="/etc/prometheus.yml"): parsing YAML file /etc/prometheus.yml: yaml: line 53: did not find expected key
[root@mcw03 ~]#

修改正确配置，然后修改告警规则for为10s。也就是10s钟后还是这个状态，就发送通知。然后把配置改坏，重载配置失败触发告警

[root@mcw03 ~]# vim /etc/rules/prometheus_alerts.yml 
[root@mcw03 ~]# grep 10 /etc/rules/prometheus_alerts.yml
      for: 10s
      for: 10m
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]# echo xxx >>/etc/prometheus.yml 
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
failed to reload config: couldn't load configuration (--config.file="/etc/prometheus.yml"): parsing YAML file /etc/prometheus.yml: yaml: line 55: could not find expected ':'
[root@mcw03 ~]#

之前黄色是pending，现在红色是发送了通知了把

重载失败的邮件告警出来了，只是有点延迟的厉害，邮件通知，这个邮件主题，好像是标签合在一起了

可用性警报(服务，up机器，缺失指标)

服务可用性

我们之前开启systemd，只收集三个服务的情况。

查找服务状态active不是1的，就是服务不正常的，然后告警

编写告警配置文件

[root@mcw03 ~]# vim /etc/rules/keyongxing_alerts.yml
[root@mcw03 ~]# cat /etc/rules/keyongxing_alerts.yml
groups:
  - name: keyongxing_alerts
    rules:
    - alert: NodeServiceDown
      expr: node_systemd_unit_state{state="active"} == 0
      for: 60s
      labels:
        severity: critical
      annotations:
        summary: Service {{ $labels.name }} on {{ $labels.instance }} is nolonger active!
        description: Werner Heisenberg says - "OMG Where's my service?"
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

[root@mcw03 ~]# vim /etc/rules/keyongxing_alerts.yml
[root@mcw03 ~]# cat /etc/rules/keyongxing_alerts.yml
groups:
  - name: Keyongxing_alerts
    rules:
    - alert: NodeServiceDown
      expr: node_systemd_unit_state{state="active"} != 1
      for: 60s
      labels:
        severity: critical
      annotations:
        summary: Service {{ $labels.name }} on {{ $labels.instance }} is nolonger active!
        description: Werner Heisenberg says - "OMG Where's my service?"
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload

没有看到这个告警规则，看到报错，先把报错去掉

Feb 12 13:32:32 mcw03 prometheus: level=error ts=2024-02-12T05:32:32.909623139Z caller=file.go:321 component="discovery manager scrape" discovery=file msg="Error reading file" path=/etc/targets/docker/daemons.yml err="yaml: unmarshal errors:\n  line 4: field datacenter not found in type struct { Targets []string \"yaml:\\\"targets\\\"\"; Labels model.LabelSet \"yaml:\\\"labels\\\"\" }"

修改完成之后，告警规则还是没有出来，

[root@mcw03 ~]# cat /etc/targets/docker/daemons.yml
- targets:
  - "10.0.0.12:8080"
- labels:
  "datacenter": "mcwymlhome"
[root@mcw03 ~]# vim /etc/targets/docker/daemons.yml
[root@mcw03 ~]# cat /etc/targets/docker/daemons.yml
- targets:
  - "10.0.0.12:8080"
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

放到别处一份，在重载一下

[root@mcw03 ~]# cat /etc/rules/keyongxing_alerts.yml 
groups:
  - name: Keyongxing_alerts
    rules:
    - alert: NodeServiceDown
      expr: node_systemd_unit_state{state="active"} != 1
      for: 60s
      labels:
        severity: critical
      annotations:
        summary: Service {{ $labels.name }} on {{ $labels.instance }} is nolonger active!
        description: Werner Heisenberg says - "OMG Where's my service?"
[root@mcw03 ~]# vim /etc/rules/node_rules.yml 
[root@mcw03 ~]# cat /etc/rules/node_rules.yml
groups:
  - name: node_rules
    interval: 10s
    rules:
    - record: instance:node_cpu:avg_rate5m
      expr:  100 - avg(irate(node_cpu_seconds_total{job='agent1',mode='idle'}[5m])) by (instance)*100
    - record: instace:node_memory_usage:percentage
      expr: (node_memory_MemTotal_bytes-(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes))/node_memory_MemTotal_bytes*100
      labels:
        metric_type: aggregation
        name: machangwei
  - name: xiaoma_rules
    rules:
    - record: mcw:diskusage
      expr: (node_filesystem_size_bytes{mountpoint="/"}-node_filesystem_free_bytes{mountpoint="/"})/node_filesystem_size_bytes{mountpoint="/"}*100
    - alert: NodeServiceDown
      expr: node_systemd_unit_state{state="active"} != 1
      for: 60s
      labels:
        severity: critical
      annotations:
        summary: Service {{ $labels.name }} on {{ $labels.instance }} is nolonger active!
        description: Werner Heisenberg says - "OMG Where's my service?"

[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

可以看到，可以同名的。现在是两个，我们找的是alert的名字，而不是找组的-name名字，找错了

我们停掉这个服务，触发一下告警

停止服务

[root@mcw02 ~]# systemctl status rsyslog.service 
● rsyslog.service - System Logging Service
   Loaded: loaded (/usr/lib/systemd/system/rsyslog.service; enabled; vendor preset: enabled)
   Active: active (running) since Sat 2024-02-10 22:40:47 CST; 1 day 15h ago
     Docs: man:rsyslogd(8)
           http://www.rsyslog.com/doc/
 Main PID: 1053 (rsyslogd)
   Memory: 68.0K
   CGroup: /system.slice/rsyslog.service
           └─1053 /usr/sbin/rsyslogd -n

Feb 10 22:40:42 mcw02 systemd[1]: Starting System Logging Service...
Feb 10 22:40:44 mcw02 rsyslogd[1053]:  [origin software="rsyslogd" swVersion="8.24.0" x-pid="1053" x-info="http://www.rsyslog.com"] start
Feb 10 22:40:44 mcw02 rsyslogd[1053]: imjournal: fscanf on state file `/var/lib/rsyslog/imjournal.state' failed  [v8.24.0 try http://www.rsyslog.com/e/2027 ]
Feb 10 22:40:44 mcw02 rsyslogd[1053]: imjournal: ignoring invalid state file [v8.24.0]
Feb 10 22:40:47 mcw02 systemd[1]: Started System Logging Service.
Feb 11 03:48:04 mcw02 rsyslogd[1053]:  [origin software="rsyslogd" swVersion="8.24.0" x-pid="1053" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
[root@mcw02 ~]# 
[root@mcw02 ~]# systemctl stop rsyslog.service 
[root@mcw02 ~]# systemctl status rsyslog.service 
● rsyslog.service - System Logging Service
   Loaded: loaded (/usr/lib/systemd/system/rsyslog.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Mon 2024-02-12 13:43:13 CST; 2s ago
     Docs: man:rsyslogd(8)
           http://www.rsyslog.com/doc/
  Process: 1053 ExecStart=/usr/sbin/rsyslogd -n $SYSLOGD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 1053 (code=exited, status=0/SUCCESS)

Feb 10 22:40:42 mcw02 systemd[1]: Starting System Logging Service...
Feb 10 22:40:44 mcw02 rsyslogd[1053]:  [origin software="rsyslogd" swVersion="8.24.0" x-pid="1053" x-info="http://www.rsyslog.com"] start
Feb 10 22:40:44 mcw02 rsyslogd[1053]: imjournal: fscanf on state file `/var/lib/rsyslog/imjournal.state' failed  [v8.24.0 try http://www.rsyslog.com/e/2027 ]
Feb 10 22:40:44 mcw02 rsyslogd[1053]: imjournal: ignoring invalid state file [v8.24.0]
Feb 10 22:40:47 mcw02 systemd[1]: Started System Logging Service.
Feb 11 03:48:04 mcw02 rsyslogd[1053]:  [origin software="rsyslogd" swVersion="8.24.0" x-pid="1053" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Feb 12 13:43:13 mcw02 systemd[1]: Stopping System Logging Service...
Feb 12 13:43:13 mcw02 rsyslogd[1053]:  [origin software="rsyslogd" swVersion="8.24.0" x-pid="1053" x-info="http://www.rsyslog.com"] exiting on signal 15.
Feb 12 13:43:13 mcw02 systemd[1]: Stopped System Logging Service.
[root@mcw02 ~]#

已经触发状态不等于1了

发送告警通知。虽然告警规则，同样的规则写了两次，并且两个都是触发告警状态，但是这里只发送了一次通知，合理

服务重新启动，告警告警消失

机器可用性

求平均值

分组聚合，根据作业工作组，求每组的up的均值。

找每个job组，up均值在一半以下，也就是%50的实例无法完成抓取，就可以用来触发告警

-------

up的有7个

根据job分组求和，up的个数

分组统计up个数

=---------

缺失指标告警

情况如下，用absent，如果有这个指标，就不会返回数据。如果没有这个指标，就返回值为1.这里用于判断这个指标是否存在，表达式是否可以。检测是否存在缺失的指标

absent检测是否存在缺失的指标

[root@mcw03 ~]# cat /etc/rules/node_rules.yml
groups:
  - name: node_rules
    interval: 10s
    rules:
    - record: instance:node_cpu:avg_rate5m
      expr:  100 - avg(irate(node_cpu_seconds_total{job='agent1',mode='idle'}[5m])) by (instance)*100
    - record: instace:node_memory_usage:percentage
      expr: (node_memory_MemTotal_bytes-(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes))/node_memory_MemTotal_bytes*100
      labels:
        metric_type: aggregation
        name: machangwei
  - name: xiaoma_rules
    rules:
    - record: mcw:diskusage
      expr: (node_filesystem_size_bytes{mountpoint="/"}-node_filesystem_free_bytes{mountpoint="/"})/node_filesystem_size_bytes{mountpoint="/"}*100
    - alert: InstanceGone
      expr: absent(up{job="agent1"})
      for: 10s
      labels:
        severity: critical
      annotations:
        summary: Host {{ $labels.name }} is nolonger reporting!
        description: ‘Werner Heisenberg says - "OMG Where are my instances?"

[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

修改为不存在的job，就是缺少值，然后触发了告警

回到顶部

路由

路由配置

修改配置之前

[root@mcw04 ~]# cat /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: '13xx2@163.com'
  smtp_auth_username: '13xx2@163.com'
  smtp_auth_password: 'ExxxxNW'
  smtp_require_tls: false

templates:
  - '/etc/alertmanager/template/*.tmpl'

route:
  receiver: email
receivers:
  - name: 'email'
    email_configs:
      - to: '8xx5@qq.com'
[root@mcw04 ~]#

修改之后。路由下面有分支路由，可以使用标签匹配和正则表达式匹配接受者，接受者在rceivers里面可以定义多个，路由匹配那里用的是接受者配置下的name名称来找到接收者

[root@mcw04 ~]# cat /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: '13xx32@163.com'
  smtp_auth_username: '13xxx32@163.com'
  smtp_auth_password: 'EHxxNW'
  smtp_require_tls: false

templates:
  - '/etc/alertmanager/template/*.tmpl'

route:
  group_by: ['instance','cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: email
  routes:
  - match:
      severity: critical
    receiver: pager
  - match_re:
      serverity: ^(warning|critical)$
    receiver: support_team

receivers:
  - name: 'email'
    email_configs:
      - to: '8xx15@qq.com'
  - name: 'support_team'
    email_configs:
      - to: '89xx5@qq.com' 
  - name: 'pager'
    email_configs:
      - to: '13xx32@163.com'
[root@mcw04 ~]# systemctl restart alertmanager.service 
[root@mcw04 ~]#

先弄的都没有触发告警的

突然发现，我这里把告警规则，写到记录规则下面了。他俩的区别是，一个是record，另外一个是alert,可以混合写在一起，在这里。

找三个，两个是criticlal的，一个是warning告警的，手动触发一下

由下可以看到，有critical标签的都发送到163邮箱了。有warning的，是发送到qq邮箱了，告警根据标签或者正则，匹配到了不同的接收者，然后发送到不同的地方了

路由表(多条件匹配)

先将上面的告警恢复

现在的配置是这个的

[root@mcw04 ~]# cat /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: '13xx32@163.com'
  smtp_auth_username: '13xx2@163.com'
  smtp_auth_password: 'ExxSRNW'
  smtp_require_tls: false

templates:
  - '/etc/alertmanager/template/*.tmpl'

route:
  group_by: ['instance','cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: email
  routes:
  - match:
      severity: critical
    receiver: pager
  - match_re:
      serverity: ^(warning|critical)$
    receiver: support_team

receivers:
  - name: 'email'
    email_configs:
      - to: '8xx5@qq.com'
  - name: 'support_team'
    email_configs:
      - to: '8xx5@qq.com' 
  - name: 'pager'
    email_configs:
      - to: '13xx2@163.com'
[root@mcw04 ~]#

critical会匹配到pager，pager是163邮箱

停止一个服务，触发带有critical标签的告警

[root@mcw02 ~]# systemctl stop rsyslog.service 
[root@mcw02 ~]#

可以看到163邮箱有这个告警了

qq邮箱并没有

下面重新启动这个服务，恢复告警后，修改路由配置

首先给这个警报规则的，加个标签

[root@mcw03 ~]# cat /etc/rules/keyongxing_alerts.yml
groups:
  - name: Keyongxing_alerts
    rules:
    - alert: NodeServiceDown
      expr: node_systemd_unit_state{state="active"} != 1
      for: 60s
      labels:
        severity: critical
        service: machangweiapp
      annotations:
        summary: Service {{ $labels.name }} on {{ $labels.instance }} is nolonger active!
        description: Werner Heisenberg says - "OMG Where's my service?"
[root@mcw03 ~]# curl -X POST http://localhost:9090/-/reload
[root@mcw03 ~]#

加的是这个标签

看配置，只看路由部分的,在match下，再接一个路由匹配，这样就是如果是有标签critical的会发给163邮箱。如果再匹配到service:machangweiapp的，那么会发送到qq邮箱，多层匹配。

[root@mcw04 ~]# cat /etc/alertmanager/alertmanager.yml
.....
route:
  group_by: ['instance','cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: email
  routes:
  - match:
      severity: critical
    receiver: pager
    routes:
      - match:
          service: machangweiapp
      receiver: support_team
  - match_re:
      serverity: ^(warning|critical)$
    receiver: support_team

receivers:
  - name: 'email'
    email_configs:
      - to: '8x5@qq.com'
  - name: 'support_team'
    email_configs:
      - to: '8x15@qq.com' 
  - name: 'pager'
    email_configs:
      - to: '13x2@163.com'
[root@mcw04 ~]#

[root@mcw04 ~]# systemctl restart alertmanager.service 
[root@mcw04 ~]#

下面触发告警试试

有问题，告警服务

[root@mcw04 ~]# systemctl status alertmanager.service 
● alertmanager.service - Prometheus Alertmanager
   Loaded: loaded (/etc/systemd/system/alertmanager.service; disabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Mon 2024-02-12 16:41:26 CST; 7min ago
  Process: 29042 ExecStart=/usr/local/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/var/lib/alertmanager/ (code=exited, status=1/FAILURE)
 Main PID: 29042 (code=exited, status=1/FAILURE)

Feb 12 16:41:26 mcw04 systemd[1]: alertmanager.service: main process exited, code=exited, status=1/FAILURE
Feb 12 16:41:26 mcw04 systemd[1]: Unit alertmanager.service entered failed state.
Feb 12 16:41:26 mcw04 systemd[1]: alertmanager.service failed.
Feb 12 16:41:26 mcw04 systemd[1]: alertmanager.service holdoff time over, scheduling restart.
Feb 12 16:41:26 mcw04 systemd[1]: start request repeated too quickly for alertmanager.service
Feb 12 16:41:26 mcw04 systemd[1]: Failed to start Prometheus Alertmanager.
Feb 12 16:41:26 mcw04 systemd[1]: Unit alertmanager.service entered failed state.
Feb 12 16:41:26 mcw04 systemd[1]: alertmanager.service failed.
[root@mcw04 ~]# less /var/log/messages
[root@mcw04 ~]# tail -6  /var/log/messages
Feb 12 16:41:26 mcw04 systemd: alertmanager.service holdoff time over, scheduling restart.
Feb 12 16:41:26 mcw04 systemd: start request repeated too quickly for alertmanager.service
Feb 12 16:41:26 mcw04 systemd: Failed to start Prometheus Alertmanager.
Feb 12 16:41:26 mcw04 systemd: Unit alertmanager.service entered failed state.
Feb 12 16:41:26 mcw04 systemd: alertmanager.service failed.
Feb 12 16:45:33 mcw04 grafana-server: logger=cleanup t=2024-02-12T16:45:33.590972694+08:00 level=info msg="Completed cleanup jobs" duration=22.429387ms
[root@mcw04 ~]#

报错了，因为配置文件错了

[root@mcw04 ~]# journalctl -u alertmanager

Feb 12 16:50:08 mcw04 alertmanager[29189]: ts=2024-02-12T08:50:08.744Z caller=cluster.go:186 level=info component=cluster msg="setting advertise address explicitly" addr=10.0.0.14 
Feb 12 16:50:08 mcw04 alertmanager[29189]: ts=2024-02-12T08:50:08.751Z caller=cluster.go:683 level=info component=cluster msg="Waiting for gossip to settle..." interval=2s
Feb 12 16:50:08 mcw04 alertmanager[29189]: ts=2024-02-12T08:50:08.782Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alert
Feb 12 16:50:08 mcw04 alertmanager[29189]: ts=2024-02-12T08:50:08.782Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/e
Feb 12 16:50:08 mcw04 alertmanager[29189]: ts=2024-02-12T08:50:08.783Z caller=cluster.go:692 level=info component=cluster msg="gossip not settled but continuing anyway" polls=0 ela
Feb 12 16:50:08 mcw04 systemd[1]: alertmanager.service: main process exited, code=exited, status=1/FAILURE
Feb 12 16:50:08 mcw04 systemd[1]: Unit alertmanager.service entered failed state.
Feb 12 16:50:08 mcw04 systemd[1]: alertmanager.service failed.
Feb 12 16:50:08 mcw04 systemd[1]: alertmanager.service holdoff time over, scheduling restart.
Feb 12 16:50:08 mcw04 systemd[1]: start request repeated too quickly for alertmanager.service
Feb 12 16:50:08 mcw04 systemd[1]: Failed to start Prometheus Alertmanager.
Feb 12 16:50:08 mcw04 systemd[1]: Unit alertmanager.service entered failed state.
Feb 12 16:50:08 mcw04 systemd[1]: alertmanager.service failed.

这里应该是齐平

改成如下，route -match 两个，多了空格

route:
  group_by: ['instance','cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: email
  routes:
  - match:
      severity: critical
    receiver: pager
    routes:
    - match:
        service: machangweiapp
      receiver: support_team
  - match_re:
      serverity: ^(warning|critical)$
    receiver: support_team

再次重启，正常了

可以看到，已经按照预期，发送给qq了

163邮箱没有收到

continue开启之后，好像是这个匹配上之后，还可以继续往下匹配；如果是发送到多个地方，可以用这个，默认值是false。

  routes:
  - match:
      severity: critical
    receiver: pager
    continue: true

回到顶部

接收器和通知模板

接收器

在pager接收者下面加slack_configs配置

  - name: 'pager'
    email_configs:
      - to: '13x32@163.com'
    slack_configs:
      - api_url: https://hooks.slack.com/services/ABC123/ABC123/EXAMPLE
        channel: #monitoring
        text: '{{ .CommonAnnotations.summary }}'

[root@mcw04 ~]# cat /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: '13xx32@163.com'
  smtp_auth_username: '13x2@163.com'
  smtp_auth_password: 'ExSRNW'
  smtp_require_tls: false

templates:
  - '/etc/alertmanager/template/*.tmpl'

route:
  group_by: ['instance','cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: email
  routes:
  - match:
      severity: critical
    receiver: pager
    #routes:
    #- match:
    #    service: machangweiapp
    #  receiver: support_team
  - match_re:
      serverity: ^(warning|critical)$
    receiver: support_team

receivers:
  - name: 'email'
    email_configs:
      - to: '89x5@qq.com'
  - name: 'support_team'
    email_configs:
      - to: '89x15@qq.com' 
  - name: 'pager'
    email_configs:
      - to: '13x32@163.com'
    slack_configs:
      - api_url: https://hooks.slack.com/services/ABC123/ABC123/EXAMPLE
        channel: #monitoring
        text: '{{ .CommonAnnotations.summary }}'
[root@mcw04 ~]# 
[root@mcw04 ~]# systemctl restart alertmanager.service 
[root@mcw04 ~]#

结果是这样的

告警发送到钉钉群

钉钉机器人创建：

https://www.cnblogs.com/machangwei-8/p/18013311

下载地址：https://github.com/timonwong/prometheus-webhook-dingtalk/releases
本次安装版本为2.1.0

部署包下载完毕，开始安装

cd /prometheus
tar -xvzf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
mv prometheus-webhook-dingtalk-2.1.0.linux-amd64 webhook_dingtalk
cd webhook_dingtalk

编写配置文件（复制之后切记删除#的所有注释，否则启动服务时会报错），将上述获取的钉钉webhook地址填写到如下文件

vim dingtalk.yml

timeout: 5s

targets:
  webhook_robot:
  	# 钉钉机器人创建后的webhook地址
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  webhook_mention_all:
  	# 钉钉机器人创建后的webhook地址
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    # 提醒全员
    mention:
      all: true

进行系统service编写

创建webhook_dingtalk配置文件

cd /usr/lib/systemd/system
vim webhook_dingtalk.service

webhook_dingtalk.service 文件填入如下内容后保存:wq

[Unit]
Description=https://prometheus.io

[Service]
Restart=on-failure
ExecStart=/prometheus/webhook_dingtalk/prometheus-webhook-dingtalk --config.file=/prometheus/webhook_dingtalk/dingtalk.yml --web.listen-address=:8060

[Install]
WantedBy=multi-user.target

查看配置文件

cat webhook_dingtalk.service

刷新服务配置并启动服务

systemctl daemon-reload
systemctl start webhook_dingtalk.service

查看服务运行状态

systemctl status webhook_dingtalk.service

设置开机自启动

systemctl enable webhook_dingtalk.service

我们记下 urls=http://localhost:8060/dingtalk/webhook_robot/send 这一段值，接下来的配置会用上

配置Alertmanager

打开 /prometheus/alertmanager/alertmanager.yml，修改为如下内容

global:
  # 在没有报警的情况下声明为已解决的时间
  resolve_timeout: 5m

route:
  # 接收到告警后到自定义分组
  group_by: ["alertname"]
  # 分组创建后初始化等待时长
  group_wait: 10s
  # 告警信息发送之前的等待时长
  group_interval: 30s
  # 重复报警的间隔时长
  repeat_interval: 5m
  # 默认消息接收
  receiver: "dingtalk"

receivers:
  # 钉钉
  - name: 'dingtalk'
    webhook_configs:
    	# prometheus-webhook-dingtalk服务的地址
      - url: http://1xx.xx.xx.7:8060/dingtalk/webhook_robot/send
        send_resolved: true

在prometheus安装文件夹根目录增加alert_rules.yml配置文件，内容如下

groups:
  - name: alert_rules
    rules:
      - alert: CpuUsageAlertWarning
        expr: sum(avg(irate(node_cpu_seconds_total{mode!='idle'}[5m])) without (cpu)) by (instance) > 0.60
        for: 2m
        labels:
          level: warning
        annotations:
          summary: "Instance {{ $labels.instance }} CPU usage high"
          description: "{{ $labels.instance }} CPU usage above 60% (current value: {{ $value }})"
      - alert: CpuUsageAlertSerious
        #expr: sum(avg(irate(node_cpu_seconds_total{mode!='idle'}[5m])) without (cpu)) by (instance) > 0.85
        expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{job=~".*",mode="idle"}[5m])) * 100)) > 85
        for: 3m
        labels:
          level: serious
        annotations:
          summary: "Instance {{ $labels.instance }} CPU usage high"
          description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
      - alert: MemUsageAlertWarning
        expr: avg by(instance) ((1 - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) / node_memory_MemTotal_bytes) * 100) > 70
        for: 2m
        labels:
          level: warning
        annotations:
          summary: "Instance {{ $labels.instance }} MEM usage high"
          description: "{{$labels.instance}}: MEM usage is above 70% (current value is: {{ $value }})"
      - alert: MemUsageAlertSerious
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes > 0.90
        for: 3m
        labels:
          level: serious
        annotations:
          summary: "Instance {{ $labels.instance }} MEM usage high"
          description: "{{ $labels.instance }} MEM usage above 90% (current value: {{ $value }})"
      - alert: DiskUsageAlertWarning
        expr: (1 - node_filesystem_free_bytes{fstype!="rootfs",mountpoint!="",mountpoint!~"/(run|var|sys|dev).*"} / node_filesystem_size_bytes) * 100 > 80
        for: 2m
        labels:
          level: warning
        annotations:
          summary: "Instance {{ $labels.instance }} Disk usage high"
          description: "{{$labels.instance}}: Disk usage is above 80% (current value is: {{ $value }})"
      - alert: DiskUsageAlertSerious
        expr: (1 - node_filesystem_free_bytes{fstype!="rootfs",mountpoint!="",mountpoint!~"/(run|var|sys|dev).*"} / node_filesystem_size_bytes) * 100 > 90
        for: 3m
        labels:
          level: serious
        annotations:
          summary: "Instance {{ $labels.instance }} Disk usage high"
          description: "{{$labels.instance}}: Disk usage is above 90% (current value is: {{ $value }})"
      - alert: NodeFileDescriptorUsage
        expr: avg by (instance) (node_filefd_allocated{} / node_filefd_maximum{}) * 100 > 60
        for: 2m
        labels:
          level: warning
        annotations:
          summary: "Instance {{ $labels.instance }} File Descriptor usage high"
          description: "{{$labels.instance}}: File Descriptor usage is above 60% (current value is: {{ $value }})"
      - alert: NodeLoad15
        expr: avg by (instance) (node_load15{}) > 80
        for: 2m
        labels:
          level: warning
        annotations:
          summary: "Instance {{ $labels.instance }} Load15 usage high"
          description: "{{$labels.instance}}: Load15 is above 80 (current value is: {{ $value }})"
      - alert: NodeAgentStatus
        expr: avg by (instance) (up{}) == 0
        for: 2m
        labels:
          level: warning
        annotations:
          summary: "{{$labels.instance}}: has been down"
          description: "{{$labels.instance}}: Node_Exporter Agent is down (current value is: {{ $value }})"
      - alert: NodeProcsBlocked
        expr: avg by (instance) (node_procs_blocked{}) > 10
        for: 2m
        labels:
          level: warning
        annotations:
          summary: "Instance {{ $labels.instance }}  Process Blocked usage high"
          description: "{{$labels.instance}}: Node Blocked Procs detected! above 10 (current value is: {{ $value }})"
      - alert: NetworkTransmitRate
        #expr:  avg by (instance) (floor(irate(node_network_transmit_bytes_total{device="ens192"}[2m]) / 1024 / 1024)) > 50
        expr:  avg by (instance) (floor(irate(node_network_transmit_bytes_total{}[2m]) / 1024 / 1024 * 8 )) > 40
        for: 1m
        labels:
          level: warning
        annotations:
          summary: "Instance {{ $labels.instance }} Network Transmit Rate usage high"
          description: "{{$labels.instance}}: Node Transmit Rate (Upload) is above 40Mbps/s (current value is: {{ $value }}Mbps/s)"
      - alert: NetworkReceiveRate
        #expr:  avg by (instance) (floor(irate(node_network_receive_bytes_total{device="ens192"}[2m]) / 1024 / 1024)) > 50
        expr:  avg by (instance) (floor(irate(node_network_receive_bytes_total{}[2m]) / 1024 / 1024 * 8 )) > 40
        for: 1m
        labels:
          level: warning
        annotations:
          summary: "Instance {{ $labels.instance }} Network Receive Rate usage high"
          description: "{{$labels.instance}}: Node Receive Rate (Download) is above 40Mbps/s (current value is: {{ $value }}Mbps/s)"
      - alert: DiskReadRate
        expr: avg by (instance) (floor(irate(node_disk_read_bytes_total{}[2m]) / 1024 )) > 200
        for: 2m
        labels:
          level: warning
        annotations:
          summary: "Instance {{ $labels.instance }} Disk Read Rate usage high"
          description: "{{$labels.instance}}: Node Disk Read Rate is above 200KB/s (current value is: {{ $value }}KB/s)"
      - alert: DiskWriteRate
        expr: avg by (instance) (floor(irate(node_disk_written_bytes_total{}[2m]) / 1024 / 1024 )) > 20
        for: 2m
        labels:
          level: warning
        annotations:
          summary: "Instance {{ $labels.instance }} Disk Write Rate usage high"
          description: "{{$labels.instance}}: Node Disk Write Rate is above 20MB/s (current value is: {{ $value }}MB/s)"

修改prometheys.yml,最上方三个节点改为如下配置

global:
  scrape_interval:     15s 
  evaluation_interval: 15s 

alerting:
  alertmanagers:
  - static_configs:
    # alertmanager服务地址
    - targets: ['11x.xx.x.7:9093']

rule_files:
  - "alert_rules.yml"

执行curl -XPOST localhost:9090/-/reload刷新prometheus配置
执行systemctl restart alertmanger.service或docker restart alertmanager刷新alertmanger服务

验证配置

@@@自己操作

下载解压包

[root@mcw04 ~]# mv prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz /prometheus/
[root@mcw04 ~]# cd /prometheus/
[root@mcw04 prometheus]# ls
prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
[root@mcw04 prometheus]# tar xf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz 
[root@mcw04 prometheus]# ls
prometheus-webhook-dingtalk-2.1.0.linux-amd64  prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
[root@mcw04 prometheus]# cd prometheus-webhook-dingtalk-2.1.0.linux-amd64/
[root@mcw04 prometheus-webhook-dingtalk-2.1.0.linux-amd64]# ls
config.example.yml  contrib  LICENSE  prometheus-webhook-dingtalk
[root@mcw04 prometheus-webhook-dingtalk-2.1.0.linux-amd64]# cd ..
[root@mcw04 prometheus]# ls
prometheus-webhook-dingtalk-2.1.0.linux-amd64  prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
[root@mcw04 prometheus]# mv prometheus-webhook-dingtalk-2.1.0.linux-amd64 webhook_dingtalk
[root@mcw04 prometheus]# cd webhook_dingtalk
[root@mcw04 webhook_dingtalk]# ls
config.example.yml  contrib  LICENSE  prometheus-webhook-dingtalk
[root@mcw04 webhook_dingtalk]#

配置启动

需要提前新增钉钉群的机器人，所以我们需要参考下面链接，申请机器人

钉钉发送告警

我们下面。alertmanager配置里接收者用的是webhook1,dingtalk程序的配置里，需要配置secret,所以我们将机器人改下。

取消之前的关键字，使用加签，这个secret就是用的这个加签。

[root@mcw04 webhook_dingtalk]# ls
config.example.yml  contrib  LICENSE  prometheus-webhook-dingtalk
[root@mcw04 webhook_dingtalk]# cp config.example.yml dingtalk.yml
[root@mcw04 webhook_dingtalk]# vim dingtalk.yml 
[root@mcw04 webhook_dingtalk]# cat dingtalk.yml 
## Request timeout
# timeout: 5s

## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true

## Customizable templates path
#templates:
#  - contrib/templates/legacy/template.tmpl

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
#  title: '{{ template "legacy.title" . }}'
#  text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"
targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=2f15xxxxa0c
    # secret for signature
    secret: SEC07946bssxxxxx7ac1e3
  webhook2:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  webhook_legacy:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    # Customize template content
    message:
      # Use legacy template
      title: '{{ template "legacy.title" . }}'
      text: '{{ template "legacy.content" . }}'
  webhook_mention_all:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    mention:
      all: true
  webhook_mention_users:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    mention:
      mobiles: ['156xxxx8827', '189xxxx8325']
[root@mcw04 webhook_dingtalk]# cd /usr/lib/systemd/system
[root@mcw04 system]# vim webhook_dingtalk.service
[root@mcw04 system]# cat webhook_dingtalk.service 
[Unit]
Description=https://prometheus.io

[Service]
Restart=on-failure
ExecStart=/prometheus/webhook_dingtalk/prometheus-webhook-dingtalk --config.file=/prometheus/webhook_dingtalk/dingtalk.yml --web.listen-address=:8060

[Install]
WantedBy=multi-user.target
[root@mcw04 system]# systemctl daemon-reload
[root@mcw04 system]# systemctl start webhook_dingtalk.service
[root@mcw04 system]# systemctl status webhook_dingtalk.service
● webhook_dingtalk.service - https://prometheus.io
   Loaded: loaded (/usr/lib/systemd/system/webhook_dingtalk.service; disabled; vendor preset: disabled)
   Active: active (running) since Mon 2024-02-12 22:27:00 CST; 7s ago
 Main PID: 32796 (prometheus-webh)
   CGroup: /system.slice/webhook_dingtalk.service
           └─32796 /prometheus/webhook_dingtalk/prometheus-webhook-dingtalk --config.file=/prometheus/webhook_dingtalk/dingtalk.yml --web.listen-address=:8060

Feb 12 22:27:00 mcw04 systemd[1]: Started https://prometheus.io.
Feb 12 22:27:00 mcw04 systemd[1]: Starting https://prometheus.io...
Feb 12 22:27:00 mcw04 prometheus-webhook-dingtalk[32796]: ts=2024-02-12T14:27:00.264Z caller=main.go:59 level=info msg="Starting prometheus-webhook-dingtalk" version="...b3005ab4)"
Feb 12 22:27:00 mcw04 prometheus-webhook-dingtalk[32796]: ts=2024-02-12T14:27:00.264Z caller=main.go:60 level=info msg="Build context" (gogo1.18.1,userroot@177bd003ba4...=(MISSING)
Feb 12 22:27:00 mcw04 prometheus-webhook-dingtalk[32796]: ts=2024-02-12T14:27:00.264Z caller=coordinator.go:83 level=info component=configuration file=/prometheus/webh...tion file"
Feb 12 22:27:00 mcw04 prometheus-webhook-dingtalk[32796]: ts=2024-02-12T14:27:00.265Z caller=coordinator.go:91 level=info component=configuration file=/prometheus/webh...tion file"
Feb 12 22:27:00 mcw04 prometheus-webhook-dingtalk[32796]: ts=2024-02-12T14:27:00.265Z caller=main.go:97 level=info component=configuration msg="Loading templates" templates=
Feb 12 22:27:00 mcw04 prometheus-webhook-dingtalk[32796]: ts=2024-02-12T14:27:00.266Z caller=main.go:113 component=configuration msg="Webhook urls for prometheus alertmanager" u...
Feb 12 22:27:00 mcw04 prometheus-webhook-dingtalk[32796]: ts=2024-02-12T14:27:00.267Z caller=web.go:208 level=info component=web msg="Start listening for connections" address=:8060
Hint: Some lines were ellipsized, use -l to show in full.
[root@mcw04 system]# systemctl enable webhook_dingtalk.service
Created symlink from /etc/systemd/system/multi-user.target.wants/webhook_dingtalk.service to /usr/lib/systemd/system/webhook_dingtalk.service.
[root@mcw04 system]#

记下 urls=http://localhost:8060/dingtalk/webhook_robot/send 这一段值

http://10.0.0.14:8060/dingtalk/webhook1/send

修改之前配置

[root@mcw04 system]# ls /etc/alertmanager/alertmanager.yml 
/etc/alertmanager/alertmanager.yml
[root@mcw04 system]# cat /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: '13xx2@163.com'
  smtp_auth_username: '13xx2@163.com'
  smtp_auth_password: 'EHxxNW'
  smtp_require_tls: false

templates:
  - '/etc/alertmanager/template/*.tmpl'

route:
  group_by: ['instance','cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: email
  routes:
  - match:
      severity: critical
    receiver: pager
    #routes:
    #- match:
    #    service: machangweiapp
    #  receiver: support_team
  - match_re:
      serverity: ^(warning|critical)$
    receiver: support_team

receivers:
  - name: 'email'
    email_configs:
      - to: '8xx5@qq.com'
  - name: 'support_team'
    email_configs:
      - to: '8xx5@qq.com' 
  - name: 'pager'
    email_configs:
      - to: '13xx2@163.com'
    slack_configs:
      - api_url: https://oapi.dingtalk.com/robot/send?access_token=2f153x1a0c
        #channel: #monitoring
        text: 'mcw {{ .CommonAnnotations.summary }}'
[root@mcw04 system]#

修改之后:

修改的地方是：

路由添加了下面标签匹配到的就发往dingtalk接收者

  - match:
      severity: critical
    receiver: dingtalk
接收者下面新增了dingtalk配置。访问地址，就是dingtalk运行的机器和端口,需要修改的地方就是指定用dingtalk下的定义的那个webhook名称，我们这里用的是webhook1

  - name: 'dingtalk'
    webhook_configs:
      - url: http://10.0.0.14:8060/dingtalk/webhook1/send
        send_resolved: true

[root@mcw04 system]# cat /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: '13xx2@163.com'
  smtp_auth_username: '13xx2@163.com'
  smtp_auth_password: 'ExxNW'
  smtp_require_tls: false

templates:
  - '/etc/alertmanager/template/*.tmpl'

route:
  group_by: ['instance','cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: email
  routes:
  - match:
      severity: critical
    receiver: dingtalk
  - match:
      severity: critical
    receiver: pager
    #routes:
    #- match:
    #    service: machangweiapp
    #  receiver: support_team
  - match_re:
      serverity: ^(warning|critical)$
    receiver: support_team

receivers:
  - name: 'email'
    email_configs:
      - to: '8xx5@qq.com'
  - name: 'support_team'
    email_configs:
      - to: '8xx5@qq.com' 
  - name: 'pager'
    email_configs:
      - to: '13xxx32@163.com'
  - name: 'dingtalk'
    webhook_configs:
      - url: http://10.0.0.14:8060/dingtalk/webhook1/send
        send_resolved: true
[root@mcw04 system]# systemctl restart alertmanager.service 
[root@mcw04 system]#

重启一下，这个带有标签是critiacal的服务。这样就能触发告警

[root@mcw02 ~]# systemctl start rsyslog.service 
[root@mcw02 ~]# systemctl stop rsyslog.service 
[root@mcw02 ~]#

触发的是这条告警规则

这里显示，是dingtalk

我们可以看到，我们的机器人已经发送了告警信息，在钉钉群里面

邮件通知模板

5.1.1、案例需求

默认的告警信息有些太简单，我们可以借助于告警的模板信息，对告警的信息进行丰富增加。我们需要借助于alertmanager的模板功能来实现。

5.1.2、使用流程

1、分析关键信息
2、定制模板内容
3、prometheus加载模板文件
4、告警信息使用模板内容属性

5.2、定制邮件模板

5.2.1、编写邮件模板

mkdir /data/server/alertmanager/email_template && cd /data/server/alertmanager/email_template
cat >email.tmpl<<'EOF'
{{ define "test.html" }}
<table border="1">
<thead>
        <th>告警级别</th>
        <th>告警类型</th>
        <th>故障主机</th>
        <th>告警主题</th>
        <th>告警详情</th>
        <th>触发时间</th>
</thead>
<tbody>
{{ range $i, $alert := .Alerts }}
        <tr>
                <td>{{ index $alert.Labels.severity }}</td>
                <td>{{ index $alert.Labels.alertname }}</td>
                <td>{{ index $alert.Labels.instance }}</td>
                <td>{{ index $alert.Annotations.summary }}</td>
                <td>{{ index $alert.Annotations.description }}</td>
                <td>{{ $alert.StartsAt }}</td>
        </tr>
{{ end }}
</tbody>
</table>
{{ end }}
EOF

属性解析：
{{ define "test.html" }} 表示定义了一个 test.html 模板文件，通过该名称在配置文件中应用。
此模板文件就是使用了大量的ajax模板语言。
$alert.xxx 其实是从默认的告警信息中提取出来的重要信息。

@@@

查看模板配置路径

[root@mcw04 system]# cat /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: '13582215632@163.com'
  smtp_auth_username: '13582215632@163.com'
  smtp_auth_password: 'EHUKIEHDQJCSSRNW'
  smtp_require_tls: false

templates:
  - '/etc/alertmanager/template/*.tmpl'

route:
  group_by: ['instance','cluster']

[root@mcw04 system]# ls /etc/alertmanager/template/
[root@mcw04 system]# cd /etc/alertmanager/template/
[root@mcw04 template]# cat >email.tmpl<<'EOF'
> {{ define "test.html" }}
> <table border="1">
> <thead>
>         <th>告警级别</th>
>         <th>告警类型</th>
>         <th>故障主机</th>
>         <th>告警主题</th>
>         <th>告警详情</th>
>         <th>触发时间</th>
> </thead>
> <tbody>
> {{ range $i, $alert := .Alerts }}
>         <tr>
>                 <td>{{ index $alert.Labels.severity }}</td>
>                 <td>{{ index $alert.Labels.alertname }}</td>
>                 <td>{{ index $alert.Labels.instance }}</td>
>                 <td>{{ index $alert.Annotations.summary }}</td>
>                 <td>{{ index $alert.Annotations.description }}</td>
>                 <td>{{ $alert.StartsAt }}</td>
>         </tr>
> {{ end }}
> </tbody>
> </table>
> {{ end }}
> EOF
[root@mcw04 template]# ls
email.tmpl
[root@mcw04 template]# cat email.tmpl 
{{ define "test.html" }}
<table border="1">
<thead>
        <th>告警级别</th>
        <th>告警类型</th>
        <th>故障主机</th>
        <th>告警主题</th>
        <th>告警详情</th>
        <th>触发时间</th>
</thead>
<tbody>
{{ range $i, $alert := .Alerts }}
        <tr>
                <td>{{ index $alert.Labels.severity }}</td>
                <td>{{ index $alert.Labels.alertname }}</td>
                <td>{{ index $alert.Labels.instance }}</td>
                <td>{{ index $alert.Annotations.summary }}</td>
                <td>{{ index $alert.Annotations.description }}</td>
                <td>{{ $alert.StartsAt }}</td>
        </tr>
{{ end }}
</tbody>
</table>
{{ end }}
[root@mcw04 template]#

5.2.2、修改alertmanager.yml【即应用邮件模板】

]# vi /data/server/alertmanager/etc/alertmanager.yml 
# 全局配置【配置告警邮件地址】
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.126.com:25'
  smtp_from: '**ygbh@126.com'
  smtp_auth_username: 'pyygbh@126.com'
  smtp_auth_password: 'BXDVLEAJEH******'
  smtp_hello: '126.com'
  smtp_require_tls: false

# 模板配置
templates:
  - '../email_template/*.tmpl'

# 路由配置
route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 120s
  receiver: 'email'

# 收信人员
receivers:
- name: 'email'
  email_configs:
  - to: '277667028@qq.com'
    send_resolved: true
    html: '{{ template "test.html" . }}'
    headers: { Subject: "[WARN] 报警邮件" }

# 规则主动失效措施，如果不想用的话可以取消掉
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']


属性解析：
{{}} 属性用于加载其它信息，所以应该使用单引号括住
{} 不需要使用单引号，否则服务启动不成功

@@@

把标签包含critical 的告警，路由到钉钉的注释掉，让它路由到pager,pager发送到163邮箱，添加下面三个配置，让它用这个我们创建的配置。我们上面是定义了

如何找到test.html。因为这个配置文件里面定义了模板的路径。那么新增这个模板是匹配到的，是可以作为模板识别出来的，里面又定义了这个模板的名称是test.html。所以发送消息的时候用这个模板渲染生成页面

5.2.3、检查语法是否正常

命令在alertmanager的tar解压包里

]# amtool check-config /data/server/alertmanager/etc/alertmanager.yml 
Checking '/data/server/alertmanager/etc/alertmanager.yml'  SUCCESS
Found:
 - global config
 - route
 - 1 inhibit rules
 - 1 receivers
 - 1 templates
  SUCCESS

[root@mcw04 template]#  /tmp/alertmanager-0.26.0.linux-amd64/amtool check-config /etc/alertmanager/alertmanager.yml
Checking '/etc/alertmanager/alertmanager.yml'  SUCCESS
Found:
 - global config
 - route
 - 0 inhibit rules
 - 4 receivers
 - 1 templates
  SUCCESS

[root@mcw04 template]#

5.2.4、重启alertmanager服务

systemctl restart alertmanager

重启完之后，已经触发并发送了告警通知

可以看到告警信息

从alert变量里面获取的生成的数据，从alfert下的标签注解里面获取对应的内容

钉钉通知模板

创建模板文件

[root@mcw04 template]# cat /etc/alertmanager/template/default.tmpl 
{{ define "default.tmpl" }}

{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}

============ = **<font color='#FF0000'>告警</font>** = =============  #红色字体

**告警名称:**    {{ $alert.Labels.alertname }}   
**告警级别:**    {{ $alert.Labels.severity }} 级   
**告警状态:**    {{ .Status }}   
**告警实例:**    {{ $alert.Labels.instance }} {{ $alert.Labels.device }}   
**告警概要:**    {{ .Annotations.summary }}   
**告警详情:**    {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}   
**故障时间:**    {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}  
============ = end = =============  
{{- end }}
{{- end }}

{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}

============ = <font color='#00FF00'>恢复</font> = =============   #绿色字体

**告警实例:**    {{ .Labels.instance }}   
**告警名称:**    {{ .Labels.alertname }}  
**告警级别:**    {{ $alert.Labels.severity }} 级   
**告警状态:**    {{   .Status }} 
**告警概要:**    {{ $alert.Annotations.summary }}  
**告警详情:**    {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}  
**故障时间:**    {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}  
**恢复时间:**    {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}  

============ = **end** = =============
{{- end }}
{{- end }}
{{- end }}
[root@mcw04 template]#

新增配置指定模板，并且webhook中使用模板

[root@mcw04 template]# ps -ef|grep ding
root      34609      1  0 00:26 ?        00:00:00 /prometheus/webhook_dingtalk/prometheus-webhook-dingtalk --config.file=/prometheus/webhook_dingtalk/dingtalk.yml --web.listen-address=:8060
root      34747   2038  0 00:34 pts/0    00:00:00 grep --color=auto ding
[root@mcw04 template]# cat /prometheus/webhook_dingtalk/dingtalk.yml
## Request timeout
# timeout: 5s

## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true

## Customizable templates path
#templates:
#  - contrib/templates/legacy/template.tmpl

## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
#  title: '{{ template "legacy.title" . }}'
#  text: '{{ template "legacy.content" . }}'

## Targets, previously was known as "profiles"
templates:
  - /etc/alertmanager/template/default.tmpl

targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=2f1532xx1a0c
    # secret for signature
    secret: SEC079xxac1e3
    message:
      text: '{{ template "default.tmpl" . }}'

重启服务

[root@mcw04 template]# systemctl restart webhook_dingtalk.service

恢复alertmanager配置

  routes:
  - match:
      severity: critical
    receiver: dingtalk



  - name: 'dingtalk'
    webhook_configs:
      - url: http://10.0.0.14:8060/dingtalk/webhook1/send
        send_resolved: true

完整配置如下

[root@mcw04 template]# cat /etc/alertmanager/alertmanager.yml 
global:
  smtp_smarthost: 'smtp.163.com:465'
  smtp_from: '135xx32@163.com'
  smtp_auth_username: '13xx2@163.com'
  smtp_auth_password: 'EHUKxxSRNW'
  smtp_require_tls: false

templates:
  - '/etc/alertmanager/template/*.tmpl'

route:
  group_by: ['instance','cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: email
  routes:
  - match:
      severity: critical
    receiver: dingtalk
  - match:
      severity: critical
    receiver: pager
    #routes:
    #- match:
    #    service: machangweiapp
    #  receiver: support_team
  - match_re:
      serverity: ^(warning|critical)$
    receiver: support_team

receivers:
  - name: 'email'
    email_configs:
      - to: '8xx5@qq.com'
  - name: 'support_team'
    email_configs:
      - to: '8xxx5@qq.com' 
  - name: 'pager'
    email_configs:
      - to: '1xx2@163.com'
        send_resolved: true
        html: '{{ template "test.html" . }}'
        headers: { Subject: "[WARN] 报警邮件" }
  - name: 'dingtalk'
    webhook_configs:
      - url: http://10.0.0.14:8060/dingtalk/webhook1/send
        send_resolved: true
[root@mcw04 template]#

停止以及开启这个服务，触发这个告警规则

[root@mcw02 ~]# systemctl start rsyslog.service 
[root@mcw02 ~]# systemctl stop rsyslog.service 
[root@mcw02 ~]#

触发的告警规则如下

告警效果如下，当产生告警时，用这个模板发送了告警，当告警恢复时，也发送了告警消息。就是消息有很大的延迟，感觉。告警消息很久才能显示在群里，恢复通知也很久才发出来，不知道是不是哪里时间设置有延迟问题还是就是这样慢

回到顶部

silence和维护

通过alertmanager控制silence

把匹配critical发给钉钉的注释掉，让它走下面的pager,也就是163邮箱

重启服务

[root@mcw04 template]# systemctl restart alertmanager.service

停止服务，触发下面的警报规则

[root@mcw02 ~]# systemctl stop rsyslog.service 
[root@mcw02 ~]#

通知已经发送

可以看到这个，已经发送告警通知了

现在我添加silence

添加匹配标签

点击这里报错了

按enter键把它加进去

点击创建

创建成功

可以查看

可以编辑和使之过期

停止服务触发告警

[root@mcw02 ~]# systemctl stop rsyslog.service 
[root@mcw02 ~]#

告警显示时间是utc时间，差了8个小时，

状态变红了e

并没有新的告警通知

并没有告警通知产生

指定过期

我们看上面过期时间是1:31，然后看我们的告警通知时间，9点31，减去时差，正好就是过期时间发送出去的。也就是添加了slice之后，Prometheus上能看到触发警报规则，但是alertmanager没有发送通知。当slice过期之后，因为服务还没恢复，告警通知立马发送出去了。

通过amtool控制silence

[root@mcw04 ~]# /tmp/alertmanager-0.26.0.linux-amd64/amtool --alertmanager.url=http://10.0.0.14:9093 silence add alertname=InstancesGone service=machangweiapp 
amtool: error: comment required by config

[root@mcw04 ~]# /tmp/alertmanager-0.26.0.linux-amd64/amtool --alertmanager.url=http://10.0.0.14:9093 silence add  --comment "xiaoma test" alertname=InstancesGone service=machangweiapp 
836bb0d7-4501-4d6a-bd0d-a03e659eec13
[root@mcw04 ~]#

可以看到新增的

不过这个匹配不到，应该这个名称的告警规则，没有带有标签service的标签。但是依然是可以创建出来sillences的

查询silence

[root@mcw04 ~]# /tmp/alertmanager-0.26.0.linux-amd64/amtool --alertmanager.url=http://10.0.0.14:9093 silence query
ID                                    Matchers                                           Ends At                  Created By  Comment      
836bb0d7-4501-4d6a-bd0d-a03e659eec13  alertname="InstancesGone" service="machangweiapp"  2024-02-13 03:14:26 UTC  root        xiaoma test  
[root@mcw04 ~]#

使silence失效

[root@mcw04 ~]# /tmp/alertmanager-0.26.0.linux-amd64/amtool --alertmanager.url=http://10.0.0.14:9093 silence query
ID                                    Matchers                                           Ends At                  Created By  Comment      
836bb0d7-4501-4d6a-bd0d-a03e659eec13  alertname="InstancesGone" service="machangweiapp"  2024-02-13 03:14:26 UTC  root        xiaoma test  
[root@mcw04 ~]# /tmp/alertmanager-0.26.0.linux-amd64/amtool --alertmanager.url=http://10.0.0.14:9093 silence expire 836bb0d7-4501-4d6a-bd0d-a03e659eec13
[root@mcw04 ~]# /tmp/alertmanager-0.26.0.linux-amd64/amtool --alertmanager.url=http://10.0.0.14:9093 silence query
ID  Matchers  Ends At  Created By  Comment  
[root@mcw04 ~]#

添加配置文件，默认家目录下面那个文件，然后写上参数，这样命令行可以省去一些参数，

[root@mcw04 ~]# mkdir -p .config/amtool
[root@mcw04 ~]# vim .config/amtool/config.yml
[root@mcw04 ~]# cat .config/amtool/config.yml
alertmanager.url: "http://10.0.0.14:9093"
[root@mcw04 ~]# /tmp/alertmanager-0.26.0.linux-amd64/amtool  silence add  --comment "xiaoma test1" alertname=InstancesGone service=machangwei01 
709516e6-2725-4c15-9280-8871c28dc890
[root@mcw04 ~]# /tmp/alertmanager-0.26.0.linux-amd64/amtool  silence query
ID                                    Matchers                                          Ends At                  Created By  Comment       
709516e6-2725-4c15-9280-8871c28dc890  alertname="InstancesGone" service="machangwei01"  2024-02-13 03:30:40 UTC  root        xiaoma test1  
[root@mcw04 ~]#

指定作者，指定过期时间24小时。我们可以看到，第二条，就是第二天的结束时间了，命令行默认是当天系统用户创建

[root@mcw04 ~]# /tmp/alertmanager-0.26.0.linux-amd64/amtool  silence add  --comment "xiaoma test2" alertname=InstancesGone service=machangwei02  --author "马昌伟" --duration "24h"
90ad0a5d-5fe4-4da4-996e-fc8a70a87552
[root@mcw04 ~]# /tmp/alertmanager-0.26.0.linux-amd64/amtool  silence query
ID                                    Matchers                                          Ends At                  Created By  Comment       
709516e6-2725-4c15-9280-8871c28dc890  alertname="InstancesGone" service="machangwei01"  2024-02-13 03:30:40 UTC  root        xiaoma test1  
90ad0a5d-5fe4-4da4-996e-fc8a70a87552  alertname="InstancesGone" service="machangwei02"  2024-02-14 02:45:23 UTC  马昌伟         xiaoma test2  
[root@mcw04 ~]#

指定作者，在配置文件里面

[root@mcw04 ~]# vim .config/amtool/config.yml 
[root@mcw04 ~]# cat .config/amtool/config.yml
alertmanager.url: "http://10.0.0.14:9093"
author: machangwei@qq.com
comment_required: true
[root@mcw04 ~]# /tmp/alertmanager-0.26.0.linux-amd64/amtool  silence add  --comment "xiaoma test3" alertname=InstancesGone service=machangwei03  --duration "24h"
3742a548-5978-4cd1-9433-9561c5bf6566
[root@mcw04 ~]# /tmp/alertmanager-0.26.0.linux-amd64/amtool  silence query
ID                                    Matchers                                          Ends At                  Created By         Comment       
709516e6-2725-4c15-9280-8871c28dc890  alertname="InstancesGone" service="machangwei01"  2024-02-13 03:30:40 UTC  root               xiaoma test1  
90ad0a5d-5fe4-4da4-996e-fc8a70a87552  alertname="InstancesGone" service="machangwei02"  2024-02-14 02:45:23 UTC  马昌伟                xiaoma test2  
3742a548-5978-4cd1-9433-9561c5bf6566  alertname="InstancesGone" service="machangwei03"  2024-02-14 02:49:58 UTC  machangwei@qq.com  xiaoma test3  
[root@mcw04 ~]#

posted @ 2024-02-11 09:53 马昌伟阅读(141) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· prometheus使用2

· prometheus使用4

· Prometheus+Alertmanager+webhook-dingtalk实现钉钉告警

· Prometheus+Alertmanager+钉钉告警

· 二进制安装部署Grafana+Prometheus+Consul/file_sd_configs(服务发现)+Alertmanage告警

阅读排行：
· 无需6万激活码！GitHub神秘组织3小时极速复刻Manus，手把手教你使用OpenManus搭建本
· C#/.NET/.NET Core优秀项目和框架2025年2月简报
· 什么是nginx的强缓存和协商缓存
· 一文读懂知识蒸馏
· Manus爆火，是硬核还是营销？

公告

昵称：马昌伟
园龄： 7年3个月
粉丝： 236
关注： 37

+加关注

2025年3月

日

一

二

三

四

五

六

不错链接

服务发现

基于文件的服务发现

基于api的服务发现

基于dns的服务发现

警报管理 alertmanager

安装alertmanager

监控alertmanager

配置alertmanager

添加报警规则

添加第一条告警规则

触发告警，以及配置邮件告警

参考邮件alertmanger配置

添加新警报和模板，获取标签值，指标值

Prometheus警报

可用性警报(服务，up机器，缺失指标)

服务可用性

机器可用性

缺失指标告警

路由

路由配置

路由表(多条件匹配)

接收器和通知模板

接收器

告警发送到钉钉群

配置Alertmanager

验证配置

@@@自己操作

邮件通知模板

5.1.1、案例需求

5.1.2、使用流程

5.2、定制邮件模板

5.2.1、编写邮件模板

@@@

5.2.2、修改alertmanager.yml【即应用邮件模板】

@@@

5.2.3、检查语法是否正常

5.2.4、重启alertmanager服务

钉钉通知模板

silence和维护

通过alertmanager控制silence

通过amtool控制silence

公告

搜索

常用链接

随笔档案

目录导航