基于Prometheus+Grafana搭建可视化监控服务 (二) AlertManager告警

基于Prometheus+Grafana搭建可视化监控服务 (二) AlertManager告警

上一篇 基于Prometheus+Grafana搭建可视化监控服务 (一) Prometheus

一、概述

基于Prometheus+Grafana方案中的告警服务配置有两种方案,一是基于Prometheus的AlertManager,二是基于Grafana的Alert配置,本文描述的是AlertManager方案

AlertManager官方文档:https://prometheus.io/docs/alerting/latest/alertmanager/

Prometheus Alert 告警状态有三种状态:Inactive、Pending、Firing。
Inactive:非活动状态,表示正在监控,但是还未有任何警报触发。
Pending:表示这个警报必须被触发。由于警报可以被分组、压抑/抑制或静默/静音,所以等待验证,一旦所有的验证都通过,则将转到 Firing 状态。
Firing:将警报发送到 AlertManager,它将按照配置将警报的发送给所有接收者。一旦警报解除,则将状态转到 Inactive,如此循环。

二、AlertManager安装

2.1.下载安装

[root@server ~]# cd /usr/local/src
[root@server ~]# wget https://github.com/prometheus/alertmanager/releases/download/v0.22.2/alertmanager-0.22.2.linux-amd64.tar.gz
[root@server src]# mkdir -p /usr/local/prometheus/
[root@server src]# tar xvf alertmanager-0.22.2.linux-amd64.tar.gz -C /usr/local/prometheus/
[root@server src]# mv /usr/local/prometheus/alertmanager-0.22.2.linux-amd64/ /usr/local/prometheus/alertmanager

2.2.将AlertManager配置成系统服务

[root@server src]# vi /etc/systemd/system/alertmanager.service
[Unit]
Description=AlertManager Server
Documentation=https://prometheus.io/docs/alerting/latest/alertmanager/
After=network.target

[Service]
ExecStart=/usr/local/prometheus/alertmanager/alertmanager \
  --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -SIGINT $MAINPID
Restart=on-failure
[Install]
WantedBy=multi-user.target

2.3.通过systemctl启动alertmanager

[root@server src]# systemctl daemon-reload
[root@server src]# systemctl start alertmanager
[root@server src]# systemctl status alertmanager 
[root@server src]# systemctl enable alertmanager 

三、配置告警通知方式

官方文档: https://prometheus.io/docs/alerting/configuration/
alertmanger.yml主要作用有三个:
(1)设定告警媒介
(2)设定告警模版
(3)路由告警对象。

[root@server src]# mkdir -p /usr/local/prometheus/alertmanager/templates/
[root@server src]# vi /usr/local/prometheus/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.domain.com'
  smtp_from: 'devops@domain.com'
  smtp_auth_username: 'devops@domain.com'
  smtp_auth_password: '123456'
  smtp_require_tls: false
templates:
- '/usr/local/prometheus/alertmanager/templates/*.tmpl'
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
- name: 'email'
  email_configs:
  - to: 'admin@domain.com'
    send_resolved: true # 恢复后通知
    #html: '{{ template "email.to.html" . }}' # 邮件模板
    #headers: { Subject: " {{ .CommonLabels.instance }} {{ .CommonAnnotations.summary }}" }   #标题
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'job', 'instance']

简单介绍一下主要配置的作用:
global: 全局配置,包括报警解决后的超时时间、SMTP 相关配置、各种渠道通知的 API 地址等等。
route: 用来设置报警的分发策略,它是一个树状结构,按照深度优先从左向右的顺序进行匹配。
receivers: 配置告警消息接受者信息,例如常用的 email、wechat、slack、webhook 等消息通知方式。
inhibit_rules: 抑制规则配置,当存在与另一组匹配的警报(源)时,抑制规则将禁用与一组匹配的警报(目标)。

alertmanager主要处理流程:
接收到Alert,根据labels判断属于哪些Route(可存在多个Route,一个Route有多个Group,一个Group有多个Alert)
将Alert分配到Group中,没有则新建Group
新的Group等待group_wait指定的时间(等待时可能收到同一Group的Alert),根据resolve_timeout判断Alert是否解决,然后发送通知
已有的Group等待group_interval指定的时间,判断Alert是否解决,当上次发送通知到现在的间隔大于repeat_interval或者Group有更新时会发送通知

resolve_timeout: 5m # 恢复的超时时间,默认是5分钟
路由树的根节点,每个传进来的报警从这里开始
group_by: ['alertname'] #报警分组依据
group_wait: 10s #组等待时间,初次发警报的延时
group_interval: 10s #发送前等待时间
repeat_interval: 1h #重复周期,如果报警发送成功,等待多久重新发送一次
https://www.kancloud.cn/huyipow/prometheus/527563 这里有更多详细描述]

equal: ['alertname', 'job', 'instance'] #确保这个配置下的标签内容相同才会抑制,也就是说警报中必须有这三个标签值才会被抑制

检查配置文件

[root@server src]# /usr/local/prometheus/alertmanager/amtool check-config /usr/local/prometheus/alertmanager/alertmanager.yml
Checking 'alertmanager.yml'  SUCCESS
Found:

- global config
- route
- 1 inhibit rules
- 2 receivers
- 0 templates

多种告警通道配置示例:

 # 路由
route:
  group_by: ['alertname'] # 报警分组依据
  group_wait: 20s #组等待时间
  group_interval: 20s # 发送前等待时间
  repeat_interval: 12h #重复周期
  receiver: 'email' # 默认警报接收者
  #子路由
  routes:
  - receiver: 'wechat'
    match:
      severity: test  #标签severity为test时满足条,使用wechat警报

四、Prometheus侧配置告警规则

4.1.配置告警服务和规则目录

[root@server src]# vi /usr/local/prometheus/prometheus/prometheus.yml
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - 127.0.0.1:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
   - "rules/*.yml"

校验配置正确性

[root@server src]# /usr/local/prometheus/prometheus/promtool check config /usr/local/prometheus/prometheus/prometheus.yml

#注意:每次调整配置或调整规则文件后都需要通过此检查规则配置正确性
#然后执行 systemctl reload prometheus 使之规则生效

4.2.创建告警规则

[root@server src]# mkdir -p /usr/local/prometheus/prometheus/rules/
[root@server src]# vi /usr/local/prometheus/prometheus/rules/general.yml
groups:
- name: node-up
  rules:
  - alert: node-up
    expr: up{job="node_exporter"} == 0
    for: 15s
    labels:
      severity: critical
      team: node
    annotations:
      summary: "{{ $labels.instance }} 已停止运行超过 15s!"
      description: "{{ $labels.instance }} 检测到异常停止!请重点关注!!!"

参数说明:
name: node-up #规则名称
expr: up{job="node_exporter"} == 0 #告警条件
for: 15s #查询时间间隔
severity: critical # 告警级别
annotations: # 注释
summary: "{{ $labels.instance }} 已停止运行超过 15s!" # 发送告警的内容

五、在线查看

Prometheus在线查看规则和告警

http://ip:9090/rules  #在线查看规则
http://ip:9090/alerts #在线查看告警

AlertManager内置在线查看告警

http://ip:9093/#/alerts 
停止目标服务器的node_exporter服务,过一会即会收到告警邮件
[root@server src]# systemctl stop node_exporter

然后恢复node_exporter服务,过一会即会收到恢复邮件
[root@server src]# systemctl start node_exporter

六、告警通知模板

[root@server src]# vi /usr/local/prometheus/alertmanager/templates/email.tmpl
{{ define "email.to.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{ range .Alerts }}
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }}  <br>
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
{{ end }}{{ end -}}

{{- if gt (len .Alerts.Resolved) 0 -}}
{{ range .Alerts }}
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }} <br>
{{ end }}{{ end -}}

{{- end }}

如果收到的邮件告警中时间相差8小时(使用的是UTC时间),则需要通过增加“.Add 28800e9” 或 .Local.来解决,例如:

{{ .StartsAt.Format "2006-01-02 15:04:05" }} 改为{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}

{{ .StartsAt.Format "2006-01-02 15:04:05" }} 改为{{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}

七、其他告警规则参考

7.1.Linux告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/linux_rule.yml
groups: 
  - name: linux_alert
    rules: 
      - alert: "linux load5 over 5"
        for: 5s
        expr: node_load5 > 5
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }}  over 5,当前值:{{ $value }}"
          summary: "linux load5  over 5"

      - alert: "node-up"
        for: 5s
        expr: up{job="node_exporter"}==0
        labels:
          serverity: critical
		annotations:
		  summary: "{{ $labels.instance }} 已停止运行超过 15s!"
		  description: "{{ $labels.instance }} 检测到异常停止!请重点关注!!!"

      - alert: "cpu used percent over 80% per 1 min"
        for: 5s
        expr: 100 * (1 - avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[1m])))  * on(instance) group_left(hostname) node_uname_info > 80
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "cpu used percent over 80% per 1 min"

      - alert: "memory used percent over 85%"
        for: 5m
        expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes{instance=~"172.*"})) * 100 > 85
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "memory used percent over 85%"

      - alert: "eth0 input traffic network over 10M"
        for: 3m
        expr: sum by(instance) (irate(node_network_receive_bytes_total{device="eth0",instance=~"172.*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "eth0 input traffic network over 10M"

      - alert: "eth0 output traffic network over 10M"
        for: 3m
        expr: sum by(instance) (irate(node_network_transmit_bytes_total{device="eth0",instance=~"172.*|175.*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "eth0 output traffic network over 10M"

      - alert: "disk usage over 80%"
        for: 10m
        expr: (node_filesystem_size_bytes{device=~"/dev/.+"} - node_filesystem_free_bytes{device=~"/dev/.+"} )/ node_filesystem_size_bytes{device=~"/dev/.+"} * 100 > 80
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.mountpoint }} 分区 over 80%,当前值:{{ $value }}"
          summary: "disk usage over 80%"

7.2.icmp告警规则

主要用来判断target是否在线或者是有网络抖动

[root@server src]# vi /usr/local/prometheus/prometheus/rules/check_icmp_rule.yml
groups:
  - name: icmp check
    rules:
      - alert: icmp_check failed
        for: 5s
        expr: probe_success{job="icmp_check"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.hostname }} icmp检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的服务器 {{ $labels.hostname }} 服务器检测不通"

7.3.端口监控告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/port_check_rule.yml
groups:
  - name: tcp port check
    rules:
      - alert: tcp_port_check failed
        for: 5s
        expr: probe_success{job="port_status"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.service }} tcp检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的应用 {{ $labels.service }} 端口检测不通"

7.4.URL监控告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/http_url_check_rule.yml
groups:
  - name: httpd url check
    rules:
      - alert: http_url_check failed
        for: 5s
        expr: probe_success{job="http_status"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.service }} url检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的应用 {{ $labels.service }} url接口检测不通"

7.5.HTTP状态码监控告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/http_status_code_check_rule.yml
groups:
  - name: http_status_code check
    rules:
      - alert: http_status_code check failed
        for: 1m
        expr: probe_http_status_code{job="http_status"}>=400 and probe_success{job="http_status"}==0
        labels:
          serverity: critical
        annotations:
          summary: '业务报警: 网站不可访问'      
          description: '{{ $labels.service }}组的应用{{$labels.instance}} 不可访问,请及时查看,当前状态码为{{$value}}'

7.6.MySQL告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/mysql_rule.yml
 groups:
    - name: MySQL-rules
      rules:
      - alert: MySQL Status 
        expr: up{job="mysql_exporter"} == 0
        for: 5s 
        labels:
          severity: warning
        annotations:
          summary: "{{$labels.instance}}: MySQL has stop !!!"
          description: "检测MySQL数据库运行状态"
    
      - alert: MySQL Slave IO Thread Status
        expr: mysql_slave_status_slave_io_running == 0
        for: 5s 
        labels:
          severity: warning
        annotations: 
          summary: "{{$labels.instance}}: MySQL Slave IO Thread has stop !!!"
          description: "检测MySQL主从IO线程运行状态"
    
      - alert: MySQL Slave SQL Thread Status 
        expr: mysql_slave_status_slave_sql_running == 0
        for: 5s 
        labels:
          severity: warning
        annotations: 
          summary: "{{$labels.instance}}: MySQL Slave SQL Thread has stop !!!"
          description: "检测MySQL主从SQL线程运行状态"
    
      - alert: MySQL Slave Delay Status 
        expr: mysql_slave_status_sql_delay == 30
        for: 5s 
        labels:
          severity: warning
        annotations: 
          summary: "{{$labels.instance}}: MySQL Slave Delay has more than 30s !!!"
          description: "检测MySQL主从延时状态"
    
      - alert: Mysql_Too_Many_Connections
        expr: rate(mysql_global_status_threads_connected[5m]) > 200
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{$labels.instance}}: 连接数过多"
          description: "{{$labels.instance}}: 连接数过多,请处理 ,(current value is: {{ $value }})"  
    
      - alert: Mysql_Too_Many_slow_queries
        expr: rate(mysql_global_status_slow_queries[5m]) > 3
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{$labels.instance}}: 慢查询有点多,请检查处理"
          description: "{{$labels.instance}}: Mysql slow_queries is more than 3 per second ,(current value is: {{ $value }})"

7.7.Redis告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/redis_rule.yml
groups:
- name:  Redis
  rules: 
    - alert: redis-up
      expr: redis_up == 0
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Redis down (instance {{ $labels.instance }})"
        description: "Redis检测到异常停止!请重点关注!!!,mmp\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: MissingBackup
      expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Missing backup (instance {{ $labels.instance }})"
        description: "Redis has not been backuped for 24 hours\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"       
    - alert: OutOfMemory
      expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Out of memory (instance {{ $labels.instance }})"
        description: "Redis is running out of memory (> 90%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: ReplicationBroken
      expr: delta(redis_connected_slaves[1m]) < 0
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Replication broken (instance {{ $labels.instance }})"
        description: "Redis instance lost a slave\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: TooManyConnections
      expr: redis_connected_clients > 1000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Too many connections (instance {{ $labels.instance }})"
        description: "Redis instance has too many connections\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"       
    - alert: NotEnoughConnections
      expr: redis_connected_clients < 5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Not enough connections (instance {{ $labels.instance }})"
        description: "Redis instance should have more connections (> 5)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: RejectedConnections
      expr: increase(redis_rejected_connections_total[1m]) > 0
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Rejected connections (instance {{ $labels.instance }})"
        description: "Some connections to Redis has been rejected\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

7.8.elasticsearch监控告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/ elasticsearch_rule.yml
groups:
  - name: elasticsearch
    rules:
      - record: elasticsearch_filesystem_data_used_percent
        expr: 100 * (elasticsearch_filesystem_data_size_bytes - elasticsearch_filesystem_data_free_bytes)
          / elasticsearch_filesystem_data_size_bytes
      - record: elasticsearch_filesystem_data_free_percent
        expr: 100 - elasticsearch_filesystem_data_used_percent
      - alert: ElasticsearchTooFewNodesRunning
        expr: elasticsearch_cluster_health_number_of_nodes < 3
        for: 5m
        labels:
          severity: critical
        annotations:
          description: There are only {{$value}} < 3 ElasticSearch nodes running
          summary: ElasticSearch running on less than 3 nodes
      - alert: ElasticsearchHeapTooHigh
        expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}
          > 0.9
        for: 15m
        labels:
          severity: critical
        annotations:
          description: The heap usage is over 90% for 15m
          summary: ElasticSearch node {{$labels.node}} heap usage is high

7.9.rabbitmq监控告警规则

[root@server src]# vi /usr/local/prometheus/prometheus/rules/ rabbitmq_rule.yml
groups:
- name: rabbitmq-up
  rules:
  - alert: rabbitmq-up
    expr: up{job="rabbitmq_exporter"} == 0 
    for: 15s
    annotations:
      summary: "{{ $labels.name }} 已停止运行超过 15s!"
      description: "RabbitMQ,{{$labels.name}} has been down"

7.10.predict_linear警报预测

通过predict_linear函数可实现预测,例如磁盘多久预计会满

- name: disk_alerts
  rules:
  - alert: DiskWillFillin4Hours
    expr: predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summmary: Disk on {{ $labels.instance }} will fill in approximately 4 hours.

八、相关参考

posted @ 2021-08-15 17:04  一片相思林  阅读(651)  评论(0编辑  收藏  举报