Prometheus监控url存活🍕

Prometheus监控url存活及Alertmanager告警🍕


 

上篇文章中已经部署了Prometheus及其组件,Prometheus需要监控node节点的url存活状态,需要在Prometheus控制节点部署blackbox_exporter组件。

blackbox_exporter 是 Prometheus 监控系统中的一种 exporter,它用于监控网络服务的可用性和性能。blackbox_exporter 允许用户通过 HTTP、HTTPS、DNS、TCP 和 ICMP 等协议对网络端点进行探测,并收集相关的指标数据。

以下是 blackbox_exporter 的一些主要特点和用途:

主要特点

  • 多种协议支持:blackbox_exporter 支持多种协议,包括 HTTP、HTTPS、DNS、TCP 和 ICMP,使得它能够监控不同类型的服务。
  • 自定义探针:用户可以自定义探针(probes)来执行特定的检查,比如检查 HTTP 响应状态码、响应时间、SSL 证书有效期等。
  • 模块化配置:通过模块化的配置文件,用户可以为不同的探测目标定义不同的探针配置。
  • 指标暴露:blackbox_exporter 会将探测结果以 Prometheus 指标的形式暴露出来,这些指标可以被 Prometheus 服务器抓取并存储。
  • 安全性:支持使用 TLS 加密连接进行探测,确保数据传输的安全性。

用途

  • 网站可用性监控:检查网站是否能够成功响应请求,以及响应时间是否在合理范围内。
  • SSL 证书监控:监控 SSL 证书的有效期,确保证书不会过期。
  • 网络延迟监控:通过 ICMP 探测(如 ping)来监控网络延迟。
  • 端口监控:通过 TCP 探测来检查服务端口是否开放并能够接受连接。
  • DNS 监控:检查 DNS 服务器是否能够正确解析域名。

blackbox_exporter 的配置通常涉及两个主要部分:Exporter 本身的配置和 Prometheus 的抓取配置。

一、部署blackbox_exporter

① 下载安装

[root@localhost ~]# wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.25.0/blackbox_exporter-0.25.0.linux-amd64.tar.gz
[root@localhost ~]# tar -xvf blackbox_exporter-0.25.0.linux-amd64.tar.gz -C /usr/local/
[root@localhost ~]# cd /usr/local/
[root@localhost local]# mv blackbox_exporter-0.25.0.linux-amd64   blackbox_exporter
[root@localhost local]# cd blackbox_exporter

 ② 修改配置文件

[root@localhost blackbox_exporter]# vim blackbox.yml 
modules:
  http_2xx:
    prober: http
    http:
      preferred_ip_protocol: "ip4"
  http_post_2xx:
    prober: http
    http:
      method: POST
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  grpc:
    prober: grpc
    grpc:
      tls: true
      preferred_ip_protocol: "ip4"
  grpc_plain:
    prober: grpc
    grpc:
      tls: false
      service: "service1"
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
      - send: "SSH-2.0-blackbox-ssh-check"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp
  icmp_ttl5:
    prober: icmp
    timeout: 5s
    icmp:
      ttl: 5

③ 启动blackbox_exporter

[root@localhost ~]# vim /usr/lib/systemd/system/blackbox_exporter.service 
[Unit]
Description=Prometheus Blackbox Exporter
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml

[Install]
WantedBy=multi-user.target

[root@localhost ~]# systemctl daemon-reload
[root@localhost ~]# systemctl enable --now blackbox_exporter

blackbox_exporter的grafana对应模板可以导入id9965

二、Prometheus配置对blackbox_exporter抓取

在 Prometheus 的配置文件中,需要设置对 blackbox_exporter 的抓取:

在scrape_configs块中添加 “http_status” 任务

[root@localhost prometheus]# vim /usr/local/promethues/prometheus.yml 
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets: ["localhost:9100"]

  - job_name: "http_status"
    file_sd_configs: 
      - files:
        - /usr/local/prometheus/file_sd_config/*.yml
    metrics_path: /probe
    params:
      module: [http_2xx]

    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115  # 替换为 blackbox_exporter 的地址


  - job_name: "alertmanager"
    static_configs:
      - targets: ["localhost:9093"]

配置要监控url的目标

[root@localhost prometheus]# vim /usr/local/prometheus/file_sd_config/slb_jilin.yml 
#####################################吉林业务############################################
- targets:
  - "http://60.60.60.60:50"       //配置需要监控的url
  labels:
    environment: "Prod"
    region: "阿里云-华北2-北京"
    job: "http_status"
    AlertReceivers: "吉林"
    project: "吉林project"
    service: "考生端"
    ecs: "192.168.1.1"

- targets:
  - "http://23.23.23.23:88"      //配置需要监控的url
  labels:
    environment: "Prod"
    region: "阿里云-华北2-北京"
    job: "http_status"
    AlertReceivers: "吉林"
    project: "吉林project"
    service: "管理端"
    ecs: "192.168.1.2"

配置告警规则

[root@localhost prometheus]# vim /usr/local/prometheus/rules/alert_http_status_code.yml 
groups:  
- name: http_status_code_rules  
  rules:  
  - alert: HTTP_Status_Not_200  
    expr: probe_http_status_code{job="http_status"} != 200  
    for: 1m  
    labels:  
      severity: critical  
      component: web-service
      environment: Prod          # 添加环境标签
      service: web-service       # 添加服务名称标签
      team: devops               # 添加负责团队标签
    annotations:  
      summary: "HTTP Status Code Not 200"  
      description: "{{ $labels.instance }} 程序无法访问!!!"  
      details: "HTTP 状态码为 {{ $value }},请检查服务状态。"  # 添加详细信息

访问web页面可以看到,已经获取到http_status任务目标

 三、Alertmanager配置告警

修改alertmanager.yml配置文件。

主要组成部分

  1. routes: 这是一个路由配置的列表,定义了如何根据告警的标签将告警路由到不同的接收器(receivers)。

  2. match: 每个路由都有一个 match 条件,用于匹配告警的标签。在这个例子中,所有的匹配条件都是基于 AlertReceivers 标签。

  3. receiver: 指定当告警匹配到该路由时,告警将被发送到哪个接收器。接收器可以是电子邮件、Slack、Webhook 等。

[root@localhost alertmanager]# vim alertmanager.yml
global:
  resolve_timeout: 5m
  ## 这里为qq邮箱 SMTP 服务地址,官方地址为 smtp.qq.com 端口为 465587,同时要设置开启 POP3/SMTP 服务。
  smtp_smarthost: 'smtp.126.com:465'
# smtp_from: ' "告警机器人"  <Noleaf@126.com>'
  smtp_from: '=?UTF-8?B?5ZGK6K2m5py65Zmo5Lq6?= <Noleaf@126.com>'
# smtp_from: 'Noleaf@126.com'
  smtp_auth_username: 'Noleaf@126.com'
  #授权码,不是密码,在 QQ 邮箱服务端设置开启 POP3/SMTP 服务时会提示
  smtp_auth_password: 'LLLLLLLLLLLLLLJ'
  smtp_require_tls: false

#1、模板
templates:
  - '/usr/local/alertmanager/templates/*.tmpl'

####################################路由配置##################################################
#2、路由
route:
  group_by: ['job', 'project', 'service']
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 4h
  #设置默认接收器(必须)
  receiver: 'email'

  routes:

    - match:
        AlertReceivers: '广东'
      receiver: 'guangdong-receivers'

    - match:
        AlertReceivers: '贵州'
      receiver: 'guizhou-receivers'

    - match:
        AlertReceivers: '湖北'
      receiver: 'hubei-receivers'  

    - match:
        AlertReceivers: '吉林'
      receiver: 'jilin-receivers'                

####################################接收器配置##########################################        
#3、接收器
receivers:
- name: 'email'
  email_configs:
  - to: '1515151515@qq.com'
    send_resolved: true
    html: '{{ template "email.alert.recovery.html" . }}'
    headers: { Subject: "Prometheus [Warning] 告警邮件" }

#单独设置告警恢复
#- name: 'restore-email'
#  email_configs:
#  - to: '15151515@qq.com'
#    html: '{{ template "email.recovery.html" . }}'
#    headers:  
#         Subject: "告警恢复通知" 

#广东业务
- name: 'guangdong-receivers'
  email_configs:
  - to: '1515151515@qq.com'
    send_resolved: true
    html: '{{ template "email.alert.recovery.html" .  }}'
    headers: { Subject: "广东业务  [Warning] 告警邮件"  }

#贵州业务
- name: 'guizhou-receivers'
  email_configs:
  - to: '151151551@qq.com'
    send_resolved: true
    html: '{{ template "email.alert.recovery.html" .  }}'
    headers: { Subject: "贵州业务  [Warning] 告警邮件"  }

#湖北业务
- name: 'hubei-receivers'
  email_configs:
  - to: '1515151511@qq.com'
    send_resolved: true
    html: '{{ template "email.alert.recovery.html" .  }}'
    headers: { Subject: "湖北业务  [Warning] 告警邮件"  }
#吉林业务
- name: 'jilin-receivers'
  email_configs:
  - to: '1515151515@qq.com'
    send_resolved: true
    html: '{{ template "email.alert.recovery.html" .  }}'
    headers: { Subject: "吉林业务  [Warning] 告警邮件"  }

#######################################抑制器配置####################################################      
# 抑制器配置
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    #确保这个配置下的标签内容相同才会抑制,也就是说警报中必须有这三个标签值才会被抑制。
    equal: ['alertname', 'job', 'instance']

 告警邮件模板

[root@localhost alertmanager]# cat templates/email_alert_recovery_html.tmpl 
{{ define "email.alert.recovery.html" }}
{{ if gt (len .Alerts.Firing) 0 }}
{{ range $index, $alert := .Alerts.Firing }}

========= <span style="color:red;font-size:36px;font-weight:bold;"> 告警通知 </span>=========
<br>
<span style="font-size:20px;font-weight:bold;"> 告警程序:</span> Alertmanager <br>
<span style="font-size:20px;font-weight:bold;"> 告警类型:</span> {{ $alert.Labels.alertname }} <br>
<span style="font-size:20px;font-weight:bold;"> 告警级别:</span> {{ $alert.Labels.severity }}  <br>
<span style="font-size:20px;font-weight:bold;"> 告警状态:</span> {{ .Status }} <br>
<span style="font-size:20px;font-weight:bold;"> 故障主机:</span> {{ $alert.Labels.instance }} {{ $alert.Labels.device }} <br>
<span style="font-size:20px;font-weight:bold;"> 告警主题:</span> {{ .Annotations.summary }} <br>
<span style="font-size:20px;font-weight:bold;"> 告警详情:</span> {{ $alert.Annotations.message }}{{ $alert.Annotations.description }} <br>
{{/*注释信息:range 语句用来遍历 .Labels.SortedPairs 中的每个标签对*/}}
{{/*<span style="font-size:20px;font-weight:bold;"> 主机标签:</span> {{ range .Labels.SortedPairs }} <br> [{{ .Name }}: {{ .Value | html }} ]{{ end }}<br>*/}}
<span style="font-size:20px;font-weight:bold;"> 主机标签:</span><br>
<ul>
  <li>environment: {{ with $alert.Labels.environment }}{{ . }}{{ else }}N/A{{ end }}</li>
  <li>region: {{ with $alert.Labels.region }}{{ . }}{{ else }}N/A{{ end }}</li>
  <li>project: {{ with $alert.Labels.project }}{{ . }}{{ else }}N/A{{ end }}</li>
  <li>service: {{ with $alert.Labels.service }}{{ . }}{{ else }}N/A{{ end }}</li>
  <li>ecs: {{ with $alert.Labels.ecs }}{{ . }}{{ else }}N/A{{ end }}</li>
</ul>
<span style="font-size:20px;font-weight:bold;"> 故障时间:</span> {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>

============== = end =  ==============<br>
<br>
<br>
<br>
<span style="font-size:18px;font-weight:normal;">如服务器正常下线或维护,请忽略本邮件!</span>
<br>

{{ end }}
{{ end }}

{{ if gt (len .Alerts.Resolved) 0 }}
{{ range $index, $alert := .Alerts.Resolved }}

========= <span style="color:#00FF00;font-size:36px;font-weight:bold;"> 告警恢复 </span>=========
<br>
<span style="font-size:20px;font-weight:bold;"> 告警程序:</span> Alertmanager <br>
<span style="font-size:20px;font-weight:bold;"> 告警主题:</span> {{ $alert.Annotations.summary }}<br>
<span style="font-size:20px;font-weight:bold;"> 告警主机:</span> {{ .Labels.instance }} <br>
<span style="font-size:20px;font-weight:bold;"> 告警类型:</span> {{ .Labels.alertname }}<br>
<span style="font-size:20px;font-weight:bold;"> 告警级别:</span> {{ $alert.Labels.severity }}  <br>
<span style="font-size:20px;font-weight:bold;"> 告警状态:</span> {{ .Status }}<br>
<span style="font-size:20px;font-weight:bold;"> 告警详情:</span> {{ if eq .Status "resolved" }} 程序已恢复访问。{{ else }} {{ $alert.Annotations.message }}{{ $alert.Annotations.description }} {{ end }}<br>
{{/*<span style="font-size:20px;font-weight:bold;"> 主机标签:</span> {{ range .Labels.SortedPairs }} <br> [{{ .Name }}: {{ .Value | html }} ]{{ end }}<br>*/}}
<span style="font-size:20px;font-weight:bold;"> 主机标签:</span><br>
<ul>
  <li>environment: {{ with $alert.Labels.environment }}{{ . }}{{ else }}N/A{{ end }}</li>
  <li>region: {{ with $alert.Labels.region }}{{ . }}{{ else }}N/A{{ end }}</li>
  <li>project: {{ with $alert.Labels.project }}{{ . }}{{ else }}N/A{{ end }}</li>
  <li>service: {{ with $alert.Labels.service }}{{ . }}{{ else }}N/A{{ end }}</li>
  <li>ecs: {{ with $alert.Labels.ecs }}{{ . }}{{ else }}N/A{{ end }}</li>
</ul>
<span style="font-size:20px;font-weight:bold;"> 故障时间:</span> {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
<span style="font-size:20px;font-weight:bold;"> 恢复时间:</span> {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>

============= = end = ==============
<br>
<br>
<br>
<br>
{{ end }}
{{ end }}

{{ end }}

邮件展示:

            

 

posted @ 2024-10-29 15:06  Noleaf  阅读(66)  评论(0编辑  收藏  举报