Prometheus监控url存活🍕
Prometheus监控url存活及Alertmanager告警🍕
上篇文章中已经部署了Prometheus及其组件,Prometheus需要监控node节点的url存活状态,需要在Prometheus控制节点部署blackbox_exporter组件。
blackbox_exporter
是 Prometheus 监控系统中的一种 exporter,它用于监控网络服务的可用性和性能。blackbox_exporter
允许用户通过 HTTP、HTTPS、DNS、TCP 和 ICMP 等协议对网络端点进行探测,并收集相关的指标数据。
以下是 blackbox_exporter
的一些主要特点和用途:
主要特点
- 多种协议支持:
blackbox_exporter
支持多种协议,包括 HTTP、HTTPS、DNS、TCP 和 ICMP,使得它能够监控不同类型的服务。 - 自定义探针:用户可以自定义探针(probes)来执行特定的检查,比如检查 HTTP 响应状态码、响应时间、SSL 证书有效期等。
- 模块化配置:通过模块化的配置文件,用户可以为不同的探测目标定义不同的探针配置。
- 指标暴露:
blackbox_exporter
会将探测结果以 Prometheus 指标的形式暴露出来,这些指标可以被 Prometheus 服务器抓取并存储。 - 安全性:支持使用 TLS 加密连接进行探测,确保数据传输的安全性。
用途
- 网站可用性监控:检查网站是否能够成功响应请求,以及响应时间是否在合理范围内。
- SSL 证书监控:监控 SSL 证书的有效期,确保证书不会过期。
- 网络延迟监控:通过 ICMP 探测(如 ping)来监控网络延迟。
- 端口监控:通过 TCP 探测来检查服务端口是否开放并能够接受连接。
- DNS 监控:检查 DNS 服务器是否能够正确解析域名。
blackbox_exporter
的配置通常涉及两个主要部分:Exporter 本身的配置和 Prometheus 的抓取配置。
一、部署blackbox_exporter
① 下载安装
[root@localhost ~]# wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.25.0/blackbox_exporter-0.25.0.linux-amd64.tar.gz [root@localhost ~]# tar -xvf blackbox_exporter-0.25.0.linux-amd64.tar.gz -C /usr/local/
[root@localhost ~]# cd /usr/local/
[root@localhost local]# mv blackbox_exporter-0.25.0.linux-amd64 blackbox_exporter
[root@localhost local]# cd blackbox_exporter
② 修改配置文件
[root@localhost blackbox_exporter]# vim blackbox.yml modules: http_2xx: prober: http http: preferred_ip_protocol: "ip4" http_post_2xx: prober: http http: method: POST tcp_connect: prober: tcp pop3s_banner: prober: tcp tcp: query_response: - expect: "^+OK" tls: true tls_config: insecure_skip_verify: false grpc: prober: grpc grpc: tls: true preferred_ip_protocol: "ip4" grpc_plain: prober: grpc grpc: tls: false service: "service1" ssh_banner: prober: tcp tcp: query_response: - expect: "^SSH-2.0-" - send: "SSH-2.0-blackbox-ssh-check" irc_banner: prober: tcp tcp: query_response: - send: "NICK prober" - send: "USER prober prober prober :prober" - expect: "PING :([^ ]+)" send: "PONG ${1}" - expect: "^:[^ ]+ 001" icmp: prober: icmp icmp_ttl5: prober: icmp timeout: 5s icmp: ttl: 5
③ 启动blackbox_exporter
[root@localhost ~]# vim /usr/lib/systemd/system/blackbox_exporter.service [Unit] Description=Prometheus Blackbox Exporter After=network.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml [Install] WantedBy=multi-user.target
[root@localhost ~]# systemctl daemon-reload
[root@localhost ~]# systemctl enable --now blackbox_exporter
blackbox_exporter的grafana对应模板可以导入id:9965
二、Prometheus配置对blackbox_exporter抓取
在 Prometheus 的配置文件中,需要设置对 blackbox_exporter
的抓取:
在scrape_configs块中添加 “http_status” 任务
[root@localhost prometheus]# vim /usr/local/promethues/prometheus.yml # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - localhost:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "rules/*.yml" # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["localhost:9090"] - job_name: "node" static_configs: - targets: ["localhost:9100"] - job_name: "http_status" file_sd_configs: - files: - /usr/local/prometheus/file_sd_config/*.yml metrics_path: /probe params: module: [http_2xx] relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: localhost:9115 # 替换为 blackbox_exporter 的地址 - job_name: "alertmanager" static_configs: - targets: ["localhost:9093"]
配置要监控url的目标
[root@localhost prometheus]# vim /usr/local/prometheus/file_sd_config/slb_jilin.yml #####################################吉林业务############################################ - targets: - "http://60.60.60.60:50" //配置需要监控的url labels: environment: "Prod" region: "阿里云-华北2-北京" job: "http_status" AlertReceivers: "吉林" project: "吉林project" service: "考生端" ecs: "192.168.1.1" - targets: - "http://23.23.23.23:88" //配置需要监控的url labels: environment: "Prod" region: "阿里云-华北2-北京" job: "http_status" AlertReceivers: "吉林" project: "吉林project" service: "管理端" ecs: "192.168.1.2"
配置告警规则
[root@localhost prometheus]# vim /usr/local/prometheus/rules/alert_http_status_code.yml groups: - name: http_status_code_rules rules: - alert: HTTP_Status_Not_200 expr: probe_http_status_code{job="http_status"} != 200 for: 1m labels: severity: critical component: web-service environment: Prod # 添加环境标签 service: web-service # 添加服务名称标签 team: devops # 添加负责团队标签 annotations: summary: "HTTP Status Code Not 200" description: "{{ $labels.instance }} 程序无法访问!!!" details: "HTTP 状态码为 {{ $value }},请检查服务状态。" # 添加详细信息
访问web页面可以看到,已经获取到http_status任务目标
三、Alertmanager配置告警
修改alertmanager.yml配置文件。
主要组成部分
-
routes: 这是一个路由配置的列表,定义了如何根据告警的标签将告警路由到不同的接收器(receivers)。
-
match: 每个路由都有一个
match
条件,用于匹配告警的标签。在这个例子中,所有的匹配条件都是基于AlertReceivers
标签。 -
receiver: 指定当告警匹配到该路由时,告警将被发送到哪个接收器。接收器可以是电子邮件、Slack、Webhook 等。
[root@localhost alertmanager]# vim alertmanager.yml global: resolve_timeout: 5m ## 这里为qq邮箱 SMTP 服务地址,官方地址为 smtp.qq.com 端口为 465 或 587,同时要设置开启 POP3/SMTP 服务。 smtp_smarthost: 'smtp.126.com:465' # smtp_from: ' "告警机器人" <Noleaf@126.com>' smtp_from: '=?UTF-8?B?5ZGK6K2m5py65Zmo5Lq6?= <Noleaf@126.com>' # smtp_from: 'Noleaf@126.com' smtp_auth_username: 'Noleaf@126.com' #授权码,不是密码,在 QQ 邮箱服务端设置开启 POP3/SMTP 服务时会提示 smtp_auth_password: 'LLLLLLLLLLLLLLJ' smtp_require_tls: false #1、模板 templates: - '/usr/local/alertmanager/templates/*.tmpl' ####################################路由配置################################################## #2、路由 route: group_by: ['job', 'project', 'service'] group_wait: 10s group_interval: 5m repeat_interval: 4h #设置默认接收器(必须) receiver: 'email' routes: - match: AlertReceivers: '广东' receiver: 'guangdong-receivers' - match: AlertReceivers: '贵州' receiver: 'guizhou-receivers' - match: AlertReceivers: '湖北' receiver: 'hubei-receivers' - match: AlertReceivers: '吉林' receiver: 'jilin-receivers' ####################################接收器配置########################################## #3、接收器 receivers: - name: 'email' email_configs: - to: '1515151515@qq.com' send_resolved: true html: '{{ template "email.alert.recovery.html" . }}' headers: { Subject: "Prometheus [Warning] 告警邮件" } #单独设置告警恢复 #- name: 'restore-email' # email_configs: # - to: '15151515@qq.com' # html: '{{ template "email.recovery.html" . }}' # headers: # Subject: "告警恢复通知" #广东业务 - name: 'guangdong-receivers' email_configs: - to: '1515151515@qq.com' send_resolved: true html: '{{ template "email.alert.recovery.html" . }}' headers: { Subject: "广东业务 [Warning] 告警邮件" } #贵州业务 - name: 'guizhou-receivers' email_configs: - to: '151151551@qq.com' send_resolved: true html: '{{ template "email.alert.recovery.html" . }}' headers: { Subject: "贵州业务 [Warning] 告警邮件" } #湖北业务 - name: 'hubei-receivers' email_configs: - to: '1515151511@qq.com' send_resolved: true html: '{{ template "email.alert.recovery.html" . }}' headers: { Subject: "湖北业务 [Warning] 告警邮件" } #吉林业务 - name: 'jilin-receivers' email_configs: - to: '1515151515@qq.com' send_resolved: true html: '{{ template "email.alert.recovery.html" . }}' headers: { Subject: "吉林业务 [Warning] 告警邮件" } #######################################抑制器配置#################################################### # 抑制器配置 inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' #确保这个配置下的标签内容相同才会抑制,也就是说警报中必须有这三个标签值才会被抑制。 equal: ['alertname', 'job', 'instance']
告警邮件模板
[root@localhost alertmanager]# cat templates/email_alert_recovery_html.tmpl {{ define "email.alert.recovery.html" }} {{ if gt (len .Alerts.Firing) 0 }} {{ range $index, $alert := .Alerts.Firing }} ========= <span style="color:red;font-size:36px;font-weight:bold;"> 告警通知 </span>========= <br> <span style="font-size:20px;font-weight:bold;"> 告警程序:</span> Alertmanager <br> <span style="font-size:20px;font-weight:bold;"> 告警类型:</span> {{ $alert.Labels.alertname }} <br> <span style="font-size:20px;font-weight:bold;"> 告警级别:</span> {{ $alert.Labels.severity }} <br> <span style="font-size:20px;font-weight:bold;"> 告警状态:</span> {{ .Status }} <br> <span style="font-size:20px;font-weight:bold;"> 故障主机:</span> {{ $alert.Labels.instance }} {{ $alert.Labels.device }} <br> <span style="font-size:20px;font-weight:bold;"> 告警主题:</span> {{ .Annotations.summary }} <br> <span style="font-size:20px;font-weight:bold;"> 告警详情:</span> {{ $alert.Annotations.message }}{{ $alert.Annotations.description }} <br> {{/*注释信息:range 语句用来遍历 .Labels.SortedPairs 中的每个标签对*/}} {{/*<span style="font-size:20px;font-weight:bold;"> 主机标签:</span> {{ range .Labels.SortedPairs }} <br> [{{ .Name }}: {{ .Value | html }} ]{{ end }}<br>*/}} <span style="font-size:20px;font-weight:bold;"> 主机标签:</span><br> <ul> <li>environment: {{ with $alert.Labels.environment }}{{ . }}{{ else }}N/A{{ end }}</li> <li>region: {{ with $alert.Labels.region }}{{ . }}{{ else }}N/A{{ end }}</li> <li>project: {{ with $alert.Labels.project }}{{ . }}{{ else }}N/A{{ end }}</li> <li>service: {{ with $alert.Labels.service }}{{ . }}{{ else }}N/A{{ end }}</li> <li>ecs: {{ with $alert.Labels.ecs }}{{ . }}{{ else }}N/A{{ end }}</li> </ul> <span style="font-size:20px;font-weight:bold;"> 故障时间:</span> {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br> ============== = end = ==============<br> <br> <br> <br> <span style="font-size:18px;font-weight:normal;">如服务器正常下线或维护,请忽略本邮件!</span> <br> {{ end }} {{ end }} {{ if gt (len .Alerts.Resolved) 0 }} {{ range $index, $alert := .Alerts.Resolved }} ========= <span style="color:#00FF00;font-size:36px;font-weight:bold;"> 告警恢复 </span>========= <br> <span style="font-size:20px;font-weight:bold;"> 告警程序:</span> Alertmanager <br> <span style="font-size:20px;font-weight:bold;"> 告警主题:</span> {{ $alert.Annotations.summary }}<br> <span style="font-size:20px;font-weight:bold;"> 告警主机:</span> {{ .Labels.instance }} <br> <span style="font-size:20px;font-weight:bold;"> 告警类型:</span> {{ .Labels.alertname }}<br> <span style="font-size:20px;font-weight:bold;"> 告警级别:</span> {{ $alert.Labels.severity }} <br> <span style="font-size:20px;font-weight:bold;"> 告警状态:</span> {{ .Status }}<br> <span style="font-size:20px;font-weight:bold;"> 告警详情:</span> {{ if eq .Status "resolved" }} 程序已恢复访问。{{ else }} {{ $alert.Annotations.message }}{{ $alert.Annotations.description }} {{ end }}<br> {{/*<span style="font-size:20px;font-weight:bold;"> 主机标签:</span> {{ range .Labels.SortedPairs }} <br> [{{ .Name }}: {{ .Value | html }} ]{{ end }}<br>*/}} <span style="font-size:20px;font-weight:bold;"> 主机标签:</span><br> <ul> <li>environment: {{ with $alert.Labels.environment }}{{ . }}{{ else }}N/A{{ end }}</li> <li>region: {{ with $alert.Labels.region }}{{ . }}{{ else }}N/A{{ end }}</li> <li>project: {{ with $alert.Labels.project }}{{ . }}{{ else }}N/A{{ end }}</li> <li>service: {{ with $alert.Labels.service }}{{ . }}{{ else }}N/A{{ end }}</li> <li>ecs: {{ with $alert.Labels.ecs }}{{ . }}{{ else }}N/A{{ end }}</li> </ul> <span style="font-size:20px;font-weight:bold;"> 故障时间:</span> {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br> <span style="font-size:20px;font-weight:bold;"> 恢复时间:</span> {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br> ============= = end = ============== <br> <br> <br> <br> {{ end }} {{ end }} {{ end }}
邮件展示: