玩prometheus过程中遇到的一些问题
一、pgw的无默认值监控项
1、prometheus的配置文件
global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 1m # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). alerting: alertmanagers: - static_configs: - targets: - localhost:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "/opt/soft/alertmanager-0.25.0.linux-amd64/local_rules.yml" scrape_configs: - job_name: "pushGateway" static_configs: - targets: ["localhost:9099"] #relabel_configs: # - source_labels: [srcIp] # separator: ':' # regex: '(.*)' # replacement: '${1}' # target_label: instance - job_name: "windows_prometheus" static_configs: - targets: - 192.168.61.153:9182 - 192.168.62.238:9182 - 192.168.62.157:9182 - 192.168.62.17:9182 - 192.168.66.175:9182 - 192.168.62.54:9182 - 192.168.62.94:9182 - 192.168.62.87:9182 - 192.168.62.52:9182 - 192.168.62.224:9182 - 192.168.62.222:9182 - 192.168.62.185:9182 - 192.168.62.71:9182 - 192.168.62.110:9182 - 192.168.62.88:9182 - 192.168.62.28:9182 - 192.168.62.175:9182 - 192.168.62.247:9182 relabel_configs: - source_labels: [__address__] separator: ':' regex: '(.*):(.*)' replacement: '${1}' target_label: src_instance
起初自己设定的配置是两个单纯的static_configs,用自带的labels进行向量匹配。
2、pushGateWay 的数据层面, pushgateway 的数据来源是推送到pgw,然后prometheus 再去拉取pgw对应的metric。所以pgw跟exporter就有个区别,pgw没有默认值。
3、那么在当上图中 short-lived jobs没有向pgw推送数据时,pgw不知道是否存在该监控向量。所以此时对应监控项用来判断应用的存活状态就多少有些麻烦。
4、贴上告警规则
groups: - name: local_rules rules: - alert: InstanceDown expr: up{job="windows_prometheus"} == 0 for: 2m labels: severity: error annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes, current value: {{ $value }}" - alert: auto_wx_friend expr: (job_last_success_unixtime{exported_job="auto_wx_friend_from_pgw"} or (up{job="windows_prometheus"} * 0)) == 0 for: 2m labels: severity: error annotations: summary: "auto_wx_friend down {{ $labels.src_instance }} {{ $labels.job }}" description: "{{ $labels.src_instance }} {{ $labels.job }} (current value: {{ $value }})"
推送的脚本,可以直接用
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway import socket class PushGateWayPrometheus: """ pushgateway """ def __init__(self): self.registry = CollectorRegistry() self.gateway = '192.168.60.203:9099' # label 和 value 对应 self.label_name = ['src_instance', ] self.src_ip_label_value = socket.gethostbyname(socket.gethostname()) # 无需修改 self.job = 'auto_wx_friend_from_pgw' def gauge_process_alive(self, metric_name: str, describe: str) -> None: """ 如果对应值设置为1,则表示应用仍然存活 :param metric_name: :param describe: :return: """ g = Gauge(metric_name, describe, registry=self.registry, labelnames=self.label_name) g.labels(self.src_ip_label_value).set(1) def push(self, metric_name: str, describe: str) -> None: """ 推送对应的指标,如果有新的只需新增 :param metric_name: :param describe: :return: """ self.gauge_process_alive(metric_name, describe) push_to_gateway(self.gateway, job=self.job, registry=self.registry) PushGateWayPrometheus().push('job_last_success_unixtime', 'Last time a batch job successfully finished')
告警规则中,InstanceDown是exporter的,auto_wx_friend是pgw的。而在prometheus中的src_instance是为了尝试向量匹配,所以得有相同labels。
虽然简陋但也实现了效果,如果有大佬有更牛的可以赐教一番。
二、alertmanager 告警抑制
inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']
这是网上的,参考一下解释。
# 抑制器配置 inhibit_rules: # 抑制规则 - source_match: # 源标签警报触发时抑制含有目标标签的警报,在当前警报匹配 status: 'High' status: 'High' target_match: status: 'Warning' # equal: ['alertname','operations', 'instance'] # 确保这个配置下的标签内容相同才会抑制,也就是说警报中必须有这三个标签值才会被抑制。