pushGateWay 遇到的一些问题
遇到的两个问题
1、发现通过pgw推送的metric在不继续推送监控数据时,prometheus仍然在更新数据
2、通过pgw推送metric都被覆盖了
问题一的解决方式
1)查看文档:https://prometheus.io/docs/practices/pushing/
- When monitoring multiple instances through a single Pushgateway, the Pushgateway becomes both a single point of failure and a potential bottleneck.
- You lose Prometheus's automatic instance health monitoring via the
up
metric (generated on every scrape). - The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API.
2)解决方式
只能通过pushgateway的api来删除metric,通过prometheus的api删不掉
2.1)全删和局部删,路径规则参考:https://github.com/prometheus/pushgateway
curl -X PUT http://127.0.0.1:9099/api/v1/admin/wipe
curl -X DELETE http://127.0.0.1:9099/metrics/job/auto_wx_friend_from_pgw/process_name/5ENDU19620000906/grouping_src_instance/192.168.61.153
问题二的解决方式
可以参考:https://blog.csdn.net/wulitaotao96/article/details/125368734
1)推送代码
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway import socket class PushGateWayPrometheus: """ pushgateway """ def __init__(self): self.registry = CollectorRegistry() self.gateway = '192.168.60.203:9099' # label 和 value 对应 self.label_name = ['src_instance', 'process_name'] self.src_ip_label_value = socket.gethostbyname(socket.gethostname()) # 无需修改 self.job = 'auto_wx_friend_from_pgw' self.request_timeout = 5 def gauge_process_alive(self, metric_name: str, describe: str, process_name: str) -> None: """ 如果对应值设置为1,则表示应用仍然存活 :param metric_name: :param describe: :return: """ g = Gauge(metric_name, describe, registry=self.registry, labelnames=self.label_name) g.labels(self.src_ip_label_value, process_name).set(1) def push(self, metric_name: str, describe: str, process_name: str) -> None: """ 推送对应的指标,如果有新的只需新增 :param metric_name: :param describe: :return: """ self.gauge_process_alive(metric_name, describe, process_name) push_to_gateway(self.gateway, job=self.job, registry=self.registry, timeout=self.request_timeout, grouping_key={"process_name": process_name, "grouping_src_instance": self.src_ip_label_value}) # 不用动 PushGateWayPrometheus().push('job_last_success_unixtime', 'Last time a batch job successfully finished', 'ce0717179055de32027e') PushGateWayPrometheus().push('job_last_success_unixtime', 'Last time a batch job successfully finished', '5ENDU19620000906') PushGateWayPrometheus().push('job_last_success_unixtime', 'Last time a batch job successfully finished', 'ce071717fdf178a20c7e')
2)指定grouping_key,根据grouping_key中的值进行分组,默认就是根据job进行分组
grouping_key={"process_name": process_name, "grouping_src_instance": self.src_ip_label_value}
3)此时可以看看上传监控项后产生了多少数据
[root@www pushgateway-1.5.1.linux-amd64]# curl -s "http://192.168.60.203:9099/metrics"|grep "auto_wx_friend_from_pgw" job_last_success_unixtime{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="5ENDU19620000906",src_instance="192.168.61.153"} 1 job_last_success_unixtime{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="ce0717179055de32027e",src_instance="192.168.61.153"} 1 job_last_success_unixtime{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="ce071717fdf178a20c7e",src_instance="192.168.61.153"} 1 push_failure_time_seconds{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="5ENDU19620000906"} 0 push_failure_time_seconds{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="ce0717179055de32027e"} 0 push_failure_time_seconds{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="ce071717fdf178a20c7e"} 0 push_time_seconds{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="5ENDU19620000906"} 1.6793950591862314e+09 push_time_seconds{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="ce0717179055de32027e"} 1.6793950501798096e+09 push_time_seconds{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="ce071717fdf178a20c7e"} 1.6793950681916375e+09
我只要判断应用是否存活就够了,所以只要push_time_seconds{}的上传时间足够小就行。