pushGateWay 遇到的一些问题

遇到的两个问题

1、发现通过pgw推送的metric在不继续推送监控数据时,prometheus仍然在更新数据

2、通过pgw推送metric都被覆盖了

 

问题一的解决方式

1)查看文档:https://prometheus.io/docs/practices/pushing/

  • When monitoring multiple instances through a single Pushgateway, the Pushgateway becomes both a single point of failure and a potential bottleneck.
  • You lose Prometheus's automatic instance health monitoring via the up metric (generated on every scrape).
  • The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API.

2)解决方式

  只能通过pushgateway的api来删除metric,通过prometheus的api删不掉

  2.1)全删和局部删,路径规则参考:https://github.com/prometheus/pushgateway
curl -X PUT http://127.0.0.1:9099/api/v1/admin/wipe    
    
curl -X DELETE http://127.0.0.1:9099/metrics/job/auto_wx_friend_from_pgw/process_name/5ENDU19620000906/grouping_src_instance/192.168.61.153

 

问题二的解决方式

可以参考:https://blog.csdn.net/wulitaotao96/article/details/125368734

1)推送代码

from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
import socket


class PushGateWayPrometheus:
    """
    pushgateway
    """

    def __init__(self):
        self.registry = CollectorRegistry()
        self.gateway = '192.168.60.203:9099'
        # label 和 value 对应
        self.label_name = ['src_instance', 'process_name']
        self.src_ip_label_value = socket.gethostbyname(socket.gethostname())
        # 无需修改
        self.job = 'auto_wx_friend_from_pgw'
        self.request_timeout = 5

    def gauge_process_alive(self, metric_name: str, describe: str, process_name: str) -> None:
        """
        如果对应值设置为1,则表示应用仍然存活
        :param metric_name:
        :param describe:
        :return:
        """
        g = Gauge(metric_name, describe, registry=self.registry,
                  labelnames=self.label_name)
        g.labels(self.src_ip_label_value, process_name).set(1)

    def push(self, metric_name: str, describe: str, process_name: str) -> None:
        """
        推送对应的指标,如果有新的只需新增
        :param metric_name:
        :param describe:
        :return:
        """
        self.gauge_process_alive(metric_name, describe, process_name)
        push_to_gateway(self.gateway, job=self.job, registry=self.registry, timeout=self.request_timeout,
                        grouping_key={"process_name": process_name, "grouping_src_instance": self.src_ip_label_value})


# 不用动
PushGateWayPrometheus().push('job_last_success_unixtime', 'Last time a batch job successfully finished',
                             'ce0717179055de32027e')
PushGateWayPrometheus().push('job_last_success_unixtime', 'Last time a batch job successfully finished',
                             '5ENDU19620000906')
PushGateWayPrometheus().push('job_last_success_unixtime', 'Last time a batch job successfully finished',
                             'ce071717fdf178a20c7e')

 

2)指定grouping_key,根据grouping_key中的值进行分组,默认就是根据job进行分组

grouping_key={"process_name": process_name, "grouping_src_instance": self.src_ip_label_value}

 

3)此时可以看看上传监控项后产生了多少数据

[root@www pushgateway-1.5.1.linux-amd64]# curl -s "http://192.168.60.203:9099/metrics"|grep "auto_wx_friend_from_pgw"
job_last_success_unixtime{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="5ENDU19620000906",src_instance="192.168.61.153"} 1
job_last_success_unixtime{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="ce0717179055de32027e",src_instance="192.168.61.153"} 1
job_last_success_unixtime{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="ce071717fdf178a20c7e",src_instance="192.168.61.153"} 1
push_failure_time_seconds{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="5ENDU19620000906"} 0
push_failure_time_seconds{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="ce0717179055de32027e"} 0
push_failure_time_seconds{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="ce071717fdf178a20c7e"} 0
push_time_seconds{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="5ENDU19620000906"} 1.6793950591862314e+09
push_time_seconds{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="ce0717179055de32027e"} 1.6793950501798096e+09
push_time_seconds{grouping_src_instance="192.168.61.153",instance="",job="auto_wx_friend_from_pgw",process_name="ce071717fdf178a20c7e"} 1.6793950681916375e+09

我只要判断应用是否存活就够了,所以只要push_time_seconds{}的上传时间足够小就行。





posted @ 2023-03-21 19:10  JvvYou  阅读(550)  评论(0编辑  收藏  举报