云原生环境监控系统-Prometheus

前言

运维工程师的3大核心职能：服务器资源管理、变更管理、故障管理；

目前维护一些云原生项目，这些项目采用K8s部署，相较于传统的监控，Kubernetes云监控会面临以下棘手问题；

容器的封闭性、隔离性
容器的动态调度
容器网络的虚拟化、软件定义网络

我想通过1款监控系统对

物理服务器层
系统层
网络层
K8s集群层
运行在K8s集群之上的基础设施类应用+业务类应用（应用层）

进行全方位无死角监控；

一款监控系统软件应具备以下核心功能

数据采集：通过pull/push的方式采集数据
数据存储：SQL、NoSQL(K/V、Document、Colum列式存储、TSDB时序数据库)
展示：Grafana
告警：通过各种媒介（E-maill、短信、微信、钉钉）通知到报警接收人

一、Prometheus监控系统概述

Prometheus是前Goole工程师模仿Goole的Brogmon监控系统，而开发出来开源监控系统；

Kubernets是参考Goole的Borg系统开发的容器编排工具，所以Prometheus更适用于监控Kubernets集群；

Prometheus的优势

Metric有独创的指标格式
多维度标签，每个独立的标签组合都代表1个独立的时间序列
内建时序数据的聚合、切割、切片功能
支持双精度浮点型数据，但无法存储日志
被监控端自动发现
数据时序存储：Prometheus内置TSDB，通过PromQL进行TSDB的查询

Prometheus的劣势

无法存储日志，使用Lock或者ELK/EFK收集日志；
Prometheus内建的时序(Time Series)数据库只能存储1个月历史数据；

二、Prometheus监控系统组成

PrometheusServer

PrometheusServer无需配置，既可自动发现待监控的目标对象Target；

PrometheusServer的Retrieval（监控信息采集器）只能通过Pull的方式，从被监控端拉取监控数据；

Instrumentation

Instrumentaition允许应用自己暴露metric指标

任何能支持Prometheus去Scrape指标数据的应用程序都首先具备1个测量系统；

在Prometheus的语境中，Instrumentation是指附加到应用程序内部，用于暴露程序指标的客户端库；

from prometheus_client import Gauge, Counter, Histogram, Summary  # 1.Prometheus指标类型
from prometheus_client.core import CollectorRegistry

程序员借助这些客户端库编写代码，生产可暴露的指标数据；

Prometheus主要通过3种类型的途径，使用HTTP协议，从Targart上抓取指标数据

Exporter：　　　针对没有集成Instrumentation的应用程序，Prometheus社区提供Exporter
Instrumentation：程序在开发时在代码中内建了Instrumentation（仪表板）功能；
Pushgateway：临时任务Push监控指标

其中

基础设施类应用监控：将Prometheus提供Expoerter，例如MySQL/Redis/Nginx.......都可以从Prometheus社区获取相应的Exporter安装到对应Target上；

业务类应用的监控（应用程序内置Prometheus仪表板）：将Prometheus提供Instrumentation集成到程序代码中；

import prometheus_client
from prometheus_client import Gauge, Counter, Histogram, Summary  # 1.Prometheus指标类型
from prometheus_client.core import CollectorRegistry
from flask import Response, Flask

app = Flask(__name__)
REGISTRY = CollectorRegistry(auto_describe=False)

#2.定义Prometheus指标的数据模型
cup_gauge = Gauge(
    "cpu_usage",      #指标名称（key）
    "统计CUP使用率",  # 指标说明
    ["core_number", "machine_ip"],  # Lable：同1指标可能会匹配到多个目标或者设备，因此使用标签作为指标的元数据可以为指标添加多维度的描述信息，这些Lable作为过滤器进行指标的过滤和聚合；
    registry=REGISTRY)


@app.route('/metrics')
def metrics():
    #3.Prometheus指标的数据模型填充数据
    cup_gauge.labels("1", "192.168.56.18").set(25)
    cup_gauge.labels("2", "192.168.56.18").set(25)
    cup_gauge.labels("3", "192.168.56.18").set(25)
    cup_gauge.labels("4", "192.168.56.18").set(25)

    cup_gauge.labels("1", "192.168.56.19").set(23)
    cup_gauge.labels("2", "192.168.56.19").set(27)
    cup_gauge.labels("3", "192.168.56.19").set(28)
    cup_gauge.labels("4", "192.168.56.19").set(22)

    cup_gauge.labels("1", "192.168.56.20").set(21)
    cup_gauge.labels("2", "192.168.56.20").set(29)
    cup_gauge.labels("3", "192.168.56.20").set(32)
    cup_gauge.labels("4", "192.168.56.20").set(28)

    return Response(prometheus_client.generate_latest(REGISTRY),
                    mimetype="text/plain")


@app.route('/')
def index():
    return "<h1>Customized Exporter</h1><br> <a href='metrics'>Metrics</a>"


if __name__ == "__main__":
    app.run(host='0.0.0.0', port=9100, debug=True)

"""
# HELP cpu_usage 统计CUP使用率
# TYPE cpu_usage gauge
cpu_usage{core_number="1",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="2",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="3",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="4",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="1",machine_ip="192.168.56.19"} 23.0
cpu_usage{core_number="2",machine_ip="192.168.56.19"} 27.0
cpu_usage{core_number="3",machine_ip="192.168.56.19"} 28.0
cpu_usage{core_number="4",machine_ip="192.168.56.19"} 22.0
cpu_usage{core_number="1",machine_ip="192.168.56.20"} 21.0
cpu_usage{core_number="2",machine_ip="192.168.56.20"} 29.0
cpu_usage{core_number="3",machine_ip="192.168.56.20"} 32.0
cpu_usage{core_number="4",machine_ip="192.168.56.20"} 28.0

"""

Python程序内建Prometheus仪表盘

Exporters

对于那些未内建Instrumentation，没有暴露Prometheus所支持格式的指标数据的应用程序来说，常用的监控方法是在待监控目标应用程序外，单独部署1个指标暴露程序，该类程序统称为Exporter；

换句话说Exporter负责从被监控的目标应用程序上主动采集和聚合原始的数据，并转换/聚合为Prometheus的指标数据格式

Prometheus社区提供了大量的Exporter例如Node Exporter、MogoDB Exporter、MySQL Exporter等；

PushGateway

如果Prometheus的被监控端是1些短期存在的应用程序，只能通过Push的方式推送报警，可以借助PushGateway组件；

被监控端推送（Push）监控数据到PushGateway，PrometheusServer的Retrieval（监控信息采集器）通过Pull的方式，从PushGateway拉取监控数据；

AlertManager

PrometheusServer只能获取报警指标但无法告警，当PrometheusServer抓取到异常值之后，Prometheus支持通过告警机制向用户发送反馈或者警示，已触发用户能及时采取应对措施；

PrometheusServer通过Push的方式，向AlertManager组件推送告警信息；

AlertManager再把各种告警信息，根据用户配置的告警路由，通过各种告警媒介，通知到各种告警接收人；

1.告警状态

PrometheusAlert告警的生命周期中包含有3种状态：Inactive、Pending、Firing。

1.1.Inactive状态

非活动状态，表示正在监控，但是还未有任何警报触发。

1.2.Pending状态

表示这个警报必须被触发。由于警报可以被分组、压抑/抑制或静默/静音，所以等待验证，一旦所有的验证都通过，则将转到Firing 状态，将警报发送到AlertManager。

1.3.Firing状态

此时报警将按照AlertManager的配置规则，将警报的发送给报警的接收者，一旦警旦警报解除，则将状态转到Inactive状态，如此循环。

远端TSDB

数据就是金矿，但Prometheus内建的时序(Time Series)数据库只能存储1个月历史数据；

如果想要保存更长时间的历史监控数据可以使用远端的TSDB例如InfluxDB，对历史告警数据进行长期存储；

有了大量的历史告警数据以及日志，就可以通过机器学习技术，做告警趋势预测，最终达成智能运维目标；

三、云原生架构下监控覆盖方案

云原生应用的监控通常涉及多个层面的监控，涵盖

基础设施
应用性能
应用间交互链路

云原生架构（特别是 Kubernetes 环境）本身高度动态和分布式，因此需要使用专门的工具来管理和监控这些资源。

以下是常见的云原生应用监控方法和工具：

1. 基础设施层监控

基础设施监控通常包括操作系统资源（如 CPU、内存、磁盘、网络）和容器运行时（如 Docker 或 Kubernetes）本身的状态监控。

Node Exporter: 用于监控主机级别的性能指标（如 CPU、内存、磁盘、网络）。
cAdvisor: 用于监控容器层面的性能指标，包括容器的 CPU、内存、磁盘、网络使用情况。
Kube-state-metrics: 用于监控 Kubernetes 集群资源的状态，如 Pods、Nodes、Deployments、ReplicaSets 等的健康状况和配置信息。

NodeExporter

NodeExporter不是在K8s集群的各个Node节点，作为DamonSet形式运行；

kube-state-metrics

kube-state-metrics专注于暴露Kubernetes资源的状态信息。

kube-state-metrics从Kubernetes的API-Server组件获取集群中资源的元数据。

这些数据包括 Kubernetes 资源对象的状态和配置信息，如 Pod、Node、Deployment、ReplicaSet 等。

它并不提供集群的性能指标（如 CPU、内存使用情况），而是提供 Kubernetes 资源（如 Pods、Deployments、Nodes、Namespaces 等）的状态和元数据。

比如，它会提供Pods是否运行、Deployment 的副本数、Node 的健康状态等信息。

kube-state-metrics需要自己手动部署在Kubernetes集群中，它并不是Kubernetes默认组件的一部分，因此需要单独安装和配置。

cAdvisor(ContainerAdvisor)

cAdvisor是1个开源工具，用于提供容器的性能分析和资源使用情况监控。

我们平时执行kubectl top pods和kubectl top nodes命令，获取pod和node资源使用信息，这些数据都是由Kubelet的cAdvisor提供。

cAdvisor由Google开发，并广泛用于容器化环境（特别是 Docker和Kubernetes）中，帮助开发人员和运维人员监控和分析容器的资源使用情况。

在Kubernetes集群中，cAdvisor是Kubelet的一部分，默认情况下，Kubelet会在每个节点上运行cAdvisor来监控容器的资源使用情况。

因此，在Kubernetes集群中，不需要单独安装cAdvisor。

默认情况下，cAdvisor数据通过以下URL访问

http://<node-ip>:10250/metrics/cadvisor

在使用Prometheus等工具进行抓取Metric指标时，Prometheus-server会从该端口和路径抓取容器资源使用情况的相关数据。

2. 应用层监控

应用层监控关注的是应用程序的性能、行为和健康状况。通过采集来自应用的自定义指标（如请求量、响应时间、错误率等），可以更好地了解应用的运行状态。

Prometheus & Client Libraries: Prometheus 提供了强大的查询语言（PromQL），可以通过客户端库（如 Go、Java、Python 等）收集应用程序的自定义指标（如 API 请求次数、响应时间等）。
Application Performance Monitoring (APM): 工具如 OpenTelemetry、Jaeger、Zipkin 等，用于跟踪应用的请求流（分布式追踪）和性能监控。可以帮助识别应用瓶颈和性能问题。

3.应用间交互链路层监控（Trace）

分布式追踪可以帮助追踪跨多个服务或微服务的请求流，帮助开发者识别系统中的性能瓶颈和问题。

Jaeger和Zipkin：这两个工具通常与 Prometheus或其他监控工具结合使用，用于采集、存储和分析分布式跟踪数据，帮助开发人员理解服务间的请求流。
OpenTelemetry：1个开源框架，提供统一的 API 和 SDK 用于收集分布式追踪、度量和日志数据。可以与 Prometheus、Jaeger 等工具配合使用。

四、Prometheus部署

1.下载相关组件

2.启动PrometheusServer

PrometheusServer程序是Go语言开发的，直接运行二进制文件即可；

PrometheusServer程序自建了仪表板（Instrumentation）工作在http://安装主机的外网IP:9090/metrics；

[root@localhost prometheus-2.40.2]# ls
console_libraries  consoles  data  LICENSE  NOTICE  prometheus  prometheus.yml  promtool
[root@localhost prometheus-2.40.2]# ./prometheus

热加载配置文件

curl -X POST http://localhost:9090/-/reload

若使用这种方式进行热加载操作，需要在 Prometheus 服务启动时指定 --web.enable-lifecycle，添加到以上的 Prometheus 自启动文件中使用。

3.启动node_exporter

把node_exporter部署在被监控端，貌似没有Windows系统的node_exporter，不过可以自己开发1个自定义的Exporter；

/usr/bin/node_exporter 
--web.listen-address=10.65.0.46:49100  #监听的端口
--collector.meminfo_numa                
--no-collector.bcache
--no-collector.arp --no-collector.edac
--no-collector.entropy --no-collector.hwmon
--no-collector.ipvs --no-collector.timex 
--no-collector.nfs --no-collector.nfsd 
--no-collector.wifi 
--no-collector.zfs 
--no-collector.xfs 
--collector.textfile.directory=/data/textfile   #指定从/data/textfile目录下，读取某些自定义的监控指标信息。

启动node_exporter之后监听在http://被监控主机IP:9100/metrics，PrometheusServer会通过Htttp协议和Pull的方式来抓取数据；

[root@localhost node_exporter-1.4.0.linux-amd64]# ls
LICENSE  node_exporter  NOTICE
[root@localhost node_exporter-1.4.0.linux-amd64]# ./node_exporter 
ts=2022-11-18T10:22:19.410Z caller=node_exporter.go:182 level=info msg="Starting node_exporter" version="(version=1.4.0, branch=HEAD, revision=7da1321761b3b8dfc9e496e1a60e6a476fec6018)"
ts=2022-11-18T10:22:19.410Z caller=node_exporter.go:183 level=info msg="Build context" build_context="(go=go1.19.1, user=root@83d90983e89c, date=20220926-12:32:56)"
ts=2022-11-18T10:22:19.410Z caller=node_exporter.go:185 level=warn msg="Node Exporter is running as root user. This exporter is designed to run as unprivileged user, root is not required."
ts=2022-11-18T10:22:19.410Z caller=filesystem_common.go:111 level=info collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)
ts=2022-11-18T10:22:19.410Z caller=filesystem_common.go:113 level=info collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
ts=2022-11-18T10:22:19.411Z caller=diskstats_common.go:100 level=info collector=diskstats msg="Parsed flag --collector.diskstats.device-exclude" flag=^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$
ts=2022-11-18T10:22:19.411Z caller=node_exporter.go:108 level=info msg="Enabled collectors"
ts=2022-11-18T10:22:19.411Z caller=node_exporter.go:115 level=info collector=arp

3.1.扩展自己的监控指标

利用node_exporter的--collector.textfile.directory=/data/textfile启动选项，扩展监控指标。

监控salt中状态为Not connected和No response的minion

3.1.1.crontab

每5分钟定时执行检测minion命令( salt \* test.ping>file)输出到文件，基于该文件进行数据分析，统计出监控指标。

*/5 * * * * /home/scripts/sa_utils/salt-check-minion-new.sh

3.1.2.Shell脚本

#!/bin/bash
logfile="/tmp/salt-check.txt"
textfile_directory="/data/textfile"
[ -d $textfile_directory ] || mkdir $textfile_directory
salt \* test.ping > ${logfile} 2>&1
/bin/python3 salt_check_minion.py $logfile $textfile_directory

/home/scripts/sa_utils/salt-check-minion-new.sh

3.1.3.Python脚本

# !/usr/bin/python
# -*- coding: utf-8 -*-
#Author: Zhanglijia
#Created: 2023-05-15

import sys
class MinionStatusChecking(object):
    def __init__(self, logfile, textfile_directory):
        self.logfile = logfile
        self.textfile_directory = textfile_directory
        self.no_response_position = []
        self.no_response_hosts = []
        self.no_connected_position = []
        self.no_connected_hosts = []

    def get_log_data(self):
        with open(self.logfile) as f:
            line_list = f.readlines()
        self.log_data = line_list

    def mark_error_minion_position(self):
        line_number = 0
        for line in self.log_data:
            line = line.strip()
            line_number += 1
            if line.endswith("[No response]"):
                self.no_response_position.append(line_number - 1)
            elif line.endswith("[Not connected]"):
                self.no_connected_position.append(line_number - 1)

    def mark_error_minion_hosts(self):
        line_number = 0
        for line in self.log_data:
            line = line.strip()
            line_number += 1
            if line_number in self.no_response_position:
                self.no_response_hosts.append(line)
            elif line_number in self.no_connected_position:
                self.no_connected_hosts.append(line)

    def mark_error_metrics(self):
        with open(self.textfile_directory, "w") as f:
            line = 'salt_minion_errors{status="host_not_connected",hosts="%s"} %d\n' % (
            ",".join(self.no_connected_hosts), len(self.no_connected_hosts))
            f.write(line)
            line = 'salt_minion_errors{status="host_not_response",hosts="%s"} %d\n' % (
            ",".join(self.no_response_hosts), len(self.no_response_hosts))
            f.write(line)

    def __call__(self, *args, **kwargs):
        print("#######开始执行#######")
        self.get_log_data()
        self.mark_error_minion_position()
        self.mark_error_minion_hosts()
        self.mark_error_metrics()
        print("#######执行成功请检查%s#######" % (self.textfile_directory))


if __name__ == '__main__':
    try:
        log_file = sys.argv[1]
        textfile_directory_file = sys.argv[2] + "/salt_minion_errors.prom"
    except Exception as e:
        print("请输入log_file和textfile_directory_file")
    else:
        MinionStatusChecking(log_file, textfile_directory_file)()

/home/scripts/sa_utils/salt_check_minion.py

3.1.4.定义监控指标数据格式

salt_minion_errors{status="host_not_connected",hosts="datanode065.eq-sg-3.zhanggen.com:,kafka010.eq-sg-3.apus.com:,vm-cleanapi003.eq-sg-2.zhanggen.com:"} 72
salt_minion_errors{status="host_not_response",hosts="datanode065.eq-sg-2.zhanggen.com:,kafka010.eq-sg-2.apus.com:,vm-cleanapi003.eq-sg-2.zhanggen.com:"} 19

/data/textfile/salt_minion_errors.prom

3.1.5.查看采集到的指标数据

salt_minion_errors{status=~"host_not_connected|host_not_response"}

3.1.6.Alertmanager配置

groups:
- name: default
  rules:
  - alert: "salt-minion节点故障"
    expr: salt_minion_errors{status=~"host_not_connected|host_not_response"} >= 1
    for: 5m
    labels:
      level: P4
    annotations:
      summary: "salt-minion节点故障"
      description: "salt master:{{$labels.idc}}|{{$labels.instance}}|{{$labels.hosts}}|故障的minion数量为{{$value}},可能是salt-minion进程有问题，请尽快排查。"

/data/prometheus/conf/rules/alert/default.yml

4.prometheus.yml 配置

先通过静态配置的方式指定Prometheus去Pull哪些Target；

Prometheus会定时加载prometheus.yml配置文件；

我们可以借助Eureka/Consul/Zookeeper这些配置管理中心，实现在不重启PrometheusServer的情况下，被监控端服务自动发现功能；

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "zhanggen_nodes"
    metrics_path: "/metrics"
    static_configs:
      - targets:
        - "192.168.56.18:9100"

通过Prometheus的主配置文件扩展其他子配置文件。

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 10.65.9.102:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "/data/prometheus/conf/rules/alert/*.yml"
  - "/data/prometheus/conf/rules/record/*.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "federate"
    scrape_interval: 45s
    scrape_timeout: 45s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~"..*"}'
    static_configs:
      - targets: ['10.65.9.102:32717','10.65.9.102:31860']
        labels:
          cluster: 'sg2'
      - targets: ['10.65.9.112:32434','10.65.9.112:31866']
        labels:
          cluster: 'sg1'

  - job_name: "node"
    scrape_interval: 30s
    scrape_timeout: 30s
    file_sd_configs:
    - files:
      - /data/prometheus/conf/file-sd/node.yaml

  - job_name: "tc-sg-xallog"
    scrape_interval: 30s
    scrape_timeout: 30s
    file_sd_configs:
    - files:
      - /data/prometheus/conf/file-sd/tc-xallog.yaml

  - job_name: "kafka"
    scrape_interval: 45s
    scrape_timeout: 45s
    file_sd_configs:
    - files:
      - /data/prometheus/conf/file-sd/kafka.yaml

  - job_name: "rabbitmq"
    scrape_interval: 45s
    scrape_timeout: 45s
    file_sd_configs:
    - files:
      - /data/prometheus/conf/file-sd/rabbitmq.yaml

  - job_name: "elasticsearch"
    scrape_interval: 60s
    scrape_timeout: 60s
    file_sd_configs:
    - files:
      - /data/prometheus/conf/file-sd/elasticsearch.yaml

  - job_name: zookeeper
    scrape_interval: 60s
    scrape_timeout: 60s
    file_sd_configs:
    - files:
      - /data/prometheus/conf/file-sd/zookeeper.yaml

  - job_name: 'blackbox'
    metrics_path: /probe
    scrape_interval: 1m
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
    file_sd_configs:
    - files:
      - /data/prometheus/conf/file-sd/http.yaml
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115  # The blackbox exporter's real hostname:port.

  - job_name: 'https'
    scrape_interval: 4m
    metrics_path: /probe
    params:
      module: [https_tls]  # Look for a HTTP 200 response.
    file_sd_configs:
    - files:
      - /data/prometheus/conf/file-sd/https.yaml
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115  # The blackbox exporter's real hostname:port.

  - job_name: 'tcp-connect'
    metrics_path: /probe
    params:
      module: [tcp_connect]  # Look for a HTTP 200 response.
    file_sd_configs:
    - files:
      - /data/prometheus/conf/file-sd/tcp.yaml
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115  # The blackbox exporter's real hostname:port.

prometheus.yml

5.查看被监控的Targets

6.PromQL查询同1个Job下的报警

运维标准化是运维自动化的前提条件；

如果在配置时，系统层、K8s层、业务应用层，都使用同1个Lable，例如归类为1个Job，就可以把各层相关的报警内容串联起来发送；

五、Prometheus监控指标

PrometheusServer定时通过HTTP协议去Target暴露的Exporter中Pull监控指标信息；

PrometheusServer把采集到的监控指标（时间序列）存储到时序数据库。

1.metrics数据类型

Prometheus使用4种方法来描述监视的指标

Counter

计数器用于保存单调递增的数据，例如站点访问次数，不能为负值，也不支持减少，但可以重置为0；

Gauge(仪表板)

仪表盘用于存储有起伏特征的指标数据，例如内存空闲大小

Gauge是Counter的超集，相较于Counter存在指标数据丢失的可能性；

Counter能让用户确切了解指标随着时间的变化状态，而Gauge则可能随着时间流逝而变得精准度越来越低；

Histogram(直方图）

把数据分成不同的区间（也叫“桶”），每个区间里有多少数据就统计多少。

例如统计请求响应时间时，我们可以把响应时间分成 0~~100ms、100~~200ms、200~500ms 等桶，然后分别统计请求落在这些时间范围里的次数。

Summary(摘要)

Summary直接预先设定好一些“百分位”值。

比如我们希望知道 90%的请求响应时间小于多少（P90），或 99%的请求响应时间小于多少（P99）；

Summary会直接在客户端计算好这些百分位数值上报，而不需要自己分桶。

直方图和摘要如何选择

Histogram：适合用来分析整体数据分布，特别是需要动态地（随时）查询不同百分位值的时候。比如我们可以通过 Prometheus 查询出 90%（P90）或 99%（P99）的请求响应时间。
Summary：适合在客户端（应用程序）里就已经知道要统计哪些固定的百分位数，比如只需要记录90%或99%的响应时间，这种情况下，Summary 会直接帮你计算并上报这些百分位值。

举个例子

假设我们在监控一个服务的响应时间：

Histogram 的做法是，给出多个时间段（0-100ms、100-200ms、200-500ms 等），每次请求结束后，把响应时间记录到相应的区间里。最后我们可以用这些数据动态地计算出任何百分位（比如 P50、P90、P99），还能分析各个区间里的请求数量分布。
Summary 的做法是，直接上报应用程序内计算好的P90或P99的响应时间，它并不关注各个时间区间的数据，也无法根据这些数据再重新计算新的百分位。这样可以简化处理，但如果想统计其它百分位（例如 P80），Summary 就做不到了。

总结

Histogram：适合需要更灵活的查询、聚合时使用，尤其是在多个服务节点上需要汇总监控时。
Summary：适合更固定、更简单的监控需求，比如只需要观察特定的百分位，不需要动态修改。

2.metrics数据格式

Prometheus仅用于以"键值"形式存储时序式的聚合数据，它并不支持存储文本数据；

# HELP cpu_usage 统计CUP使用率
# TYPE cpu_usage gauge
cpu_usage{core_number="1",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="2",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="3",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="4",machine_ip="192.168.56.18"} 25.0
cpu_usage{core_number="1",machine_ip="192.168.56.19"} 23.0
cpu_usage{core_number="2",machine_ip="192.168.56.19"} 27.0
cpu_usage{core_number="3",machine_ip="192.168.56.19"} 28.0
cpu_usage{core_number="4",machine_ip="192.168.56.19"} 22.0
cpu_usage{core_number="1",machine_ip="192.168.56.20"} 21.0
cpu_usage{core_number="2",machine_ip="192.168.56.20"} 29.0
cpu_usage{core_number="3",machine_ip="192.168.56.20"} 32.0
cpu_usage{core_number="4",machine_ip="192.168.56.20"} 28.0

键=指标名称+标签

指标名称：Key称为指标（Metric）名称+标签（label），通常意味着CPU速率、内存使用率/分区比例等；
标签： 1个指标无法精确匹配到1个监控目标或者设备，指标+标签可以更加精确的描述出同1个监控目标或者设备的不同监控项（多维度的描述信息）；

另外通过指标+Lable作为过滤器，进行指标的过滤和聚合；

值

浮点型数值

六、PromQL

Prometheus采用向量数据格式保存监控数据。

提供了内置的数据查询语句PromQL（Prometheus Query Language）也称为向量表达式，以支持用户进行实时的向量查询及聚合操作；

1.PromQL语法

PromQL通过指标名称+Label标签作为查询条件，从Prometheus的时间序列数据库中，查询出符合查询条件的时间序列（向量），作为当前PromQL的查询结果。

2.PromQL查询结果

PromQL查询结果也称向量，向量分为瞬时向量和范围向量，2种不同的数据类型；

Instant vector： - a set of time series containing a single sample for each time series, all sharing the same timestamp
Range vector： - a set of time series containing a range of data points over time for each time series
Scalar： - a simple numeric floating point value
String： - a simple string value: currently unused

比较常见的是

"瞬时向量" #查询得到最新的值，(实时数据)通常用于报警、实时监控。（1维数组）
"区间向量" #查询某1段时间范围内所有的样本值，多用于数据分析、预测。（2维数组）

2.1.瞬时向量（Instance vector）

PromQL查询出的每1个向量（时间序列）之中只包含：1个时间戳+1个值。

curl 'http://localhost:9090/api/v1/query' \
  --data 'query=http_requests_total{code="200"}' \
  --data time=1608481001
{
  "metric": {"__name__": "http_requests_total", "code": "200"},
  "value": [1608481001（1个时间戳）, "881（1个值）"]
}

Instance vector（瞬时向量）向量就是1个方向的数据，表示1个时间序列的集合，但是每1个时序上只有最近的1个点，而不是线。

2.2.范围向量（RangeVector）

PromQL查询出的每1个向量（时间序列）之中只包含：N个时间戳+N个值。（二维数组）

curl 'http://localhost:9090/api/v1/query' \
  --data 'query=http_requests_total{code="200"}[30s]' \
  --data time=1608481001
{
  "metric": {"__name__": "http_requests_total", "code": "200"},
  "values": 
         [  [1608480978, "863"], [1608480986, "874"], [1608480094, "881"] ]
}

Range vector（范围向量）表示一段时间范围里的时序，每个时序可包含多个点。

min_over_time(kube_pod_container_status_ready{namespace=~"aimaster-user-namespace.*",pod=~"app-workspace.*"}[4h]) == 1

七、Alertmanager告警策略

上文提到PrometheusServer和Alertmanager是2个不同的组件；

PrometheusServer负责去Exporter上Pull、存储、分析报警信息，并且触发告警，而Alertmanager负责把告警推送出去；

在PrometheusServer的主配置文件（prometheys.yml）中配置产生的条件以及产生告警，配置产生告警之后推送给哪些Alertmanager；

Prometheus通过alter-rule告警规则，触发告警通知给altermanager；

Alertmanager组件则用于处理这些由Prometheus触发的告警；

Alertmanager组件除了提供基本的告警通知能力以外，还主要提供了如：分组、抑制以及静默等告警特性：

1.Prometheus配置Alertmanager的地址

在Prometheus的主配置文件（prometheys.yml）中，配置触发报警之后，报警传递（push）给哪些AlertManager去处理报警；

/data/prometheus/conf/prometheus.yml

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 10.65.9.102:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "/data/prometheus/conf/rules/alert/*.yml"
  - "/data/prometheus/conf/rules/record/*.yml"

2.Prometheus配置触发报警的条件

在Prometheus的主配置文件（prometheys.yml）中，配置触发报警的条件。