k8s全方位监控-prometheus-配置文件介绍以及基于文件服务发现
1、scrape_configs 参数介绍
# 默认的全局配置 global: scrape_interval: 15s # 采集间隔15s,默认为1min一次 evaluation_interval: 15s # 计算规则的间隔15s默认为1min一次 scrape_timeout: 10s # 采集超时时间,默认为10s external_labels: # 当和其他外部系统交互时的标签,如远程存储、联邦集群时 prometheus: monitoring/k8s # 如:prometheus-operator的配置 prometheus_replica: prometheus-k8s-1 # Alertmanager的配置 alerting: alertmanagers: - static_configs: - targets: - 127.0.0.1:9093 # alertmanager的服务地址,如127.0.0.1:9093 alert_relabel_configs: # 在抓取之前对任何目标及其标签进行修改。 - separator: ; regex: prometheus_replica replacement: $1 action: labeldrop # 一旦加载了报警规则文件,将按照evaluation_interval即15s一次进行计算,rule文件可以有多个 rule_files: # - "first_rules.yml" # - "second_rules.yml" # scrape_configs为采集配置,包含至少一个job scrape_configs: # Prometheus的自身监控 将在采集到的时间序列数据上打上标签job=xx - job_name: 'prometheus' # 采集指标的默认路径为:/metrics,如 localhost:9090/metric # 协议默认为http static_configs: - targets: ['localhost:9090'] # 远程读,可选配置,如将监控数据远程读写到influxdb的地址,默认为本地读写 remote_write: 127.0.0.1:8090 # 远程写 remote_read: 127.0.0.1:8090
2、scrape_configs配置案例
prometheus的配置中,最常用的就是scrape_configs配置,比如添加新的监控项,修改原有监控项的地址频率等。 最简单配置为: scrape_configs: - job_name: prometheus metrics_path: /metrics scheme: http static_configs: - targets: - localhost:9090 完整配置为(附prometheus-operator的推荐配置): # job 将以标签形式出现在指标数据中,如node-exporter采集的数据,job=node-exporter job_name: node-exporter # 采集频率:30s scrape_interval: 30s # 采集超时:10s scrape_timeout: 10s # 采集对象的path路径 metrics_path: /metrics # 采集协议:http或者https scheme: https # 可选的采集url的参数 params: name: demo # 当自定义label和采集到的自带label冲突时的处理方式,默认冲突时会重名为exported_xx honor_labels: false # 当采集对象需要鉴权才能获取时,配置账号密码等信息 basic_auth: username: admin password: admin password_file: /etc/pwd # bearer_token或者文件位置(OAuth 2.0鉴权) bearer_token: kferkhjktdgjwkgkrwg bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token # https的配置,如跳过认证,或配置证书文件 tls_config: # insecure_skip_verify: true ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt server_name: kubernetes insecure_skip_verify: false # 代理地址 proxy_url: 127.9.9.0:9999 # Azure的服务发现配置 azure_sd_configs: # Consul的服务发现配置 consul_sd_configs: # DNS的服务发现配置 dns_sd_configs: # EC2的服务发现配置 ec2_sd_configs: # OpenStack的服务发现配置 openstack_sd_configs: # file的服务发现配置 file_sd_configs: # GCE的服务发现配置 gce_sd_configs: # Marathon的服务发现配置 marathon_sd_configs: # AirBnB的服务发现配置 nerve_sd_configs: # Zookeeper的服务发现配置 serverset_sd_configs: # Triton的服务发现配置 triton_sd_configs: # Kubernetes的服务发现配置 kubernetes_sd_configs: - role: endpoints namespaces: names: - monitoring # 对采集对象进行一些静态配置,如打特定的标签 static_configs: - targets: ['localhost:9090', 'localhost:9191'] labels: my: label your: label # 在Prometheus采集数据之前,通过Target实例的Metadata信息,动态重新写入Label的值。 如将原始的__meta_kubernetes_namespace直接写成namespace,简洁明了 relabel_configs: - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: service replacement: $1 action: replace - source_labels: [__meta_kubernetes_pod_name] separator: ; regex: (.*) target_label: pod replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: job replacement: ${1} action: replace - separator: ; regex: (.*) target_label: endpoint replacement: web action: replace # 指标relabel的配置,如丢掉某些无用的指标 metric_relabel_configs: - source_labels: [__name__] separator: ; regex: etcd_(debugging|disk|request|server).* replacement: $1 action: drop
3、常见案例
1.获取集群中各节点信息,并按可用区或地域分类 如使用k8s的role:node采集集群中node的数据,可以通过"meta_domain_beta_kubernetes_io_zone"标签来获取到该节点的地域,该label为集群创建时为node打上的标记,kubectl decribe node可以看到。 然后可以通过relabel_configs定义新的值 relabel_configs: - source_labels: ["meta_domain_beta_kubernetes_io_zone"] regex: "(.*)" replacement: $1 action: replace target_label: "zone" 后面可以直接通过node{zone="XX"}来进行地域筛选 2.过滤信息,或者按照职能(RD、运维)进行监控管理 对于不同职能(开发、测试、运维)的人员可能只关心其中一部分的监控数据,他们可能各自部署的自己的Prometheus Server用于监控自己关心的指标数据,不必要的数据需要过滤掉,以免浪费资源,可以最类似配置; metric_relabel_configs: - source_labels: [__name__] separator: ; regex: etcd_(debugging|disk|request|server).* replacement: $1 action: drop action: drop代表丢弃掉符合条件的指标,不进行采集。 3.搭建prometheus联邦集群,管理各IDC(地域)监控实例 如果存在多个地域,每个地域又有很多节点或者集群,可以采用默认的联邦集群部署,每个地域部署自己的prometheus server实例,采集自己地域的数据。然后由统一的server采集所有地域数据,进行统一展示,并按照地域归类 配置: scrape_configs: - job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="prometheus"}' - '{__name__=~"job:.*"}' - '{__name__=~"node.*"}' static_configs: - targets: - '192.168.77.11:9090' - '192.168.77.12:9090'
4、服务发现
[root@VM_0_14_centos prometheus]# cat prometheus.yml # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - 172.18.0.1:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" - "alert_rules/rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['localhost:9090'] labels: idc: bj - job_name: 'harbor_server' file_sd_configs: ####基于文件的发现 - files: - /opt/prometheus/file_sd_configs/harbor_monitor.json ### refresh_interval: 10s - job_name: 'container' static_configs: - targets: ['172.18.0.1:8080'] 配置文件: [root@VM_0_14_centos prometheus]# cat /opt/prometheus/file_sd_configs/harbor_monitor.json [ { "targets": ["172.19.0.14:9100","124.156.173.164:9100"] } ]
良禽择木而栖 贤臣择主而侍