prometheus(2)之对kubernetes的监控
prometheus服务发现
- 1.基于endpoints的service注释服务自动发现。
- 2.基于pod注释的服务自动发现
- 3.基于consul注册的服务自动发现
- 4.手动配置服务发现
- 5.pushgetway手动上传服务发现
Prometheus对kubernetes的监控
对于Kubernetes而言,我们可以把当中所有的资源分为几类:
- 基础设施层(Node):集群节点,为整个集群和应用提供运行时资源
- 容器基础设施(Container):为应用提供运行时环境
- 用户应用(Pod):Pod中会包含一组容器,它们一起工作,并且对外提供一个(或者一组)功能
- 内部服务负载均衡(Service):在集群内,通过Service在集群暴露应用功能,集群内应用和应用之间访问时提供内部的负载均衡
- 外部访问入口(Ingress):通过Ingress提供集群外的访问入口,从而可以使外部客户端能够访问到部署在Kubernetes集群内的服务
因此,如果要构建一个完整的监控体系,我们应该考虑,以下5个方面:
- 集群节点状态监控:从集群中各节点的kubelet服务获取节点的基本运行状态;
- 集群节点资源用量监控:通过Daemonset的形式在集群中各个节点部署Node Exporter采集节点的资源使用情况;
- 节点中运行的容器监控:通过各个节点中kubelet内置的cAdvisor中获取个节点中所有容器的运行状态和资源使用情况;
- 如果在集群中部署的应用程序本身内置了对Prometheus的监控支持,那么我们还应该找到相应的Pod实例,并从该Pod实例中获取其内部运行状态的监控指标。
- 对k8s本身的组件做监控:apiserver、scheduler、controller-manager、kubelet、kube-proxy
1. node-exporter介绍?
node-exporter可以采集机器(物理机、虚拟机、云主机等)的监控指标数据,能够采集到的指标包括CPU, 内存,磁盘,网络,文件数等信息。
安装node-exporter
[root@xianchaomaster1 ~]# kubectl create ns monitor-sa 把node-exporter.tar.gz镜像压缩包上传到k8s的各个节点,手动解压: [root@xianchaomaster1 ~]# docker load -i node-exporter.tar.gz [root@xianchaonode1 ~]# docker load -i node-exporter.tar.gz 最好pull到本地传入镜像仓库 [root@node-1-172 tomcat]# docker tag prom/node-exporter:v0.16.0 172.17.166.217/kubenetes/node-exporter:v0.16.0 docker push 172.17.166.217/kubenetes/node-exporter:v0.16.0
apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitor-sa labels: name: node-exporter spec: selector: matchLabels: name: node-exporter template: metadata: labels: name: node-exporter spec: hostPID: true hostIPC: true hostNetwork: true # # hostNetwork、hostIPC、hostPID都为True时,表示这个Pod里的所有容器,会直接使用宿主机的网络,直接与宿主机进行IPC(进程间通信)通信,可以看到宿主机里正在运行的所有进程。 # 加入了hostNetwork:true会直接将我们的宿主机的9100端口映射出来,从而不需要创建service 在我们的宿主机上就会有一个9100的端口 containers: - name: node-exporter image: 172.17.166.217/kubenetes/node-exporter:v0.16.0 ports: - containerPort: 9100 resources: requests: cpu: 0.15 securityContext: privileged: true # #开启特权模式 args: - --path.procfs - /host/proc - --path.sysfs - /host/sys - --collector.filesystem.ignored-mount-points - '"^/(sys|proc|dev|host|etc)($|/)"' #通过正则表达式忽略某些文件系统挂载点的信息收集 volumeMounts: - name: dev mountPath: /host/dev - name: proc mountPath: /host/proc - name: sys mountPath: /host/sys - name: rootfs mountPath: /rootfs tolerations: - key: "node-role.kubernetes.io/master" #对master节点 打污点容忍 operator: "Exists" effect: "NoSchedule" ##将主机/dev、/proc、/sys这些目录挂在到容器中,这是因为我们采集的很多节点数据都是通过这些文件来获取系统信息的。 volumes: - name: proc hostPath: path: /proc - name: dev hostPath: path: /dev - name: sys hostPath: path: /sys - name: rootfs hostPath: path: /
node-export原理通过共享主机资源目录,容器实现对特定目录下文件的查看如cpuinfo等获取信息。
#通过kubectl apply更新node-exporter.yaml文件 [root@xianchaomaster1]# kubectl apply -f node-export.yaml #查看node-exporter是否部署成功 [root@xianchaomaster1]# kubectl get pods -n monitor-sa 显示如下,看到pod的状态都是running,说明部署成功 NAME READY STATUS RESTARTS AGE node-exporter-9qpkd 1/1 Running 0 89s node-exporter-zqmnk 1/1 Running 0 89s 通过node-exporter采集数据 curl http://主机ip:9100/metrics #node-export默认的监听端口是9100,可以看到当前主机获取到的所有监控数据 curl http://192.168.40.180:9100/metrics | grep node_cpu_seconds 显示192.168.40.180主机cpu的使用情况 # HELP node_cpu_seconds_total Seconds the cpus spent in each mode. # TYPE node_cpu_seconds_total counter node_cpu_seconds_total{cpu="0",mode="idle"} 72963.37 node_cpu_seconds_total{cpu="0",mode="iowait"} 9.35 node_cpu_seconds_total{cpu="0",mode="irq"} 0 node_cpu_seconds_total{cpu="0",mode="nice"} 0 node_cpu_seconds_total{cpu="0",mode="softirq"} 151.4 node_cpu_seconds_total{cpu="0",mode="steal"} 0 node_cpu_seconds_total{cpu="0",mode="system"} 656.12 node_cpu_seconds_total{cpu="0",mode="user"} 267.1 #HELP:解释当前指标的含义,上面表示在每种模式下node节点的cpu花费的时间,以s为单位 #TYPE:说明当前指标的数据类型,上面是counter类型 node_cpu_seconds_total{cpu="0",mode="idle"} : cpu0上idle进程占用CPU的总时间,CPU占用时间是一个只增不减的度量指标,从类型中也可以看出node_cpu的数据类型是counter(计数器) counter计数器:只是采集递增的指标 curl http://192.168.40.180:9100/metrics | grep node_load # HELP node_load1 1m load average. # TYPE node_load1 gauge node_load1 0.1 node_load1该指标反映了当前主机在最近一分钟以内的负载情况,系统的负载情况会随系统资源的使用而变化,因此node_load1反映的是当前状态,数据可能增加也可能减少,从注释中可以看出当前指标类型为gauge(标准尺寸) gauge标准尺寸:统计的指标可增加可减少
Prometheus server安装和配置
10.1 创建sa账号,对sa做rbac授权 创建一个sa账号monitor kubectl create serviceaccount monitor -n monitor-sa #把sa账号monitor通过clusterrolebing绑定到clusterrole上 kubectl create clusterrolebinding monitor-clusterrolebinding -n monitor-sa --clusterrole=cluster-admin --serviceaccount=monitor-sa:monitor 10.2 创建prometheus数据存储目录 #在k8s集群的xianchaonode1节点上创建数据存储目录 mkdir /data chmod 777 /data/
创建一个configmap存储卷,用来存放prometheus配置信息
--- kind: ConfigMap apiVersion: v1 metadata: labels: app: prometheus name: prometheus-config namespace: monitor-sa data: prometheus.yml: | global: #全局配置 scrape_interval: 15s #数据抓取时间 scrape_timeout: 10s #抓取超时时间 evaluation_interval: 1m #评估告警周期 scrape_configs: #配置数据源 - job_name: 'kubernetes-node' #target名称 kubernetes_sd_configs: #k8s中服务发现 - role: node #使用的角色 node会使用kubelet默认的http端口来获取一些节点信息 relabel_configs: #重新标记采集数据 - source_labels: [__address__] #将默认采集到的source_loabels重新赋值address 作为一个endpoints regex: '(.*):10250' #将source_labels中的10250替换 replacement: '${1}:9100' #9100替换为10250 target_label: __address__ #替换为ip:9100 action: replace #动作替换 - action: labelmap #匹配到下面正则表达式的标签会被保留 regex: __meta_kubernetes_node_label_(.+) #保留这个标签 - job_name: 'kubernetes-node-cadvisor' kubernetes_sd_configs: - role: node scheme: https #定义协议 tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt #定义ca证书 #key_file: /etc/kubernetes/ssl/ca-key.pem bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token #token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) #保留当前标签 - target_label: __address__ replacement: kubernetes.default.svc:443 #将原本地址转换为此地址 - source_labels: [__meta_kubernetes_node_name] #定义标签 regex: (.+) #正则任意内容 target_label: __metrics_path__ #匹配到source_labels: [__meta_kubernetes_node_name]标签中的__metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor #替换为此地址 - job_name: 'kubernetes-apiserver' kubernetes_sd_configs: - role: endpoints #基于k8s的服务发现 服务可以监控的一个指标 scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt #key_file: /etc/kubernetes/ssl/ca-key.pem bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep #动作保留 regex: default;kubernetes;https #匹配到这些保留 - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep #保留 regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace #替换 target_label: __scheme__ regex: (https?) #采集到带有https的字段替换为上方字段 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_labe l: kubernetes_name
kubectl apply -f prometheus-cfg.yaml kubectl get configmap
安装prometheus
--- apiVersion: apps/v1 kind: Deployment metadata: name: prometheus-server namespace: monitor-sa labels: app: prometheus spec: replicas: 1 selector: matchLabels: app: prometheus component: server #matchExpressions: #- {key: app, operator: In, values: [prometheus]} #- {key: component, operator: In, values: [server]} template: metadata: labels: app: prometheus component: server annotations: prometheus.io/scrape: 'false' #打一个描述信息 在prometheus中定义拥有该描述信息不被抓取 spec: #nodeName: node1 定义了node节点 serviceAccountName: monitor containers: - name: prometheus image: 172.17.166.217/kubenetes/prometheus:v2.2.1 imagePullPolicy: IfNotPresent #从本地进行安装 本地无则拉取 command: - prometheus - --config.file=/etc/prometheus/prometheus.yml #配置文件路径 通过configmap 投射 - --storage.tsdb.path=/prometheus #数据存放目录 - --storage.tsdb.retention=720h #默认删除时间 - --web.enable-lifecycle #开启热加载 ports: - containerPort: 9090 protocol: TCP volumeMounts: - mountPath: /etc/prometheus/prometheus.yml name: prometheus-config subPath: prometheus.yml - mountPath: /prometheus/ name: prometheus-storage-volume volumes: - name: prometheus-config configMap: name: prometheus-config items: - key: prometheus.yml path: prometheus.yml mode: 0644 - name: prometheus-storage-volume hostPath: path: /data type: Directory ~
kubectl apply -f prometheus-deploy.yaml kubectl get pods -n monitor-sa
创建prometheus service(用于提供访问)
apiVersion: v1 kind: Service metadata: name: prometheus namespace: monitor-sa spec: ports: - port: 9090 protocol: TCP targetPort: 9090 selector: app: prometheus component: server type: ClusterIP --- #ingress apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: prometheus namespace: monitor-sa spec: rules: - host: csk8s.mingcloud.net http: paths: - pathType: Prefix path: / backend: service: name: prometheus port: number: 9090
kubectl get svc -n monitor-sa
prometheus配置文件详解
relabel_configs重写标签
job_name:kubernetes-node
kind: ConfigMap apiVersion: v1 metadata: labels: app: prometheus name: prometheus-config namespace: monitor-sa data: prometheus.yml: | rule_files: - /etc/prometheus/rules.yml alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"] global: scrape_interval: 15s scrape_timeout: 10s evaluation_interval: 1m scrape_configs: - job_name: 'kubernetes-node' kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:9100' target_label: __address__ action: replace - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: 'kubernetes-node-cadvisor' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-apiserver' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_scrape - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: kubernetes_namespace - action: replace source_labels: - __meta_kubernetes_pod_name target_label: kubernetes_pod_name - job_name: 'kubernetes-schedule' scrape_interval: 5s static_configs: - targets: ['172.17.166.217:10251','172.17.166.218:10251','172.17.166.219:10251'] - job_name: 'kubernetes-controller-manager' scrape_interval: 5s static_configs: - targets: ['172.17.166.217:10252','172.17.166.218:10252','172.17.166.219:10252'] - job_name: 'kubernetes-kube-proxy' scrape_interval: 5s static_configs: - targets: ['172.17.166.219:10249','172.17.27.255:10249','172.17.27.248:10249','172.17.4.79:10249'] - job_name: 'pushgateway' scrape_interval: 5s static_configs: - targets: ['172.17.166.217:9091'] honor_labels: true - job_name: 'kubernetes-etcd' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ca.pem cert_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/kubernetes.pem key_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/kubernetes-key.pem scrape_interval: 5s static_configs: - targets: ['172.17.166.219:2379','172.17.4.79:2379','172.17.27.255:2379','172.17.27.248:2379']
#scrape_configs:配置数据源,称为target,每个target用job_name命名。又分为静态配置和服务发现
- job_name: 'kubernetes-node' kubernetes_sd_configs: #使用的是k8s的服务发现 - role: node # 使用node角色,它使用默认的kubelet提供的http端口来发现集群中每个node节点。 relabel_configs: #重新标记 - source_labels: [__address__] #配置的原始标签,匹配地址 regex: '(.*):10250' #匹配带有10250端口的url replacement: '${1}:9100' #把匹配到的ip:10250的ip保留 target_label: __address__ #新生成的url是${1}获取到的ip:9100 action: replace - action: labelmap #匹配到下面正则表达式的标签会被保留,如果不做regex正则的话,默认只是会显示instance标签 regex: __meta_kubernetes_node_label_(.+)
注意:Before relabeling表示匹配到的所有标签 instance="xianchaomaster1" Before relabeling: __address__="192.168.40.180:10250" __meta_kubernetes_node_address_Hostname="xianchaomaster1" __meta_kubernetes_node_address_InternalIP="192.168.40.180" __meta_kubernetes_node_annotation_kubeadm_alpha_kubernetes_io_cri_socket="/var/run/dockershim.sock" __meta_kubernetes_node_annotation_node_alpha_kubernetes_io_ttl="0" __meta_kubernetes_node_annotation_projectcalico_org_IPv4Address="192.168.40.180/24" __meta_kubernetes_node_annotation_projectcalico_org_IPv4IPIPTunnelAddr="10.244.123.64" __meta_kubernetes_node_annotation_volumes_kubernetes_io_controller_managed_attach_detach="true" __meta_kubernetes_node_label_beta_kubernetes_io_arch="amd64" __meta_kubernetes_node_label_beta_kubernetes_io_os="linux" __meta_kubernetes_node_label_kubernetes_io_arch="amd64" __meta_kubernetes_node_label_kubernetes_io_hostname="xianchaomaster1" __meta_kubernetes_node_label_kubernetes_io_os="linux" __meta_kubernetes_node_label_node_role_kubernetes_io_control_plane="" __meta_kubernetes_node_label_node_role_kubernetes_io_master="" __meta_kubernetes_node_name="xianchaomaster1" __metrics_path__="/metrics" __scheme__="http" instance="xianchaomaster1" job="kubernetes-node"
node角色默认的获取地址为nodeip:10250端口,由于使用了node-export端口为9100,所以对原地址进行切割重新拼接。并将默认的__meta_kubernetes_node_label标签进行保留。
job_name: kubernetes-node-cadvisor
- job_name: 'kubernetes-node-cadvisor' # 抓取cAdvisor数据,是获取kubelet上/metrics/cadvisor接口数据来获取容器的资源使用情况 kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap #把匹配到的标签保留 regex: __meta_kubernetes_node_label_(.+) #保留匹配到的具有__meta_kubernetes_node_label的标签 - target_label: __address__ #获取到的地址:__address__="192.168.40.180:10250" replacement: kubernetes.default.svc:443 #把获取到的地址替换成新的地址kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) #把原始标签中__meta_kubernetes_node_name值匹配到 target_label: __metrics_path__ #获取__metrics_path__对应的值 replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor #把metrics替换成新的值api/v1/nodes/xianchaomaster1/proxy/metrics/cadvisor ${1}是__meta_kubernetes_node_name获取到的值 新的url就是https://kubernetes.default.svc:443/api/v1/nodes/xianchaomaster1/proxy/metrics/cadvisor
cadvisor用于获取容器资源指标,默认集成在kubelet metric中,通过正则拼接 使目标通过kubernetes.default.svc:443地址访问server-api的clusterIP *.*.0.1访问到后端server-api的api/v1/nodes/各个node名称/proxy/metrics/cadvisor来获取cadvisor
job_name: kubernetes-apiserver
- job_name: 'kubernetes-apiserver' kubernetes_sd_configs: - role: endpoints #使用k8s中的endpoint服务发现,采集apiserver 6443端口获取到的数据 scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: __meta_kubernetes_namespace #endpoint这个对象的名称空间 ,__meta_kubernetes_service_name #endpoint对象的服务名 , __meta_kubernetes_endpoint_port_name #exnpoint的端口名称] action: keep #采集满足条件的实例,其他实例不采集 regex: default;kubernetes;https
#正则匹配到的默认空间下的service名字是kubernetes,协议是https的endpoint类型保留下来
endpoints角色默认到endpoints下查找ip+6443端口
对以下类型进行保留regex: default;kubernetes;https 就会查找到api-services ip及端口。
job_name: kubernetes-service-endpoints
- job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true # 重新打标仅抓取到的具有 "prometheus.io/scrape: true" 的annotation的端点,意思是说如果某个service具有prometheus.io/scrape = true annotation声明则抓取,annotation本身也是键值结构,所以这里的源标签设置为键,而regex设置值true,当值匹配到regex设定的内容时则执行keep动作也就是保留,其余则丢弃。 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) #重新设置scheme,匹配源标签__meta_kubernetes_service_annotation_prometheus_io_scheme也就是prometheus.io/scheme annotation,如果源标签的值匹配到regex,则把值替换为__scheme__对应的值。 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) # 应用中自定义暴露的指标,也许你暴露的API接口不是/metrics这个路径,那么你可以在这个POD对应的service中做一个"prometheus.io/path = /mymetrics" 声明,上面的意思就是把你声明的这个路径赋值给__metrics_path__,其实就是让prometheus来获取自定义应用暴露的metrices的具体路径,不过这里写的要和service中做好约定,如果service中这样写 prometheus.io/app-metrics-path: '/metrics' 那么你这里就要 __meta_kubernetes_service_annotation_prometheus_io_app_metrics_path这样写。 - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 # 暴露自定义的应用的端口,就是把地址和你在service中定义的 "prometheus.io/port = <port>" 声明做一个拼接,然后赋值给__address__,这样prometheus就能获取自定义应用的端口,然后通过这个端口再结合__metrics_path__来获取指标,如果__metrics_path__值不是默认的/metrics那么就要使用上面的标签替换来获取真正暴露的具体路径。
- action: labelmap #保留下面匹配到的标签
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace #替换__meta_kubernetes_namespace变成kubernetes_namespace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
通过对endpoints进行数据抓取,也就是说在service创建中要打上相应的注释对地址拼接,实现服务自动发现。
annotations: prometheus.io/scrape: 'true' prometheus.io/port: '9121'
job_name: kubernetes-pods
- job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_scrape #匹配到以下标签的抓取 - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_path #匹配路径 target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port #匹配端口 prometheus.io/scrape: 'true' prometheus.io/port: '9121' target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) #地址进行拼接 - action: replace source_labels: - __meta_kubernetes_namespace #保留标签 target_label: kubernetes_namespace - action: replace source_labels: - __meta_kubernetes_pod_name target_label: kubernetes_pod_name
原理与服务自动发现类似,调用pod角色通过pod注释信息动态采集。
静态服务发现
- job_name: 'kubernetes-schedule' scrape_interval: 5s static_configs: - targets: ['172.17.166.217:10251','172.17.166.218:10251','172.17.166.219:10251'] - job_name: 'kubernetes-controller-manager' scrape_interval: 5s static_configs: - targets: ['172.17.166.217:10252','172.17.166.218:10252','172.17.166.219:10252'] - job_name: 'kubernetes-kube-proxy' scrape_interval: 5s static_configs: - targets: ['172.17.166.219:10249','172.17.27.255:10249','172.17.27.248:10249','172.17.4.79:10249'] - job_name: 'pushgateway' scrape_interval: 5s static_configs: - targets: ['172.17.166.217:9091'] honor_labels: true - job_name: 'kubernetes-etcd' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ca.pem cert_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/kubernetes.pem key_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/kubernetes-key.pem scrape_interval: 5s static_configs: - targets: ['172.17.166.219:2379','172.17.4.79:2379','172.17.27.255:2379','172.17.27.248:2379']
prometheus热更新
Prometheus热加载 #为了每次修改配置文件可以热加载prometheus,也就是不停止prometheus,就可以使配置生效,想要使配置生效可用如下热加载命令: [root@xianchaomaster1 prometheus]# kubectl get pods -n monitor-sa -o wide -l app=prometheus #10.244.121.4是prometheus的pod的ip地址,如何查看prometheus的pod ip 想要使配置生效可用如下命令热加载: [root@xianchaomaster1]# curl -X POST http://10.244.121.4:9090/-/reload #热加载速度比较慢,可以暴力重启prometheus,如修改上面的prometheus-cfg.yaml文件之后,可执行如下强制删除: kubectl delete -f prometheus-cfg.yaml kubectl delete -f prometheus-deploy.yaml 然后再通过apply更新: kubectl apply -f prometheus-cfg.yaml kubectl apply -f prometheus-deploy.yaml 注意: 线上最好热加载,暴力删除可能造成监控数据的丢失
安装kube-state-metrics组件
kube-state-metrics是什么?
kube-state-metrics通过监听API Server生成有关资源对象的状态指标,比如Deployment、Node、Pod,需要注意的是kube-state-metrics只是简单的提供一个metrics数据,并不会存储这些指标数据,所以我们可以使用Prometheus来抓取这些数据然后存储,主要关注的是业务相关的一些元数据,比如Deployment、Pod、副本状态等;调度了多少个replicas?现在可用的有几个?多少个Pod是running/stopped/terminated状态?Pod重启了多少次?我有多少job在运行中。
安装kube-state-metrics组件
(1)创建sa,并对sa授权
在k8s的控制节点生成一个kube-state-metrics-rbac.yaml文件
通过kubectl apply更新资源清单yaml文件
--- apiVersion: v1 kind: ServiceAccount metadata: name: kube-state-metrics namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kube-state-metrics rules: - apiGroups: [""] resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"] verbs: ["list", "watch"] - apiGroups: ["extensions"] resources: ["daemonsets", "deployments", "replicasets"] verbs: ["list", "watch"] - apiGroups: ["apps"] resources: ["statefulsets"] verbs: ["list", "watch"] - apiGroups: ["batch"] resources: ["cronjobs", "jobs"] verbs: ["list", "watch"] - apiGroups: ["autoscaling"] resources: ["horizontalpodautoscalers"] verbs: ["list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system
2)安装kube-state-metrics组件
apiVersion: apps/v1 kind: Deployment metadata: name: kube-state-metrics namespace: kube-system spec: replicas: 1 selector: matchLabels: app: kube-state-metrics template: metadata: labels: app: kube-state-metrics spec: serviceAccountName: kube-state-metrics containers: - name: kube-state-metrics image: 172.17.166.217/kubenetes/kube-state-metrics:v1.9.0 ports: - containerPort: 8080
3)创建service
apiVersion: v1 kind: Service metadata: annotations: prometheus.io/scrape: 'true' name: kube-state-metrics namespace: kube-system labels: app: kube-state-metrics spec: ports: - name: kube-state-metrics port: 8080 protocol: TCP selector: app: kube-state-metrics
通过注释来抓取数据annotations:prometheus.io/scrape: 'true