老杨 k8s prometheus 监控-组件监控篇-2
基于Prometheus的全方位监控平台--K8S集群层面监控
一、KubeStateMetrics简介
kube-state-metrics 是一个 Kubernetes 组件,它通过查询 Kubernetes 的 API 服务器,收集关于 Kubernetes 中各种资源(如节点、pod、服务等)的状态信息,并将这些信息转换成 Prometheus 可以使用的指标。
kube-state-metrics 主要功能:
- 节点状态信息,如节点 CPU 和内存的使用情况、节点状态、节点标签等。
- Pod 的状态信息,如 Pod 状态、容器状态、容器镜像信息、Pod 的标签和注释等。
- Deployment、Daemonset、Statefulset 和 ReplicaSet 等控制器的状态信息,如副本数、副本状态、创建时间等。
- Service 的状态信息,如服务类型、服务 IP 和端口等。
- 存储卷的状态信息,如存储卷类型、存储卷容量等。
- Kubernetes 的 API 服务器状态信息,如 API 服务器的状态、请求次数、响应时间等。
通过 kube-state-metrics 可以方便的对 Kubernetes 集群进行监控,发现问题,以及提前预警。
二、KubeStateMetrics
包含ServiceAccount
、ClusterRole
、ClusterRoleBinding
、Deployment
、ConfigMap
、Service
六类YAML文件
apiVersion: v1 kind: ServiceAccount metadata: name: kube-state-metrics namespace: monitor labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kube-state-metrics labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: [""] resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: ["list", "watch"] - apiGroups: ["apps"] resources: - statefulsets - daemonsets - deployments - replicasets verbs: ["list", "watch"] - apiGroups: ["batch"] resources: - cronjobs - jobs verbs: ["list", "watch"] - apiGroups: ["autoscaling"] resources: - horizontalpodautoscalers verbs: ["list", "watch"] - apiGroups: ["networking.k8s.io", "extensions"] resources: - ingresses verbs: ["list", "watch"] - apiGroups: ["storage.k8s.io"] resources: - storageclasses verbs: ["list", "watch"] - apiGroups: ["certificates.k8s.io"] resources: - certificatesigningrequests verbs: ["list", "watch"] - apiGroups: ["policy"] resources: - poddisruptionbudgets verbs: ["list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: kube-state-metrics-resizer namespace: monitor labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: [""] resources: - pods verbs: ["get"] - apiGroups: ["extensions","apps"] resources: - deployments resourceNames: ["kube-state-metrics"] verbs: ["get", "update"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: kube-state-metrics labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: monitor --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: kube-state-metrics namespace: monitor labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: kube-state-metrics-resizer subjects: - kind: ServiceAccount name: kube-state-metrics namespace: monitor --- apiVersion: apps/v1 kind: Deployment metadata: name: kube-state-metrics namespace: monitor labels: k8s-app: kube-state-metrics kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v1.3.0 spec: selector: matchLabels: k8s-app: kube-state-metrics version: v1.3.0 replicas: 1 template: metadata: labels: k8s-app: kube-state-metrics version: v1.3.0 annotations: scheduler.alpha.kubernetes.io/critical-pod: '' spec: priorityClassName: system-cluster-critical serviceAccountName: kube-state-metrics containers: - name: kube-state-metrics image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.4.2 ports: - name: http-metrics ## 用于公开kubernetes的指标数据的端口。prometheus调用 containerPort: 8080 - name: telemetry ##用于公开自身kube-state-metrics的指标数据的端口。prometheus调用 containerPort: 8081 readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 timeoutSeconds: 5 - name: addon-resizer ##addon-resizer 用来伸缩部署在集群内的 metrics-server, kube-state-metrics等监控组件 image: mirrorgooglecontainers/addon-resizer:1.8.6 resources: limits: cpu: 200m memory: 200Mi requests: cpu: 100m memory: 30Mi env: - name: MY_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: MY_POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace volumeMounts: - name: config-volume mountPath: /etc/config command: - /pod_nanny - --config-dir=/etc/config - --container=kube-state-metrics - --cpu=100m - --extra-cpu=1m - --memory=100Mi - --extra-memory=2Mi - --threshold=5 - --deployment=kube-state-metrics volumes: - name: config-volume configMap: name: kube-state-metrics-config --- # Config map for resource configuration. apiVersion: v1 kind: ConfigMap metadata: name: kube-state-metrics-config namespace: monitor labels: k8s-app: kube-state-metrics kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile data: NannyConfiguration: |- apiVersion: nannyconfig/v1alpha1 kind: NannyConfiguration --- apiVersion: v1 kind: Service metadata: name: kube-state-metrics namespace: monitor labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/name: "kube-state-metrics" annotations: prometheus.io/scrape: 'true' spec: ports: - name: http-metrics port: 8080 targetPort: http-metrics protocol: TCP - name: telemetry port: 8081 targetPort: telemetry protocol: TCP selector: k8s-app: kube-state-metrics
确认验证:
kubectl get all -nmonitor |grep kube-state-metrics curl -kL $(kubectl get service -n monitor | grep kube-state-metrics |awk '{ print $3 }'):8080/metrics
2.1、prometheus 配置 kube-state-metrics
- 编写prometheus配置文件,需要注意的是,他默认匹配到的是8080和801两个端口,需要手动指定为8080端口; 8080为集群状态指标,8081为自身指标
- job_name: kube-state-metrics
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
regex: kube-state-metrics
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:8080
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
自身监控
- job_name: kube-state-metrics-self kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_name] regex: kube-state-metrics action: keep - source_labels: [__meta_kubernetes_pod_ip] regex: (.+) target_label: __address__ replacement: ${1}:8081 - source_labels: [__meta_kubernetes_endpoints_name] action: replace target_label: endpoint - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_service_name] action: replace target_label: service - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace
2.1、Kubernetes 集群架构监控
在 prometheus-config.yaml
一次添加如下采集数据:
2.1.1、kube-apiserver
端口6443
需要注意的是使用https访问时,需要tls相关配置,可以指定ca证书路径或者 insecure_skip_verify: true
跳过证书验证。
除此之外,还要指定 bearer_token_file
,否则会提示 server returned HTTP status 400 Bad Request
;
- job_name: kube-apiserver kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name] action: keep regex: default;kubernetes - source_labels: [__meta_kubernetes_endpoints_name] action: replace target_label: endpoint - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_service_name] action: replace target_label: service - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace
二进制部署监控配置
需要先监控kubelet
思考:
1、如何过滤出apiserver节点
在监控coredns中,通过以下配置过滤出coredns的pod
- source_labels: - __meta_kubernetes_service_label_k8s_app regex: kube-dns action: keep
而对于node类型的指标,可以通过标签过滤,如果集群中有10台机器,其中三台为master主节点,7台为node节点,如果根据以下配置,会出现10台spiserver 节点,这是我们不想看到的。
- job_name: kube-apiserver metrics_path: /metrics scheme: https tls_config: insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:6443' target_label: __address__ action: replace
因此,采用对master节点打上标签方式,通过标签匹配方式过滤出主节点
查看当前节点标签
[root@master-1 prometheus-k8s]# kubectl get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS master-1 Ready <none> 15d v1.20.15 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-1,kubernetes.io/os=linux node-1 Ready <none> 15d v1.20.15 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node-1,kubernetes.io/os=linux node-2 Ready <none> 15d v1.20.15 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node-2,kubernetes.io/os=linux
查看节点已有标签
__meta_kubernetes_node_label_kubernetes_io_arch="amd64" __meta_kubernetes_node_label_kubernetes_io_os="linux"
对节点打上标签
kubectl label node master-1 kubernetes.io/master=true
再次查看标签
[root@master-1 prometheus-k8s]# kubectl get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS master-1 Ready <none> 15d v1.20.15 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master-1,kubernetes.io/master=true,kubernetes.io/os=linux node-1 Ready <none> 15d v1.20.15 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node-1,kubernetes.io/os=linux node-2 Ready <none> 15d v1.20.15 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node-2,kubernetes.io/os=linux
去prometheus Service Discovery 查看标签
__meta_kubernetes_node_labelpresent_kubernetes_io_master="true"
- job_name: kube-apiserver metrics_path: /metrics scheme: https tls_config: insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - source_labels: [__meta_kubernetes_node_labelpresent_kubernetes_io_master] regex: true action: keep - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:6443' target_label: __address__ action: replace
2、是不是应该添加这段配置
- source_labels: - __meta_kubernetes_service_label_k8s_app # 源标签 regex: kube-dns # 匹配关键字 action: keep # 只保留service标签值为kube-dns的实例
解释
不用
他这个是把service 标签过滤出来后,在对pod 标签进行重写,也就是说:它取到了pod的ip赋值给__address__,然后prometheus对这个__address__进行访问获取监控指标
而对于node节点指标,就不用过滤service,而是直接使用node ip+端口获取指标,而这个ip和端口来源于__address__
假设有一个节点的初始地址为 192.168.1.1:10250,经过重标签配置后: 初始发现的 address 标签:192.168.1.1:10250 经过第二个 relabeling 配置: 使用正则表达式 '(.*):10250' 匹配并替换为 ${1}:6443 得到新的地址:192.168.1.1:6443 设置到 __address__ 标签:__address__=192.168.1.1:6443 最终,Prometheus 将会抓取 https://192.168.1.1:6443/metrics 来获取 kube-apiserver 的指标。 总结: __address__ 是 Prometheus 用来指定目标地址的内部标签。 通过 relabel 配置,可以将节点地址转换为适当的形式,并设置到 __address__ 标签中。 Prometheus 使用 __address__ 标签来确定从哪里抓取指标。
2.1.2、controller-manager
端口:10257
- 查看controller-manager信息(名称可能不太一样,大家注意一下)
# kubectl describe pod -n kube-system kube-controller-manager-master1 Name: kube-controller-manager-k8s-master Namespace: kube-system …… Labels: component=kube-controller-manager tier=control-plane …… Containers: kube-controller-manager: …… Command: kube-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf --bind-address=127.0.0.1 ……
由上可知,匹配pod对象,lable标签为component=kube-controller-manager即可,但需注意的是controller-manager默认只运行127.0.0.1访问,因此还需先修改controller-manager配置.
- 修改
/etc/kubernetes/manifests/kube-controller-manager.yaml
# cat /etc/kubernetes/manifests/kube-controller-manager.yaml …… - command: - --bind-address=0.0.0.0 # 端口改为0.0.0.0 #- --port=0 # 注释0端口 ……
- 编写prometheus配置文件,需要注意的是,他默认匹配到的是80端口,需要手动指定为10252端口;
- job_name: kube-controller-manager kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_label_component] regex: kube-controller-manager action: keep - source_labels: [__meta_kubernetes_pod_ip] regex: (.+) target_label: __address__ replacement: ${1}:10252 - source_labels: [__meta_kubernetes_endpoints_name] action: replace target_label: endpoint - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_service_name] action: replace target_label: service - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace
二进制部署监控
- job_name: kube-controller-manager metrics_path: /metrics scheme: https tls_config: insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token # prometheus 挂载的tokne kubernetes_sd_configs: # 服务发现 - role: node # 获取node指标 relabel_configs: - action: labelmap # (.+) 部分捕获的内容)作为新的标签名称,而标签的值则保持不变。 regex: __meta_kubernetes_node_label_(.+) - source_labels: [__meta_kubernetes_node_label_kubernetes_io_role] regex: master # 过滤主节点 action: keep - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:10257' target_label: __address__ # prometheus访问 action: replace
报错解决:
查看端口
都监听在localhost上面,需要修改
etcd
#[Member] ETCD_NAME="etcd-1" ETCD_DATA_DIR="/var/lib/etcd/default.etcd" ETCD_LISTEN_PEER_URLS="https://192.168.0.120:2380" ETCD_LISTEN_CLIENT_URLS="https://192.168.0.120:2379,http://192.168.0.120:2379" #[Clustering] ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.0.120:2380" ETCD_ADVERTISE_CLIENT_URLS="https://192.168.0.120:2379" ETCD_INITIAL_CLUSTER="etcd-1=https://192.168.0.120:2380,etcd-2=https://192.168.0.61:2380,etcd-3=https://192.168.0.192:2380" ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster" ETCD_INITIAL_CLUSTER_STATE="new"
kube-scheduler
KUBE_SCHEDULER_OPTS="--logtostderr=false \ --v=2 \ --log-dir=/opt/kubernetes/logs \ --leader-elect \ --kubeconfig=/opt/kubernetes/cfg/kube-scheduler.kubeconfig \ --bind-address=192.168.0.120"
kube-controller-manager
KUBE_CONTROLLER_MANAGER_OPTS="--logtostderr=false \ --v=2 \ --log-dir=/opt/kubernetes/logs \ --leader-elect=true \ --kubeconfig=/opt/kubernetes/cfg/kube-controller-manager.kubeconfig \ --bind-address=192.168.0.120 \ --allocate-node-cidrs=true \ --cluster-cidr=10.244.0.0/16 \ --service-cluster-ip-range=10.0.0.0/16 \ --cluster-signing-cert-file=/opt/kubernetes/ssl/ca.pem \ --cluster-signing-key-file=/opt/kubernetes/ssl/ca-key.pem \ --root-ca-file=/opt/kubernetes/ssl/ca.pem \ --service-account-private-key-file=/opt/kubernetes/ssl/ca-key.pem \ --cluster-signing-duration=876000h0m0s"
2.1.3、scheduler
端口:10259
[root@tiaoban prometheus]# kubectl describe pod -n kube-system kube-scheduler-master1 Name: kube-scheduler-k8s-master Namespace: kube-system …… Labels: component=kube-scheduler tier=control-plane ……
由上可知,匹配pod对象,lable标签为component=kube-scheduler即可scheduler和controller-manager一样,默认监听0端口,需要注释
修改 /etc/kubernetes/manifests/kube-scheduler.yaml
# cat /etc/kubernetes/manifests/kube-scheduler.yaml …… - command: - --bind-address=0.0.0.0 # 端口改为0.0.0.0 #- --port=0 # 注释0端口 ……
- 编写prometheus配置文件,需要注意的是,他默认匹配到的是80端口,需要手动指定为10251端口,同时指定token,否则会提示
server returned HTTP status 400 Bad Request
;
- job_name: kube-scheduler kubernetes_sd_configs: - role: pod bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_pod_label_component] regex: kube-scheduler action: keep - source_labels: [__meta_kubernetes_pod_ip] regex: (.+) target_label: __address__ replacement: ${1}:10251 - source_labels: [__meta_kubernetes_endpoints_name] action: replace target_label: endpoint - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_service_name] action: replace target_label: service - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace
二进制部署监控
- job_name: kube-ccheduler metrics_path: /metrics scheme: https tls_config: insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token # prometheus 挂载的tokne kubernetes_sd_configs: # 服务发现 - role: node # 获取node指标 relabel_configs: - action: labelmap # (.+) 部分捕获的内容)作为新的标签名称,而标签的值则保持不变。 regex: __meta_kubernetes_node_label_(.+) - source_labels: [__meta_kubernetes_node_label_kubernetes_io_role] regex: master # 过滤主节点 action: keep - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:10259' target_label: __address__ # prometheus访问 action: replace
2.1.5、coredns
- 编写prometheus配置文件,需要注意的是,他默认匹配到的是53端口,需要手动指定为9153端口
- 服务发现:问题提出:为什么采用服务发现机制?
- 1、如果写svc域名,它是一个负载均衡,也就是说prometheus如果要采集指标,如果有多个pod,他不会同时采集
- 2、那如果写pod ip呢? 也不行,因为pod重启 ip 就发生改变。
- job_name: coredns kubernetes_sd_configs: # 服务发现配置 - role: endpoints # 去apiserver 拿去endpoint 资源清单,动态获取svc关联的 pod ip地址 relabel_configs: - source_labels: - __meta_kubernetes_service_label_k8s_app # 源标签 regex: kube-dns # 匹配关键字 action: keep # 对提取到数值进行保留,其它全部删除 - source_labels: [__meta_kubernetes_pod_ip] # ip地址 regex: (.+) # . 是任意字符,+ 任意字符的多次,就是匹配所有了 target_label: __address__ # 把取到的标签和值转换为标签为:__address__ 同时赋值给它 replacement: ${1}:9153 # 然后替换值,后面再加上:9153 - source_labels: [__meta_kubernetes_endpoints_name] action: replace # 标签替换,也就是重命名 target_label: endpoint - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_service_name] action: replace target_label: service - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace
2.1.6、etcd
端口:
# kubectl describe pod -n kube-system etcd-master1 Name: etcd-master1 Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Node: master1/192.10.192.158 Start Time: Mon, 30 Jan 2023 15:06:35 +0800 Labels: component=etcd tier=control-plane ··· Command: etcd --advertise-client-urls=https://192.10.192.158:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --client-cert-auth=true --data-dir=/var/lib/etcd --initial-advertise-peer-urls=https://192.10.192.158:2380 --initial-cluster=master1=https://192.10.192.158:2380 --key-file=/etc/kubernetes/pki/etcd/server.key --listen-client-urls=https://127.0.0.1:2379,https://192.10.192.158:2379 --listen-metrics-urls=http://127.0.0.1:2381 --listen-peer-urls=https://192.10.192.158:2380 --name=master1 --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt --peer-client-cert-auth=true --peer-key-file=/etc/kubernetes/pki/etcd/peer.key --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt --snapshot-count=10000 --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt ···
由上可知,启动参数里面有一个 --listen-metrics-urls=http://127.0.0.1:2381 的配置,该参数就是来指定 Metrics 接口运行在 2381 端口下面的,而且是 http 的协议,所以也不需要什么证书配置,这就比以前的版本要简单许多了,以前的版本需要用 https 协议访问,所以要配置对应的证书。但是还需修改配置文件,地址改为0.0.0.0
- 编写prometheus配置文件,需要注意的是,他默认匹配到的是2379端口,需要手动指定为2381端口
- job_name: etcd kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: - __meta_kubernetes_pod_label_component regex: etcd action: keep - source_labels: [__meta_kubernetes_pod_ip] regex: (.+) target_label: __address__ replacement: ${1}:2381 - source_labels: [__meta_kubernetes_endpoints_name] action: replace target_label: endpoint - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace
上面部分参数简介如下:
- kubernetes_sd_configs: 设置发现模式为 Kubernetes 动态服务发现
- kubernetes_sd_configs.role: 指定 Kubernetes 的服务发现模式,这里设置为 endpoints 的服务发现模式,该模式下会调用 kube-apiserver 中的接口获取指标数据。并且还限定只获取 kube-state-metrics 所在 - Namespace 的空间 kube-system 中的 Endpoints 信息
- kubernetes_sd_configs.namespace: 指定只在配置的 Namespace 中进行 endpoints 服务发现
- relabel_configs: 用于对采集的标签进行重新标记
二进制部署监控
手动获取指标
首先需要配置etcd的http ip+端口为ip+端口,不能用localhost,不然只能本机获取
- job_name: etcd-cluster metrics_path: /metrics kubernetes_sd_configs: # 服务发现 - role: node # 获取node指标 relabel_configs: - action: labelmap # (.+) 部分捕获的内容)作为新的标签名称,而标签的值则保持不变。 regex: __meta_kubernetes_node_label_(.+) - source_labels: [__meta_kubernetes_node_label_kubernetes_io_role] regex: master # 过滤主节点 action: keep - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:2379' target_label: __address__ # prometheus访问 action: replace
热加载prometheus,使configmap配置文件生效(也可以等待prometheus的自动热加载):
curl -XPOST http://prometheus.kubernets.cn/-/reload
三、cAdvisor
cAdvisor 主要功能:
- 对容器资源的使用情况和性能进行监控。它以守护进程方式运行,用于收集、聚合、处理和导出正在运行容器的有关信息。
- cAdvisor 本身就对 Docker 容器支持,并且还对其它类型的容器尽可能的提供支持,力求兼容与适配所有类型的容器。
- Kubernetes 已经默认将其与 Kubelet 融合,所以我们无需再单独部署 cAdvisor 组件来暴露节点中容器运行的信息。
3.1、Prometheus 添加 kubelet cAdvisor 配置
由于 Kubelet 中已经默认集成 cAdvisor 组件,所以无需部署该组件。需要注意的是,他的指标采集地址为 /metrics/cadvisor
,需要配置https访问,可以设置 insecure_skip_verify: true
跳过证书验证;
使用curl命令测试访问获取kubelet的指标,tokne为sa的tokne。
因为要使用prometheus访问组件获取指标,所以必须得给prometheus创建一个sa 可以获取到指标。
查看prometheus挂载的sa token
测试用prometheus 的sa tokne访问
- job_name: kubelet
metrics_path: /metrics/cadvisor
scheme: https
tls_config:
insecure_skip_verify: true # 跳过证书验证
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token # prometheus 挂载的tokne
kubernetes_sd_configs: # 服务发现
- role: node # 获取node指标
relabel_configs:
- action: labelmap # (.+)
部分捕获的内容)作为新的标签名称,而标签的值则保持不变。
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_endpoints_name] # 底下这部分可以不要,kubelet不涉及,因为我们用的是二进制部署
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
热加载prometheus,使configmap配置文件生效:
curl -XPOST http://prometheus.kubernets.cn/-/reload
3.2、Prometheus 添加 ingresscontorller配置
端口是10254,采用daemonset 部署
- job_name: kube-ingress-controller metrics_path: /metrics kubernetes_sd_configs: # 服务发现 - role: node # 获取node指标 relabel_configs: - action: labelmap # (.+) 部分捕获的内容)作为新的标签名称,而标签的值则保持不变。 regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:10254'
target_label: __address__
action: replace
3.3、Prometheus 添加 kube-proxy配置
端口:10249 二进制部署
- job_name: kube-proxy metrics_path: /metrics scheme: https tls_config: insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token # prometheus 挂载的tokne kubernetes_sd_configs: # 服务发现 - role: node # 获取node指标 relabel_configs: - action: labelmap # (.+) 部分捕获的内容)作为新的标签名称,而标签的值则保持不变。 regex: __meta_kubernetes_node_label_(.+) - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:10249' target_label: __address__ action: replace
四、node-exporter
Node Exporter 是 Prometheus 官方提供的一个节点资源采集组件,可以用于收集服务器节点的数据,如 CPU频率信息、磁盘IO统计、剩余可用内存等等。
部署创建:
由于是针对所有K8S-node节点,所以我们这边使用DaemonSet这种方式;
apiVersion: apps/v1 kind: DaemonSet metadata: name: node-exporter namespace: monitor labels: name: node-exporter spec: selector: matchLabels: name: node-exporter template: metadata: labels: name: node-exporter spec: hostPID: true hostIPC: true hostNetwork: true # pod使用了宿主机网络 containers: - name: node-exporter image: prom/node-exporter:latest ports: - containerPort: 9100 resources: requests: cpu: 0.15 securityContext: privileged: true args: - --path.procfs - /host/proc - --path.sysfs - /host/sys - --collector.filesystem.ignored-mount-points - '"^/(sys|proc|dev|host|etc)($|/)"' volumeMounts: - name: dev mountPath: /host/dev - name: proc mountPath: /host/proc - name: sys mountPath: /host/sys - name: rootfs mountPath: /rootfs tolerations: - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" volumes: - name: proc hostPath: path: /proc - name: dev hostPath: path: /dev - name: sys hostPath: path: /sys - name: rootfs hostPath: path: /
node_exporter.yaml
文件说明:
- hostPID:指定是否允许Node Exporter进程绑定到主机的PID命名空间。若值为true,则可以访问宿主机中的PID信息。
- hostIPC:指定是否允许Node Exporter进程绑定到主机的IPC命名空间。若值为true,则可以访问宿主机中的IPC信息。
- hostNetwork:指定是否允许Node Exporter进程绑定到主机的网络命名空间。若值为true,则可以访问宿主机中的网络信息。
验证:
[root@master1 /]# curl localhost:9100/metrics |grep cpu % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since theprogram started. # TYPE go_memstats_gc_cpu_fraction gauge go_memstats_gc_cpu_fraction 2.7853625597758774e-06 # HELP node_cpu_guest_seconds_total Seconds the CPUs spent in guests (VMs) for each mode. # TYPE node_cpu_guest_seconds_total counter node_cpu_guest_seconds_total{cpu="0",mode="nice"} 0 node_cpu_guest_seconds_total{cpu="0",mode="user"} 0 node_cpu_guest_seconds_total{cpu="1",mode="nice"} 0 node_cpu_guest_seconds_total{cpu="1",mode="user"} 0 # HELP node_cpu_seconds_total Seconds the CPUs spent in each mode. # TYPE node_cpu_seconds_total counter node_cpu_seconds_total{cpu="0",mode="idle"} 1.90640354e+06 node_cpu_seconds_total{cpu="0",mode="iowait"} 6915.35 node_cpu_seconds_total{cpu="0",mode="irq"} 0 node_cpu_seconds_total{cpu="0",mode="nice"} 1426.82 node_cpu_seconds_total{cpu="0",mode="softirq"} 14446.63
4.1、k8s-node 监控
在 prometheus-config.yaml
中新增采集 job:k8s-nodes
node_exporter也是每个node节点都运行,因此role使用node即可,默认address端口为10250,替换为9100即可;
因为pod使用了主机网络,这里role 使用node即可。
- job_name: k8s-node-export kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:9100' target_label: __address__ action: replace - action: labelmap regex: __meta_kubernetes_node_label_(.+) - source_labels: [__meta_kubernetes_endpoints_name] action: replace target_label: endpoint - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace
热加载prometheus,使configmap配置文件生效:
curl -XPOST http://prometheus.kubernets.cn/-/reload
五、总结
- kube-state-metrics:将 Kubernetes API 中的各种对象状态信息转化为 Prometheus 可以使用的监控指标数据。
- cAdvisor:用于监视容器资源使用和性能的工具,它可以收集 CPU、内存、磁盘、网络和文件系统等方面的指标数据。
- node-exporter:用于监控主机指标数据的收集器,它可以收集 CPU 负载、内存使用情况、磁盘空间、网络流量等各种指标数据。
这三种工具可以协同工作,为用户提供一个全面的 Kubernetes 监控方案,帮助用户更好地了解其 Kubernetes 集群和容器化应用程序的运行情况。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· 单线程的Redis速度为什么快?