每天一点K8S基础--使用prometheus监控K8S集群
用prometheus监控K8S集群
1、监控任务简介
依据prometheus架构图,使用prometheus对K8S集群进行监控,主要包括以下几个方面:
1、对集群节点状态监控;可以在集群中每个节点运行exports实现;
2、对节点中的pod进行监控;集群各节点上的POD情况均受kubelet管理,kubelet服务内置的cadvisor可以获取pod的运行情况和资源使用情况;
3、对K8S的核心组件进行监控,API server、etcd、kubelet、controller-manager、kube-scheduler、kube-proxy等
2、创建node-exporter监控node资源
# yaml文件
链接: https://pan.baidu.com/s/12kwzyK5f3FjDNK4HV-ADJA 提取码: cjmo
# 创建namespace
[ master-worker-node-1 root ~] # kubectl create namespace monitor-sa
namespace/monitor-sa created
# 创建daemonset资源
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # kubectl apply -f node-exporter.yaml
daemonset.apps/node-exporter created
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # kubectl get daemonset -n monitor-sa
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
node-exporter 3 3 3 3 3 <none> 3m42s
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # kubectl get pods -n monitor-sa -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-exporter-95p4m 1/1 Running 0 40s 192.168.100.92 master-worker-node-2 <none> <none>
node-exporter-hp5zb 1/1 Running 0 39s 192.168.100.93 master-worker-node-3 <none> <none>
node-exporter-hvf47 1/1 Running 0 40s 192.168.100.91 master-worker-node-1 <none> <none>
# 检查端口信息
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # curl http://192.168.100.91:9100/metrics | grep node_cpu_seconds | head -5
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 97k 0 97k 0 0 3906k 0 --:--:-- --:--:-- --:--:-- 3906k
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter # counter类型
node_cpu_seconds_total{cpu="0",mode="idle"} 694084.85
node_cpu_seconds_total{cpu="0",mode="iowait"} 2498.56
node_cpu_seconds_total{cpu="0",mode="irq"} 14901.21
3、配置prometheus server
3.1、给prometheus server创建service account,并配置rbac授权
# 创建service account
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # kubectl create serviceaccount monitor -n monitor-sa
serviceaccount/monitor created
# 进行clusterrolebinding
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # kubectl create clusterrolebinding monitor --serviceaccount monitor-sa:monitor --clusterrole cluster-admin --namespace monitor-sa
clusterrolebinding.rbac.authorization.k8s.io/monitor created
3.2 创建prometheus本地数据目录
# 此处采用NFS先挂在到node节点,三个节点都有相同的目录,所以prometheus server pod飘移不影响运行。
[ master-worker-node-1 root ~] # df -Th | grep -E "Type|192.168.100.94"
Filesystem Type Size Used Avail Use% Mounted on
192.168.100.94:/prometheus/node-1 nfs4 51G 2.5G 49G 5% /prometheus
# 创建目录,并授予权限
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # mkdir /prometheus/data
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # ls -ld /prometheus/data
drwxr-xr-x. 2 nobody nobody 6 Feb 19 12:54 /prometheus/data
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # chmod 777 /prometheus/data
3.3 配置prometheus server配置文件
# 以configmap的形式添加prometheus server配置文件。
# 链接: https://pan.baidu.com/s/1RuyLOoH_FM90mG8CT1hKtQ 提取码: lpvg
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # kubectl apply -f prometheus-cfg.yaml
configmap/prometheus-config created
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # kubectl get configmap -n monitor-sa
NAME DATA AGE
kube-root-ca.crt 1 15h
prometheus-config 1 7s
# 因为prometheus server的配置文件中的scrape部分配置了4个job,那么登录页面就可以看到4个target
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # cat prometheus-cfg.yaml | grep job
- job_name: 'kubernetes-node'
- job_name: 'kubernetes-node-cadvisor'
- job_name: 'kubernetes-apiserver'
- job_name: 'kubernetes-service-endpoints'
1、kubernetes-node: 是通过kubelet默认提供的10250端口发现
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # ss -ntupal | grep 10250
tcp LISTEN 0 128 *:10250 *:* users:(("kubelet",pid=9468,fd=27))
2、kubernetes-node-cadvisor: 是通过kubelet/metrics/cadvisor接口发现
3、kubernetes-apiserver: 是通过采集apiserver的6443端口发现
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # ss -naltup | grep 6443 -w
tcp LISTEN 0 128 *:6443 *:* users:(("kube-apiserver",pid=7520,fd=7))
4、kubernetes-service-endpoints: 是通过service的annotation中的相应信息实现。
3.4、通过deployment部署prometheus server
# 创建prometheus pod,并将configmap以volume的形式挂载
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # cat prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitor-sa
labels:
app: prometheus
spec:
selector:
matchLabels:
app: prometheus
component: server
replicas: 1
template:
metadata:
labels:
app: prometheus
component: server
spec:
serviceAccount: monitor
containers:
- name: prometheus-server
image: bitnami/prometheus:2.42.0
imagePullPolicy: IfNotPresent
command:
- prometheus
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention=720h
- --web.enable-lifecycle
ports:
- containerPort: 9090
protocol: TCP
volumeMounts:
- mountPath: /etc/prometheus
name: prometheus-config
- mountPath: /prometheus/
name: prometheus-storage-volume
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
- name: prometheus-storage-volume
hostPath:
path: /prometheus/data
type: Directory
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # kubectl apply -f prometheus-deployment.yaml
deployment.apps/prometheus created
# pod启动正常
[ master-worker-node-1 root /prometheus] # kubectl get deployments.apps -o wide -n monitor-sa prometheus
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
prometheus 1/1 1 1 4m1s prometheus-server bitnami/prometheus:2.42.0 app=prometheus,component=server
[ master-worker-node-1 root /prometheus] # kubectl get pods -o wide -n monitor-sa prometheus-666897fc49-2jpnx
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
prometheus-666897fc49-2jpnx 1/1 Running 0 4m12s 10.244.132.197 master-worker-node-2 <none> <none>
3.5、暴露prometheus server pod的端口
[ master-worker-node-1 root ~] # kubectl expose deployment -n monitor-sa prometheus --type NodePort --name prometheus --port 9090 --target-port 9090 --protocol TCP
service/prometheus exposed
[ master-worker-node-1 root ~] # kubectl get service -o wide -n monitor-sa
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
prometheus NodePort 10.103.206.76 <none> 9090:31346/TCP 26s app=prometheus,component=server
# 此时就可以登录界面进行查看,在target页签下可以看到相应的信息。
3.6、手动添加service到prometheus中
# 当前target中的service-endpoints只有kube-dns service中的两个endpoints
# 能被加入到target中是因为service的annotation被kubernetes-service-endpoints job选中
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # kubectl describe service -n kube-system kube-dns
Name: kube-dns
Namespace: kube-system
Labels: k8s-app=kube-dns
kubernetes.io/cluster-service=true
kubernetes.io/name=CoreDNS
Annotations: prometheus.io/port: 9153
prometheus.io/scrape: true
# 新创建一个deployment nginx
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # kubectl get deployments.apps nginx -owide
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
nginx 1/1 1 1 95s nginx nginx app=nginx
# 给deployment nginx配置service,其中service配置相应的annotations
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # kubectl get service -o wide nginx-deployment-service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
nginx-deployment-service NodePort 10.99.22.192 <none> 30080:31942/TCP 11s app=nginx
# 此时失败主要是因为nginx默认不携带metrics资源,需要通过service monitor创建一个接口。后面会有单独的操作流程。
4、prometheus热加载
# 当prometheus的配置文件进行了修改,删除prometheus pod将导致数据丢失,如果prometheus配置文件中指定了热加载参数,可以通过命令进行热加载
# prometheus容器启动时,添加了相应的热加载参数
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # cat prometheus-deployment.yaml | grep lifecycle
- --web.enable-lifecycle
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # kubectl get pods -n monitor-sa prometheus-6bb56888bc-k2t6h -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
prometheus-6bb56888bc-k2t6h 1/1 Running 0 7h51m 10.244.123.136 master-worker-node-1 <none> <none>
# 执行热加载
[ master-worker-node-1 root ~/k8s/yaml/prometheus] # curl -X POST 10.244.123.136:9090/-/reload
5、小结
1、prometheus监控K8S集群主要包括对node节点监控,节点资源监控和K8S基础服务监控
2、job中有大量的语法进行特殊配置;
3、如何暴露接口,没有自带接口时如果构造接口,是配置prometheus的重点;
4、prometheus创建中如果指定了热加载参数,那么可以对运行中的prometheus进行热加载。防止重启prometheus过程中数据丢失。