linux运维、架构之路-K8s故障排查
一、kubernetes故障排查
1、应用程序故障排查
①主要针对Pod级别的,
非running状态时使用describe查看Pod事件进行问题排查。describe也可以查看其他资源对象事件,如deployment、service等。
kubectl describe TYPE/NAME
[root@k8s-master ~]# kubectl describe pod web Name: web Namespace: default Priority: 0 Node: k8s-node1/192.168.56.62 Start Time: Wed, 16 Dec 2020 14:43:55 +0800 Labels: <none> Annotations: cni.projectcalico.org/podIP: 10.244.36.81/32 cni.projectcalico.org/podIPs: 10.244.36.81/32 Status: Pending IP: IPs: <none> Containers: nginx: Container ID: Image: nginx Image ID: Port: 80/TCP Host Port: 0/TCP State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-c87dr (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: default-token-c87dr: Type: Secret (a volume populated by a Secret) SecretName: default-token-c87dr Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned default/web to k8s-node1 Normal Pulling 11s kubelet, k8s-node1 Pulling image "nginx"
kubectl logs TYPE/NAME [-c CONTAINER]:Apiserver调用kubelet的接口获取
[root@k8s-master ~]# kubectl logs web /docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration /docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/ /docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh 10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf 10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf /docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh /docker-entrypoint.sh: Configuration complete; ready for start up
kubectl exec POD [-c CONTAINER] --COMMAND [args...],一个Pod中有多个容器时,使用-c指定容器的名称。
②pod处于pending状态可能的原因
- 下载镜像
- 可能node节点资源不足
- 没有匹配到节点标签
- 有污点
2、管理节点异常排查
集群架构图
①kubeadm部署
除kubelet服务外,其他组件均采用静态Pod启动。
[root@k8s-master ~]# kubectl get pods -n kube-system NAME READY STATUS RESTARTS AGE calico-kube-controllers-59877c7fb4-z2bms 1/1 Running 2 105d calico-node-pnjxq 1/1 Running 1 105d calico-node-v48jq 1/1 Running 1 105d coredns-7ff77c879f-dqk8t 1/1 Running 1 105d coredns-7ff77c879f-j8zsp 1/1 Running 1 105d etcd-k8s-master 1/1 Running 1 105d kube-apiserver-k8s-master 1/1 Running 1 105d kube-controller-manager-k8s-master 1/1 Running 6 105d kube-proxy-ck88h 1/1 Running 1 105d kube-proxy-hkb9f 1/1 Running 1 105d kube-scheduler-k8s-master 1/1 Running 6 105d metrics-server-8fcfb55ff-wlw5s 1/1 Running 3 104d
其他服务配置文件路径:/etc/kubernetes/manifests
[root@k8s-master ~]# ll /etc/kubernetes/manifests/ 总用量 16 -rw------- 1 root root 1887 9月 1 17:04 etcd.yaml -rw------- 1 root root 2738 9月 1 17:04 kube-apiserver.yaml -rw------- 1 root root 2594 9月 1 17:04 kube-controller-manager.yaml -rw------- 1 root root 1149 9月 1 17:04 kube-scheduler.yaml
通过组件服务及进程、证书等区别k8s集群部署方式
[root@k8s-master ~]# systemctl status kube-apiserver.service Unit kube-apiserver.service could not be found. #说明非二进制部署 [root@k8s-master ~]# ps aux|grep apiserver #kubeadm部署的证书路径都是特定的形式 root 1696 6.1 19.0 635004 386360 ? Ssl 10:01 30:04 kube-apiserver --advertise-address=192.168.56.61 --allow-privileged=true --authorization-mode=Node,RBAC --client-ca-file=/etc/kubernetes/pki/ca.crt --enable-admission-plugins=NodeRestriction --enable-bootstrap-token-auth=true --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key --etcd-servers=https://127.0.0.1:2379 --insecure-port=0 --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-key-file=/etc/kubernetes/pki/sa.pub --service-cluster-ip-range=10.96.0.0/12 --tls-cert-file=/etc/kubernetes/pki/apiserver.crt --tls-private-key-file=/etc/kubernetes/pki/apiserver.key 1001 3837 0.0 1.2 138732 26048 ? Ssl 10:04 0:17 /dashboard --insecure-bind-address=0.0.0.0 --bind-address=0.0.0.0 --auto-generate-certificates --namespace=kubernetes-dashboard --tls-key-file=apiserver.key --tls-cert-file=apiserver.crt root 87035 0.0 0.0 112724 980 pts/1 S+ 18:09 0:00 grep --color=auto apiserver
修改静态Pod配置文件路径
[root@k8s-master ~]# tail /var/lib/kubelet/config.yaml imageMinimumGCAge: 0s kind: KubeletConfiguration nodeStatusReportFrequency: 0s nodeStatusUpdateFrequency: 0s rotateCertificates: true runtimeRequestTimeout: 0s staticPodPath: /etc/kubernetes/manifests streamingConnectionIdleTimeout: 0s syncFrequency: 0s volumeStatsAggPeriod: 0s
②二进制部署
所有组件均采用systemd管理
[root@k8s-node1 ~]# systemctl status kube-apiserver.service ● kube-apiserver.service - Kubernetes API Server Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2020-04-20 15:26:41 CST; 7 months 27 days ago Docs: https://github.com/kubernetes/kubernetes Main PID: 17587 (kube-apiserver) Tasks: 36 Memory: 356.5M CGroup: /system.slice/kube-apiserver.service └─17587 /app/kubernetes/bin/kube-apiserver --logtostderr=false --v=2 --log-dir=/app/kubernetes/logs --etcd-... Dec 16 16:22:11 k8s-node1 kube-apiserver[17587]: E1216 16:22:11.216916 17587 watcher.go:214] watch chan error: ...acted Dec 16 16:38:14 k8s-node1 kube-apiserver[17587]: E1216 16:38:14.231035 17587 watcher.go:214] watch chan error: ...acted Dec 16 16:51:27 k8s-node1 kube-apiserver[17587]: E1216 16:51:27.296324 17587 watcher.go:214] watch chan error: ...acted Dec 16 17:04:51 k8s-node1 kube-apiserver[17587]: E1216 17:04:51.356825 17587 watcher.go:214] watch chan error: ...acted Dec 16 17:20:04 k8s-node1 kube-apiserver[17587]: E1216 17:20:04.464772 17587 watcher.go:214] watch chan error: ...acted Dec 16 17:28:03 k8s-node1 kube-apiserver[17587]: E1216 17:28:03.551942 17587 watcher.go:214] watch chan error: ...acted Dec 16 17:38:01 k8s-node1 kube-apiserver[17587]: E1216 17:38:01.568538 17587 watcher.go:214] watch chan error: ...acted Dec 16 17:52:41 k8s-node1 kube-apiserver[17587]: E1216 17:52:41.593466 17587 watcher.go:214] watch chan error: ...acted Dec 16 18:01:48 k8s-node1 kube-apiserver[17587]: E1216 18:01:48.620521 17587 watcher.go:214] watch chan error: ...acted Dec 16 18:16:43 k8s-node1 kube-apiserver[17587]: E1216 18:16:43.655648 17587 watcher.go:214] watch chan error: ...acted Hint: Some lines were ellipsized, use -l to show in full.
服务配置文件路径:/usr/lib/systemd/system
③管理节点组件
- kube-apiserver
- kube-controller-manager
- kube-scheduler
3、工作节点异常排查
①管理节点组件
- kubelet #调用容器引擎接口管理容器,并将容器运行状态上报给apiserver。
- kube-proxy #实现Pod的负载均衡和服务发现,根据访问的请示,转发到后面的一组Pod。
②node是not ready状态可能原因
- kubelet服务启动有问题
- kubelet与apiserver网络不通
- kubelet携带证书有问题,例如过期
- node节点磁盘空间满了
kubelet服务未启动处理
systemctl start kubelet && systemctl enable kubelet
kubelet服务无法启动处理
journalctl -u kubelet #查看日志排查处理
journalctl -u kubelet.service >kubelet.log #输出到文件中排查
4、Service访问异常排查
①用户通过NodePort访问service流程
client -> kube-proxy监听一个端口,接受流量会被iptables/ipvs处理 -> 一组pod(分散每个节点)
[root@k8s-node1 ~]# kubectl get svc -n kube-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE grafana NodePort 10.0.0.202 <none> 3000:9006/TCP 258d [root@k8s-node1 ~]# iptables-save |grep 9006 -A KUBE-NODEPORTS -p tcp -m comment --comment "kube-system/grafana:" -m tcp --dport 9006 -j KUBE-MARK-MASQ -A KUBE-NODEPORTS -p tcp -m comment --comment "kube-system/grafana:" -m tcp --dport 9006 -j KUBE-SVC-3QDDWNGGGXWDZXKH
②查看Pod和Service是否运行正常
[root@k8s-master ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES web-5dcb957ccc-96nbn 1/1 Running 0 10m 10.244.36.93 k8s-node1 <none> <none> web-5dcb957ccc-j5sz7 1/1 Running 0 10m 10.244.36.66 k8s-node1 <none> <none> [root@k8s-master ~]# kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 106d web-service NodePort 10.99.239.53 <none> 80:31100/TCP 10m
③查看Service是否正常关联到Pod
[root@k8s-master ~]# kubectl get ep NAME ENDPOINTS AGE kubernetes 192.168.56.61:6443 106d web-service 10.244.36.66:80,10.244.36.93:80 9m43s
④Service指定target-port是否正确
[root@k8s-master ~]# kubectl exec -it web-5dcb957ccc-96nbn -- netstat -lntp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 1/nginx: master pro tcp6 0 0 :::80 :::* LISTEN 1/nginx: master pro
⑤无法访问Service其他原因
- Service是否通过DNS工作?
- kube-proxy正常工作吗?
- kube-proxy是否正常写iptables规则?
- cni网络插件是否正常工作?
成功最有效的方法就是向有经验的人学习!