linux运维、架构之路-K8s故障排查

一、kubernetes故障排查

1、应用程序故障排查

主要针对Pod级别的,

       非running状态时使用describe查看Pod事件进行问题排查。describe也可以查看其他资源对象事件,如deployment、service等。

 kubectl describe TYPE/NAME

[root@k8s-master ~]# kubectl describe pod web 
Name:         web
Namespace:    default
Priority:     0
Node:         k8s-node1/192.168.56.62
Start Time:   Wed, 16 Dec 2020 14:43:55 +0800
Labels:       <none>
Annotations:  cni.projectcalico.org/podIP: 10.244.36.81/32
              cni.projectcalico.org/podIPs: 10.244.36.81/32
Status:       Pending
IP:           
IPs:          <none>
Containers:
  nginx:
    Container ID:   
    Image:          nginx
    Image ID:       
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-c87dr (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-c87dr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-c87dr
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age        From                Message
  ----    ------     ----       ----                -------
  Normal  Scheduled  <unknown>  default-scheduler   Successfully assigned default/web to k8s-node1
  Normal  Pulling    11s        kubelet, k8s-node1  Pulling image "nginx"

kubectl logs TYPE/NAME [-c CONTAINER]:Apiserver调用kubelet的接口获取

[root@k8s-master ~]# kubectl logs web 
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Configuration complete; ready for start up

kubectl exec POD [-c CONTAINER] --COMMAND [args...],一个Pod中有多个容器时,使用-c指定容器的名称。

②pod处于pending状态可能的原因

  • 下载镜像
  • 可能node节点资源不足
  • 没有匹配到节点标签
  • 有污点

2、管理节点异常排查

集群架构图

①kubeadm部署

 除kubelet服务外,其他组件均采用静态Pod启动。

[root@k8s-master ~]# kubectl get pods -n kube-system 
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-59877c7fb4-z2bms   1/1     Running   2          105d
calico-node-pnjxq                          1/1     Running   1          105d
calico-node-v48jq                          1/1     Running   1          105d
coredns-7ff77c879f-dqk8t                   1/1     Running   1          105d
coredns-7ff77c879f-j8zsp                   1/1     Running   1          105d
etcd-k8s-master                            1/1     Running   1          105d
kube-apiserver-k8s-master                  1/1     Running   1          105d
kube-controller-manager-k8s-master         1/1     Running   6          105d
kube-proxy-ck88h                           1/1     Running   1          105d
kube-proxy-hkb9f                           1/1     Running   1          105d
kube-scheduler-k8s-master                  1/1     Running   6          105d
metrics-server-8fcfb55ff-wlw5s             1/1     Running   3          104d

其他服务配置文件路径:/etc/kubernetes/manifests

[root@k8s-master ~]# ll /etc/kubernetes/manifests/
总用量 16
-rw------- 1 root root 1887 9月   1 17:04 etcd.yaml
-rw------- 1 root root 2738 9月   1 17:04 kube-apiserver.yaml
-rw------- 1 root root 2594 9月   1 17:04 kube-controller-manager.yaml
-rw------- 1 root root 1149 9月   1 17:04 kube-scheduler.yaml

通过组件服务及进程、证书等区别k8s集群部署方式

[root@k8s-master ~]# systemctl status kube-apiserver.service
Unit kube-apiserver.service could not be found.    #说明非二进制部署
[root@k8s-master ~]# ps aux|grep apiserver         #kubeadm部署的证书路径都是特定的形式
root       1696  6.1 19.0 635004 386360 ?       Ssl  10:01  30:04 kube-apiserver --advertise-address=192.168.56.61 --allow-privileged=true --authorization-mode=Node,RBAC --client-ca-file=/etc/kubernetes/pki/ca.crt --enable-admission-plugins=NodeRestriction --enable-bootstrap-token-auth=true --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key --etcd-servers=https://127.0.0.1:2379 --insecure-port=0 --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-key-file=/etc/kubernetes/pki/sa.pub --service-cluster-ip-range=10.96.0.0/12 --tls-cert-file=/etc/kubernetes/pki/apiserver.crt --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
1001       3837  0.0  1.2 138732 26048 ?        Ssl  10:04   0:17 /dashboard --insecure-bind-address=0.0.0.0 --bind-address=0.0.0.0 --auto-generate-certificates --namespace=kubernetes-dashboard --tls-key-file=apiserver.key --tls-cert-file=apiserver.crt
root      87035  0.0  0.0 112724   980 pts/1    S+   18:09   0:00 grep --color=auto apiserver

修改静态Pod配置文件路径

[root@k8s-master ~]# tail /var/lib/kubelet/config.yaml 
imageMinimumGCAge: 0s
kind: KubeletConfiguration
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
rotateCertificates: true
runtimeRequestTimeout: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s

②二进制部署

所有组件均采用systemd管理

[root@k8s-node1 ~]# systemctl status kube-apiserver.service 
● kube-apiserver.service - Kubernetes API Server
   Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-04-20 15:26:41 CST; 7 months 27 days ago
     Docs: https://github.com/kubernetes/kubernetes
 Main PID: 17587 (kube-apiserver)
    Tasks: 36
   Memory: 356.5M
   CGroup: /system.slice/kube-apiserver.service
           └─17587 /app/kubernetes/bin/kube-apiserver --logtostderr=false --v=2 --log-dir=/app/kubernetes/logs --etcd-...

Dec 16 16:22:11 k8s-node1 kube-apiserver[17587]: E1216 16:22:11.216916   17587 watcher.go:214] watch chan error: ...acted
Dec 16 16:38:14 k8s-node1 kube-apiserver[17587]: E1216 16:38:14.231035   17587 watcher.go:214] watch chan error: ...acted
Dec 16 16:51:27 k8s-node1 kube-apiserver[17587]: E1216 16:51:27.296324   17587 watcher.go:214] watch chan error: ...acted
Dec 16 17:04:51 k8s-node1 kube-apiserver[17587]: E1216 17:04:51.356825   17587 watcher.go:214] watch chan error: ...acted
Dec 16 17:20:04 k8s-node1 kube-apiserver[17587]: E1216 17:20:04.464772   17587 watcher.go:214] watch chan error: ...acted
Dec 16 17:28:03 k8s-node1 kube-apiserver[17587]: E1216 17:28:03.551942   17587 watcher.go:214] watch chan error: ...acted
Dec 16 17:38:01 k8s-node1 kube-apiserver[17587]: E1216 17:38:01.568538   17587 watcher.go:214] watch chan error: ...acted
Dec 16 17:52:41 k8s-node1 kube-apiserver[17587]: E1216 17:52:41.593466   17587 watcher.go:214] watch chan error: ...acted
Dec 16 18:01:48 k8s-node1 kube-apiserver[17587]: E1216 18:01:48.620521   17587 watcher.go:214] watch chan error: ...acted
Dec 16 18:16:43 k8s-node1 kube-apiserver[17587]: E1216 18:16:43.655648   17587 watcher.go:214] watch chan error: ...acted
Hint: Some lines were ellipsized, use -l to show in full.

服务配置文件路径:/usr/lib/systemd/system

③管理节点组件

  • kube-apiserver
  • kube-controller-manager
  • kube-scheduler

3、工作节点异常排查

①管理节点组件

  • kubelet           #调用容器引擎接口管理容器,并将容器运行状态上报给apiserver。
  • kube-proxy    #实现Pod的负载均衡和服务发现,根据访问的请示,转发到后面的一组Pod。

②node是not ready状态可能原因

  • kubelet服务启动有问题
  • kubelet与apiserver网络不通
  • kubelet携带证书有问题,例如过期
  • node节点磁盘空间满了

 kubelet服务未启动处理

systemctl start kubelet && systemctl enable kubelet

kubelet服务无法启动处理

journalctl -u kubelet  #查看日志排查处理
journalctl -u kubelet.service >kubelet.log  #输出到文件中排查

 4、Service访问异常排查

①用户通过NodePort访问service流程

 

 client -> kube-proxy监听一个端口,接受流量会被iptables/ipvs处理 -> 一组pod(分散每个节点)

[root@k8s-node1 ~]# kubectl get svc -n kube-system 
NAME                      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                      AGE
grafana                   NodePort    10.0.0.202   <none>        3000:9006/TCP                258d

[root@k8s-node1 ~]# iptables-save |grep 9006
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-system/grafana:" -m tcp --dport 9006 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-system/grafana:" -m tcp --dport 9006 -j KUBE-SVC-3QDDWNGGGXWDZXKH

②查看Pod和Service是否运行正常

[root@k8s-master ~]# kubectl get pods -o wide
NAME                   READY   STATUS    RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
web-5dcb957ccc-96nbn   1/1     Running   0          10m   10.244.36.93   k8s-node1   <none>           <none>
web-5dcb957ccc-j5sz7   1/1     Running   0          10m   10.244.36.66   k8s-node1   <none>           <none>
[root@k8s-master ~]# kubectl get svc
NAME          TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
kubernetes    ClusterIP   10.96.0.1      <none>        443/TCP        106d
web-service   NodePort    10.99.239.53   <none>        80:31100/TCP   10m

③查看Service是否正常关联到Pod

[root@k8s-master ~]# kubectl get ep
NAME          ENDPOINTS                         AGE
kubernetes    192.168.56.61:6443                106d
web-service   10.244.36.66:80,10.244.36.93:80   9m43s

④Service指定target-port是否正确

[root@k8s-master ~]# kubectl exec  -it web-5dcb957ccc-96nbn -- netstat -lntp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      1/nginx: master pro 
tcp6       0      0 :::80                   :::*                    LISTEN      1/nginx: master pro

⑤无法访问Service其他原因

  • Service是否通过DNS工作?
  • kube-proxy正常工作吗?
  • kube-proxy是否正常写iptables规则?
  • cni网络插件是否正常工作?
posted @ 2020-12-16 15:09  闫新江  阅读(1732)  评论(0编辑  收藏  举报