解决机房加电后Kubernetes集群kube-apiserver服务不断重启报错问题
1、背景
去年协助某部门在他们测试环境下部署了一套Kubernetes集群(1Master 4Worker),当时在Kubernetes集群配置了公司内部的镜像仓库,今年上半年由于公司机房网络环境调整,某部门的服务器连不上公司镜像仓库了,于是临时在他们测试环境下部署了一套Harbor镜像仓库,用于接收他们业务构建的容器镜像,并将部署Kubernetes集群用到的镜像推送到了新的Harbor镜像仓库里面,由于部署Kubernetes集群时还部署了一些附件组件(日志、监控、微服务治理等),整个Kubernetes集群工作负载数量较多,所以就没挨着修改工作负载的镜像地址,导致部分工作负载还是配置了公司镜像仓库镜像,由于测试环境使用不多,加上附件组件比较稳定,最近半年附件组件对应Pod也没有重新调度,所以Kubernetes集群及其组件运行平稳。
上周末某部门机房需要断电进行机房线路加固,周一早晨机房来电后,发现Kubernetes集群没有自动恢复,然后他们运维人员联系我这边来帮他们恢复Kubernetes集群。
2、问题排查及解决
登录Kubernetes集群发现kube-apiserver一直重启,导致Kubernetes控制组件一直在重启,查看kube-apiserver日志报不断在尝试连接etcd 2379端口错误:
W0420 06:27:37.750969 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.111.1.134:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.111.1.134:2379: connect: connection refused". Reconnecting...
然后,查看etcd服务日志,发现报如下错误:
embed: rejected connection from (error "tls: oversized record received with length 64774", ServerName "")
通过查看etcd状态可知etcd服务运行正常:
[root@master-sg-134 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/member-master-sg-134.pem --key=/etc/ssl/etcd/ssl/member-master-sg-134-key.pem --endpoints="https://192.111.1.134:2379" endpoint status --write-out=table +----------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +----------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://192.111.1.134:2379 | 98184cb9ad9cec26 | 3.3.12 | 41 MB | true | 21 | 195592703 | +----------------------------+------------------+---------+---------+-----------+-----------+------------+ [root@master-sg-134 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/member-master-sg-134.pem --key=/etc/ssl/etcd/ssl/member-master-sg-134-key.pem --endpoints="https://192.111.1.134:2379" endpoint health --write-out=table https://192.111.1.134:2379 is healthy: successfully committed proposal: took = 844.331µs [root@master-sg-134 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/member-master-sg-134.pem --key=/etc/ssl/etcd/ssl/member-master-sg-134-key.pem --endpoints="https://192.111.1.134:2379" member list --write-out=table +------------------+---------+-------+----------------------------+----------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+-------+----------------------------+----------------------------+ | 98184cb9ad9cec26 | started | etcd1 | https://192.111.1.134:2380 | https://192.111.1.134:2379 | +------------------+---------+-------+----------------------------+----------------------------+ [root@master-sg-134 ~]#
将etcd数据导入到etcd_data.json文件中,通过查看导出的etcd_data.json文件得知Kubernetes集群数据正常。
ETCDCTL_API=3 etcdctl --cacert=/etc/ssl/etcd/ssl/ca.pem --cert=/etc/ssl/etcd/ssl/member-master-sg-134.pem --key=/etc/ssl/etcd/ssl/member-master-sg-134-key.pem --endpoints="https://192.111.1.134:2379" get / --prefix > etcd_data.json
备份Kubernetes集群数据目录:
tar -zcvf etcd-202201205.tar.gz /var/lib/etcd
通过对etcd组件的排查可以得知etcd服务运行正常,通过etcdctl客户端也能正常连接etcd服务,所以etcd组件是没问题的。但是,kube-apiserver Pod却一直重启报连不上etcd错误, /etc/kubernetes/manifests/kube-apiserver.yaml配置文件里面关于etcd相关配置也都正常,所以想着重启一下kube-apiserver这个静态Pod。
- --etcd-cafile=/etc/ssl/etcd/ssl/ca.pem - --etcd-certfile=/etc/ssl/etcd/ssl/node-master-sg-134.pem - --etcd-keyfile=/etc/ssl/etcd/ssl/node-master-sg-134-key.pem - --etcd-servers=https://192.111.1.134:2379
由于kube-apiserver是静态容器,用docker命令直接停止并删除kube-apiserver相关的容器(pause容器和运行kube-apiserver进程容器)后,kubelet会自动重启kube-apiserver这个静态Pod,但是删除kube-apiserver相关容器后发现kubelet并没重新创建kube-apiserver相关容器。
于是排查kubelet日志和docker服务引擎日志,通过docker服务引擎日志可以看出镜像仓库中没有libray/pause这个镜像。
Dec 05 16:58:20 master-sg-134 dockerd[1403]: time="2022-12-05T16:58:20.610370733+08:00" level=warning msg="Error getting v2 registry: Get https://192.111.1.137:80/v2/: http: server gave HTTP response to HTTPS client" Dec 05 16:58:20 master-sg-134 dockerd[1403]: time="2022-12-05T16:58:20.650481522+08:00" level=error msg="Not continuing with pull after error: unknown: repository library/pause not found
正常节点有这个镜像,不需要去镜像仓库拉取,可能服务器加电后,运维人员清理服务器磁盘了。经排查kubelet使用了公司镜像仓库,于是修改kueblet配置将其改成在他们测试环境下部署的Harbor镜像仓库地址,并重启kubelet服务。
[root@master-sg-134 ~]# cat /var/lib/kubelet/kubeadm-flags.env KUBELET_KUBEADM_ARGS="--cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=新搭建测试镜像仓库地址/xxxxx/pause:3.2"
重启kueblet服务后,发下kube-apiserver这个静态Pod成功启动了,剩下的Kuberenetes控制面板组件及安装的附加组件都跟着启动了,然后处理所有报错组件的镜像地址,改成他们测试环境下部署的Harbor镜像仓库地址,之后整个Kubernetes集群服务恢复。
3、总结
遇到组件启动报错的情况下,一定要基于它们之间相互依赖关系进行排查,重点查看组件日志,基于日志分析并解决问题。