排查 Kubernetes 集群无法加入 control-plane 的问题
使用下面的命令将 kube-master1 作为 control-plane 加入 k8s 集群
kubeadm join k8s-api:6443 \
--token ****** \
--discovery-token-ca-cert-hash ****** \
--control-plane \
--certificate-key *****
加入 etcd 集群时卡住
[etcd] Announced new etcd member joining to the existing etcd cluster
[etcd] Creating static Pod manifest for "etcd"
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
[kubelet-check] Initial timeout of 40s passed.
在 /var/log/containers 中发现 etcd 的错误日志
{
"level": "warn",
"ts": "2022-05-20T23:25:34.108Z",
"caller": "etcdserver/cluster_util.go:79",
"msg": "failed to get cluster response",
"address": "https://10.0.9.171:2380/members",
"error": "Get \"https://10.0.9.171:2380/members\": x509: certificate is valid for 10.0.1.81, 127.0.0.1, ::1, not 10.0.9.171"
}
从日志看是请求 https://10.0.9.171:2380/members
时,10.0.9.171 返回的证书不对。10.0.9.171 是集群中现有的 control-plane,主机名是 kube-master0。10.0.1.81 是以前的 control-plane,主机名是 k8s-master0。
用 openssl 命令检查证书
openssl s_client -showcerts -servername 10.0.9.171 -connect 10.0.9.171:2380
的确是证书问题,用的是以前的 k8s-master0 证书
---
Certificate chain
0 s:CN = k8s-master0
i:CN = etcd-ca
-----BEGIN CERTIFICATE-----
******
-----END CERTIFICATE-----
---
Server certificate
subject=CN = k8s-master0
issuer=CN = etcd-ca
---
Acceptable client certificate CA names
CN = etcd-ca
到 kube-master0 服务上检查 /etc/kubernetes/pki/etcd 中的证书
openssl x509 -in server.crt -text -noout
openssl x509 -in peer.crt -text -noout
的确还是以前 k8s-master0 使用的证书。
知道了问题原因,就很好解决了,重新生成 etcd 用到的证书。
删除 /etc/kubernetes/pki/etcd
中除了 ca.crt 与 ca.key 之外的证书文件,用下面的命令重新生成证书
kubeadm init phase certs etcd-server
kubeadm init phase certs etcd-peer
kubeadm init phase certs etcd-healthcheck-client
在 kube-master0 上从集群中删除没成功加入集群的 kube-master1
kubectl delete node kube-master1
在 kube-master1 退出集群并重新加入
kubeadm reset
kubeadm join k8s-api:6443 ...
加入成功!问题终于解决!
[etcd] Announced new etcd member joining to the existing etcd cluster
[etcd] Creating static Pod manifest for "etcd"
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
The 'update-status' phase is deprecated and will be removed in a future release. Currently it performs no operation
[mark-control-plane] Marking the node kube-master1 as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node kube-master1 as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule node-role.kubernetes.io/control-plane:NoSchedule]
This node has joined the cluster and a new control plane instance was created: