排查 Kubernetes 集群无法加入 control-plane 的问题

使用下面的命令将 kube-master1 作为 control-plane 加入 k8s 集群

kubeadm join k8s-api:6443 \
  --token ****** \
  --discovery-token-ca-cert-hash ****** \
  --control-plane \
  --certificate-key *****

加入 etcd 集群时卡住

[etcd] Announced new etcd member joining to the existing etcd cluster
[etcd] Creating static Pod manifest for "etcd"
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
[kubelet-check] Initial timeout of 40s passed.

在 /var/log/containers 中发现 etcd 的错误日志

{
  "level": "warn",
  "ts": "2022-05-20T23:25:34.108Z",
  "caller": "etcdserver/cluster_util.go:79",
  "msg": "failed to get cluster response",
  "address": "https://10.0.9.171:2380/members",
  "error": "Get \"https://10.0.9.171:2380/members\": x509: certificate is valid for 10.0.1.81, 127.0.0.1, ::1, not 10.0.9.171"
}

从日志看是请求 https://10.0.9.171:2380/members 时,10.0.9.171 返回的证书不对。10.0.9.171 是集群中现有的 control-plane,主机名是 kube-master0。10.0.1.81 是以前的 control-plane,主机名是 k8s-master0。

用 openssl 命令检查证书

openssl s_client -showcerts -servername 10.0.9.171 -connect 10.0.9.171:2380

的确是证书问题,用的是以前的 k8s-master0 证书

---
Certificate chain
 0 s:CN = k8s-master0
   i:CN = etcd-ca
-----BEGIN CERTIFICATE-----
******
-----END CERTIFICATE-----
---
Server certificate
subject=CN = k8s-master0

issuer=CN = etcd-ca

---
Acceptable client certificate CA names
CN = etcd-ca

到 kube-master0 服务上检查 /etc/kubernetes/pki/etcd 中的证书

openssl x509 -in server.crt -text -noout
openssl x509 -in peer.crt -text -noout

的确还是以前 k8s-master0 使用的证书。

知道了问题原因,就很好解决了,重新生成 etcd 用到的证书。

删除 /etc/kubernetes/pki/etcd 中除了 ca.crt 与 ca.key 之外的证书文件,用下面的命令重新生成证书

kubeadm init phase certs etcd-server
kubeadm init phase certs etcd-peer
kubeadm init phase certs etcd-healthcheck-client

在 kube-master0 上从集群中删除没成功加入集群的 kube-master1

kubectl delete node kube-master1

在 kube-master1 退出集群并重新加入

kubeadm reset
kubeadm join k8s-api:6443 ...

加入成功!问题终于解决!

[etcd] Announced new etcd member joining to the existing etcd cluster
[etcd] Creating static Pod manifest for "etcd"
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
The 'update-status' phase is deprecated and will be removed in a future release. Currently it performs no operation
[mark-control-plane] Marking the node kube-master1 as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node kube-master1 as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule node-role.kubernetes.io/control-plane:NoSchedule]

This node has joined the cluster and a new control plane instance was created:
posted @ 2022-05-21 08:31  dudu  阅读(2645)  评论(1编辑  收藏  举报