etcd节点故障处理
问题:巡检发现k8s集群的etcd集群状态不对,其中有一个节点不健康,现象如下:
[root@k8s-master1 ~]# kubectl get cs NAME STATUS MESSAGE ERROR controller-manager Healthy ok scheduler Healthy ok etcd-1 Healthy {"health":"true"} etcd-0 Healthy {"health":"true"} etcd-2 Unhealthy HTTP probe failed with statuscode: 503
而且查询etcd日志没有太多报错信息,时间和证书都是正常的,而且也没有防火墙问题,于是开始进行如下操作
1.将有故障的etcd节点remove出集群:
[root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member list 20fd79755169a89, started, etcd-3, https://172.16.23.122:2380, https://172.16.23.122:2379, false 39356a19c9b19f6d, started, etcd-1, https://172.16.23.120:2380, https://172.16.23.120:2379, false 506e9a48a5c19ec3, started, etcd-2, https://172.16.23.121:2380, https://172.16.23.121:2379, false
由上面信息可知,有故障的etcd节点为etcd-2这个,对应etcd-3这个name也就是122这一台机器
[root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member remove 20fd79755169a89 Member 20fd79755169a89 removed from cluster ad1f122f981ee2bf [root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member list 39356a19c9b19f6d, started, etcd-1, https://172.16.23.120:2380, https://172.16.23.120:2379, false 506e9a48a5c19ec3, started, etcd-2, https://172.16.23.121:2380, https://172.16.23.121:2379, false
2.第一步已经将有故障的etcd节点etcd-2剔除了集群,开始操作etcd-3这个节点,删除etcd数据,然后将etcd配置文件集群信息由new修改为existing
# rm -rf /var/lib/etcd/default.etcd/member/
修改etcd配置文件,将下面new修改为:
修改前:
ETCD_INITIAL_CLUSTER_STATE="new"
修改后:
ETCD_INITIAL_CLUSTER_STATE="existing"
3.然后将etcd-3节点加入到集群:
[root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member add etcd-2 --peer-urls=https://172.16.23.122:2380 Member a98137c10970d43c added to cluster ad1f122f981ee2bf
然后查看集群列表:
[root@k8s-master1 ~]# /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://172.16.23.120:2379,https://172.16.23.121:2379,https://172.16.23.122:2379" member list 39356a19c9b19f6d, started, etcd-1, https://172.16.23.120:2380, https://172.16.23.120:2379, false 506e9a48a5c19ec3, started, etcd-2, https://172.16.23.121:2380, https://172.16.23.121:2379, false a98137c10970d43c, unstarted, , https://172.16.23.122:2380, , false
4.重启etcd故障节点:
[root@k8s-master3 ~]# systemctl start etcd [root@k8s-master3 ~]# systemctl status etcd ● etcd.service - Etcd Server Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled) Active: active (running) since 日 2021-02-28 22:04:34 CST; 4s ago
最后查看k8s集群的etcd:
[root@k8s-master1 ~]# kubectl get cs NAME STATUS MESSAGE ERROR scheduler Healthy ok controller-manager Healthy ok etcd-2 Healthy {"health":"true"} etcd-0 Healthy {"health":"true"} etcd-1 Healthy {"health":"true"}
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· Docker 太简单,K8s 太复杂?w7panel 让容器管理更轻松!