k8s-etcd故障恢复
一.删除故障节点重新加入集群
1.查看集群状态
获取主节点和故障节点id
ETCDCTL_API=3 ./etcdctl --cacert=/etc/kubernetes/ssl/new-ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --endpoints="https://192.168.7.132:2379,https://192.168.7.134:2379,https://192.168.7.135:2379" endpoint status cluster-health --write-out=table
查看故障节点
ETCDCTL_API=3 ./etcdctl --cacert=/etc/kubernetes/ssl/new-ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --endpoints="https://192.168.7.132:2379,https://192.168.7.134:2379,https://192.168.7.135:2379" endpoint health --write-out=table
2.剔除故障节点
ETCDCTL_API=3 etcdctl --endpoints="https://172.16.169.82:2379" --cacert=/etc/etcd/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem member remove 971a0fee3d275c5
验证是否剔除成功
ETCDCTL_API=3 ./etcdctl --cacert=/etc/kubernetes/ssl/new-ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --endpoints="https://192.168.7.132:2379,https://192.168.7.134:2379,https://192.168.7.135:2379" member list --write-out=table
3.重新添加故障节点
3.1 检查旧etcd数据并清空
rm -rf etcd3.etcd/
3.2 修改要加入集群etcd3的启动参数
vi /etc/etcd/etcd.conf并保存
将etcd的--initial-cluster-state启动参数,改为--initial-cluster-state=existing
3.3 从etcd-master拷贝证书
scp -rp etcd1:/etc/etcd/ssl/ /etc/etcd/
拷贝后检查etcd证书所属用户
3.4 重新添加故障节点etcd3
注意:etcd3为etcd.conf中ETCD_NAME
ETCDCTL_API=3 etcdctl --endpoints="https://172.16.169.82:2379" --cacert=/etc/etcd/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem member add etcd3 --peer-urls=https://172.16.169.83:2380
3.5 重启服务验证成功
systemctl daemon-reload
systemctl restart etcd
二. 备份恢复方式重做集群
备注:此方式有可能导致数据丢失,亲身经历,尽量使用第一种方式。
- 备份ETCD集群时,只需要备份一个ETCD数据,然后同步到其他节点上
- 恢复ETCD数据时,每个节点都要执行恢复命令
恢复顺序:停止kube-apiserver –> 停止ETCD –> 恢复数据 –> 启动ETCD –> 启动kube-apiserve
etcd资源文件在/var/lib/etcd下,有两个文件夹
- snap:存放快照数据,etcd防止WAL文件过多而设置的快照,存储etcd数据状态
- wal:存放预写式日志,最大的作用是记录了整个数据变化的全部历程。在etcd中,所有数据的修改在提交前,都要先写入到WAL中
1、查看健康节点
ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/etcd-ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.15.215:2379,https://192.168.15.216:2379,https://192.168.15.217:2379" endpoint health --write-out=table
2、在健康节点上备份etcd数据
ETCDCTL_API=3 etcdctl snapshot save backup.db \ --cacert=/opt/etcd/ssl/etcd-ca.pem \ --cert=/opt/etcd/ssl/server.pem \ --key=/opt/etcd/ssl/server-key.pem \ --endpoints="https://192.168.15.215:2379"
3、将备份后的文件复制到其他etcd节点下
for i in {etcd2,etcd3};do scp backup.db root@i:/root ;done
4、恢复etcd数据
恢复之前需要先关闭apiserver和etcd,删除etcd数据文件
systemctl stop kube-apiserver #停止apiserver systemctl stop etcd #停止etcd rm -rf opt/etcd/data/* #删除数据目录文件,根据实际情况
在节点1执行恢复命令,如下:
ETCDCTL_API=3 etcdctl snapshot restore backup.db \ --name etcd-1 \ --initial-cluster="etcd-1=https://192.168.15.215:2380,etcd-2=https://192.168.15.216:2380,etcd-3=https://192.168.15.217:2380" \ --initial-cluster-token=etcd-cluster \ --initial-advertise-peer-urls=https://192.168.15.215:2380 \ --data-dir=/opt/etcd/data
在节点2执行恢复命令,如下:
ETCDCTL_API=3 etcdctl snapshot restore backup.db \ --name etcd-2 \ --initial-cluster="etcd-1=https://192.168.15.215:2380,etcd-2=https://192.168.15.216:2380,etcd-3=https://192.168.15.217:2380" \ --initial-cluster-token=etcd-cluster \ --initial-advertise-peer-urls=https://192.168.15.216:2380 \ --data-dir=/opt/etcd/data
在节点3执行恢复命令,如下:
ETCDCTL_API=3 etcdctl snapshot restore backup.db \ --name etcd-3 \ --initial-cluster="etcd-1=https://192.168.15.215:2380,etcd-2=https://192.168.15.216:2380,etcd-3=https://192.168.15.217:2380" \ --initial-cluster-token=etcd-cluster \ --initial-advertise-peer-urls=https://192.168.15.217:2380 \ --data-dir=/opt/etcd/data
恢复完成后,重启kube-apiserver和etcd,如下:
systemctl start kube-apiserver
systemctl start etcd
5、查看集群状态
ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/etcd-ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.15.215:2379,https://192.168.15.216:2379,https://192.168.15.217:2379" endpoint health --write-out=table
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· AI技术革命,工作效率10个最佳AI工具