记一次阿里云搭建K8S在恢复镜像快照之后etcd一个节点无法启动问题和怎么增加和删除节点
- 环境查看
系统环境
# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)
# uname -a
Linux CentOS7K8SMaster01005101 3.10.0-1160.114.2.el7.x86_64 #1 SMP Wed Mar 20 15:54:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
软件环境
# kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:37:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
# /opt/etcd/bin/etcd --version
etcd Version: 3.3.10
Git SHA: 27fc7e2
Go Version: go1.10.4
Go OS/Arch: linux/amd64
- 故障现象
阿里云一台服务器出现挖矿之后恢复了磁盘快照etcd无法启动
查看/var/log/messages日志发现以下日志
etcd: member fa1b88064457e060 has already been bootstrapped
systemd: etcd.service: main process exited, code=exited, status=1/FAILURE
systemd: Unit etcd.service entered failed state.
systemd: etcd.service failed.
systemd: etcd.service holdoff time over, scheduling restart.
etcd: recognized environment variable ETCD_NAME, but unused: shadowed by corresponding flag
etcd: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corresponding flag
etcd: recognized environment variable ETCD_LISTEN_PEER_URLS, but unused: shadowed by corresponding flag
etcd: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadowed by corresponding flag
etcd: recognized environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS, but unused: shadowed by corresponding flag
etcd: recognized environment variable ETCD_ADVERTISE_CLIENT_URLS, but unused: shadowed by corresponding flag
etcd: recognized environment variable ETCD_INITIAL_CLUSTER, but unused: shadowed by corresponding flag
etcd: recognized environment variable ETCD_INITIAL_CLUSTER_TOKEN, but unused: shadowed by corresponding flag
etcd: recognized environment variable ETCD_INITIAL_CLUSTER_STATE, but unused: shadowed by corresponding flag
etcd: etcd Version: 3.3.10
- 原因分析
恢复快照之后数据不同步导致 - 修复步骤
删除/var/lib/etcd/重启etcd无法修复
需要修改etcd启动文件
完整的配置文件如下
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
EnvironmentFile=/opt/etcd/cfg/etcd
ExecStart=/opt/etcd/bin/etcd \
--name=${ETCD_NAME} \
--data-dir=${ETCD_DATA_DIR} \
--listen-peer-urls=${ETCD_LISTEN_PEER_URLS} \
--listen-client-urls=${ETCD_LISTEN_CLIENT_URLS},http://127.0.0.1:2379 \
--advertise-client-urls=${ETCD_ADVERTISE_CLIENT_URLS} \
--initial-advertise-peer-urls=${ETCD_INITIAL_ADVERTISE_PEER_URLS} \
--initial-cluster=${ETCD_INITIAL_CLUSTER} \
--initial-cluster-token=${ETCD_INITIAL_CLUSTER_TOKEN} \
--initial-cluster-state=existing \
--cert-file=/opt/etcd/ssl/server.pem \
--key-file=/opt/etcd/ssl/server-key.pem \
--peer-cert-file=/opt/etcd/ssl/server.pem \
--peer-key-file=/opt/etcd/ssl/server-key.pem \
--trusted-ca-file=/opt/etcd/ssl/ca.pem \
--peer-trusted-ca-file=/opt/etcd/ssl/ca.pem
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
重启
# systemctl reload-daemon
# systemctl restart etcd
查看是否修复
# kubectl get cs
NAME STATUS MESSAGE ERROR
scheduler Healthy ok
controller-manager Healthy ok
etcd-1 Healthy {"health":"true"}
etcd-0 Healthy {"health":"true"}
etcd-2 Healthy {"health":"true"}
如果一个节点无法启动则可以通过先删除节点再增加节点的方式启动
- 删除节点
# 进入证书目录
# 查看当前节点
# /opt/etcd/bin/etcdctl --ca-file=ca.pem --cert-file=server.pem --key-file=server-key.pem member list
18203a217590459c: name=etcd05 peerURLs=https://172.16.5.105:2380 clientURLs=https://172.16.5.105:2379 isLeader=false
23ac2b06d0008226: name=etcd03 peerURLs=https://172.16.5.103:2380 clientURLs=https://172.16.5.103:2379 isLeader=false
a7a182aafca1a7e9: name=etcd02 peerURLs=https://172.16.5.102:2380 clientURLs=https://172.16.5.102:2379 isLeader=false
fa1b88064457e060: name=etcd01 peerURLs=https://172.16.5.101:2380 clientURLs=https://172.16.5.101:2379 isLeader=true
# 通过ID删除节点
# /opt/etcd/bin/etcdctl --ca-file=ca.pem --cert-file=server.pem --key-file=server-key.pem member remove 2dd45e0dcd9289f
- 增加节点
# 使用以下命令增加节点
# /opt/etcd/bin/etcdctl --ca-file=ca.pem --cert-file=server.pem --key-file=server-key.pem member add etcd03 https://172.16.5.103:2380
注意: 一定记得使用https否则无法启动etcd
- 启动新节点
新增节点的启动文件和上面的启动文件一样需要设置以下参数
# --initial-cluster-state=existing \
如果数据目录有文件则先清理
# rm -rf /var/lib/etcd
重启,只需要在新增的节点启动
# systemctl daemon-reload
# systemctl restart etcd
新增节点后可以修改kueb-apiserver配置文件把新增节点信息输入