K8s之Etcd的备份与恢复
ETCD用于共享和配置服务发现的分布式,一致性的KV存储系统。 ETCD是CoreOS公司发起的一个开源项目,授权协议为Apache。
ETCD 存储 k8s 所有数据信息
ETCD 是k8s集群极为重要的一块服务,存储了集群所有的数据信息。同理,如果发生灾难或者 etcd 的数据丢失,都会影响集群数据的恢复。
ETCD使用场景
ETCD 有很多使用场景,包括但不限于:
- 配置管理
- 服务注册于发现
- 选主
- 应用调度
- 分布式队列
- 分布式锁
首先在master节点安装etcd
[root@master ~]# yum install etcd -y
查看etcd版本
[root@master ~]# etcdctl -version etcdctl version: 3.3.11 API version: 2
对 etcd 的访问相当于集群中的 root 权限,因此理想情况下只有 API 服务器才能访问它。 考虑到数据的敏感性,建议只向需要访问 etcd 集群的节点授予权限。要使用安全客户端通信对 etcd 进行配置,请指定参数 --key-file=k8sclient.key和--cert-file=k8sclient.cert ,并使用 HTTPS 作为 URL 模式。 使用安全通信的客户端命令的示例:
[root@master ~]# ETCDCTL_API=3 etcdctl --endpoints 192.168.248.128:2379 \ > --cert=/etc/kubernetes/pki/etcd/server.crt \ > --key=/etc/kubernetes/pki/etcd/server.key \ > --cacert=/etc/kubernetes/pki/etcd/ca.crt \ > member list 6b96994e8b1eabe5, started, master, https://192.168.248.128:2380, https://192.168.248.128:2379
注:etcd最新的API版本是v3,与v2相比,v3更高效更清晰。k8s默认使用的etcd V3版本API,ectdctl默认使用V2版本API。要使用v3,设置环境变量export ETCDCTL_API=3临时更改为V3或者vim /etc/profile后在里面添加export ETCDCTL_API=3,然后执行source /etc/profile则永久更改为V3。
2379和2380为etcd在IANA 的注册端口【为默认端口】
- 2379:为客户端提供通讯
- 2380:为服务器间提供通讯
单节点etcd备份恢复示例:
先在主机上运行一个deployment,里面起5个pod:
[root@master ~]# kubectl get pod NAME READY STATUS RESTARTS AGE deployment-nginx-8c459867c-4fw9q 1/1 Running 1 4d19h deployment-nginx-8c459867c-ccbkz 1/1 Running 1 4d19h deployment-nginx-8c459867c-cksqd 1/1 Running 1 4d19h deployment-nginx-8c459867c-mrdz4 1/1 Running 1 4d19h deployment-nginx-8c459867c-zt8n8 1/1 Running 1 4d19h
创建一个目录,存放etcd数据,然后以快照方式保存现在etcd的数据。
[root@master ~]# mkdir -p etcd/backup/ [root@master ~]# ETCDCTL_API=3 etcdctl --endpoints 192.168.248.128:2379 \ > --cert=/etc/kubernetes/pki/etcd/server.crt \ > --key=/etc/kubernetes/pki/etcd/server.key \ > --cacert=/etc/kubernetes/pki/etcd/ca.crt \ > snapshot save ~/etcd/backup/snap.db Snapshot saved at /root/etcd/backup/snap.db [root@master ~]# cd etcd/backup/ [root@master backup]# ls snap.db
现在把上面的pod删除后再恢复etcd数据,验证是否能恢复pod。
[root@master ~]# kubectl delete -f deployment2.yaml deployment.apps "deployment-nginx" deleted [root@master ~]# kubectl get pods No resources found in default namespace #此时已经没有pod在运行了
恢复etcd,首先停止kube-apiserver和etcd,防止再有数据写入etcd。由于kube-apiserver和etcd属于静态pod,是由kubelet创建,所以需要将/etc/kubernetes/manifests/下的yaml文件移除,让其不可用。
[root@master ~]# mv /etc/kubernetes/manifests/ /etc/kubernetes/manifests.bak [root@master ~]# mv /var/lib/etcd/ /var/lib/etcd.bak
[root@master ~]# kubectl get pod -A The connection to the server 192.168.248.128:6443 was refused - did you specify the right host or port?
此时kubectl以不可用,现在恢复etcd数据
[root@master lib]# ETCDCTL_API=3 etcdctl snapshot restore /root/etcd/backup/snap.db --data-dir=/var/lib/etcd 2022-08-11 22:09:39.297931 I | mvcc: restore compact to 306417 2022-08-11 22:09:39.304403 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
将/etc/kubernetes/manifests/下的yaml文件还原,静态pod会自动重建kube-apiserver和etcd
[root@master lib]# mv /etc/kubernetes/manifests.bak/ /etc/kubernetes/manifests
[root@master lib]# kubectl get pod NAME READY STATUS RESTARTS AGE deployment-nginx-8c459867c-4fw9q 1/1 Running 1 5d deployment-nginx-8c459867c-ccbkz 1/1 Running 1 5d deployment-nginx-8c459867c-cksqd 1/1 Running 1 5d deployment-nginx-8c459867c-mrdz4 1/1 Running 1 5d deployment-nginx-8c459867c-zt8n8 1/1 Running 1 5d
此时etcd数据恢复后,原来的pod又重新回来了。
etcd自动备份脚本。
[root@master etcd]# cat etcd_backup.sh #!/bin/bash CACERT="/etc/kubernetes/pki/etcd/ca.crt " CERT="/etc/kubernetes/pki/etcd/server.crt" EKY="/etc/kubernetes/pki/etcd/server.key" ENDPOINTS="192.168.248.128:2379" ETCDCTL_API=3 etcdctl \ --cacert="${CACERT}" \ --cert="${CERT}" \ --key="${EKY}" \ --endpoints=${ENDPOINTS} \ snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db # 备份保留30天 find /data/etcd_backup_dir/ -name *.db -mtime +30 -exec rm -f {} \;
多master节点集群恢复etcd
只需要备份一个master节点数据就可以了,当我们master节点数据备份后打包传到其他节点上
mkdir /var/lib/etcd_backup #若没有,创建etcd_backup目录;若有,则不用创建 cp $BACKUP_FILE_NAME.tar /var/lib/etcd_backup #拷贝tar包至etcd_backup目录 cd /var/lib/etcd_backup #进入/var/lib/etcd_backup目录 tar -xzvf $BACKUP_FILE_NAME.tar #解压备份的tar包
停止所有节点上的kube-apiserver和etcd
systemctl stop kube-apiserver
systemctl stop etcd
移除所有etcd服务实例的数据目录
mv /var/lib/etcd/default.etcd /var/lib/etcd/default.etcd_bak
执行恢复命令,所有etcd节点依次执行
etcdctl snapshot restore /var/lib/etcd_backup/$BACKUP_FILE_NAME/etcd_snapshot.db \ #备份文件 --cacert=$ETCD_TRUSTED_CA_FILE \ #ca.crt路径 --cert=$ETCD_CERT_FILE \ #server.crt路径 --key=$ETCD_KEY_FILE \ #server.key路径 --name $ETCD_NAME \ #主机名 --initial-cluster $ETCD_INITIAL_CLUSTER \ #描述集群节点信息 --initial-advertise-peer-urls $ETCD_INITIAL_ADVERTISE_PEER_URLS \ #etcd主机IP --data-dir=/var/lib/etcd/default.etcd
备注 :vim /etc/etcd/etcd.conf,查看配置信息,以及所需字段:ETCD_TRUSTED_CA_FILE、ETCD_CERT_FILE、ETCD_KEY_FILE、ETCD_NAME、ETCD_INITIAL_CLUSTER、ETCD_INITIAL_ADVERTISE_PEER_URLS
例:
# k8s-master1 机器上操作 $ ETCDCTL_API=3 etcdctl snapshot restore /tmp/backup/etcd/etcd-snapshot-20210610.db \ --name k8s-m1 \ --initial-cluster "k8s-m1=https://172.16.2.91:2380,k8s-m2=https://172.16.2.92:2380,k8s-m3=https://172.16.2.93:2380" \ --initial-cluster-token etcd-cluster \ --initial-advertise-peer-urls https://172.16.2.91:2380 \ --data-dir=/var/lib/etcd/default.etcd # k8s-master2 机器上操作 $ ETCDCTL_API=3 etcdctl snapshot restore /tmp/backup/etcd/etcd-snapshot-20210610.db \ --name k8s-m2 \ --initial-cluster "k8s-m1=https://172.16.2.91:2380,k8s-m2=https://172.16.2.92:2380,k8s-m3=https://172.16.2.93:2380" \ --initial-cluster-token etcd-cluster \ --initial-advertise-peer-urls https://172.16.2.92:2380 \ --data-dir=/var/lib/etcd/default.etcd # k8s-master3 机器上操作 $ ETCDCTL_API=3 etcdctl snapshot restore /tmp/backup/etcd/etcd-snapshot-20210610.db \ --name k8s-m3 \ --initial-cluster "k8s-m1=https://172.16.2.91:2380,k8s-m2=https://172.16.2.92:2380,k8s-m3=https://172.16.2.93:2380" \ --initial-cluster-token etcd-cluster \ --initial-advertise-peer-urls https://172.16.2.93:2380 \ --data-dir=/var/lib/etcd/default.etcd
修改etcd数据目录权限
chown -R etcd:etcd /var/lib/etcd/default.etcd chmod -R 700 /var/lib/etcd/default.etcd
启动集群所有的etcd实例
systemctl start etcd
检查所有etcd健康状态
ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--endpoints="https://172.16.2.91:2379,https://172.16.2.92:2379,https://172.16.2.93:2379" \
endpoint health -w table
到每台 Master 启动 kube-apiserver
systemctl start kube-apiserver
systemctl status kube-apiserver
检查 Kubernetes 集群是否恢复正常
[root@master ~]# kubectl get cs Warning: v1 ComponentStatus is deprecated in v1.19+ NAME STATUS MESSAGE ERROR controller-manager Unhealthy Get "http://127.0.0.1:10252/healthz": dial tcp 127.0.0.1:10252: connect: connection refused scheduler Unhealthy Get "http://127.0.0.1:10251/healthz": dial tcp 127.0.0.1:10251: connect: connection refused etcd-0 Healthy {"health":"true"}
如果出现上述情况,需要将/etc/kubernetes/manifests/kube-scheduler.yaml和/etc/kubernetes/manifests/kube-controller-manager.yaml中的 - --port=0注释掉重启kubelet就恢复正常了。

apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: component: kube-scheduler tier: control-plane name: kube-scheduler namespace: kube-system spec: containers: - command: - kube-scheduler - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf - --bind-address=127.0.0.1 - --kubeconfig=/etc/kubernetes/scheduler.conf - --leader-elect=true # - --port=0 image: registry.aliyuncs.com/google_containers/kube-scheduler:v1.21.0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 8 httpGet: host: 127.0.0.1 path: /healthz port: 10259 scheme: HTTPS initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 15 name: kube-scheduler resources: requests: cpu: 100m startupProbe: failureThreshold: 24 httpGet: host: 127.0.0.1 path: /healthz port: 10259 scheme: HTTPS initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 15 volumeMounts: - mountPath: /etc/kubernetes/scheduler.conf name: kubeconfig readOnly: true hostNetwork: true priorityClassName: system-node-critical volumes: - hostPath: path: /etc/kubernetes/scheduler.conf type: FileOrCreate name: kubeconfig status: {}

apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: component: kube-controller-manager tier: control-plane name: kube-controller-manager namespace: kube-system spec: containers: - command: - kube-controller-manager - --allocate-node-cidrs=true - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf - --bind-address=127.0.0.1 - --client-ca-file=/etc/kubernetes/pki/ca.crt - --cluster-cidr=10.244.0.0/16 - --cluster-name=kubernetes - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key - --controllers=*,bootstrapsigner,tokencleaner - --kubeconfig=/etc/kubernetes/controller-manager.conf - --leader-elect=true # - --port=0 - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt - --root-ca-file=/etc/kubernetes/pki/ca.crt - --service-account-private-key-file=/etc/kubernetes/pki/sa.key - --service-cluster-ip-range=10.96.0.0/12 - --use-service-account-credentials=true image: registry.aliyuncs.com/google_containers/kube-controller-manager:v1.21.0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 8 httpGet: host: 127.0.0.1 path: /healthz port: 10257 scheme: HTTPS initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 15 name: kube-controller-manager resources: requests: cpu: 200m startupProbe: failureThreshold: 24 httpGet: host: 127.0.0.1 path: /healthz port: 10257 scheme: HTTPS initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 15 volumeMounts: - mountPath: /etc/ssl/certs name: ca-certs readOnly: true - mountPath: /etc/pki name: etc-pki readOnly: true - mountPath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec name: flexvolume-dir - mountPath: /etc/kubernetes/pki name: k8s-certs readOnly: true - mountPath: /etc/kubernetes/controller-manager.conf name: kubeconfig readOnly: true hostNetwork: true priorityClassName: system-node-critical volumes: - hostPath: path: /etc/ssl/certs type: DirectoryOrCreate name: ca-certs - hostPath: path: /etc/pki type: DirectoryOrCreate name: etc-pki - hostPath: path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec type: DirectoryOrCreate name: flexvolume-dir - hostPath: path: /etc/kubernetes/pki type: DirectoryOrCreate name: k8s-certs - hostPath: path: /etc/kubernetes/controller-manager.conf type: FileOrCreate name: kubeconfig status: {}
[root@master ~]# kubectl get cs Warning: v1 ComponentStatus is deprecated in v1.19+ NAME STATUS MESSAGE ERROR scheduler Healthy ok controller-manager Healthy ok etcd-0 Healthy {"health":"true"}
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!