查看etcd的service文件
1 | vi /etc/systemd/system/etcd.service |
查看etcd的数据目录
root@etcd-2:~# ll /var/lib/etcd/ total 4 drwx------ 3 root root 20 Apr 19 17:44 ./ drwxr-xr-x 43 root root 4096 Apr 20 06:11 ../ drwx------ 4 root root 29 Apr 19 17:44 member/ root@etcd-2:~# ll /var/lib/etcd/member/ total 0 drwx------ 4 root root 29 Apr 19 17:44 ./ drwx------ 3 root root 20 Apr 19 17:44 ../ drwx------ 2 root root 246 Apr 21 17:36 snap/ drwx------ 2 root root 199 Apr 21 10:13 wal/
snap 存放的是数据
wal 存放的是预写式日志(在插入数据的时候先写日志在写数据,如果日志没写成功那么数据也就没插入成功,后期通过日志恢复数据)
查看etcd集群的节点有哪些
root@etcd-1:~# etcdctl member list 71745e1fe53ea3d2, started, etcd-192.168.10.107, https://192.168.10.107:2380, https://192.168.10.107:2379, false b3497c3662525c94, started, etcd-192.168.10.108, https://192.168.10.108:2380, https://192.168.10.108:2379, false cff05c5d2e5d7019, started, etcd-192.168.10.109, https://192.168.10.109:2380, https://192.168.10.109:2379, false
cff05c5d2e5d7019, started, etcd-192.168.10.109, https://192.168.10.109:2380, https://192.168.10.109:2379, false
id , 状态,名称,集群端口,客户端端口,是否在同步数据
etcd的健康检查id , 状态,名称,集群端口,客户端端口,是否在同步数据
此种方法适用于本机
root@etcd-1:~# etcdctl endpoint health 127.0.0.1:2379 is healthy: successfully committed proposal: took = 3.52306ms
集群监控可以写一个for循环
root@etcd-1:~# export NODE_IPS="192.168.10.107 192.168.10.108 192.168.10.109" root@etcd-1:~# for ip in ${NODE_IPS}; do ETCDCTL_API=3 /usr/local/bin/etcdctl --endpoints=https://${ip}:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem endpoint health; done https://192.168.10.107:2379 is healthy: successfully committed proposal: took = 15.070727ms https://192.168.10.108:2379 is healthy: successfully committed proposal: took = 9.874537ms https://192.168.10.109:2379 is healthy: successfully committed proposal: took = 8.872484ms
以表格形式输出
oot@etcd-1:~# for ip in ${NODE_IPS}; do ETCDCTL_API=3 /usr/local/bin/etcdctl --write-out=table endpoint status --endpoints=https://${ip}:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem endpoint health; done +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://192.168.10.107:2379 | 71745e1fe53ea3d2 | 3.4.13 | 2.7 MB | false | false | 4 | 521493 | 521493 | | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://192.168.10.108:2379 | b3497c3662525c94 | 3.4.13 | 2.7 MB | false | false | 4 | 521493 | 521493 | | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://192.168.10.109:2379 | cff05c5d2e5d7019 | 3.4.13 | 2.7 MB | true | false | 4 | 521493 | 521493 | | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
etcd的基础操作
查看etc中所有的key
root@etcd-1:~# etcdctl get / --prefix --keys-only /calico/ipam/v2/assignment/ipv4/block/10.200.205.192-26 /calico/ipam/v2/assignment/ipv4/block/10.200.247.0-26 /calico/ipam/v2/assignment/ipv4/block/10.200.39.0-26 /calico/ipam/v2/assignment/ipv4/block/10.200.84.128-26 /calico/ipam/v2/handle/ipip-tunnel-addr-master-1 /calico/ipam/v2/handle/ipip-tunnel-addr-master-2 /calico/ipam/v2/handle/ipip-tunnel-addr-node-1 /calico/ipam/v2/handle/ipip-tunnel-addr-node-2 /calico/ipam/v2/handle/k8s-pod-network.3844a5799fbfdd20ab3ee16c6b176626d04c635ed8aa57a36d9e43a11b028713 /calico/ipam/v2/handle/k8s-pod-network.52d7e2ca8546bf0739c79c425ea421c63be1653fe74811c2d4b6c9242111fb22 /calico/ipam/v2/handle/k8s-pod-network.ab0b92bfc89fef7eb4486080bff1aa4e6f28109a105a70aceafb325d1d514d23 /calico/ipam/v2/handle/k8s-pod-network.c8dc5605cd5ed0a43a6169cf74d2f1738ddde1d5e72f2b4bd0cbfffe14a1232e
查看某个pod的key
net-test1为pod名字
root@master-1:~# kubectl get pod NAME READY STATUS RESTARTS AGE net-test1 1/1 Running 0 2d21h net-test2 1/1 Running 0 2d21h root@etcd-1:~# etcdctl get / --prefix --keys-only|grep net-test1 /registry/pods/default/net-test1
查看/registry/pods/default/net-test1这个key的信息
root@etcd-1:~# etcdctl get /registry/pods/default/net-test1 /registry/pods/default/net-test1 k8s v1Podې net-test1default"*$d9e53134-3638-4e51-bb43-b944037bd5652¯䏚 run net-test1z kubectl-runUpdatev¯FieldsV1: {"f:metadata":{"f:labels":{".":{},"f:run":{}}},"f:spec":{"f:containers":{"k:{\"name\":\"net-test1\"}":{".":{},"f:args":{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:resources":{},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}}},"f:dnsPolicy":{},"f:enableServiceLinks":{},"f:restartPolicy":{},"f:schedulerName":{},"f:securityContext":{},"f:terminationGracePeriodSeconds":{}}}· kubeletUpdatev®¯FieldsV1: {"f:status":{"f:conditions":{"k:{\"type\":\"ContainersReady\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Initialized\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Ready\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}}},"f:containerStatuses":{},"f:hostIP":{},"f:phase":{},"f:podIP":{},"f:podIPs":{".":{},"k:{\"ip\":\"10.200.84.129\"}":{".":{},"f:ip":{}}},"f:startTime":{}}}« kube-api-access-tc2lvkЁh " token (& kube-root-ca.crt ca.crtca.crt )' % namespace v1metadata.namespace¤± net-test1centos:7.9.2009"sleep"300000*BJL kube-api-access-tc2lv-/var/run/secrets/kubernetes.io/serviceaccount"2j/dev/termination-logr IfNotPresent¢FileAlways 2 ClusterFirstBdefaultJdefaultR192.168.10.104X`hrdefault-scheduler²6 node.kubernetes.io/not-readyExists" NoExecute(¬²8 node.kubernetes.io/unreachableExists" NoExecute(¬ƁPreemptLowerPriorityȃ Running# InitializedTru¯䎪2 ReadyTru®¯䎪2' ContainersReadyTru®¯䎪2$ 10.200.84.12¯䏂݁u¯䎪2"*192.168.10.1042 net-test1 ®¯䎚 (2centos:7.9.2009:`docker-pullable://centos@sha256:9d4bcbbb213dfd745b58be38b13b996ebb5ac315fe75711bd618426a630e0987BIdocker://d261d1933b0740fb2d478d4248371b89cbb95422fc75b05b88e3b7f032e6c818HJ BestEffortZb 10.200.84.129"
删除pod就是删除他所对应的key
root@etcd-1:~# etcdctl del /registry/pods/default/net-test1 1 root@master-1:~# kubectl get pod NAME READY STATUS RESTARTS AGE net-test2 1/1 Running 0 2d21h
此时看到已经没有了net-test1这个pod,此项操作十分危险,请谨慎操作
上传数据
root@etcd-1:~# etcdctl put /qijia "0324" OK root@etcd-1:~# etcdctl get /qijia /qijia 0324
etcd单机备份和恢复
备份
root@etcd-1:~# etcdctl snapshot save /data/backup/etcd-backup-`date +%F%H%M` {"level":"info","ts":1650619578.2251773,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"/data/backup/etcd-backup-2022-04-221726.part"} {"level":"info","ts":"2022-04-22T17:26:18.225+0800","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"} {"level":"info","ts":1650619578.2258816,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"127.0.0.1:2379"} {"level":"info","ts":"2022-04-22T17:26:18.245+0800","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"} {"level":"info","ts":1650619578.2582552,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"127.0.0.1:2379","size":"2.7 MB","took":0.032995421} {"level":"info","ts":1650619578.258354,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"/data/backup/etcd-backup-2022-04-221726"} Snapshot saved at /data/backup/etcd-backup-2022-04-221726
恢复数据 /data/etcd 一定要是空目录 etcd是自动创建的无需自己创建
root@etcd-1:~# etcdctl snapshot restore /data/backup/etcd-backup-2022-04-221726 --data-dir=/data/etcd {"level":"info","ts":1650619683.6522489,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"/data/backup/etcd-backup-2022-04-221726","wal-dir":"/data/etcd/member/wal","data-dir":"/data/etcd","snap-dir":"/data/etcd/member/snap"} {"level":"info","ts":1650619683.6754777,"caller":"mvcc/kvstore.go:380","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":421636} {"level":"info","ts":1650619683.682951,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]} {"level":"info","ts":1650619683.6885314,"caller":"snapshot/v3_snapshot.go:309","msg":"restored snapshot","path":"/data/backup/etcd-backup-2022-04-221726","wal-dir":"/data/etcd/member/wal","data-dir":"/data/etcd","snap-dir":"/data/etcd/member/snap"} root@etcd-1:~# ll /data/etcd/ total 0 drwx------ 3 root root 20 Apr 22 17:28 ./ drwxr-xr-x 4 root root 32 Apr 22 17:28 ../ drwx------ 4 root root 29 Apr 22 17:28 member/ root@etcd-1:~# ll /data/etcd/member/ total 0 drwx------ 4 root root 29 Apr 22 17:28 ./ drwx------ 3 root root 20 Apr 22 17:28 ../ drwx------ 2 root root 62 Apr 22 17:28 snap/ drwx------ 2 root root 51 Apr 22 17:28 wal/
此时数据已经恢复在了/data/etcd目录中 我们只需要将etcd.service文件中的 WorkingDirectory和 --data-dir改成恢复后的目录并重启etcd就可以
root@etcd-1:~# vi /etc/systemd/system/etcd.service [Unit] Description=Etcd Server After=network.target After=network-online.target Wants=network-online.target Documentation=https://github.com/coreos [Service] Type=notify WorkingDirectory=/var/lib/etcd/ ExecStart=/usr/local/bin//etcd \ --name=etcd-192.168.10.107 \ --cert-file=/etc/kubernetes/ssl/etcd.pem \ --key-file=/etc/kubernetes/ssl/etcd-key.pem \ --peer-cert-file=/etc/kubernetes/ssl/etcd.pem \ --peer-key-file=/etc/kubernetes/ssl/etcd-key.pem \ --trusted-ca-file=/etc/kubernetes/ssl/ca.pem \ --peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem \ --initial-advertise-peer-urls=https://192.168.10.107:2380 \ --listen-peer-urls=https://192.168.10.107:2380 \ --listen-client-urls=https://192.168.10.107:2379,http://127.0.0.1:2379 \ --advertise-client-urls=https://192.168.10.107:2379 \ --initial-cluster-token=etcd-cluster-0 \ --initial-cluster=etcd-192.168.10.107=https://192.168.10.107:2380,etcd-192.168.10.108=https://192.168.10.108:2380,etcd-192.168.10.109=https://192.168.10.109:2380 \ --initial-cluster-state=new \ --data-dir=/var/lib/etcd \ --wal-dir= \ --snapshot-count=50000 \ --auto-compaction-retention=1 \ --auto-compaction-mode=periodic \ --max-request-bytes=10485760 \ --quota-backend-bytes=8589934592 Restart=always RestartSec=15 LimitNOFILE=65536 OOMScoreAdjust=-999 [Install] WantedBy=multi-user.target
etcd集群备份和还原
因为我们的k8s集群是通过kubeasz工具安装的,所以我们可以通过kubeasz自带的脚本来备份和恢复etcd的数据
所需要的脚本如下
94.backup.yml 备份脚本
95.restore.yml 恢复脚本
因为我们的k8s集群是通过kubeasz工具安装的,所以我们可以通过kubeasz自带的脚本来备份和恢复etcd的数据
所需要的脚本如下
94.backup.yml 备份脚本
95.restore.yml 恢复脚本
root@master-1:~# ll /etc/kubeasz/playbooks/ total 92 -rw-rw-r-- 1 root root 1786 Apr 26 2021 94.backup.yml -rw-rw-r-- 1 root root 999 Apr 26 2021 95.restore.yml
备份 ezctl backup k8s集群名
root@master-1:~# ezctl --help backup <cluster> to backup the cluster state (etcd snapshot) restore <cluster> to restore the cluster state from backups
我们的集群名为qijia01
root@master-1:~# ll /etc/kubeasz/clusters/ total 0 drwxr-xr-x 3 root root 21 Feb 23 16:22 ./ drwxrwxr-x 12 root root 225 Feb 23 16:22 ../ drwxr-xr-x 5 root root 203 Apr 20 18:44 qijia01/
开始备份
root@master-1:~# ezctl backup qijia01 ansible-playbook -i clusters/qijia01/hosts -e @clusters/qijia01/config.yml playbooks/94.backup.yml 2022-04-22 18:02:54 INFO cluster:qijia01 backup begins in 5s, press any key to abort: PLAY [localhost] ************************************************************************************************************************************************************************************************************************************************* TASK [Gathering Facts] ******************************************************************************************************************************************************************************************************************************************* ok: [localhost] TASK [set NODE_IPS of the etcd cluster] ************************************************************************************************************************************************************************************************************************** ok: [localhost] TASK [get etcd cluster status] *********************************************************************************************************************************************************************************************************************************** changed: [localhost] TASK [debug] ***************************************************************************************************************************************************************************************************************************************************** ok: [localhost] => { "ETCD_CLUSTER_STATUS": { "changed": true, "cmd": "for ip in 192.168.10.107 192.168.10.108 192.168.10.109 ;do ETCDCTL_API=3 /etc/kubeasz/bin/etcdctl --endpoints=https://\"$ip\":2379 --cacert=/etc/kubeasz/clusters/qijia01/ssl/ca.pem --cert=/etc/kubeasz/clusters/qijia01/ssl/etcd.pem --key=/etc/kubeasz/clusters/qijia01/ssl/etcd-key.pem endpoint health; done", "delta": "0:00:00.526961", "end": "2022-04-22 18:03:04.644297", "failed": false, "msg": "", "rc": 0, "start": "2022-04-22 18:03:04.117336", "stderr": "https://192.168.10.107:2379 is healthy: successfully committed proposal: took = 42.136716ms\nhttps://192.168.10.108:2379 is healthy: successfully committed proposal: took = 12.285904ms\nhttps://192.168.10.109:2379 is healthy: successfully committed proposal: took = 11.06195ms", "stderr_lines": [ "https://192.168.10.107:2379 is healthy: successfully committed proposal: took = 42.136716ms", "https://192.168.10.108:2379 is healthy: successfully committed proposal: took = 12.285904ms", "https://192.168.10.109:2379 is healthy: successfully committed proposal: took = 11.06195ms" ], "stdout": "", "stdout_lines": [] } } TASK [get a running ectd node] *********************************************************************************************************************************************************************************************************************************** changed: [localhost] TASK [debug] ***************************************************************************************************************************************************************************************************************************************************** ok: [localhost] => { "RUNNING_NODE.stdout": "192.168.10.107" } TASK [get current time] ****************************************************************************************************************************************************************************************************************************************** changed: [localhost] TASK [make a backup on the etcd node] **************************************************************************************************************************************************************************************************************************** changed: [localhost -> 192.168.10.107] TASK [fetch the backup data] ************************************************************************************************************************************************************************************************************************************* changed: [localhost -> 192.168.10.107] TASK [update the latest backup] ********************************************************************************************************************************************************************************************************************************** changed: [localhost] PLAY RECAP ******************************************************************************************************************************************************************************************************************************************************* localhost : ok=10 changed=6 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
查看备份的数据是否存在
snapshot.db是复制的最新备份文件的数据,恢复的脚本写死了文件名为snapshot.db,但我们备份的文件是以时间戳命名的 所以在备份完成后会执行cp snapshot_202204221803.db snapshot.db
snapshot.db是复制的最新备份文件的数据,恢复的脚本写死了文件名为snapshot.db,但我们备份的文件是以时间戳命名的 所以在备份完成后会执行cp snapshot_202204221803.db snapshot.db
root@master-1:~# ll /etc/kubeasz/clusters/qijia01/backup/ total 5248 drwxr-xr-x 2 root root 57 Apr 22 18:03 ./ drwxr-xr-x 5 root root 203 Apr 20 18:44 ../ -rw------- 1 root root 2682912 Apr 22 18:03 snapshot.db -rw------- 1 root root 2682912 Apr 22 18:03 snapshot_202204221803.db
删掉一个pod 看恢复的时候是否能还原回来
root@master-1:~# kubectl get pod -A NAMESPACE NAME READY STATUS RESTARTS AGE default net-test1 1/1 Running 0 66m default net-test2 1/1 Running 0 2d22h kube-system calico-kube-controllers-647f956d86-zrjq9 1/1 Running 0 2d23h kube-system calico-node-47phc 1/1 Running 0 2d23h kube-system calico-node-9ghhw 1/1 Running 0 2d23h kube-system calico-node-c7stp 1/1 Running 0 2d23h kube-system calico-node-lcjsx 1/1 Running 0 2d23h kube-system coredns-74c56d8f8d-d2jbp 1/1 Running 0 2d kube-system coredns-74c56d8f8d-vds9h 1/1 Running 0 2d kubernetes-dashboard dashboard-metrics-scraper-c45b7869d-5h8t7 1/1 Running 0 47h kubernetes-dashboard kubernetes-dashboard-576cb95f94-mzwpz 1/1 Running 0 47h root@master-1:~# kubectl delete pod net-test1 pod "net-test1" deleted root@master-1:~# root@master-1:~# kubectl get pod -A NAMESPACE NAME READY STATUS RESTARTS AGE default net-test2 1/1 Running 0 2d22h kube-system calico-kube-controllers-647f956d86-zrjq9 1/1 Running 0 2d23h kube-system calico-node-47phc 1/1 Running 0 2d23h kube-system calico-node-9ghhw 1/1 Running 0 2d23h kube-system calico-node-c7stp 1/1 Running 0 2d23h kube-system calico-node-lcjsx 1/1 Running 0 2d23h kube-system coredns-74c56d8f8d-d2jbp 1/1 Running 0 2d kube-system coredns-74c56d8f8d-vds9h 1/1 Running 0 2d kubernetes-dashboard dashboard-metrics-scraper-c45b7869d-5h8t7 1/1 Running 0 47h kubernetes-dashboard kubernetes-dashboard-576cb95f94-mzwpz 1/1 Running 0 47h
恢复数据并验证删除的pod是否被还原
root@master-1:~# ezctl restore qijia01 ansible-playbook -i clusters/qijia01/hosts -e @clusters/qijia01/config.yml playbooks/95.restore.yml 2022-04-22 18:06:01 INFO cluster:qijia01 restore begins in 5s, press any key to abort: PLAY [kube_master] *********************************************************************************************************************************************************************************************************************************************** TASK [Gathering Facts] ******************************************************************************************************************************************************************************************************************************************* ok: [192.168.10.101] ok: [192.168.10.102] TASK [stopping kube_master services] ***************************************************************************************************************************************************************************************************************************** changed: [192.168.10.102] => (item=kube-apiserver) changed: [192.168.10.101] => (item=kube-apiserver) changed: [192.168.10.102] => (item=kube-controller-manager) changed: [192.168.10.102] => (item=kube-scheduler) changed: [192.168.10.101] => (item=kube-controller-manager) changed: [192.168.10.101] => (item=kube-scheduler) PLAY [kube_master,kube_node] ************************************************************************************************************************************************************************************************************************************* TASK [Gathering Facts] ******************************************************************************************************************************************************************************************************************************************* ok: [192.168.10.104] ok: [192.168.10.105] TASK [stopping kube_node services] ******************************************************************************************************************************************************************************************************************************* changed: [192.168.10.105] => (item=kubelet) changed: [192.168.10.101] => (item=kubelet) changed: [192.168.10.102] => (item=kubelet) changed: [192.168.10.104] => (item=kubelet) changed: [192.168.10.105] => (item=kube-proxy) changed: [192.168.10.101] => (item=kube-proxy) changed: [192.168.10.102] => (item=kube-proxy) changed: [192.168.10.104] => (item=kube-proxy) PLAY [etcd] ****************************************************************************************************************************************************************************************************************************************************** TASK [Gathering Facts] ******************************************************************************************************************************************************************************************************************************************* ok: [192.168.10.107] ok: [192.168.10.109] ok: [192.168.10.108] TASK [cluster-restore : 停止ectd 服务] *************************************************************************************************************************************************************************************************************************** changed: [192.168.10.109] changed: [192.168.10.108] changed: [192.168.10.107] TASK [cluster-restore : 清除etcd 数据目录] *********************************************************************************************************************************************************************************************************************** changed: [192.168.10.108] changed: [192.168.10.109] changed: [192.168.10.107] TASK [cluster-restore : 生成备份目录] **************************************************************************************************************************************************************************************************************************** ok: [192.168.10.107] changed: [192.168.10.109] changed: [192.168.10.108] TASK [cluster-restore : 准备指定的备份etcd 数据] ***************************************************************************************************************************************************************************************************************** changed: [192.168.10.109] changed: [192.168.10.108] changed: [192.168.10.107] TASK [cluster-restore : 清理上次备份恢复数据] ******************************************************************************************************************************************************************************************************************** ok: [192.168.10.107] ok: [192.168.10.108] ok: [192.168.10.109] TASK [cluster-restore : etcd 数据恢复] *************************************************************************************************************************************************************************************************************************** changed: [192.168.10.107] changed: [192.168.10.108] changed: [192.168.10.109] TASK [cluster-restore : 恢复数据至etcd 数据目录] ***************************************************************************************************************************************************************************************************************** changed: [192.168.10.108] changed: [192.168.10.107] changed: [192.168.10.109] TASK [cluster-restore : 重启etcd 服务] *************************************************************************************************************************************************************************************************************************** changed: [192.168.10.107] changed: [192.168.10.109] changed: [192.168.10.108] TASK [cluster-restore : 以轮询的方式等待服务同步完成] ************************************************************************************************************************************************************************************************************ changed: [192.168.10.107] changed: [192.168.10.108] changed: [192.168.10.109] PLAY [kube_master] *********************************************************************************************************************************************************************************************************************************************** TASK [starting kube_master services] ***************************************************************************************************************************************************************************************************************************** changed: [192.168.10.102] => (item=kube-apiserver) changed: [192.168.10.102] => (item=kube-controller-manager) changed: [192.168.10.101] => (item=kube-apiserver) changed: [192.168.10.102] => (item=kube-scheduler) changed: [192.168.10.101] => (item=kube-controller-manager) changed: [192.168.10.101] => (item=kube-scheduler) PLAY [kube_master,kube_node] ************************************************************************************************************************************************************************************************************************************* TASK [starting kube_node services] ******************************************************************************************************************************************************************************************************************************* changed: [192.168.10.104] => (item=kubelet) changed: [192.168.10.102] => (item=kubelet) changed: [192.168.10.105] => (item=kubelet) changed: [192.168.10.101] => (item=kubelet) changed: [192.168.10.105] => (item=kube-proxy) changed: [192.168.10.102] => (item=kube-proxy) changed: [192.168.10.104] => (item=kube-proxy) changed: [192.168.10.101] => (item=kube-proxy) PLAY RECAP ******************************************************************************************************************************************************************************************************************************************************* 192.168.10.101 : ok=5 changed=4 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 192.168.10.102 : ok=5 changed=4 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 192.168.10.104 : ok=3 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 192.168.10.105 : ok=3 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 192.168.10.107 : ok=10 changed=7 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 192.168.10.108 : ok=10 changed=8 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 192.168.10.109 : ok=10 changed=8 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 此时看到 net-test1 已经恢复 root@master-1:~# kubectl get pod -A NAMESPACE NAME READY STATUS RESTARTS AGE default net-test1 1/1 Running 0 73m default net-test2 1/1 Running 0 2d23h kube-system calico-kube-controllers-647f956d86-zrjq9 1/1 Running 0 2d23h kube-system calico-node-47phc 1/1 Running 0 2d23h kube-system calico-node-9ghhw 1/1 Running 0 2d23h kube-system calico-node-c7stp 1/1 Running 0 2d23h kube-system calico-node-lcjsx 1/1 Running 0 2d23h kube-system coredns-74c56d8f8d-d2jbp 1/1 Running 0 2d kube-system coredns-74c56d8f8d-vds9h 1/1 Running 0 2d1h kubernetes-dashboard dashboard-metrics-scraper-c45b7869d-5h8t7 1/1 Running 0 2d kubernetes-dashboard kubernetes-dashboard-576cb95f94-mzwpz 1/1 Running 1 2d root@master-1:~#
比较遗憾的是此种备份和恢复方式是全量备份 全量恢复,如果我只是想恢复某一个namespace下面的pod 那么这方法就无法实现
恢复etcd的流程
当etcd的集群宕机总数超过了节点数的一半时 机会导致集群宕机 后期需要恢复数据流程如下
- 恢复服务器系统
- 重新部署etcd集群
- 停止kube-apiserver/controller-manager/scheduer/kubelet/kube-proxy
- 停止etcd集群
- 各个etcd节点恢复同一份备份数据
- 启动etcd集群并验证节点健康状态
- 启动kube-apiserver/controller-manager/scheduer/kubelet/kube-proxy
- 验证k8s master状态及pod数据
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南