Etcd 备份与恢复

14.1 Etcd概述

etcd是一个构建高可用的分布式键值(key-value)数据库。etcd内部采用raft协议作为一致性算法,它是基于GO语言实现。

14.2 Etcd属性

  1. 完全复制
集群中的每个节点都可以使用完整的存档
  1. 高可用性
etcd可用于避免硬件的单点故障或网络问题
  1. 一致性
每次读取都会返回跨多主机的最新写入
  1. 简单
包括一个定义良好、面向用户的API(gRPC)
  1. 快速
每秒10000次写入的基准速度
  1. 可靠
使用Raft算法实现了存储的合理分布Etcd的工作原理

14.3 Etcd服务配置

root@k8s-etcd1:~# cat /etc/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
Documentation=https://github.com/coreos

[Service]
Type=notify
WorkingDirectory=/var/lib/etcd		#数据保存目录
ExecStart=/usr/local/bin/etcd \		#二进制文件路径
  --name=etcd-192.168.1.71 \		#当前node名称也是IP地址
  --cert-file=/etc/kubernetes/ssl/etcd.pem \
  --key-file=/etc/kubernetes/ssl/etcd-key.pem \
  --peer-cert-file=/etc/kubernetes/ssl/etcd.pem \
  --peer-key-file=/etc/kubernetes/ssl/etcd-key.pem \
  --trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
  --peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
#通告自己的集群端口
  --initial-advertise-peer-urls=https://192.168.1.71:2380 \ 
#集群之间通讯端口
  --listen-peer-urls=https://192.168.1.71:2380 \	
#客户端访问地址
  --listen-client-urls=https://192.168.1.71:2379,http://127.0.0.1:2379 \ 
#通告自己客户端端口
  --advertise-client-urls=https://192.168.1.71:2379 \ 
#创建集群使用的token一个集群内的节点保持一致
  --initial-cluster-token=etcd-cluster-0 \ 
#集群所有节点信息
  --initial-cluster=etcd-192.168.1.71=https://192.168.1.71:2380,etcd-192.168.1.72=https://192.168.1.72:2380,etcd-192.168.1.73=https://192.168.1.73:2380 \		
#新建集群为new,存在的为existing
  --initial-cluster-state=new \	
#数据目录路径
  --data-dir=/var/lib/etcd \	
  --wal-dir= \
  --snapshot-count=50000 \
#每小时压缩一次,一小时以后每隔一小时的10分之1也就是6分钟压缩一次
  --auto-compaction-retention=1 \ 
#周期性压缩
  --auto-compaction-mode=periodic \	
#(请求的最大字节、数默认一个key最大1.5mb、官方推荐最大10mb)
  --max-request-bytes=10485760 \ 
#磁盘存储空间大小限制,默认为2G,此值超过8G启动会有警告信息
  --quota-backend-bytes=8589934592
Restart=always
RestartSec=15
LimitNOFILE=65536
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target

14.4 Etcd检查集群信息

  • 检查集群心跳检测
#defrag 是顺序io方式存储
#ETCDCTL_API=3 声明api版本

ETCDCTL_API=3 /usr/local/bin/etcdctl defrag --cluster --endpoints=https://192.168.1.71:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem
#结果 心跳已经是通过的了
Finished defragmenting etcd member[https://192.168.1.71:2379]
Finished defragmenting etcd member[https://192.168.1.72:2379]
Finished defragmenting etcd member[https://192.168.1.73:2379]
  • 检查集群是否健康-http
#etcd1
root@k8s-etcd1:~# etcdctl endpoint health
127.0.0.1:2379 is healthy: successfully committed proposal: took = 3.131033ms

#etcd2
root@k8s-etcd2:~# etcdctl endpoint health
127.0.0.1:2379 is healthy: successfully committed proposal: took = 9.114311ms

#etcd3
root@k8s-etcd3:~# etcdctl endpoint health
127.0.0.1:2379 is healthy: successfully committed proposal: took = 13.232431ms

  • 检查集群是否健康-https
export NODE_IPS="192.168.1.71 192.168.1.72 192.168.1.73"
for ip in ${NODE_IPS}; do ETCDCTL_API=3 /opt/kube/bin/etcdctl --endpoints=https://${ip}:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem endpoint health;
done

#执行结果
https://192.168.1.76:2379 is healthy: successfully committed proposal: took = 8.57508ms
https://192.168.1.77:2379 is healthy: successfully committed proposal: took = 10.019689ms
https://192.168.1.78:2379 is healthy: successfully committed proposal: took = 8.723699ms

14.5 Etcd增删改查

  • 以表格的形式输出
#IS LEARNER 显示内容是否同步数据
root@k8s-etcd1:~# /usr/local/bin/etcdctl --write-out=table member list --endpoints=https://192.168.1.71:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem
+------------------+---------+-------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |       NAME        |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-------------------+---------------------------+---------------------------+------------+
| 10aef13bef63cde1 | started | etcd-192.168.1.71 | https://192.168.1.71:2380 | https://192.168.1.71:2379 |      false |
| bb7f841bd6053e72 | started | etcd-192.168.1.72 | https://192.168.1.72:2380 | https://192.168.1.72:2379 |      false |
| ff250544e12286da | started | etcd-192.168.1.73 | https://192.168.1.73:2380 | https://192.168.1.73:2379 |      false |
+------------------+---------+-------------------+---------------------------+---------------------------+------------+

  • 显示集群节点详细状态
export NODE_IPS="192.168.1.71 192.168.1.72 192.168.1.73"
for ip in ${NODE_IPS}; do ETCDCTL_API=3 /usr/local/bin/etcdctl --write-out=table endpoint status   --endpoints=https://${ip}:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem endpoint health;
done

  • 查看集群所有的key
#没事不要敲这个 感觉属于全部遍历 
root@k8s-etcd1:~# etcdctl get / --keys-only --prefix

#查看集群中nginx keys
root@k8s-etcd1:~# etcdctl get / --keys-only --prefix | grep nginx

#查看集群中所有的namespace
root@k8s-etcd1:~# etcdctl get / --keys-only --prefix | grep namespace

  • 创建数据、查看数据、删除数据
root@k8s-etcd1:~# etcdctl put /node "192.168.1.100"
OK
root@k8s-etcd1:~# etcdctl get /node
/node
192.168.1.100
root@k8s-etcd1:~# etcdctl del /node
1

14.6 Etcd数据watch机制

功能:基于不断监看数据,发生变化就主动触发通知客户端,Etcd v3的watch机制支持watch某个固定的key,也支持watch一个范围

概述:在etcd1上watch一个key,没有此key也可以执行watch,后期可以再创建
root@k8s-etcd1:~# etcdctl put /node "192.168.1.100"
OK
root@k8s-etcd1:~# etcdctl watch /node
PUT
root@k8s-etcd2:~# etcdctl put /node "192.168.1.101"
OK
root@k8s-etcd1:~# etcdctl watch /node
PUT
/node
192.168.1.101

14.7 Etcd数据删除

这样删除数据的话是直接绕过了etcd,所以这种方式很危险

#我要删除net-test1
root@deploy-harbor:~# kubectl get pods -A
NAMESPACE     NAME                                              READY   STATUS             RESTARTS          AGE
kube-system   calico-kube-controllers-68555f5f97-p255g          1/1     Running            0                 140m
kube-system   calico-node-gdc8m                                 0/1     CrashLoopBackOff   267 (4m34s ago)   47h
kube-system   calico-node-h5drr                                 0/1     CrashLoopBackOff   267 (3m28s ago)   47h
linux60       linux60-tomcat-app1-deployment-595f7ff67c-2h8vv   1/1     Running            0                 140m
myserver      linux70-nginx-deployment-55dc5fdcf9-g7lkt         1/1     Running            0                 140m
myserver      linux70-nginx-deployment-55dc5fdcf9-mrxlp         1/1     Running            0                 140m
myserver      linux70-nginx-deployment-55dc5fdcf9-q6x59         1/1     Running            0                 140m
myserver      linux70-nginx-deployment-55dc5fdcf9-s5h42         1/1     Running            0                 140m
myserver      net-test1                                         1/1     Running            0                 18s
myserver      net-test2                                         1/1     Running            0                 11s
myserver      net-test3                                         1/1     Running            0                 7s

#在etcd查找
root@k8s-etcd1:~# etcdctl get / --keys-only --prefix | grep 'net-test'
/registry/events/myserver/net-test1.17298ad770ac777d
/registry/events/myserver/net-test1.17298ad798f19b5c
/registry/events/myserver/net-test1.17298ad79a391fd2
/registry/events/myserver/net-test1.17298ad79f5cee1f
/registry/events/myserver/net-test2.17298ad8f4d8457e
/registry/events/myserver/net-test2.17298ad91bb1f309
/registry/events/myserver/net-test2.17298ad91d15fb7e
/registry/events/myserver/net-test2.17298ad921246080
/registry/events/myserver/net-test3.17298ad9fa9cf679
/registry/events/myserver/net-test3.17298ada1eeb346b
/registry/events/myserver/net-test3.17298ada20011319
/registry/events/myserver/net-test3.17298ada243b48d8
/registry/pods/myserver/net-test1
/registry/pods/myserver/net-test2
/registry/pods/myserver/net-test3

#删除
root@k8s-etcd1:~# etcdctl del /registry/pods/myserver/net-test1
1

#你会发现这条数据就没有了
root@deploy-harbor:~# kubectl get pods -A
NAMESPACE     NAME                                              READY   STATUS             RESTARTS          AGE
kube-system   calico-kube-controllers-68555f5f97-p255g          1/1     Running            0                 141m
kube-system   calico-node-gdc8m                                 0/1     Running            268 (5m44s ago)   47h
kube-system   calico-node-h5drr                                 0/1     CrashLoopBackOff   267 (4m38s ago)   47h
linux60       linux60-tomcat-app1-deployment-595f7ff67c-2h8vv   1/1     Running            0                 141m
myserver      linux70-nginx-deployment-55dc5fdcf9-g7lkt         1/1     Running            0                 141m
myserver      linux70-nginx-deployment-55dc5fdcf9-mrxlp         1/1     Running            0                 141m
myserver      linux70-nginx-deployment-55dc5fdcf9-q6x59         1/1     Running            0                 141m
myserver      linux70-nginx-deployment-55dc5fdcf9-s5h42         1/1     Running            0                 141m
myserver      net-test2                                         1/1     Running            0                 81s
myserver      net-test3                                         1/1     Running            0                 77s

14.8 Etcd V3 API版本数据备份与恢复

WAL是write ahead log(预写日志)的缩写,顾名思义,也就是在执行真正的写操作之前先写一个日志,叫预写日志
WAL:存放预写式日志,最大的作用是记录了整个数据变化的全部历程。在etcd中所有数据修改前提交,都要先写入WAL中
预写日志存放路径:真正存放数据的地方
root@k8s-etcd1:~# ll /var/lib/etcd/member/snap/db 
-rw------- 1 root root 2445312 Nov 21 08:15 /var/lib/etcd/member/snap/db

  • 备份
root@k8s-etcd1:~# etcdctl snapshot  save /tmp/test.sb
{"level":"info","ts":"2022-11-21T08:17:54.386Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/tmp/test.sb.part"}
{"level":"info","ts":"2022-11-21T08:17:54.387Z","logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2022-11-21T08:17:54.388Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"127.0.0.1:2379"}
{"level":"info","ts":"2022-11-21T08:17:54.424Z","logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2022-11-21T08:17:54.436Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"127.0.0.1:2379","size":"2.4 MB","took":"now"}
{"level":"info","ts":"2022-11-21T08:17:54.436Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/tmp/test.sb"}
Snapshot saved at /tmp/test.sb
  • 恢复
  • --data-dir="/opt/etcd" 这个目录可以不可以存在数据 但是可以为空目录
root@k8s-etcd1:~# etcdctl snapshot restore /tmp/test.sb --data-dir="/opt/etcd"
  • 恢复效果
#是新创建的目录有数据了

root@k8s-etcd1:~# ll /opt/etcd/
total 12
drwxr-xr-x 3 root root 4096 Nov 21 08:22 ./
drwxr-xr-x 3 root root 4096 Nov 21 08:21 ../
drwx------ 4 root root 4096 Nov 21 08:22 member/
root@k8s-etcd1:~# ll /opt/etcd/member/
total 16
drwx------ 4 root root 4096 Nov 21 08:22 ./
drwxr-xr-x 3 root root 4096 Nov 21 08:22 ../
drwx------ 2 root root 4096 Nov 21 08:22 snap/
drwx------ 2 root root 4096 Nov 21 08:22 wal/

#在恢复集群有两种方法
方法一:新创建的目录有恢复的数据了然后更改etcd启动服务文件、更改路径换成新的即可
root@k8s-etcd1:~# vim /etc/systemd/system/etcd.service
WorkingDirectory=/var/lib/etcd
--data-dir=/var/lib/etcd 
方法二:新创建的目录有恢复的数据了然后把etcd启动服务文件的路径下的数据删除,在把新恢复的数据拷贝过去即可
例如是这个目录 去把这个目录下数据删除
--data-dir=/var/lib/etcd

14.9 Etcd 自动备份数据脚本

root@k8s-etcd1:~# mkdir -p /data/etcd-backup-dir

#备份脚本
root@k8s-etcd1:~# cat scripts.sh 
#!/bin/bash

#Etcd time scripts backup auther quyi  
DATE=`date +%Y-%m-%d_%H-%M-%S`
ETCDCTL_API=3 /usr/local/bin/etcdctl snapshot save /data/etcd-backup-dir/etcd-snapshot-${DATE}.db &>/dev/null

#定时任务
root@k8s-etcd1:~# crontab -l
00 00 * * * /bin/bash /root/scripts.sh &>/dev/null

14.10 Ansible-实现Etcd备份与恢复

  • 先创建了一个容器
root@deploy-harbor:/etc/kubeasz# kubectl run net-test3 --image=centos:7.9.2009 sleep 10000000 -n myserver
  • 查看以下所有的pod
root@deploy-harbor:/etc/kubeasz# kubectl get pods -A
NAMESPACE     NAME                                              READY   STATUS    RESTARTS          AGE
kube-system   calico-kube-controllers-68555f5f97-p255g          1/1     Running   2 (3m44s ago)     3h21m
kube-system   calico-node-gdc8m                                 1/1     Running   280 (44s ago)     2d
kube-system   calico-node-h5drr                                 0/1     Running   282 (2m27s ago)   2d
linux60       linux60-tomcat-app1-deployment-595f7ff67c-2h8vv   1/1     Running   0                 3h21m
myserver      linux70-nginx-deployment-55dc5fdcf9-g7lkt         1/1     Running   0                 3h21m
myserver      linux70-nginx-deployment-55dc5fdcf9-mrxlp         1/1     Running   0                 3h21m
myserver      linux70-nginx-deployment-55dc5fdcf9-q6x59         1/1     Running   0                 3h21m
myserver      linux70-nginx-deployment-55dc5fdcf9-s5h42         1/1     Running   0                 3h21m
myserver      net-test1                                         1/1     Running   0                 22m
myserver      net-test2                                         1/1     Running   0                 61m
myserver      net-test3                                         1/1     Running   0                 6s

  • Ansible-采取备份
root@deploy-harbor:/etc/kubeasz# ./ezctl backup k8s-cluster1
  • 备份存放位置与备份情况
#备份成功后会生成这两个备份文件、目前这两个里面数据是一致的、要是恢复数据的话会执行snapshot.db文件 就把它认为源文件吧
root@deploy-harbor:/etc/kubeasz/clusters/k8s-cluster1/backup# ls
snapshot_202211210908.db  snapshot.db
  • 删除net-test
root@k8s-etcd1:~# etcdctl get / --keys-only --prefix | grep net-test3
/registry/events/myserver/net-test3.17298ad9fa9cf679
/registry/events/myserver/net-test3.17298ada1eeb346b
/registry/events/myserver/net-test3.17298ada20011319
/registry/events/myserver/net-test3.17298ada243b48d8
/registry/events/myserver/net-test3.17298c517a5b6447
/registry/events/myserver/net-test3.17298e3220c07e42
/registry/events/myserver/net-test3.17298e3249c37734
/registry/events/myserver/net-test3.17298e324ae2165e
/registry/events/myserver/net-test3.17298e324f55fadf
/registry/pods/myserver/net-test3
root@k8s-etcd1:~# etcdctl del /registry/pods/myserver/net-test3
1
  • Ansible-采取数据恢复
#要先把时间点的备份文件、给复制成源文件内容 复制并改名
#这里作为演示 刚刚其实备份这两个文件数据是相同的
root@deploy-harbor:/etc/kubeasz/clusters/k8s-cluster1/backup# cp snapshot_202211210908.db snapshot.db

#开始恢复数据
root@deploy-harbor:/etc/kubeasz# ./ezctl restore k8s-cluster1

#查看数据 已经net-test3已经回来了
root@deploy-harbor:/etc/kubeasz# kubectl get pods -A
NAMESPACE     NAME                                              READY   STATUS    RESTARTS          AGE
kube-system   calico-kube-controllers-68555f5f97-p255g          1/1     Running   3 (39s ago)       3h33m
kube-system   calico-node-gdc8m                                 0/1     Running   283 (5m32s ago)   2d
kube-system   calico-node-h5drr                                 1/1     Running   286 (5m36s ago)   2d
linux60       linux60-tomcat-app1-deployment-595f7ff67c-2h8vv   1/1     Running   0                 3h33m
myserver      linux70-nginx-deployment-55dc5fdcf9-g7lkt         1/1     Running   0                 3h33m
myserver      linux70-nginx-deployment-55dc5fdcf9-mrxlp         1/1     Running   0                 3h33m
myserver      linux70-nginx-deployment-55dc5fdcf9-q6x59         1/1     Running   0                 3h33m
myserver      linux70-nginx-deployment-55dc5fdcf9-s5h42         1/1     Running   0                 3h33m
myserver      net-test1                                         1/1     Running   0                 34m
myserver      net-test2                                         1/1     Running   0                 73m
myserver      net-test3                                         1/1     Running   0                 12m

14.11 ETCD数据恢复流程

当etcd集群宕机数量超过集群总节点数一半以上的时候(如总数为三台宕机两台)就会导致整个集群宕机,后期需要重新恢复数据,则恢复流程如下:

  1. 恢复服务器系统
  2. 重新部署ETCD集群
  3. 停止 kube-apiserver/conreoller-manager/scheduler/kubelet/kube-proxy
  4. 停止ETCD集群
  5. 各ETCD节点恢复同一份备份数据
  6. 启动各节点恢复同一份备份数据
  7. 启动kube-apiserver/controller-manager/scheduler/kubelet/kube-proxy
  8. 验证k8s master状态及pod数据
posted @ 2022-12-05 21:40  YIDADA-SRE  阅读(336)  评论(0编辑  收藏  举报