14.1 Etcd概述
etcd是一个构建高可用的分布式键值(key-value)数据库。etcd内部采用raft协议作为一致性算法,它是基于GO语言实现。
14.2 Etcd属性
完全复制
高可用性
一致性
简单
快速
可靠
使用Raft算法实现了存储的合理分布Etcd的工作原理
14.3 Etcd服务配置
root@k8s-etcd1:~
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
Documentation=https://github.com/coreos
[Service]
Type=notify
WorkingDirectory=/var/lib/etcd
ExecStart=/usr/local/bin/etcd \
--name=etcd-192.168.1.71 \
--cert-file=/etc/kubernetes/ssl/etcd.pem \
--key-file=/etc/kubernetes/ssl/etcd-key.pem \
--peer-cert-file=/etc/kubernetes/ssl/etcd.pem \
--peer-key-file=/etc/kubernetes/ssl/etcd-key.pem \
--trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
--peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem \
--initial-advertise-peer-urls=https://192.168.1.71:2380 \
--listen-peer-urls=https://192.168.1.71:2380 \
--listen-client-urls=https://192.168.1.71:2379,http://127.0.0.1:2379 \
--advertise-client-urls=https://192.168.1.71:2379 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=etcd-192.168.1.71=https://192.168.1.71:2380,etcd-192.168.1.72=https://192.168.1.72:2380,etcd-192.168.1.73=https://192.168.1.73:2380 \
--initial-cluster-state=new \
--data-dir=/var/lib/etcd \
--wal-dir= \
--snapshot-count=50000 \
--auto-compaction-retention=1 \
--auto-compaction-mode=periodic \
--max-request-bytes=10485760 \
--quota-backend-bytes=8589934592
Restart=always
RestartSec=15
LimitNOFILE=65536
OOMScoreAdjust=-999
[Install]
WantedBy=multi-user.target
14.4 Etcd检查集群信息
ETCDCTL_API=3 /usr/local/bin/etcdctl defrag --cluster --endpoints=https://192.168.1.71:2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem
Finished defragmenting etcd member[https://192.168.1.71:2379]
Finished defragmenting etcd member[https://192.168.1.72:2379]
Finished defragmenting etcd member[https://192.168.1.73:2379]
root@k8s-etcd1:~
127.0.0.1:2379 is healthy: successfully committed proposal: took = 3.131033ms
root@k8s-etcd2:~
127.0.0.1:2379 is healthy: successfully committed proposal: took = 9.114311ms
root@k8s-etcd3:~
127.0.0.1:2379 is healthy: successfully committed proposal: took = 13.232431ms
export NODE_IPS="192.168.1.71 192.168.1.72 192.168.1.73"
for ip in ${NODE_IPS} ; do ETCDCTL_API=3 /opt/kube/bin/etcdctl --endpoints=https://${ip} :2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem endpoint health;
done
https://192.168.1.76:2379 is healthy: successfully committed proposal: took = 8.57508ms
https://192.168.1.77:2379 is healthy: successfully committed proposal: took = 10.019689ms
https://192.168.1.78:2379 is healthy: successfully committed proposal: took = 8.723699ms
14.5 Etcd增删改查
root@k8s-etcd1:~
+------------------+---------+-------------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+-------------------+---------------------------+---------------------------+------------+
| 10aef13bef63cde1 | started | etcd-192.168.1.71 | https://192.168.1.71:2380 | https://192.168.1.71:2379 | false |
| bb7f841bd6053e72 | started | etcd-192.168.1.72 | https://192.168.1.72:2380 | https://192.168.1.72:2379 | false |
| ff250544e12286da | started | etcd-192.168.1.73 | https://192.168.1.73:2380 | https://192.168.1.73:2379 | false |
+------------------+---------+-------------------+---------------------------+---------------------------+------------+
export NODE_IPS="192.168.1.71 192.168.1.72 192.168.1.73"
for ip in ${NODE_IPS} ; do ETCDCTL_API=3 /usr/local/bin/etcdctl --write-out=table endpoint status --endpoints=https://${ip} :2379 --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem endpoint health;
done
root@k8s-etcd1:~
root@k8s-etcd1:~
root@k8s-etcd1:~
root@k8s-etcd1:~
OK
root@k8s-etcd1:~
/node
192.168.1.100
root@k8s-etcd1:~
1
14.6 Etcd数据watch机制
功能:基于不断监看数据,发生变化就主动触发通知客户端,Etcd v3的watch机制支持watch某个固定的key,也支持watch一个范围
概述:在etcd1上watch一个key,没有此key也可以执行watch,后期可以再创建
root@k8s-etcd1:~
OK
root@k8s-etcd1:~
PUT
root@k8s-etcd2:~
OK
root@k8s-etcd1:~
PUT
/node
192.168.1.101
14.7 Etcd数据删除
这样删除数据的话是直接绕过了etcd,所以这种方式很危险
root@deploy-harbor:~
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-68555f5f97-p255g 1/1 Running 0 140m
kube-system calico-node-gdc8m 0/1 CrashLoopBackOff 267 (4m34s ago) 47h
kube-system calico-node-h5drr 0/1 CrashLoopBackOff 267 (3m28s ago) 47h
linux60 linux60-tomcat-app1-deployment-595f7ff67c-2h8vv 1/1 Running 0 140m
myserver linux70-nginx-deployment-55dc5fdcf9-g7lkt 1/1 Running 0 140m
myserver linux70-nginx-deployment-55dc5fdcf9-mrxlp 1/1 Running 0 140m
myserver linux70-nginx-deployment-55dc5fdcf9-q6x59 1/1 Running 0 140m
myserver linux70-nginx-deployment-55dc5fdcf9-s5h42 1/1 Running 0 140m
myserver net-test1 1/1 Running 0 18s
myserver net-test2 1/1 Running 0 11s
myserver net-test3 1/1 Running 0 7s
root@k8s-etcd1:~
/registry/events/myserver/net-test1.17298ad770ac777d
/registry/events/myserver/net-test1.17298ad798f19b5c
/registry/events/myserver/net-test1.17298ad79a391fd2
/registry/events/myserver/net-test1.17298ad79f5cee1f
/registry/events/myserver/net-test2.17298ad8f4d8457e
/registry/events/myserver/net-test2.17298ad91bb1f309
/registry/events/myserver/net-test2.17298ad91d15fb7e
/registry/events/myserver/net-test2.17298ad921246080
/registry/events/myserver/net-test3.17298ad9fa9cf679
/registry/events/myserver/net-test3.17298ada1eeb346b
/registry/events/myserver/net-test3.17298ada20011319
/registry/events/myserver/net-test3.17298ada243b48d8
/registry/pods/myserver/net-test1
/registry/pods/myserver/net-test2
/registry/pods/myserver/net-test3
root@k8s-etcd1:~
1
root@deploy-harbor:~
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-68555f5f97-p255g 1/1 Running 0 141m
kube-system calico-node-gdc8m 0/1 Running 268 (5m44s ago) 47h
kube-system calico-node-h5drr 0/1 CrashLoopBackOff 267 (4m38s ago) 47h
linux60 linux60-tomcat-app1-deployment-595f7ff67c-2h8vv 1/1 Running 0 141m
myserver linux70-nginx-deployment-55dc5fdcf9-g7lkt 1/1 Running 0 141m
myserver linux70-nginx-deployment-55dc5fdcf9-mrxlp 1/1 Running 0 141m
myserver linux70-nginx-deployment-55dc5fdcf9-q6x59 1/1 Running 0 141m
myserver linux70-nginx-deployment-55dc5fdcf9-s5h42 1/1 Running 0 141m
myserver net-test2 1/1 Running 0 81s
myserver net-test3 1/1 Running 0 77s
14.8 Etcd V3 API版本数据备份与恢复
WAL是write ahead log (预写日志)的缩写,顾名思义,也就是在执行真正的写操作之前先写一个日志,叫预写日志
WAL:存放预写式日志,最大的作用是记录了整个数据变化的全部历程。在etcd中所有数据修改前提交,都要先写入WAL中
预写日志存放路径:真正存放数据的地方
root@k8s-etcd1:~
-rw------- 1 root root 2445312 Nov 21 08:15 /var/lib/etcd/member/snap/db
root@k8s-etcd1:~
{"level" :"info" ,"ts" :"2022-11-21T08:17:54.386Z" ,"caller" :"snapshot/v3_snapshot.go:65" ,"msg" :"created temporary db file" ,"path" :"/tmp/test.sb.part" }
{"level" :"info" ,"ts" :"2022-11-21T08:17:54.387Z" ,"logger" :"client" ,"caller" :"v3/maintenance.go:211" ,"msg" :"opened snapshot stream; downloading" }
{"level" :"info" ,"ts" :"2022-11-21T08:17:54.388Z" ,"caller" :"snapshot/v3_snapshot.go:73" ,"msg" :"fetching snapshot" ,"endpoint" :"127.0.0.1:2379" }
{"level" :"info" ,"ts" :"2022-11-21T08:17:54.424Z" ,"logger" :"client" ,"caller" :"v3/maintenance.go:219" ,"msg" :"completed snapshot read; closing" }
{"level" :"info" ,"ts" :"2022-11-21T08:17:54.436Z" ,"caller" :"snapshot/v3_snapshot.go:88" ,"msg" :"fetched snapshot" ,"endpoint" :"127.0.0.1:2379" ,"size" :"2.4 MB" ,"took" :"now" }
{"level" :"info" ,"ts" :"2022-11-21T08:17:54.436Z" ,"caller" :"snapshot/v3_snapshot.go:97" ,"msg" :"saved" ,"path" :"/tmp/test.sb" }
Snapshot saved at /tmp/test.sb
恢复
--data-dir="/opt/etcd" 这个目录可以不可以存在数据 但是可以为空目录
root@k8s-etcd1:~
total 12
drwxr-xr-x 3 root root 4096 Nov 21 08:22 ./
drwxr-xr-x 3 root root 4096 Nov 21 08:21 ../
drwx------ 4 root root 4096 Nov 21 08:22 member/
root@k8s-etcd1:~
total 16
drwx------ 4 root root 4096 Nov 21 08:22 ./
drwxr-xr-x 3 root root 4096 Nov 21 08:22 ../
drwx------ 2 root root 4096 Nov 21 08:22 snap/
drwx------ 2 root root 4096 Nov 21 08:22 wal/
方法一:新创建的目录有恢复的数据了然后更改etcd启动服务文件、更改路径换成新的即可
root@k8s-etcd1:~
WorkingDirectory=/var/lib/etcd
--data-dir=/var/lib/etcd
方法二:新创建的目录有恢复的数据了然后把etcd启动服务文件的路径下的数据删除,在把新恢复的数据拷贝过去即可
例如是这个目录 去把这个目录下数据删除
--data-dir=/var/lib/etcd
14.9 Etcd 自动备份数据脚本
root@k8s-etcd1:~
root@k8s-etcd1:~
DATE=`date +%Y-%m-%d_%H-%M-%S`
ETCDCTL_API=3 /usr/local/bin/etcdctl snapshot save /data/etcd-backup-dir/etcd-snapshot-${DATE} .db &>/dev/null
root@k8s-etcd1:~
00 00 * * * /bin/bash /root/scripts.sh &>/dev/null
14.10 Ansible-实现Etcd备份与恢复
root@deploy-harbor:/etc/kubeasz
root@deploy-harbor:/etc/kubeasz
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-68555f5f97-p255g 1/1 Running 2 (3m44s ago) 3h21m
kube-system calico-node-gdc8m 1/1 Running 280 (44s ago) 2d
kube-system calico-node-h5drr 0/1 Running 282 (2m27s ago) 2d
linux60 linux60-tomcat-app1-deployment-595f7ff67c-2h8vv 1/1 Running 0 3h21m
myserver linux70-nginx-deployment-55dc5fdcf9-g7lkt 1/1 Running 0 3h21m
myserver linux70-nginx-deployment-55dc5fdcf9-mrxlp 1/1 Running 0 3h21m
myserver linux70-nginx-deployment-55dc5fdcf9-q6x59 1/1 Running 0 3h21m
myserver linux70-nginx-deployment-55dc5fdcf9-s5h42 1/1 Running 0 3h21m
myserver net-test1 1/1 Running 0 22m
myserver net-test2 1/1 Running 0 61m
myserver net-test3 1/1 Running 0 6s
root@deploy-harbor:/etc/kubeasz
root@deploy-harbor:/etc/kubeasz/clusters/k8s-cluster1/backup
snapshot_202211210908.db snapshot.db
root@k8s-etcd1:~
/registry/events/myserver/net-test3.17298ad9fa9cf679
/registry/events/myserver/net-test3.17298ada1eeb346b
/registry/events/myserver/net-test3.17298ada20011319
/registry/events/myserver/net-test3.17298ada243b48d8
/registry/events/myserver/net-test3.17298c517a5b6447
/registry/events/myserver/net-test3.17298e3220c07e42
/registry/events/myserver/net-test3.17298e3249c37734
/registry/events/myserver/net-test3.17298e324ae2165e
/registry/events/myserver/net-test3.17298e324f55fadf
/registry/pods/myserver/net-test3
root@k8s-etcd1:~
1
root@deploy-harbor:/etc/kubeasz/clusters/k8s-cluster1/backup
root@deploy-harbor:/etc/kubeasz
root@deploy-harbor:/etc/kubeasz
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-68555f5f97-p255g 1/1 Running 3 (39s ago) 3h33m
kube-system calico-node-gdc8m 0/1 Running 283 (5m32s ago) 2d
kube-system calico-node-h5drr 1/1 Running 286 (5m36s ago) 2d
linux60 linux60-tomcat-app1-deployment-595f7ff67c-2h8vv 1/1 Running 0 3h33m
myserver linux70-nginx-deployment-55dc5fdcf9-g7lkt 1/1 Running 0 3h33m
myserver linux70-nginx-deployment-55dc5fdcf9-mrxlp 1/1 Running 0 3h33m
myserver linux70-nginx-deployment-55dc5fdcf9-q6x59 1/1 Running 0 3h33m
myserver linux70-nginx-deployment-55dc5fdcf9-s5h42 1/1 Running 0 3h33m
myserver net-test1 1/1 Running 0 34m
myserver net-test2 1/1 Running 0 73m
myserver net-test3 1/1 Running 0 12m
14.11 ETCD数据恢复流程
当etcd集群宕机数量超过集群总节点数一半以上的时候(如总数为三台宕机两台)就会导致整个集群宕机,后期需要重新恢复数据,则恢复流程如下:
恢复服务器系统
重新部署ETCD集群
停止 kube-apiserver/conreoller-manager/scheduler/kubelet/kube-proxy
停止ETCD集群
各ETCD节点恢复同一份备份数据
启动各节点恢复同一份备份数据
启动kube-apiserver/controller-manager/scheduler/kubelet/kube-proxy
验证k8s master状态及pod数据