11、K8S-高可用之Master、etcd集群、Master异常处理
1、配置前准备
1.1、与单一master节点的区别
kubeadm init --kubernetes-version=1.25.7 \
--apiserver-advertise-address=192.168.10.26 \
--service-cidr=10.96.0.0/12 \
--pod-network-cidr=10.244.0.0/16 \
--image-repository=registry.aliyuncs.com/google_containers \
--ignore-preflight-errors=Swap
而对于多节点master的k8s节点来说,它对外的访问端口是keepalived对外的VIP地址及其对应的端口,而
对于这个地址的配置,kubeadm有专用的属性来指定,他就是
--control-plane-endpoint=VIP
--apiserver-bind-port=6443
2、集群的重置
如果因为特殊因素导致,集群创建失败,我们可以通过两条命令实现环境的快速还原
2.1、Master重置清空方法
# Master节点重置 kubeadm reset; rm -rf /etc/kubernetes; rm -rf ~/.kube ; rm -rf /etc/cni/; # 清除容器的网络接口
systemctl restart containerd.service
2.2、Node重置清空方法
rm -rf /etc/cni/net.d; kubeadm reset; # 需要重启一下这个服务,避免网络插件有问题 systemctl restart containerd.service
3、所有集群中的节点增加hosts
echo "192.168.10.200 vip.k8test.com" > /etc/hosts
4、高可用配置master
4.1、注意
注意:
1、如果nginx跟master安装在一起的话,那6433端口会冲突,先实始化一台,然后把nginx端口修改为7443即可。
2、第一台初始化的获取数据的端口也要修改,位置:/etc/kubernetes/kubelet.conf,修改好后,重启一下kubelet服务
4.2、master1初始化集群
kubeadm init --kubernetes-version=1.25.7 \
--apiserver-bind-port=6443 \
--control-plane-endpoint=vip.k8test.com \
--service-cidr=10.96.0.0/12 \
--pod-network-cidr=10.244.0.0/16 \
--image-repository=registry.aliyuncs.com/google_containers \
--ignore-preflight-errors=Swap
4.3、初始化成功后显示
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root user, you can run:
export KUBECONFIG=/etc/kubernetes/admin.conf
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
You can now join any number of control-plane nodes by copying certificate authorities
and service account keys on each node and then running the following as root:
# 加入master集群的
kubeadm join vip.k8test.com:6443 --token ci73xv.kp35on7vpn75adh4 \
--discovery-token-ca-cert-hash sha256:7f70fccc24d51c0f7a845258204c7fb0a25ed4a029d419d4535728dc2346632a \
--control-plane
Then you can join any number of worker nodes by running the following on each as root:
# 节点加入集群中
kubeadm join vip.k8test.com:6443 --token ci73xv.kp35on7vpn75adh4 \
--discovery-token-ca-cert-hash sha256:7f70fccc24d51c0f7a845258204c7fb0a25ed4a029d419d4535728dc2346632a
4.4、master加入集群
# 将master2、master3 做集群高可用
kubeadm join vip.k8test.com:6443 --token ci73xv.kp35on7vpn75adh4 \
--discovery-token-ca-cert-hash sha256:7f70fccc24d51c0f7a845258204c7fb0a25ed4a029d419d4535728dc2346632a \
--control-plane
4.5、集群时报错
4.5.1、front-proxy-ca.crt: no such file or directory
报错信息:
[failure loading certificate for CA: couldn't load the certificate file /etc/kubernetes/pki/ca.crt: open /etc/kubernetes/pki/ca.crt:
no such file or directory, failure loading key for service account: couldn't load the private key file /etc/kubernetes/pki/sa.key:
open /etc/kubernetes/pki/sa.key: no such file or directory, failure loading certificate for front-proxy CA: couldn't load the
certificate file /etc/kubernetes/pki/front-proxy-ca.crt: open /etc/kubernetes/pki/front-proxy-ca.crt: no such file or directory,
failure loading certificate for etcd CA: couldn't load the certificate file /etc/kubernetes/pki/etcd/ca.crt: open /etc/kubernetes/pki/etcd/ca.crt: no such file or directory]
解决:
# 在报错主机执行
mkdir -p /etc/kubernetes/pki/etcd
# 在第一个集群主机上传过去对应报错的主机
scp -rp /etc/kubernetes/pki/ca.* master2:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/sa.* master2:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/front-proxy-ca.* master2:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/etcd/ca.* master2:/etc/kubernetes/pki/etcd
scp -rp /etc/kubernetes/admin.conf master2:/etc/kubernetes
4.6、查询集群是否成功
[root@master1 ~]# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-c676cc86f-hn247 0/1 Pending 0 16m
kube-system coredns-c676cc86f-z5vrd 0/1 Pending 0 16m
kube-system etcd-master1 1/1 Running 10 16m
kube-system etcd-master2 1/1 Running 0 10m
kube-system etcd-master3 1/1 Running 0 71s
kube-system kube-apiserver-master1 1/1 Running 18 16m
kube-system kube-apiserver-master2 1/1 Running 0 10m
kube-system kube-apiserver-master3 1/1 Running 1 (79s ago) 55s
kube-system kube-controller-manager-master1 1/1 Running 11 (9m55s ago) 16m
kube-system kube-controller-manager-master2 1/1 Running 0 10m
kube-system kube-controller-manager-master3 1/1 Running 0 14s
kube-system kube-proxy-cd9rj 1/1 Running 0 10m
kube-system kube-proxy-k4kbh 1/1 Running 0 71s
kube-system kube-proxy-rnswk 1/1 Running 0 16m
kube-system kube-scheduler-master1 1/1 Running 11 (9m55s ago) 16m
kube-system kube-scheduler-master2 1/1 Running 0 10m
kube-system kube-scheduler-master3 1/1 Running 0 67s
5、安装网络插件flannel
参考章节:安装CNI-flannel插件 : https://www.cnblogs.com/ygbh/p/17221380.html#_lab2_2_1
6、集群增加节点
6.1、往集群中增加入节点Node
kubeadm join vip.k8test.com:6443 --token ci73xv.kp35on7vpn75adh4 \
--discovery-token-ca-cert-hash sha256:7f70fccc24d51c0f7a845258204c7fb0a25ed4a029d419d4535728dc2346632a
6.2、查询节点状态
[root@master1 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master1 Ready control-plane 33m v1.25.7
master2 Ready control-plane 27m v1.26.2
master3 Ready control-plane 18m v1.26.2
node1 Ready <none> 2m6s v1.25.7
node2 Ready <none> 78s v1.25.7
4、安装dashboard
参考章节:Dashboard部署 :https://www.cnblogs.com/ygbh/p/17221496.html
5、截止当前的集群状态
5.1、基础知识
默认情况下,对于我们当前的主节点集群中,只允许掉1个,如果掉两个master,整个集群就完全崩溃了 -- 一般情况下,第一个master节点关闭后的信息查看需要1分钟左右才能看到掉线,第二个节点只要一崩溃,集群立刻无法使用 -- 对于生产的集群,原则上允许节点挂的数量需要遵循一个原则,剩余的节点数量要大于 n/2 个整数节点。 3个节点集群,只允许挂1个
5个节点集群, 只允许挂2个
其根本原因在于,所有节点的属性信息都保存在etcd中,而etcd是一个分布式的,一致的 keyvalue 存储,它遵循分布式一致性的基本节点要求
5.2、保障nginx开放所有的后端主机节点
cat /usr/local/nginx-1.20.0/conf/conf.d/apiserver.conf stream { upstream kube-apiserver { server 192.168.10.26:6443 max_fails=3 fail_timeout=30s; server 192.168.10.27:6443 max_fails=3 fail_timeout=30s; server 192.168.10.28:6443 max_fails=3 fail_timeout=30s; } server { listen 6443; proxy_connect_timeout 2s; proxy_timeout 900s; proxy_pass kube-apiserver; } }
5.3、关闭master3
systemctl stop kubelet docker 或者 直接关机
5.4、查询节点状态是否正常
[root@master1 ~]# kubectl get node NAME STATUS ROLES AGE VERSION master1 Ready control-plane 107m v1.25.7 master2 Ready control-plane 104m v1.25.7 master3 NotReady control-plane 104m v1.25.7 node1 Ready <none> 53m v1.25.7 node2 Ready <none> 53m v1.25.7
6、ETCD集群
6.1、当前状态
当前ETCD状态是处于一个master对应一个etcd,master集群完成后,etcd证书也是同步一样的,可以出现配置不一致的问题,导致etcd失败,此时只需要修改配置文件:/etc/kubernetes/manifests/etcd.yaml 即可 相关的原理可以参考官网介绍:https://kubernetes.io/zh-cn/docs/setup/production-environment/tools/kubeadm/setup-ha-etcd-with-kubeadm/
6.2、修改etcd配置文件
6.2.1、注意事项
/etc/kubernetes/manifests 这个是静态pod,修改yaml文件,无需手动重启pod,会自动重新创建pod的
6.2.2、修改etcd.yaml配置文件【三台master节点都要修改】
]# cat /etc/kubernetes/manifests/etcd.yaml ... containers: - command: - etcd - --advertise-client-urls=https://192.168.10.27:2379 - --cert-file=/etc/kubernetes/pki/etcd/server.crt - --client-cert-auth=true - --data-dir=/var/lib/etcd - --experimental-initial-corrupt-check=true - --experimental-watch-progress-notify-interval=5s - --initial-advertise-peer-urls=https://192.168.10.27:2380 - --initial-cluster=master1=https://192.168.10.26:2380,master2=https://192.168.10.27:2380,master3=https://192.168.10.28:2380 # 这个是修改部分 - --initial-cluster-state=existing ...
6.3、检查集群状态
6.3.1、进入etcd-pod里面查询
# 成员列表显示
sh-5.1# etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key member list 49a8f24b174de1fa, started, master2, https://192.168.10.27:2380, https://192.168.10.27:2379, false 4a182a6f514944cc, started, master1, https://192.168.10.26:2380, https://192.168.10.26:2379, false fe67d236c75cf789, started, master3, https://192.168.10.28:2380, https://192.168.10.28:2379, false # 查询状态 sh-5.1# etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/peer.crt \ --key=/etc/kubernetes/pki/etcd/peer.key \ --endpoints=https://192.168.10.26:2379,https://192.168.10.27:2379,https://192.168.10.28:2379 endpoint status https://192.168.10.26:2379, 4a182a6f514944cc, 3.5.6, 4.7 MB, true, false, 89, 1485767, 1485767, https://192.168.10.27:2379, 49a8f24b174de1fa, 3.5.6, 5.2 MB, false, false, 89, 1485768, 1485768, https://192.168.10.28:2379, fe67d236c75cf789, 3.5.6, 4.9 MB, false, false, 89, 1485768, 1485768, # 健康检查 sh-5.1# etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/peer.crt \ --key=/etc/kubernetes/pki/etcd/peer.key \ --endpoints=https://192.168.10.26:2379,https://192.168.10.27:2379,https://192.168.10.28:2379 endpoint health https://192.168.10.26:2379 is healthy: successfully committed proposal: took = 11.676406ms https://192.168.10.28:2379 is healthy: successfully committed proposal: took = 12.374045ms https://192.168.10.27:2379 is healthy: successfully committed proposal: took = 14.856308ms
7、master节点异常处理
7.1、原因
有时候,我们把虚拟机挂起太久,会导入master节点,开启无法加入集群中,但是生产中很少出现,因为master不会时不时关机很久的
7.2、集群操作
7.2.1、删除异常的master节点
kubectl delete node master3
7.2.2、进入集群是etcd pod里面,删除问题节点
master1 ~]# kubectl -n kube-system exec -it etcd-master1 -- /bin/sh # 查询etcd集群成员 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list 336bb7385f7571c, started, master2, https://192.168.10.27:2380, https://192.168.10.27:2379, false 4a182a6f514944cc, started, master1, https://192.168.10.26:2380, https://192.168.10.26:2379, false 602e8852aad940e0, started, master3, https://192.168.10.28:2380, https://192.168.10.28:2379, false # 删除问题节点master3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 602e8852aad940e0 注意:如果这步没有操作的话,会导入加入集群的时候,报如下错误: Failed to get etcd status for https://192.168.10.28:2379: failed to dial endpoint https://192.168.10.28:2379 with maintenance client: context deadline exceeded
7.2.3、打印出加入集群的命令
kubeadm token create --print-join-command # ] kubeadm join vip.k8test.com:6443 --token hpezbw.amy350duzlzmi9gb --discovery-token-ca-cert-hash sha256:92e95dc47dcca7c8977004e2b321b09fe138ac223d95086951f600751d82d69a --control-plane # --control-plane 需要我们手动增加
7.3、master节点操作
7.3.1、重置节点数据
# 需要重置节点数据
kubeadm reset
7.3.2、创建kubernetes目录
mkdir -p /etc/kubernetes/pki/etcd
7.3.3、到集群中任意master节点复制CA证书到此节点
scp -rp /etc/kubernetes/pki/ca.* master3:/etc/kubernetes/pki scp -rp /etc/kubernetes/pki/sa.* master3:/etc/kubernetes/pki scp -rp /etc/kubernetes/pki/front-proxy-ca.* master3:/etc/kubernetes/pki scp -rp /etc/kubernetes/pki/etcd/ca.* master3:/etc/kubernetes/pki/etcd scp -rp /etc/kubernetes/admin.conf master3:/etc/kubernetes
7.3.4、加入集群
kubeadm join vip.k8test.com:6443 --token hpezbw.amy350duzlzmi9gb --discovery-token-ca-cert-hash sha256:92e95dc47dcca7c8977004e2b321b09fe138ac223d95086951f600751d82d69a --control-plane