11、K8S-高可用之Master、etcd集群、Master异常处理

Kubernetes学习目录

1、配置前准备

1.1、与单一master节点的区别

kubeadm init --kubernetes-version=1.25.7 \
--apiserver-advertise-address=192.168.10.26 \
--service-cidr=10.96.0.0/12 \
--pod-network-cidr=10.244.0.0/16 \
--image-repository=registry.aliyuncs.com/google_containers \
--ignore-preflight-errors=Swap

而对于多节点master的k8s节点来说,它对外的访问端口是keepalived对外的VIP地址及其对应的端口,而
对于这个地址的配置,kubeadm有专用的属性来指定,他就是 
--control-plane-endpoint=VIP
--apiserver-bind-port=6443

2、集群的重置

如果因为特殊因素导致,集群创建失败,我们可以通过两条命令实现环境的快速还原

2.1、Master重置清空方法

# Master节点重置
kubeadm reset;
rm -rf /etc/kubernetes;
rm -rf ~/.kube ;
rm -rf /etc/cni/; # 清除容器的网络接口
systemctl restart containerd.service

2.2、Node重置清空方法

rm -rf /etc/cni/net.d;
kubeadm reset;

# 需要重启一下这个服务,避免网络插件有问题
systemctl restart containerd.service

3、所有集群中的节点增加hosts

echo "192.168.10.200 vip.k8test.com" > /etc/hosts

4、高可用配置master

4.1、注意

注意:
1、如果nginx跟master安装在一起的话,那6433端口会冲突,先实始化一台,然后把nginx端口修改为7443即可。
2、第一台初始化的获取数据的端口也要修改,位置:/etc/kubernetes/kubelet.conf,修改好后,重启一下kubelet服务

4.2、master1初始化集群

kubeadm init --kubernetes-version=1.25.7 \
--apiserver-bind-port=6443 \
--control-plane-endpoint=vip.k8test.com \
--service-cidr=10.96.0.0/12 \
--pod-network-cidr=10.244.0.0/16 \
--image-repository=registry.aliyuncs.com/google_containers \
--ignore-preflight-errors=Swap

4.3、初始化成功后显示

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

You can now join any number of control-plane nodes by copying certificate authorities
and service account keys on each node and then running the following as root:
# 加入master集群的
  kubeadm join vip.k8test.com:6443 --token ci73xv.kp35on7vpn75adh4 \
        --discovery-token-ca-cert-hash sha256:7f70fccc24d51c0f7a845258204c7fb0a25ed4a029d419d4535728dc2346632a \
        --control-plane 

Then you can join any number of worker nodes by running the following on each as root:

# 节点加入集群中 kubeadm join vip.k8test.com:6443 --token ci73xv.kp35on7vpn75adh4 \ --discovery-token-ca-cert-hash sha256:7f70fccc24d51c0f7a845258204c7fb0a25ed4a029d419d4535728dc2346632a

4.4、master加入集群

# 将master2、master3 做集群高可用

kubeadm join vip.k8test.com:6443 --token ci73xv.kp35on7vpn75adh4 \
      --discovery-token-ca-cert-hash sha256:7f70fccc24d51c0f7a845258204c7fb0a25ed4a029d419d4535728dc2346632a \
      --control-plane 

4.5、集群时报错

4.5.1、front-proxy-ca.crt: no such file or directory

报错信息:
[failure loading certificate for CA: couldn't load the certificate file /etc/kubernetes/pki/ca.crt: open /etc/kubernetes/pki/ca.crt: 
no such file or directory, failure loading key for service account: couldn't load the private key file /etc/kubernetes/pki/sa.key:
open /etc/kubernetes/pki/sa.key: no such file or directory, failure loading certificate for front-proxy CA: couldn't load the
certificate file /etc/kubernetes/pki/front-proxy-ca.crt: open /etc/kubernetes/pki/front-proxy-ca.crt: no such file or directory,
failure loading certificate for etcd CA: couldn't load the certificate file /etc/kubernetes/pki/etcd/ca.crt: open /etc/kubernetes/pki/etcd/ca.crt: no such file or directory] 解决: # 在报错主机执行 mkdir -p /etc/kubernetes/pki/etcd # 在第一个集群主机上传过去对应报错的主机 scp -rp /etc/kubernetes/pki/ca.* master2:/etc/kubernetes/pki scp -rp /etc/kubernetes/pki/sa.* master2:/etc/kubernetes/pki scp -rp /etc/kubernetes/pki/front-proxy-ca.* master2:/etc/kubernetes/pki scp -rp /etc/kubernetes/pki/etcd/ca.* master2:/etc/kubernetes/pki/etcd scp -rp /etc/kubernetes/admin.conf master2:/etc/kubernetes

4.6、查询集群是否成功

[root@master1 ~]# kubectl get pods -A
NAMESPACE     NAME                              READY   STATUS    RESTARTS         AGE
kube-system   coredns-c676cc86f-hn247           0/1     Pending   0                16m
kube-system   coredns-c676cc86f-z5vrd           0/1     Pending   0                16m
kube-system   etcd-master1                      1/1     Running   10               16m
kube-system   etcd-master2                      1/1     Running   0                10m
kube-system   etcd-master3                      1/1     Running   0                71s
kube-system   kube-apiserver-master1            1/1     Running   18               16m
kube-system   kube-apiserver-master2            1/1     Running   0                10m
kube-system   kube-apiserver-master3            1/1     Running   1 (79s ago)      55s
kube-system   kube-controller-manager-master1   1/1     Running   11 (9m55s ago)   16m
kube-system   kube-controller-manager-master2   1/1     Running   0                10m
kube-system   kube-controller-manager-master3   1/1     Running   0                14s
kube-system   kube-proxy-cd9rj                  1/1     Running   0                10m
kube-system   kube-proxy-k4kbh                  1/1     Running   0                71s
kube-system   kube-proxy-rnswk                  1/1     Running   0                16m
kube-system   kube-scheduler-master1            1/1     Running   11 (9m55s ago)   16m
kube-system   kube-scheduler-master2            1/1     Running   0                10m
kube-system   kube-scheduler-master3            1/1     Running   0                67s

5、安装网络插件flannel

参考章节:安装CNI-flannel插件 : https://www.cnblogs.com/ygbh/p/17221380.html#_lab2_2_1

6、集群增加节点

6.1、往集群中增加入节点Node

kubeadm join vip.k8test.com:6443 --token ci73xv.kp35on7vpn75adh4 \
        --discovery-token-ca-cert-hash sha256:7f70fccc24d51c0f7a845258204c7fb0a25ed4a029d419d4535728dc2346632a 

6.2、查询节点状态

[root@master1 ~]# kubectl get nodes
NAME      STATUS   ROLES           AGE    VERSION
master1   Ready    control-plane   33m    v1.25.7
master2   Ready    control-plane   27m    v1.26.2
master3   Ready    control-plane   18m    v1.26.2
node1     Ready    <none>          2m6s   v1.25.7
node2     Ready    <none>          78s    v1.25.7

4、安装dashboard

参考章节:Dashboard部署 :https://www.cnblogs.com/ygbh/p/17221496.html

5、截止当前的集群状态

5.1、基础知识

默认情况下,对于我们当前的主节点集群中,只允许掉1个,如果掉两个master,整个集群就完全崩溃了
 -- 一般情况下,第一个master节点关闭后的信息查看需要1分钟左右才能看到掉线,第二个节点只要一崩溃,集群立刻无法使用
 -- 对于生产的集群,原则上允许节点挂的数量需要遵循一个原则,剩余的节点数量要大于 n/2 个整数节点。
 3个节点集群,只允许挂1个
5个节点集群, 只允许挂2个
其根本原因在于,所有节点的属性信息都保存在etcd中,而etcd是一个分布式的,一致的 keyvalue 存储,它遵循分布式一致性的基本节点要求

5.2、保障nginx开放所有的后端主机节点

cat /usr/local/nginx-1.20.0/conf/conf.d/apiserver.conf 
stream {
    upstream kube-apiserver {
        server 192.168.10.26:6443     max_fails=3 fail_timeout=30s;
        server 192.168.10.27:6443     max_fails=3 fail_timeout=30s;
        server 192.168.10.28:6443     max_fails=3 fail_timeout=30s;
    }
    server {
        listen 6443;
        proxy_connect_timeout 2s;
        proxy_timeout 900s;
        proxy_pass kube-apiserver;
    }
}

5.3、关闭master3

systemctl stop kubelet docker 或者 直接关机

5.4、查询节点状态是否正常

[root@master1 ~]# kubectl get node
NAME      STATUS     ROLES           AGE    VERSION
master1   Ready      control-plane   107m   v1.25.7
master2   Ready      control-plane   104m   v1.25.7
master3   NotReady   control-plane   104m   v1.25.7
node1     Ready      <none>          53m    v1.25.7
node2     Ready      <none>          53m    v1.25.7

6、ETCD集群

6.1、当前状态

当前ETCD状态是处于一个master对应一个etcd,master集群完成后,etcd证书也是同步一样的,可以出现配置不一致的问题,导致etcd失败,此时只需要修改配置文件:/etc/kubernetes/manifests/etcd.yaml 即可

相关的原理可以参考官网介绍:https://kubernetes.io/zh-cn/docs/setup/production-environment/tools/kubeadm/setup-ha-etcd-with-kubeadm/

6.2、修改etcd配置文件

6.2.1、注意事项

/etc/kubernetes/manifests  这个是静态pod,修改yaml文件,无需手动重启pod,会自动重新创建pod的

6.2.2、修改etcd.yaml配置文件【三台master节点都要修改】

]# cat /etc/kubernetes/manifests/etcd.yaml 
...
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://192.168.10.27:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --experimental-initial-corrupt-check=true
    - --experimental-watch-progress-notify-interval=5s
    - --initial-advertise-peer-urls=https://192.168.10.27:2380
    - --initial-cluster=master1=https://192.168.10.26:2380,master2=https://192.168.10.27:2380,master3=https://192.168.10.28:2380 # 这个是修改部分
    - --initial-cluster-state=existing
 ...

6.3、检查集群状态

6.3.1、进入etcd-pod里面查询

# 成员列表显示
sh
-5.1# etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key member list 49a8f24b174de1fa, started, master2, https://192.168.10.27:2380, https://192.168.10.27:2379, false 4a182a6f514944cc, started, master1, https://192.168.10.26:2380, https://192.168.10.26:2379, false fe67d236c75cf789, started, master3, https://192.168.10.28:2380, https://192.168.10.28:2379, false # 查询状态 sh-5.1# etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/peer.crt \ --key=/etc/kubernetes/pki/etcd/peer.key \ --endpoints=https://192.168.10.26:2379,https://192.168.10.27:2379,https://192.168.10.28:2379 endpoint status https://192.168.10.26:2379, 4a182a6f514944cc, 3.5.6, 4.7 MB, true, false, 89, 1485767, 1485767, https://192.168.10.27:2379, 49a8f24b174de1fa, 3.5.6, 5.2 MB, false, false, 89, 1485768, 1485768, https://192.168.10.28:2379, fe67d236c75cf789, 3.5.6, 4.9 MB, false, false, 89, 1485768, 1485768, # 健康检查 sh-5.1# etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/peer.crt \ --key=/etc/kubernetes/pki/etcd/peer.key \ --endpoints=https://192.168.10.26:2379,https://192.168.10.27:2379,https://192.168.10.28:2379 endpoint health https://192.168.10.26:2379 is healthy: successfully committed proposal: took = 11.676406ms https://192.168.10.28:2379 is healthy: successfully committed proposal: took = 12.374045ms https://192.168.10.27:2379 is healthy: successfully committed proposal: took = 14.856308ms

7、master节点异常处理

7.1、原因

有时候,我们把虚拟机挂起太久,会导入master节点,开启无法加入集群中,但是生产中很少出现,因为master不会时不时关机很久的

7.2、集群操作

7.2.1、删除异常的master节点

kubectl delete node master3

 

7.2.2、进入集群是etcd pod里面,删除问题节点

master1 ~]# kubectl -n kube-system exec -it etcd-master1 -- /bin/sh

# 查询etcd集群成员
etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
336bb7385f7571c, started, master2, https://192.168.10.27:2380, https://192.168.10.27:2379, false
4a182a6f514944cc, started, master1, https://192.168.10.26:2380, https://192.168.10.26:2379, false
602e8852aad940e0, started, master3, https://192.168.10.28:2380, https://192.168.10.28:2379, false

# 删除问题节点master3
etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 602e8852aad940e0

注意:如果这步没有操作的话,会导入加入集群的时候,报如下错误:
Failed to get etcd status for https://192.168.10.28:2379: failed to dial endpoint https://192.168.10.28:2379 with maintenance client: context deadline exceeded

7.2.3、打印出加入集群的命令

kubeadm token create --print-join-command
# ] kubeadm join vip.k8test.com:6443 --token hpezbw.amy350duzlzmi9gb --discovery-token-ca-cert-hash sha256:92e95dc47dcca7c8977004e2b321b09fe138ac223d95086951f600751d82d69a --control-plane 


# --control-plane 需要我们手动增加

7.3、master节点操作

7.3.1、重置节点数据

# 需要重置节点数据
kubeadm reset

7.3.2、创建kubernetes目录

mkdir -p /etc/kubernetes/pki/etcd

7.3.3、到集群中任意master节点复制CA证书到此节点

scp -rp /etc/kubernetes/pki/ca.* master3:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/sa.* master3:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/front-proxy-ca.* master3:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/etcd/ca.* master3:/etc/kubernetes/pki/etcd
scp -rp /etc/kubernetes/admin.conf master3:/etc/kubernetes

7.3.4、加入集群

kubeadm join vip.k8test.com:6443 --token hpezbw.amy350duzlzmi9gb --discovery-token-ca-cert-hash sha256:92e95dc47dcca7c8977004e2b321b09fe138ac223d95086951f600751d82d69a --control-plane 

 

posted @ 2023-03-16 11:41  小粉优化大师  阅读(1361)  评论(2编辑  收藏  举报