k8s组件和网络插件挂掉，演示已有的pod是否正常运行

环境

上面链接是这个环境的开始，k8s资源配置在上面链接里面

03 master ,05 06是node

[root@mcwk8s03 mcwtest]# kubectl get nodes -o wide
NAME       STATUS   ROLES    AGE    VERSION    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION          CONTAINER-RUNTIME
mcwk8s05   Ready    <none>   581d   v1.15.12   10.0.0.35     <none>        CentOS Linux 7 (Core)   3.10.0-693.el7.x86_64   docker://20.10.21
mcwk8s06   Ready    <none>   581d   v1.15.12   10.0.0.36     <none>        CentOS Linux 7 (Core)   3.10.0-693.el7.x86_64   docker://20.10.21
[root@mcwk8s03 mcwtest]#

[root@mcwk8s03 mcwtest]# kubectl get svc
NAME          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)          AGE
kubernetes    ClusterIP   10.2.0.1     <none>        443/TCP          584d
mcwtest-svc   NodePort    10.2.0.155   <none>        2024:33958/TCP   26h
nginx         ClusterIP   None         <none>        80/TCP           414d
[root@mcwk8s03 mcwtest]# kubectl get svc|grep mcwtest
mcwtest-svc   NodePort    10.2.0.155   <none>        2024:33958/TCP   26h
[root@mcwk8s03 mcwtest]# kubectl get deploy|grep mcwtest
mcwtest-deploy     1/1     1            1           26h
[root@mcwk8s03 mcwtest]# kubectl get pod|grep mcwtest
mcwtest-deploy-6465665557-g9zjd     1/1     Running            1          25h
[root@mcwk8s03 mcwtest]# kubectl get pod -o wide|grep mcwtest
mcwtest-deploy-6465665557-g9zjd     1/1     Running            1          25h    172.17.89.10   mcwk8s05   <none>           <none>
[root@mcwk8s03 mcwtest]#

停止服务之前，可以正常用nodeip: nodeport访问

[root@mcwk8s03 mcwtest]# kubectl get svc
NAME          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)          AGE
kubernetes    ClusterIP   10.2.0.1     <none>        443/TCP          584d
mcwtest-svc   NodePort    10.2.0.155   <none>        2024:33958/TCP   26h
nginx         ClusterIP   None         <none>        80/TCP           414d
[root@mcwk8s03 mcwtest]# date 
Thu Jun  6 00:50:34 CST 2024
[root@mcwk8s03 mcwtest]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 16:50:42 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s03 mcwtest]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 16:50:46 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s03 mcwtest]#

【】停止master上的组件

[root@mcwk8s03 /]# systemctl stop kube-apiserver.service 
[root@mcwk8s03 /]# systemctl status kube-apiserver.service 
● kube-apiserver.service - Kubernetes API Server
   Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Thu 2024-06-06 00:54:51 CST; 14s ago
     Docs: https://github.com/kubernetes/kubernetes
  Process: 19837 ExecStart=/opt/kubernetes/bin/kube-apiserver $KUBE_APISERVER_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 19837 (code=exited, status=0/SUCCESS)

Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.395965   19837 wrap.go:47] GET /apis/rbac.authorization.k8s.io/v1/clus...3:5887]
Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.396002   19837 wrap.go:47] GET /apis/storage.k8s.io/v1/volumeattachmen...3:5887]
Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.396033   19837 wrap.go:47] GET /apis/admissionregistration.k8s.io/v1be...3:5887]
Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.396047   19837 wrap.go:47] GET /apis/rbac.authorization.k8s.io/v1/role...3:5887]
Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.396068   19837 wrap.go:47] GET /api/v1/nodes?resourceVersion=4830599&t...3:5887]
Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.396083   19837 wrap.go:47] GET /api/v1/secrets?resourceVersion=4710803...3:5887]
Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.396108   19837 wrap.go:47] GET /api/v1/namespaces?resourceVersion=4710...3:5887]
Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.405097   19837 wrap.go:47] GET /api/v1/namespaces/default/endpoints/ku...3:5887]
Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: E0606 00:54:51.408262   19837 controller.go:179] no master IPs were listed in storage...service
Jun 06 00:54:51 mcwk8s03 systemd[1]: Stopped Kubernetes API Server.
Hint: Some lines were ellipsized, use -l to show in full.
[root@mcwk8s03 /]#

执行命令已经有问题了

[root@mcwk8s03 /]# kubectl get svc
The connection to the server localhost:8080 was refused - did you specify the right host or port?
[root@mcwk8s03 /]# kubectl get nodes
The connection to the server localhost:8080 was refused - did you specify the right host or port?
[root@mcwk8s03 /]#

/var/log/message报错

Jun  6 00:58:11 mcwk8s03 kube-scheduler: E0606 00:58:11.720321  123920 reflector.go:125] k8s.io/client-go/informers/factory.go:133: 
Failed to list *v1.ReplicaSet: Get http://127.0.0.1:8080/apis/apps/v1/replicasets?limit=500&resourceVersion=0: dial tcp 127.0.0.1:8080: connect: connection refused

nodeip:nodeport的容器没有受到影响，还在运行

[root@mcwk8s03 mcwtest]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:01:02 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s03 mcwtest]# date
Thu Jun  6 01:01:12 CST 2024
[root@mcwk8s03 mcwtest]#

还能正常访问

停掉schedule和controller-manager，pod可以正常提供服务

[root@mcwk8s03 mcwtest]# systemctl stop kube-scheduler.service 
[root@mcwk8s03 mcwtest]# systemctl stop kube-controller-manager.service 
[root@mcwk8s03 mcwtest]# date
Thu Jun  6 01:03:23 CST 2024
[root@mcwk8s03 mcwtest]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:03:26 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s03 mcwtest]#

master没有kubelet和kube proxy

[root@mcwk8s03 mcwtest]# ps -ef|grep proxy
root     125098   1429  0 01:05 pts/0    00:00:00 grep --color=auto proxy
[root@mcwk8s03 mcwtest]# ps -ef|grep let
root     125106   1429  0 01:05 pts/0    00:00:00 grep --color=auto let
[root@mcwk8s03 mcwtest]#

【3】停掉node上的组件

[root@mcwk8s03 mcwtest]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:06:41 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s03 mcwtest]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:06:44 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s03 mcwtest]#

既然master不影响已有pod的正常使用，那么先把apiserver的启动一下，方便看环境

我们master上只启动了apiserver组件的，启动之后就可以看到，pod等信息正常显示

[root@mcwk8s03 mcwtest]# systemctl start apiserver
Failed to start apiserver.service: Unit not found.
[root@mcwk8s03 mcwtest]# systemctl start kube-apiserver.service 
[root@mcwk8s03 mcwtest]# kubectl get svc
NAME          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)          AGE
kubernetes    ClusterIP   10.2.0.1     <none>        443/TCP          584d
mcwtest-svc   NodePort    10.2.0.155   <none>        2024:33958/TCP   26h
nginx         ClusterIP   None         <none>        80/TCP           414d
[root@mcwk8s03 mcwtest]# kubectl get nodes
NAME       STATUS   ROLES    AGE    VERSION
mcwk8s05   Ready    <none>   581d   v1.15.12
mcwk8s06   Ready    <none>   581d   v1.15.12
[root@mcwk8s03 mcwtest]#

查看组件状态，也就是这三个，不影响已有pod的nodeip:nodeport方式的访问。

[root@mcwk8s03 mcwtest]# kubectl get cs
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused   
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused   
etcd-1               Healthy     {"health":"true"}                                                                           
etcd-2               Healthy     {"health":"true"}                                                                           
etcd-0               Healthy     {"health":"true"}                                                                           
[root@mcwk8s03 mcwtest]#

我们master看下，找个clusterIP

[root@mcwk8s03 mcwtest]# kubectl get svc
NAME          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)          AGE
kubernetes    ClusterIP   10.2.0.1     <none>        443/TCP          584d
mcwtest-svc   NodePort    10.2.0.155   <none>        2024:33958/TCP   26h
nginx         ClusterIP   None         <none>        80/TCP           414d
[root@mcwk8s03 mcwtest]#

然后去node上访问一下，也是可以正常访问的。为啥不在master请求clusterIP：port，这是因为没有ipvsadm规则，master没有部署kubeproxy这些node上的服务把

[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:12:34 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]#

停掉05和06上的kubelet服务

[root@mcwk8s05 ~]# systemctl stop kubelet.service 
[root@mcwk8s05 ~]# 

[root@mcwk8s06 ~]# systemctl stop kubelet.service 
[root@mcwk8s06 ~]#

容器的nodeip:nodeport访问还是正常的，clusterIP：port访问也正常

[root@mcwk8s05 ~]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:19:13 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:19:17 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:19:21 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]#

下面开始停止kubeproxy服务,先停止06节点的

[root@mcwk8s06 ~]# systemctl stop kube-proxy.service 
[root@mcwk8s06 ~]#

06节点的nodeip:nodeport依然可以访问这个服务

[root@mcwk8s05 ~]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:23:41 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:23:44 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]#

可以看到，06上路由和ipvs规则都还在

[root@mcwk8s06 ~]# ipvsadm -Ln|grep -C 2 10.0.0.36
  -> 172.17.9.2:9100              Masq    1      0          0         
  -> 172.17.89.2:9100             Masq    1      0          0         
TCP  10.0.0.36:31672 rr
  -> 172.17.9.2:9100              Masq    1      0          0         
  -> 172.17.89.2:9100             Masq    1      0          0         
TCP  10.0.0.36:33958 rr
  -> 172.17.89.10:20000           Masq    1      0          1         
TCP  10.0.0.36:46735 rr
  -> 172.17.89.13:3000            Masq    1      0          0         
TCP  10.2.0.1:443 rr
--
TCP  172.17.9.1:46735 rr
  -> 172.17.89.13:3000            Masq    1      0          0         
TCP  10.0.0.36:30001 rr
  -> 172.17.89.5:8443             Masq    1      0          0         
TCP  10.0.0.36:30003 rr
  -> 172.17.89.4:9090             Masq    1      0          0         
TCP  10.2.0.155:2024 rr
[root@mcwk8s06 ~]# 
[root@mcwk8s06 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.0.254      0.0.0.0         UG    100    0        0 eth0
10.0.0.0        0.0.0.0         255.255.255.0   U     100    0        0 eth0
172.17.9.0      0.0.0.0         255.255.255.0   U     0      0        0 docker0
172.17.83.0     172.17.83.0     255.255.255.0   UG    0      0        0 flannel.1
172.17.89.0     172.17.89.0     255.255.255.0   UG    0      0        0 flannel.1
[root@mcwk8s06 ~]#

pod在05机器上，把05的kube-proxy关掉，已经有的pod，也是不影响使用，目前两个node的都停掉了

[root@mcwk8s05 /]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:26:06 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 /]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:26:34 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 /]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:26:37 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 /]#

[root@mcwk8s05 /]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:26:06 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 /]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:26:34 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 /]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:26:37 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 /]#

【4】停掉flannel服务。下面有个重要的分析容器通信的过程案例

停止前查看服务情况，有去两个宿主机容器网段的路由。并且nodeip:nodeport是通的。nodeip:nodeport。master上没有ipvsadm规则，因此clusterIP：port就没法找到转到对应的容器IP：容器port，但是nodeip:nodeport，不需要ipvs规则找转发的后端，直接通过nodeip通信，访问nodeport，然后到了指定的nodeip:nodeport之后，再根据ipvsadm转发规则，转发给对应的容器IP：容器Port。如果这个容器是在刚刚请求的nodeip主机上，那么直接通过docker0通信找到容器IP：容器port；如果不在这个机器，那么再通过容器跨宿主机通信的方式，再进行找到对应的宿主机，然后找到宿主机上对应的容器IP。

[root@mcwk8s03 mcwtest]# systemctl status flanneld.service 
● flanneld.service - Flanneld overlay address etcd agent
   Loaded: loaded (/usr/lib/systemd/system/flanneld.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2024-06-04 23:28:50 CST; 1 day 2h ago
 Main PID: 11892 (flanneld)
   Memory: 13.9M
   CGroup: /system.slice/flanneld.service
           └─11892 /opt/kubernetes/bin/flanneld --ip-masq --etcd-endpoints=https://10.0.0.33:2379,https://10.0.0.35:2379,https://10.0.0.36:2...

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
[root@mcwk8s03 mcwtest]# 
[root@mcwk8s03 mcwtest]# 
[root@mcwk8s03 mcwtest]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.0.254      0.0.0.0         UG    100    0        0 eth0
10.0.0.0        0.0.0.0         255.255.255.0   U     100    0        0 eth0
172.17.9.0      172.17.9.0      255.255.255.0   UG    0      0        0 flannel.1
172.17.83.0     0.0.0.0         255.255.255.0   U     0      0        0 docker0
172.17.89.0     172.17.89.0     255.255.255.0   UG    0      0        0 flannel.1
[root@mcwk8s03 mcwtest]# ifconfig flannel.1
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.17.83.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::b083:33ff:fe7b:fd37  prefixlen 64  scopeid 0x20<link>
        ether b2:83:33:7b:fd:37  txqueuelen 0  (Ethernet)
        RX packets 11  bytes 924 (924.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 11  bytes 924 (924.0 B)
        TX errors 0  dropped 8 overruns 0  carrier 0  collisions 0

[root@mcwk8s03 mcwtest]# ifconfig docker
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.83.1  netmask 255.255.255.0  broadcast 172.17.83.255
        ether 02:42:e9:a4:51:4f  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@mcwk8s03 mcwtest]# curl 10.2.0.155:2024
^C
[root@mcwk8s03 mcwtest]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:30:23 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s03 mcwtest]#

我们停止03master上的flannel，发现还是正常用nodeip:nodeport访问已有的容器服务的，路由还在，网卡网段还没变化。

[root@mcwk8s03 mcwtest]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.0.254      0.0.0.0         UG    100    0        0 eth0
10.0.0.0        0.0.0.0         255.255.255.0   U     100    0        0 eth0
172.17.9.0      172.17.9.0      255.255.255.0   UG    0      0        0 flannel.1
172.17.83.0     0.0.0.0         255.255.255.0   U     0      0        0 docker0
172.17.89.0     172.17.89.0     255.255.255.0   UG    0      0        0 flannel.1
[root@mcwk8s03 mcwtest]# ifconfig flannel.1
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.17.83.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::b083:33ff:fe7b:fd37  prefixlen 64  scopeid 0x20<link>
        ether b2:83:33:7b:fd:37  txqueuelen 0  (Ethernet)
        RX packets 11  bytes 924 (924.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 11  bytes 924 (924.0 B)
        TX errors 0  dropped 8 overruns 0  carrier 0  collisions 0

[root@mcwk8s03 mcwtest]# ifconfig docker
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.83.1  netmask 255.255.255.0  broadcast 172.17.83.255
        ether 02:42:e9:a4:51:4f  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@mcwk8s03 mcwtest]#

下面我们查看下05 node的信息，准备关闭这个节点的flannel服务，正常访问pod服务

[root@mcwk8s05 ~]# systemctl status flanneld.service 
● flanneld.service - Flanneld overlay address etcd agent
   Loaded: loaded (/usr/lib/systemd/system/flanneld.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2024-06-05 00:50:25 CST; 24h ago
  Process: 4201 ExecStartPost=/opt/kubernetes/bin/mk-docker-opts.sh -k DOCKER_NETWORK_OPTIONS -d /run/flannel/subnet.env (code=exited, status=0/SUCCESS)
 Main PID: 4184 (flanneld)
   Memory: 16.2M
   CGroup: /system.slice/flanneld.service
           └─4184 /opt/kubernetes/bin/flanneld --ip-masq --etcd-endpoints=https://10.0.0.33:2379,https://10.0.0.35:2379,https://10.0.0.36:2379 -etcd-cafile=/opt/etcd/ssl/ca.pem -etcd-cert...

Jun 05 23:50:25 mcwk8s05 flanneld[4184]: I0605 23:50:25.459783    4184 main.go:388] Lease renewed, new expiration: 2024-06-06 15:50:25.409920414 +0000 UTC
Jun 05 23:50:25 mcwk8s05 flanneld[4184]: I0605 23:50:25.459933    4184 main.go:396] Waiting for 22h59m59.949994764s to renew lease
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
[root@mcwk8s05 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.0.254      0.0.0.0         UG    100    0        0 eth0
10.0.0.0        0.0.0.0         255.255.255.0   U     100    0        0 eth0
172.17.9.0      172.17.9.0      255.255.255.0   UG    0      0        0 flannel.1
172.17.83.0     172.17.83.0     255.255.255.0   UG    0      0        0 flannel.1
172.17.89.0     0.0.0.0         255.255.255.0   U     0      0        0 docker0
[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:46:55 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:47:00 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:47:04 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# ifconfig flannel.1
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.17.89.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::3470:76ff:feea:39b8  prefixlen 64  scopeid 0x20<link>
        ether 36:70:76:ea:39:b8  txqueuelen 0  (Ethernet)
        RX packets 78  bytes 5468 (5.3 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 86  bytes 7144 (6.9 KiB)
        TX errors 0  dropped 8 overruns 0  carrier 0  collisions 0

[root@mcwk8s05 ~]# ifconfig docker
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.17.89.1  netmask 255.255.255.0  broadcast 172.17.89.255
        inet6 fe80::42:18ff:fee1:e8fc  prefixlen 64  scopeid 0x20<link>
        ether 02:42:18:e1:e8:fc  txqueuelen 0  (Ethernet)
        RX packets 1212277  bytes 505391601 (481.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1374361  bytes 1966293845 (1.8 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@mcwk8s05 ~]#

停掉两个node上的flannel服务

[root@mcwk8s05 ~]# systemctl stop flanneld.service 
[root@mcwk8s05 ~]# 


[root@mcwk8s06 ~]# systemctl stop flanneld.service
[root@mcwk8s06 ~]#

查看停掉所有flannel之后，05 node的信息。可以看的，依然可以访问容器服务，通过nodeip:nodeport或者是clusterip:port 。至此，k8s组件和网络插件的停止，目前来看对原有的pod是没有受到影响的。也就是，pod创建好之后，正常提供服务不会依赖k8s组件，网络组件等，也就是只有pod创建的时候，会依赖这些组件创建容器，创建好容器之后，就能正常使用，不依赖组件。

[root@mcwk8s05 ~]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:49:28 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:49:33 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 17:49:50 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# ifconfig flannel.1
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.17.89.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::3470:76ff:feea:39b8  prefixlen 64  scopeid 0x20<link>
        ether 36:70:76:ea:39:b8  txqueuelen 0  (Ethernet)
        RX packets 83  bytes 5816 (5.6 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 91  bytes 7576 (7.3 KiB)
        TX errors 0  dropped 8 overruns 0  carrier 0  collisions 0

[root@mcwk8s05 ~]# ifconfig docker
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.17.89.1  netmask 255.255.255.0  broadcast 172.17.89.255
        inet6 fe80::42:18ff:fee1:e8fc  prefixlen 64  scopeid 0x20<link>
        ether 02:42:18:e1:e8:fc  txqueuelen 0  (Ethernet)
        RX packets 1213264  bytes 505536182 (482.1 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1375186  bytes 1967017624 (1.8 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@mcwk8s05 ~]#

此时在master上查看etcd中网络信息，现在将flannel网络服务启动，会不会改变该宿主机的容器网段，从而对现有的容器产生影响，导致网络问题呢。

[root@mcwk8s03 mcwtest]# etcdctl ls /
/coreos.com
[root@mcwk8s03 mcwtest]# etcdctl ls /coreos.com
/coreos.com/network
[root@mcwk8s03 mcwtest]# etcdctl ls /coreos.com/network
/coreos.com/network/config
/coreos.com/network/subnets
[root@mcwk8s03 mcwtest]# etcdctl ls /coreos.com/network/subnets/
/coreos.com/network/subnets/172.17.83.0-24
/coreos.com/network/subnets/172.17.9.0-24
/coreos.com/network/subnets/172.17.89.0-24
[root@mcwk8s03 mcwtest]# etcdctl ls /coreos.com/network/config
/coreos.com/network/config
[root@mcwk8s03 mcwtest]# etcdctl get /coreos.com/network/config
{ "Network": "172.17.0.0/16", "Backend": {"Type": "vxlan"}}
[root@mcwk8s03 mcwtest]#

网络服务启动之后，并没有改变宿主机容器网段

[root@mcwk8s05 ~]# ifconfig flannel.1
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.17.89.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::3470:76ff:feea:39b8  prefixlen 64  scopeid 0x20<link>
        ether 36:70:76:ea:39:b8  txqueuelen 0  (Ethernet)
        RX packets 83  bytes 5816 (5.6 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 91  bytes 7576 (7.3 KiB)
        TX errors 0  dropped 8 overruns 0  carrier 0  collisions 0

[root@mcwk8s05 ~]# ifconfig docker
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.17.89.1  netmask 255.255.255.0  broadcast 172.17.89.255
        inet6 fe80::42:18ff:fee1:e8fc  prefixlen 64  scopeid 0x20<link>
        ether 02:42:18:e1:e8:fc  txqueuelen 0  (Ethernet)
        RX packets 1213264  bytes 505536182 (482.1 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1375186  bytes 1967017624 (1.8 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@mcwk8s05 ~]# 
[root@mcwk8s05 ~]# systemctl status flanneld.service 
● flanneld.service - Flanneld overlay address etcd agent
   Loaded: loaded (/usr/lib/systemd/system/flanneld.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Jun 05 23:50:25 mcwk8s05 flanneld[4184]: I0605 23:50:25.459783    4184 main.go:388] Lease renewed, new expiration: 2024-06-06 15:50:25.409920414 +0000 UTC
Jun 05 23:50:25 mcwk8s05 flanneld[4184]: I0605 23:50:25.459933    4184 main.go:396] Waiting for 22h59m59.949994764s to renew lease
Jun 06 01:48:25 mcwk8s05 systemd[1]: Stopping Flanneld overlay address etcd agent...
Jun 06 01:48:25 mcwk8s05 flanneld[4184]: I0606 01:48:25.715882    4184 main.go:404] Stopped monitoring lease
Jun 06 01:48:25 mcwk8s05 flanneld[4184]: I0606 01:48:25.716035    4184 main.go:322] Waiting for all goroutines to exit
Jun 06 01:48:25 mcwk8s05 flanneld[4184]: I0606 01:48:25.741648    4184 main.go:337] shutdownHandler sent cancel signal...
Jun 06 01:48:25 mcwk8s05 flanneld[4184]: I0606 01:48:25.759081    4184 main.go:325] Exiting cleanly...
Jun 06 01:48:25 mcwk8s05 systemd[1]: Stopped Flanneld overlay address etcd agent.
[root@mcwk8s05 ~]# systemctl start flanneld.service 
[root@mcwk8s05 ~]# ifconfig flannel.1
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.17.89.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::3470:76ff:feea:39b8  prefixlen 64  scopeid 0x20<link>
        ether 36:70:76:ea:39:b8  txqueuelen 0  (Ethernet)
        RX packets 83  bytes 5816 (5.6 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 91  bytes 7576 (7.3 KiB)
        TX errors 0  dropped 8 overruns 0  carrier 0  collisions 0

[root@mcwk8s05 ~]# ifconfig docker
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.17.89.1  netmask 255.255.255.0  broadcast 172.17.89.255
        inet6 fe80::42:18ff:fee1:e8fc  prefixlen 64  scopeid 0x20<link>
        ether 02:42:18:e1:e8:fc  txqueuelen 0  (Ethernet)
        RX packets 1216321  bytes 505986272 (482.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1377747  bytes 1969464945 (1.8 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@mcwk8s05 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.0.254      0.0.0.0         UG    100    0        0 eth0
10.0.0.0        0.0.0.0         255.255.255.0   U     100    0        0 eth0
172.17.9.0      172.17.9.0      255.255.255.0   UG    0      0        0 flannel.1
172.17.83.0     172.17.83.0     255.255.255.0   UG    0      0        0 flannel.1
172.17.89.0     0.0.0.0         255.255.255.0   U     0      0        0 docker0
[root@mcwk8s05 ~]#

并且在其它节点，依然可以正常访问到容器服务

[root@mcwk8s03 mcwtest]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 18:01:00 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s03 mcwtest]#

[root@mcwk8s06 ~]# systemctl stop flanneld.service
[root@mcwk8s06 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Wed, 05 Jun 2024 18:02:02 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s06 ~]#

我们再试一下，重启flannel服务，发现容器网段还是没有发生变化，之前遇到那些会变化的原因是什么呢？

[root@mcwk8s03 mcwtest]# etcdctl ls /coreos.com/network/subnets/
/coreos.com/network/subnets/172.17.89.0-24
/coreos.com/network/subnets/172.17.83.0-24
/coreos.com/network/subnets/172.17.9.0-24
[root@mcwk8s03 mcwtest]# ifconfig flannel.1
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.17.83.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::b083:33ff:fe7b:fd37  prefixlen 64  scopeid 0x20<link>
        ether b2:83:33:7b:fd:37  txqueuelen 0  (Ethernet)
        RX packets 11  bytes 924 (924.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 11  bytes 924 (924.0 B)
        TX errors 0  dropped 8 overruns 0  carrier 0  collisions 0

[root@mcwk8s03 mcwtest]# ifconfig docker
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.83.1  netmask 255.255.255.0  broadcast 172.17.83.255
        ether 02:42:e9:a4:51:4f  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@mcwk8s03 mcwtest]# systemctl restart flanneld.service 
[root@mcwk8s03 mcwtest]# etcdctl ls /coreos.com/network/subnets/
/coreos.com/network/subnets/172.17.9.0-24
/coreos.com/network/subnets/172.17.89.0-24
/coreos.com/network/subnets/172.17.83.0-24
[root@mcwk8s03 mcwtest]# ifconfig flannel.1
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.17.83.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::b083:33ff:fe7b:fd37  prefixlen 64  scopeid 0x20<link>
        ether b2:83:33:7b:fd:37  txqueuelen 0  (Ethernet)
        RX packets 11  bytes 924 (924.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 11  bytes 924 (924.0 B)
        TX errors 0  dropped 8 overruns 0  carrier 0  collisions 0

[root@mcwk8s03 mcwtest]# ifconfig docker
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.83.1  netmask 255.255.255.0  broadcast 172.17.83.255
        ether 02:42:e9:a4:51:4f  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@mcwk8s03 mcwtest]#

一天后，发现05节点的网络有问题，不通了，启动kube-proxy之后就好了

未见异常，05不可以请求，但是06可以访问到pod，pod在05机器上

[root@mcwk8s03 mcwtest]# kubectl get svc
NAME          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)          AGE
kubernetes    ClusterIP   10.2.0.1     <none>        443/TCP          584d
mcwtest-svc   NodePort    10.2.0.155   <none>        2024:33958/TCP   2d
nginx         ClusterIP   None         <none>        80/TCP           415d
[root@mcwk8s03 mcwtest]# kubectl get pod
NAME                                READY   STATUS             RESTARTS   AGE
mcwnginx-depoyment-8shfd            1/1     Running            20         415d
mcwnginx-depoyment-ngthk            1/1     Running            7          415d
mcwtest-deploy-6465665557-kbpdr     1/1     Running            0          115m
nginx-deployment-68c7f5464c-n8t2b   1/1     Running            0          115m
nginx-statefulset-0                 1/1     Running            0          19m
nginx-statefulset-1                 1/1     Running            1          46h
nginx-statefulset-2                 1/1     Running            1          46h
prometheus-0                        0/1     CrashLoopBackOff   8          19m
prometheus-1                        0/1     CrashLoopBackOff   8          19m
prometheus-2                        0/1     CrashLoopBackOff   536        46h
[root@mcwk8s03 mcwtest]# curl -I 10.0.0.35:33958
curl: (7) Failed connect to 10.0.0.35:33958; Connection timed out
[root@mcwk8s03 mcwtest]# curl -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 14:57:09 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s03 mcwtest]#

查看还在的

在05上用05的nodeip访问不了pod，但是用06的nodeip可以访问到。查看kube-proxy服务是停止的。启动之后，可以正常用05的nodeip 去访问了。

[root@mcwk8s05 /]# curl  -I 10.0.0.35:33958
curl: (7) Failed connect to 10.0.0.35:33958; Connection timed out
[root@mcwk8s05 /]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 15:01:22 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 /]# systemctl status kube-proxy
● kube-proxy.service - Kubernetes Proxy
   Loaded: loaded (/usr/lib/systemd/system/kube-proxy.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Thu 2024-06-06 21:07:32 CST; 1h 54min ago
  Process: 50648 ExecStart=/opt/kubernetes/bin/kube-proxy $KUBE_PROXY_OPTS (code=killed, signal=PIPE)
 Main PID: 50648 (code=killed, signal=PIPE)

Jun 06 20:59:58 mcwk8s05 kube-proxy[50648]: I0606 20:59:58.068662   50648 config.go:132] Calling handler.OnEndpointsUpdate
Jun 06 20:59:58 mcwk8s05 kube-proxy[50648]: I0606 20:59:58.068668   50648 config.go:132] Calling handler.OnEndpointsUpdate
Jun 06 20:59:58 mcwk8s05 kube-proxy[50648]: I0606 20:59:58.068673   50648 config.go:132] Calling handler.OnEndpointsUpdate
Jun 06 20:59:58 mcwk8s05 kube-proxy[50648]: I0606 20:59:58.068679   50648 config.go:132] Calling handler.OnEndpointsUpdate
Jun 06 20:59:58 mcwk8s05 kube-proxy[50648]: I0606 20:59:58.068683   50648 config.go:132] Calling handler.OnEndpointsUpdate
Jun 06 20:59:58 mcwk8s05 kube-proxy[50648]: I0606 20:59:58.068688   50648 config.go:132] Calling handler.OnEndpointsUpdate
Jun 06 21:00:24 mcwk8s05 kube-proxy[50648]: I0606 21:00:02.190985   50648 config.go:132] Calling handler.OnEndpointsUpdate
Jun 06 21:00:30 mcwk8s05 kube-proxy[50648]: I0606 21:00:03.493815   50648 config.go:132] Calling handler.OnEndpointsUpdate
Jun 06 21:00:45 mcwk8s05 kube-proxy[50648]: I0606 21:00:07.350882   50648 config.go:132] Calling handler.OnEndpointsUpdate
Jun 06 21:00:49 mcwk8s05 kube-proxy[50648]: I0606 21:00:16.466427   50648 config.go:132] Calling handler.OnEndpointsUpdate
[root@mcwk8s05 /]# systemctl start kube-proxy
[root@mcwk8s05 /]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 15:01:54 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 /]#

通过查看日志/var/log/measage，可以看到，这个规则0个激活，一个未连接，也就是ipvs规则有点问题，启动kube-proxy之后，在删除这个规则，或许是删除后重加，然后才激活连接把，那么怎么检查这个转发rs是有效的呢

停止etcd，对访问容器的影响

有时间再研究下停止etcd，是否会对容器正常访问造成影响呢，应该是会的吧，毕竟容器跨宿主机通信的时候，etcd保持了容器网段和宿主机ip之间的路由关系，它应该就是个路由表，有这个才能知道要访问的容器在哪个宿主机上，然后将数据包发过去，到那里再解封装然后再~~吧

查看环境

[root@mcwk8s03 mcwtest]# kubectl get svc|grep mcwtest
mcwtest-svc   NodePort    10.2.0.155   <none>        2024:33958/TCP   2d1h
[root@mcwk8s03 mcwtest]# kubectl get pod|grep mcwtest
mcwtest-deploy-6465665557-kbpdr     1/1     Running            0          152m
[root@mcwk8s03 mcwtest]# kubectl get csd
error: the server doesn't have a resource type "csd"
[root@mcwk8s03 mcwtest]# kubectl get cs
NAME                 STATUS    MESSAGE             ERROR
scheduler            Healthy   ok                  
controller-manager   Healthy   ok                  
etcd-0               Healthy   {"health":"true"}   
etcd-2               Healthy   {"health":"true"}   
etcd-1               Healthy   {"health":"true"}   
[root@mcwk8s03 mcwtest]#

查看正常访问

[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 15:29:32 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 15:29:53 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 15:29:56 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# date
Thu Jun  6 23:30:06 CST 2024
[root@mcwk8s05 ~]#

将etcd停下来

[root@mcwk8s03 mcwtest]# kubectl get cs
NAME                 STATUS    MESSAGE             ERROR
scheduler            Healthy   ok                  
controller-manager   Healthy   ok                  
etcd-0               Healthy   {"health":"true"}   
etcd-2               Healthy   {"health":"true"}   
etcd-1               Healthy   {"health":"true"}   
[root@mcwk8s03 mcwtest]# 
[root@mcwk8s03 mcwtest]# 
[root@mcwk8s03 mcwtest]# systemctl stop etcd
[root@mcwk8s03 mcwtest]# kubectl get cs
NAME                 STATUS      MESSAGE                                                                                   ERROR
etcd-1               Unhealthy   Get https://10.0.0.33:2379/health: dial tcp 10.0.0.33:2379: connect: connection refused   
controller-manager   Healthy     ok                                                                                        
scheduler            Healthy     ok                                                                                        
etcd-2               Healthy     {"health":"true"}                                                                         
etcd-0               Healthy     {"health":"true"}                                                                         
[root@mcwk8s03 mcwtest]# etcdctl ls /
/coreos.com
[root@mcwk8s03 mcwtest]# kubectl get cs
NAME                 STATUS      MESSAGE                                                                                   ERROR
etcd-1               Unhealthy   Get https://10.0.0.33:2379/health: dial tcp 10.0.0.33:2379: connect: connection refused   
etcd-2               Unhealthy   Get https://10.0.0.35:2379/health: dial tcp 10.0.0.35:2379: connect: connection refused   
controller-manager   Healthy     ok                                                                                        
scheduler            Healthy     ok                                                                                        
etcd-0               Unhealthy   HTTP probe failed with statuscode: 503                                                    
[root@mcwk8s03 mcwtest]# etcdctl ls /
/coreos.com
[root@mcwk8s03 mcwtest]#

此时停了两个节点，访问pod还是正常的

[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:15:10 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:15:11 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:15:15 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]#

此时etcd都挂了，无法查询数据了。不过我这里的etcd查看保存的数据，好像一点点只有网络数据，其他人部署的，是有很多类数据的，不知道差别在哪里，我这里停止etcd服务好像影响较小

[root@mcwk8s03 mcwtest]# kubectl get cs
NAME                 STATUS      MESSAGE                                                                                   ERROR
etcd-0               Unhealthy   Get https://10.0.0.33:2379/health: dial tcp 10.0.0.33:2379: connect: connection refused   
etcd-2               Unhealthy   Get https://10.0.0.36:2379/health: dial tcp 10.0.0.36:2379: connect: connection refused   
controller-manager   Healthy     ok                                                                                        
scheduler            Healthy     ok                                                                                        
etcd-1               Unhealthy   Get https://10.0.0.35:2379/health: dial tcp 10.0.0.35:2379: connect: connection refused   
[root@mcwk8s03 mcwtest]# etcdctl ls /
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.0.0.36:2379: connect: connection refused
; error #1: dial tcp 10.0.0.33:2379: connect: connection refused
; error #2: dial tcp 10.0.0.35:2379: connect: connection refused

error #0: dial tcp 10.0.0.36:2379: connect: connection refused
error #1: dial tcp 10.0.0.33:2379: connect: connection refused
error #2: dial tcp 10.0.0.35:2379: connect: connection refused

[root@mcwk8s03 mcwtest]#

etcd停止后，依然正常访问到05节点上的pod

[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:19:38 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:19:41 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:19:46 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]#

不过此时，master上无法访问k8s资源了

[root@mcwk8s03 mcwtest]# kubectl get svc
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get services)
[root@mcwk8s03 mcwtest]#

我们查看日志/var/log/messages,flanneld报错etcd集群不可用。etcd几个节点连接拒绝

un  7 00:22:09 mcwk8s03 flanneld: E0607 00:22:09.538405  130015 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigur
ed; error #0: dial tcp 10.0.0.35:2379: getsockopt: connection refused
Jun  7 00:22:09 mcwk8s03 flanneld: ; error #1: dial tcp 10.0.0.36:2379: getsockopt: connection refused

Jun 7 00:22:14 mcwk8s03 flanneld: E0607 00:22:14.547963 130015 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.0.0.35:2379: getsockopt: connection refused
Jun 7 00:22:14 mcwk8s03 flanneld: ; error #1: dial tcp 10.0.0.36:2379: getsockopt: connection refused
Jun 7 00:22:14 mcwk8s03 flanneld: ; error #2: dial tcp 10.0.0.33:2379: getsockopt: connection refused

apiserver也报错连接etcd异常

Jun  7 00:26:08 mcwk8s03 kube-apiserver: W0607 00:26:08.919244  125396 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {10.0.0.36:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.36:2379: connect: connection refused". Reconnecting...
Jun  7 00:26:08 mcwk8s03 kube-apiserver: W0607 00:26:08.998513  125396 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {10.0.0.33:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.33:2379: connect: connection refused". Reconnecting...

过了一阵，我们的pod还是可以访问到

[root@mcwk8s05 ~]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:27:41 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:27:44 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:27:47 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# date
Fri Jun  7 00:27:53 CST 2024
[root@mcwk8s05 ~]#

把05和06两个node的flanneld都停掉，依然可以访问pod

[root@mcwk8s05 ~]# systemctl stop flanneld
[root@mcwk8s05 ~]# date
Fri Jun  7 00:29:08 CST 2024
[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:29:07 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:29:11 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:29:16 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]#

上面之前已经关了flanneld了，这里想启动一个etcd节点失败了

[root@mcwk8s06 ~]# systemctl stop flanneld
[root@mcwk8s06 ~]# systemctl start etcd

Job for etcd.service failed because a timeout was exceeded. See "systemctl status etcd.service" and "journalctl -xe" for details.
[root@mcwk8s06 ~]# 
[root@mcwk8s06 ~]# 
[root@mcwk8s06 ~]# systemctl status etcd.service
● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; disabled; vendor preset: disabled)
   Active: activating (start) since Fri 2024-06-07 00:31:48 CST; 15s ago
 Main PID: 33888 (etcd)
   Memory: 71.8M
   CGroup: /system.slice/etcd.service
           └─33888 /opt/etcd/bin/etcd --name=etcd03 --data-dir=/var/lib/etcd/default.etcd --listen-peer-urls=https://10.0.0.36:2380 --listen-client-urls=https://10.0.0.36:2379,http://127....

Jun 07 00:32:02 mcwk8s06 etcd[33888]: publish error: etcdserver: request timed out
Jun 07 00:32:03 mcwk8s06 etcd[33888]: 1f77d024d243f007 is starting a new election at term 910111
Jun 07 00:32:03 mcwk8s06 etcd[33888]: 1f77d024d243f007 became candidate at term 910112
Jun 07 00:32:03 mcwk8s06 etcd[33888]: 1f77d024d243f007 received MsgVoteResp from 1f77d024d243f007 at term 910112
Jun 07 00:32:03 mcwk8s06 etcd[33888]: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to 64051d53b5971b69 at term 910112
Jun 07 00:32:03 mcwk8s06 etcd[33888]: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to f1ec1f6015c9d4a4 at term 910112
Jun 07 00:32:03 mcwk8s06 etcd[33888]: health check for peer 64051d53b5971b69 could not connect: dial tcp 10.0.0.35:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
Jun 07 00:32:03 mcwk8s06 etcd[33888]: health check for peer f1ec1f6015c9d4a4 could not connect: dial tcp 10.0.0.33:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
Jun 07 00:32:03 mcwk8s06 etcd[33888]: health check for peer 64051d53b5971b69 could not connect: dial tcp 10.0.0.35:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
Jun 07 00:32:03 mcwk8s06 etcd[33888]: health check for peer f1ec1f6015c9d4a4 could not connect: dial tcp 10.0.0.33:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")

/var/log/message报错

Jun  7 00:34:13 mcwk8s06 etcd: 1f77d024d243f007 is starting a new election at term 910195
Jun  7 00:34:13 mcwk8s06 etcd: 1f77d024d243f007 became candidate at term 910196
Jun  7 00:34:13 mcwk8s06 etcd: 1f77d024d243f007 received MsgVoteResp from 1f77d024d243f007 at term 910196
Jun  7 00:34:13 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to 64051d53b5971b69 at term 910196
Jun  7 00:34:13 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to f1ec1f6015c9d4a4 at term 910196
Jun  7 00:34:15 mcwk8s06 etcd: health check for peer f1ec1f6015c9d4a4 could not connect: dial tcp 10.0.0.33:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
Jun  7 00:34:15 mcwk8s06 etcd: health check for peer 64051d53b5971b69 could not connect: dial tcp 10.0.0.35:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
Jun  7 00:34:15 mcwk8s06 etcd: health check for peer f1ec1f6015c9d4a4 could not connect: dial tcp 10.0.0.33:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
Jun  7 00:34:15 mcwk8s06 etcd: health check for peer 64051d53b5971b69 could not connect: dial tcp 10.0.0.35:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
Jun  7 00:34:15 mcwk8s06 etcd: publish error: etcdserver: request timed out
Jun  7 00:34:15 mcwk8s06 etcd: 1f77d024d243f007 is starting a new election at term 910196
Jun  7 00:34:15 mcwk8s06 etcd: 1f77d024d243f007 became candidate at term 910197
Jun  7 00:34:15 mcwk8s06 etcd: 1f77d024d243f007 received MsgVoteResp from 1f77d024d243f007 at term 910197
Jun  7 00:34:15 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to 64051d53b5971b69 at term 910197
Jun  7 00:34:15 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to f1ec1f6015c9d4a4 at term 910197
Jun  7 00:34:16 mcwk8s06 etcd: 1f77d024d243f007 is starting a new election at term 910197
Jun  7 00:34:16 mcwk8s06 etcd: 1f77d024d243f007 became candidate at term 910198
Jun  7 00:34:16 mcwk8s06 etcd: 1f77d024d243f007 received MsgVoteResp from 1f77d024d243f007 at term 910198
Jun  7 00:34:16 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to 64051d53b5971b69 at term 910198
Jun  7 00:34:16 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to f1ec1f6015c9d4a4 at term 910198
Jun  7 00:34:17 mcwk8s06 etcd: 1f77d024d243f007 is starting a new election at term 910198
Jun  7 00:34:17 mcwk8s06 etcd: 1f77d024d243f007 became candidate at term 910199
Jun  7 00:34:17 mcwk8s06 etcd: 1f77d024d243f007 received MsgVoteResp from 1f77d024d243f007 at term 910199
Jun  7 00:34:17 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to 64051d53b5971b69 at term 910199
Jun  7 00:34:17 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to f1ec1f6015c9d4a4 at term 910199
Jun  7 00:34:18 mcwk8s06 etcd: 1f77d024d243f007 is starting a new election at term 910199
Jun  7 00:34:18 mcwk8s06 etcd: 1f77d024d243f007 became candidate at term 910200
Jun  7 00:34:18 mcwk8s06 etcd: 1f77d024d243f007 received MsgVoteResp from 1f77d024d243f007 at term 910200
:

此时想先把06 node的flanneld启动起来，但是也启动失败

[root@mcwk8s06 ~]# systemctl start flanneld
Job for flanneld.service failed because a timeout was exceeded. See "systemctl status flanneld.service" and "journalctl -xe" for details.
[root@mcwk8s06 ~]# systemctl status flanneld
● flanneld.service - Flanneld overlay address etcd agent
   Loaded: loaded (/usr/lib/systemd/system/flanneld.service; disabled; vendor preset: disabled)
   Active: activating (start) since Fri 2024-06-07 00:42:11 CST; 1min 12s ago
 Main PID: 36542 (flanneld)
   Memory: 8.5M
   CGroup: /system.slice/flanneld.service
           └─36542 /opt/kubernetes/bin/flanneld --ip-masq --etcd-endpoints=https://10.0.0.33:2379,https://10.0.0.35:2379,https://10.0.0.36:2379 -etcd-cafile=/opt/etcd/ssl/ca.pem -etcd-cer...

Jun 07 00:42:54 mcwk8s06 flanneld[36542]: ; error #2: net/http: TLS handshake timeout
Jun 07 00:42:55 mcwk8s06 flanneld[36542]: timed out
Jun 07 00:43:05 mcwk8s06 flanneld[36542]: E0607 00:43:05.841656   36542 main.go:349] Couldn't fetch network config: client: etcd cluster is unavailable or misconfigured; error...tion refused
Jun 07 00:43:05 mcwk8s06 flanneld[36542]: ; error #1: dial tcp 10.0.0.33:2379: getsockopt: connection refused
Jun 07 00:43:05 mcwk8s06 flanneld[36542]: ; error #2: net/http: TLS handshake timeout
Jun 07 00:43:06 mcwk8s06 flanneld[36542]: timed out
Jun 07 00:43:16 mcwk8s06 flanneld[36542]: E0607 00:43:16.846177   36542 main.go:349] Couldn't fetch network config: client: etcd cluster is unavailable or misconfigured; error...tion refused
Jun 07 00:43:16 mcwk8s06 flanneld[36542]: ; error #1: dial tcp 10.0.0.33:2379: getsockopt: connection refused
Jun 07 00:43:16 mcwk8s06 flanneld[36542]: ; error #2: net/http: TLS handshake timeout
Jun 07 00:43:17 mcwk8s06 flanneld[36542]: timed out
Hint: Some lines were ellipsized, use -l to show in full.
[root@mcwk8s06 ~]#

查看/var/log/message,去连接etcd找网络信息去了，但是没有连接上

Jun  7 00:42:11 mcwk8s06 systemd: flanneld.service start operation timed out. Terminating.
Jun  7 00:42:11 mcwk8s06 flanneld: E0607 00:42:11.520247   36223 main.go:349] Couldn't fetch network config: context canceled
Jun  7 00:42:11 mcwk8s06 flanneld: I0607 00:42:11.520736   36223 main.go:337] shutdownHandler sent cancel signal...
Jun  7 00:42:11 mcwk8s06 systemd: Failed to start Flanneld overlay address etcd agent.
Jun  7 00:42:11 mcwk8s06 systemd: Unit flanneld.service entered failed state.
Jun  7 00:42:11 mcwk8s06 systemd: flanneld.service failed.
Jun  7 00:42:11 mcwk8s06 kubelet: I0607 00:42:11.587113   61577 prober.go:173] HTTP-Probe Host: http://172.17.9.10, Port: 3000, Path: /login
Jun  7 00:42:11 mcwk8s06 kubelet: I0607 00:42:11.587152   61577 prober.go:176] HTTP-Probe Headers: map[]
Jun  7 00:42:11 mcwk8s06 kubelet: I0607 00:42:11.589166   61577 http.go:120] Probe succeeded for http://172.17.9.10:3000/login, Response: {200 OK 200 HTTP/1.1 1 1 map[Content-Type:[text/html
; charset=UTF-8] Date:[Thu, 06 Jun 2024 16:42:11 GMT]] 0xc0014ee7e0 -1 [chunked] true false map[] 0xc00140c100 <nil>}
Jun  7 00:42:11 mcwk8s06 kubelet: I0607 00:42:11.589213   61577 prober.go:125] Readiness probe for "grafana-core-5cc8dff58b-t97pb_kube-system(d15fbb7f-6d16-4471-9e58-e64ccbfd89da):grafana-co
re" succeeded
Jun  7 00:42:11 mcwk8s06 systemd: flanneld.service holdoff time over, scheduling restart.
Jun  7 00:42:11 mcwk8s06 systemd: Starting Flanneld overlay address etcd agent...
Jun  7 00:42:11 mcwk8s06 flanneld: I0607 00:42:11.823038   36542 main.go:475] Determining IP address of default interface
Jun  7 00:42:11 mcwk8s06 flanneld: I0607 00:42:11.824305   36542 main.go:488] Using interface with name eth0 and address 10.0.0.36
Jun  7 00:42:11 mcwk8s06 flanneld: I0607 00:42:11.824322   36542 main.go:505] Defaulting external address to interface address (10.0.0.36)
Jun  7 00:42:11 mcwk8s06 flanneld: warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
Jun  7 00:42:11 mcwk8s06 flanneld: I0607 00:42:11.825141   36542 main.go:235] Created subnet manager: Etcd Local Manager with Previous Subnet: None
Jun  7 00:42:11 mcwk8s06 flanneld: I0607 00:42:11.825148   36542 main.go:238] Installing signal handlers

06上的节点起不来，那先启动03上的etcd试试，毕竟它在启动命令的一个节点。

启动第一个etcd节点，正常启动，启动之后06节点之前启动失败的，也已经启动起来了，估计我虽然点击了启动，并且启动失败了，但是在后台06还在尝试重启，这才在第一个节点起来之后，它就起来了。也就是etcd如果起不来，可以先尝试启动第一个节点。毕竟之前日志中，启动06的etcd，检查连接其它两个节点的，好像第一个节点的关注度更高点。

[root@mcwk8s03 /]# kubectl get cs
NAME                 STATUS      MESSAGE                                                                                   ERROR
etcd-0               Unhealthy   Get https://10.0.0.33:2379/health: dial tcp 10.0.0.33:2379: connect: connection refused   
etcd-1               Unhealthy   Get https://10.0.0.35:2379/health: dial tcp 10.0.0.35:2379: connect: connection refused   
scheduler            Healthy     ok                                                                                        
controller-manager   Healthy     ok                                                                                        
etcd-2               Unhealthy   Get https://10.0.0.36:2379/health: net/http: TLS handshake timeout                        
[root@mcwk8s03 /]# systemctl start etcd
[root@mcwk8s03 /]# kubectl get cs

NAME                 STATUS      MESSAGE                                                                                   ERROR
etcd-1               Unhealthy   Get https://10.0.0.35:2379/health: dial tcp 10.0.0.35:2379: connect: connection refused   
controller-manager   Healthy     ok                                                                                        
etcd-0               Healthy     {"health":"true"}                                                                         
etcd-2               Healthy     {"health":"true"}                                                                         
scheduler            Healthy     ok                                                                                        
[root@mcwk8s03 /]#

我们看下数据，可以正常用kubectl命令查看k8s资源了，并且网段目前还是没有变化

[root@mcwk8s03 /]# kubectl get svc
NAME          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)          AGE
kubernetes    ClusterIP   10.2.0.1     <none>        443/TCP          585d
mcwtest-svc   NodePort    10.2.0.155   <none>        2024:33958/TCP   2d2h
nginx         ClusterIP   None         <none>        80/TCP           415d
[root@mcwk8s03 /]# kubectl get nodes
NAME       STATUS   ROLES    AGE    VERSION
mcwk8s05   Ready    <none>   582d   v1.15.12
mcwk8s06   Ready    <none>   582d   v1.15.12
[root@mcwk8s03 /]# etcdctl ls /
/coreos.com
[root@mcwk8s03 /]# etcdctl ls /coreos.com/
/coreos.com/network
[root@mcwk8s03 /]# etcdctl ls /coreos.com/network/
/coreos.com/network/config
/coreos.com/network/subnets
[root@mcwk8s03 /]# etcdctl ls /coreos.com/network/subnets
/coreos.com/network/subnets/172.17.83.0-24
/coreos.com/network/subnets/172.17.89.0-24
/coreos.com/network/subnets/172.17.9.0-24
[root@mcwk8s03 /]#

此时pod一直是可以正常访问的，也就是说，如果k8s集群组件出现问题，已有的pod在没发生状态改变，比如重建之类的，那么应该是不影响pod正常提供服务的。我这里验证，不排除考虑不全面而信息比较片面的问题。

[root@mcwk8s05 ~]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:54:20 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:54:23 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 16:54:27 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]#

现在pod在06 node上，现在有个etcd节点没有启动，不过其它两个节点已经启动了这个不会是存在的异常问题。之前一直以为pod还在05上，没有注意到pod已经到了06上了，从

[root@mcwk8s03 /]# kubectl get svc |grep mcwtest
mcwtest-svc   NodePort    10.2.0.155   <none>        2024:33958/TCP   2d2h
[root@mcwk8s03 /]# kubectl get deploy -o wide|grep mcwtest
mcwtest-deploy     1/1     1            1           2d2h   mcwtest      centos         app=mcwpython
[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest
mcwtest-deploy-6465665557-kbpdr     1/1     Running            0          4h     172.17.9.12   mcwk8s06   <none>           <none>
[root@mcwk8s03 /]#

停止06上flanneld，然后重启上面的容器，

[root@mcwk8s06 ~]# systemctl stop flanneld
[root@mcwk8s06 ~]# docker ps|grep mcwtest
04f4ddf6c5ef   5d0da3dc9764                                                          "sh -c 'echo 123 >>/…"   4 hours ago   Up 4 hours             k8s_mcwtest_mcwtest-deploy-6465665557-kbpdr_default_569facd2-279d-4d87-b8c6-1edbea1015c4_0
02aae29ff065   registry.cn-hangzhou.aliyuncs.com/google-containers/pause-amd64:3.0   "/pause"                 4 hours ago   Up 4 hours             k8s_POD_mcwtest-deploy-6465665557-kbpdr_default_569facd2-279d-4d87-b8c6-1edbea1015c4_0
[root@mcwk8s06 ~]# docker restart 04f4
04f4
[root@mcwk8s06 ~]#

pod还能正常提供服务

[root@mcwk8s05 ~]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 17:06:19 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 17:06:21 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 17:06:24 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# date
Fri Jun  7 01:07:33 CST 2024
[root@mcwk8s05 ~]#

现在把etcd都恢复，也就是05节点上的也恢复下，master和所有node的flanneld都已经是停掉，kube-proxy kubelet node上都正常运行

[root@mcwk8s03 /]# kubectl get cs
NAME                 STATUS    MESSAGE             ERROR
scheduler            Healthy   ok                  
controller-manager   Healthy   ok                  
etcd-0               Healthy   {"health":"true"}   
etcd-1               Healthy   {"health":"true"}   
etcd-2               Healthy   {"health":"true"}   
[root@mcwk8s03 /]#

此时删除pod，不影响重建pod，现在从06节点新建到05节点，但是访问服务时网络不通了，这跟所有的flanneld停掉有关，如果已有的pod，不受影响，但是重建pod，是受到影响的，需要flanneld参与的，但是它做了啥，以后看源码研究下

[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest
mcwtest-deploy-6465665557-kbpdr     1/1     Running            0          4h15m   172.17.9.12   mcwk8s06   <none>           <none>
[root@mcwk8s03 /]# kubectl delete pod mcwtest-deploy-6465665557-kbpdr
pod "mcwtest-deploy-6465665557-kbpdr" deleted
[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest
mcwtest-deploy-6465665557-nwbnn     1/1     Running            0          44s     172.17.89.7   mcwk8s05   <none>           <none>
[root@mcwk8s03 /]# kubectl get svc -o wide|grep mcwtest
mcwtest-svc   NodePort    10.2.0.155   <none>        2024:33958/TCP   2d2h   app=mcwpython
[root@mcwk8s03 /]# curl -I 10.0.0.36:33958
curl: (7) Failed connect to 10.0.0.36:33958; Connection refused
[root@mcwk8s03 /]# curl -I 10.0.0.35:33958
curl: (7) Failed connect to 10.0.0.35:33958; Connection refused
[root@mcwk8s03 /]#

但是pod在05上，在05上访问还是通的，那么就是跨宿主机可能存在问题了，看下什么问题

[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 17:20:52 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 17:20:59 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 17:21:07 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]#

可是一开始的确是不通的，但是现在06上又通了

[root@mcwk8s06 ~]# curl  -I 10.0.0.36:33958
curl: (7) Failed connect to 10.0.0.36:33958; Connection timed out
[root@mcwk8s06 ~]# curl  -I 10.0.0.35:33958
curl: (7) Failed connect to 10.0.0.35:33958; Connection timed out
[root@mcwk8s06 ~]# curl -I 10.2.0.155:2024
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 17:23:08 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s06 ~]# curl  -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 17:23:13 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s06 ~]#

03上也访问通了，那么问题是不是ipvs规则，没有及时更新，有延长的问题导致的呢？

[root@mcwk8s03 /]# curl -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 17:24:11 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s03 /]# curl -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 17:24:17 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s03 /]#

我们准备再次删除pod，重建pod，然后看不通的时候，是不是ipvs还没更新。目前pod ip如下，可以正常访问

[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest
mcwtest-deploy-6465665557-nwbnn     1/1     Running            0          12m     172.17.89.7   mcwk8s05   <none>           <none>
[root@mcwk8s03 /]# kubectl get svc |grep mcwtest
mcwtest-svc   NodePort    10.2.0.155   <none>        2024:33958/TCP   2d3h
[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest
mcwtest-deploy-6465665557-nwbnn     1/1     Running            0          12m     172.17.89.7   mcwk8s05   <none>           <none>
[root@mcwk8s03 /]#

[root@mcwk8s05 ~]# ipvsadm -Ln|grep -2  10.0.0.35:33958
  -> 172.17.9.2:9100              Masq    1      0          0         
  -> 172.17.89.2:9100             Masq    1      0          0         
TCP  10.0.0.35:33958 rr
  -> 172.17.89.7:20000            Masq    1      0          0         
TCP  10.0.0.35:46735 rr
[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 17:29:26 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]#

重建之后，两个node上ipvs规则已经改为新的rs了，但是网络还是没有通。估算是过了一分钟，才通网的

[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest
mcwtest-deploy-6465665557-nwbnn     1/1     Running            0          12m     172.17.89.7   mcwk8s05   <none>           <none>
[root@mcwk8s03 /]# 
[root@mcwk8s03 /]# kubectl delete pod mcwtest-deploy-6465665557-nwbnn
pod "mcwtest-deploy-6465665557-nwbnn" deleted
[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest
mcwtest-deploy-6465665557-zpfvp     1/1     Running            0          37s     172.17.89.8   mcwk8s05   <none>           <none>
[root@mcwk8s03 /]#

[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958
curl: (7) Failed connect to 10.0.0.35:33958; Connection refused
[root@mcwk8s05 ~]# ipvsadm -Ln|grep -2  10.0.0.35:33958
  -> 172.17.9.2:9100              Masq    1      0          0         
  -> 172.17.89.2:9100             Masq    1      0          0         
TCP  10.0.0.35:33958 rr
  -> 172.17.89.8:20000            Masq    1      0          1         
TCP  10.0.0.35:46735 rr
[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958
curl: (7) Failed connect to 10.0.0.35:33958; Connection refused
[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958
curl: (7) Failed connect to 10.0.0.35:33958; Connection refused

现在将两个node上的kube-proxy停掉，理论上pod重建，kubelet管理创建新的容器，但是网络ipvs规则方面，应该是不会被重置

[root@mcwk8s05 ~]# systemctl stop kube-proxy
[root@mcwk8s05 ~]# 


[root@mcwk8s06 ~]# systemctl stop kube-proxy
[root@mcwk8s06 ~]#

删除pod前检查

[root@mcwk8s03 /]# kubectl get svc|grep mcwtest
mcwtest-svc   NodePort    10.2.0.155   <none>        2024:33958/TCP   2d3h
[root@mcwk8s03 /]# kubectl get pod|grep mcwtest
mcwtest-deploy-6465665557-zpfvp     1/1     Running            0          9m25s
[root@mcwk8s03 /]#

[root@mcwk8s05 ~]# ipvsadm -Ln|grep -2  10.0.0.35:33958
  -> 172.17.9.2:9100              Masq    1      0          0         
  -> 172.17.89.2:9100             Masq    1      0          0         
TCP  10.0.0.35:33958 rr
  -> 172.17.89.8:20000            Masq    1      0          0         
TCP  10.0.0.35:46735 rr
[root@mcwk8s05 ~]#

删除重建pod

[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest
mcwtest-deploy-6465665557-zpfvp     1/1     Running            0          37s     172.17.89.8   mcwk8s05   <none>           <none>
[root@mcwk8s03 /]# kubectl delete pod mcwtest-deploy-6465665557-zpfvp
pod "mcwtest-deploy-6465665557-zpfvp" deleted
[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest
mcwtest-deploy-6465665557-qc9t2     1/1     Running            0          112s    172.17.89.7   mcwk8s05   <none>           <none>
[root@mcwk8s03 /]#

几分钟后看两个node上还是没有被修改为新的rs，

[root@mcwk8s05 ~]# ipvsadm -Ln|grep -2  10.0.0.35:33958
  -> 172.17.9.2:9100              Masq    1      0          0         
  -> 172.17.89.2:9100             Masq    1      0          0         
TCP  10.0.0.35:33958 rr
  -> 172.17.89.8:20000            Masq    1      0          0         
TCP  10.0.0.35:46735 rr
[root@mcwk8s05 ~]#

此时网络肯定是不通的

[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958
curl: (7) Failed connect to 10.0.0.35:33958; Connection timed out
[root@mcwk8s05 ~]# curl -I 10.0.0.36:33958
curl: (7) Failed connect to 10.0.0.36:33958; Connection timed out
[root@mcwk8s05 ~]#

我们看下service，这个和新的pod一致的，及时修改为新的IP的

[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest
mcwtest-deploy-6465665557-qc9t2     1/1     Running            0          112s    172.17.89.7   mcwk8s05   <none>           <none>
[root@mcwk8s03 /]# kubectl get svc|grep mcwtest
mcwtest-svc   NodePort    10.2.0.155   <none>        2024:33958/TCP   2d3h
[root@mcwk8s03 /]# kubectl describe svc mcwtest-svc
Name:                     mcwtest-svc
Namespace:                default
Labels:                   <none>
Annotations:              kubectl.kubernetes.io/last-applied-configuration:
                            {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"mcwtest-svc","namespace":"default"},"spec":{"ports":[{"name":"mcw...
Selector:                 app=mcwpython
Type:                     NodePort
IP:                       10.2.0.155
Port:                     mcwport  2024/TCP
TargetPort:               20000/TCP
NodePort:                 mcwport  33958/TCP
Endpoints:                172.17.89.7:20000
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>
[root@mcwk8s03 /]#

我们看下上面，这个是集群IP和集群IP端口。然后还有目标端口就是容器里面的端口，而endport就是容器IP加上容器里面的端口，nodeport就是nodeport，端口名字我们之前定义的，跟deplolyment,pod等资源名称没有关系

IP: 10.2.0.155
Port: mcwport 2024/TCP

我们把kube-proxy启动之后，立马就更新了新的rs了。

[root@mcwk8s06 ~]# ipvsadm -Ln|grep -2  10.0.0.36:33958
  -> 172.17.9.2:9100              Masq    1      0          0         
  -> 172.17.89.2:9100             Masq    1      0          0         
TCP  10.0.0.36:33958 rr
  -> 172.17.89.8:20000            Masq    1      0          0         
TCP  10.0.0.36:46735 rr
[root@mcwk8s06 ~]# systemctl start kube-proxy
[root@mcwk8s06 ~]# ipvsadm -Ln|grep -2  10.0.0.36:33958
  -> 172.17.9.2:9100              Masq    1      0          0         
  -> 172.17.89.2:9100             Masq    1      0          0         
TCP  10.0.0.36:33958 rr
  -> 172.17.89.7:20000            Masq    1      0          0         
TCP  10.0.0.36:46735 rr
[root@mcwk8s06 ~]#

此时05的kube-proxy还没有启动，06的启动，所以05的没有更新ipvs 的rs，06的更新了。05的不可以访问到pod,06的可以了

[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958
curl: (7) Failed connect to 10.0.0.35:33958; Connection timed out
[root@mcwk8s05 ~]# curl -I 10.0.0.36:33958
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.18
Date: Thu, 06 Jun 2024 17:58:40 GMT
Content-type: text/html; charset=ANSI_X3.4-1968
Content-Length: 816

[root@mcwk8s05 ~]#

实验完毕，把环境恢复正常。

本节总结：

如果网络不通可以从下面查看

1、可以看svc和pod的IP是否对应，

2、查看ipvs规则是否和pod对应，如果不对应，查看kube-proxy是否正常运行，是否需要考虑重启等等，重启似乎不影响已有pod的使用。

综上可知：

容器创建销毁等等，需要用到k8s组件，网络插件等等。但是已经创建的容器，网络已经在宿主机上存在了，把K8S组件和网络插件停止，不影响同宿主机容器间通信，以及不影响这些容器跨主机通信。也就是已有的容器服务，clusterip：port和nodeip:nodeport等方式去访问容器应用，还是正常提供服务可以访问到的。并且我这里停止flannel之后，然后再启动，这个宿主机的容器网段还是原来的，容器网关也是原来的没有变动，etcd保存的网段也是没有变动。所以重新启动网络插件，没有使得已有的容器网段发生改变，因此有个疑问，之前为什么重启flannel服务，会让宿主机的容器网段发生改变呢，从而也要重启容器，使得所在宿主机的容器网段对应上呢

注意：按照我的理解，容器跨主机通信，可能需要flanneld去请求etcd，查询出要前往的容器网段所在的宿主机IP，然后进行数据包的封装，通过物理网卡通信，而flannel.1相当于隧道网络，网关是docker0,docker0又会走到默认路由，到eth0 ，然后根据进行宿主机间数据包传输，从而实现宿主机间的数据包的路由转发。但是flanneld停掉之后，又是谁在查询容器网段，或者是怎么找到需要去的容器网段的宿主机IP呢，这个关系是怎么找到的呢，以后有时间研究补充。部分疑问可见下面链接

某天猜测，宿主机可能会缓存路由表信息么？？所以即使etcd和flannel挂了容器它也能跨宿主机通信么？？

由此产生的疑问：回头再验证：创建一条隧道网络，进行传输的时候，是否是转换为物理网卡IP进行通信？

posted @ 2024-06-06 02:12 马昌伟阅读(230) 评论(3) 编辑收藏举报

刷新页面返回顶部

魔降风云变