k8s组件和网络插件挂掉,演示已有的pod是否正常运行
环境
上面链接是这个环境的开始,k8s资源配置在上面链接里面
03 master ,05 06是node
[root@mcwk8s03 mcwtest]# kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME mcwk8s05 Ready <none> 581d v1.15.12 10.0.0.35 <none> CentOS Linux 7 (Core) 3.10.0-693.el7.x86_64 docker://20.10.21 mcwk8s06 Ready <none> 581d v1.15.12 10.0.0.36 <none> CentOS Linux 7 (Core) 3.10.0-693.el7.x86_64 docker://20.10.21 [root@mcwk8s03 mcwtest]#
[root@mcwk8s03 mcwtest]# kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.2.0.1 <none> 443/TCP 584d mcwtest-svc NodePort 10.2.0.155 <none> 2024:33958/TCP 26h nginx ClusterIP None <none> 80/TCP 414d [root@mcwk8s03 mcwtest]# kubectl get svc|grep mcwtest mcwtest-svc NodePort 10.2.0.155 <none> 2024:33958/TCP 26h [root@mcwk8s03 mcwtest]# kubectl get deploy|grep mcwtest mcwtest-deploy 1/1 1 1 26h [root@mcwk8s03 mcwtest]# kubectl get pod|grep mcwtest mcwtest-deploy-6465665557-g9zjd 1/1 Running 1 25h [root@mcwk8s03 mcwtest]# kubectl get pod -o wide|grep mcwtest mcwtest-deploy-6465665557-g9zjd 1/1 Running 1 25h 172.17.89.10 mcwk8s05 <none> <none> [root@mcwk8s03 mcwtest]#
停止服务之前,可以正常用nodeip: nodeport访问
[root@mcwk8s03 mcwtest]# kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.2.0.1 <none> 443/TCP 584d mcwtest-svc NodePort 10.2.0.155 <none> 2024:33958/TCP 26h nginx ClusterIP None <none> 80/TCP 414d [root@mcwk8s03 mcwtest]# date Thu Jun 6 00:50:34 CST 2024 [root@mcwk8s03 mcwtest]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 16:50:42 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s03 mcwtest]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 16:50:46 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s03 mcwtest]#
【】停止master上的组件
[root@mcwk8s03 /]# systemctl stop kube-apiserver.service [root@mcwk8s03 /]# systemctl status kube-apiserver.service ● kube-apiserver.service - Kubernetes API Server Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled) Active: inactive (dead) since Thu 2024-06-06 00:54:51 CST; 14s ago Docs: https://github.com/kubernetes/kubernetes Process: 19837 ExecStart=/opt/kubernetes/bin/kube-apiserver $KUBE_APISERVER_OPTS (code=exited, status=0/SUCCESS) Main PID: 19837 (code=exited, status=0/SUCCESS) Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.395965 19837 wrap.go:47] GET /apis/rbac.authorization.k8s.io/v1/clus...3:5887] Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.396002 19837 wrap.go:47] GET /apis/storage.k8s.io/v1/volumeattachmen...3:5887] Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.396033 19837 wrap.go:47] GET /apis/admissionregistration.k8s.io/v1be...3:5887] Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.396047 19837 wrap.go:47] GET /apis/rbac.authorization.k8s.io/v1/role...3:5887] Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.396068 19837 wrap.go:47] GET /api/v1/nodes?resourceVersion=4830599&t...3:5887] Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.396083 19837 wrap.go:47] GET /api/v1/secrets?resourceVersion=4710803...3:5887] Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.396108 19837 wrap.go:47] GET /api/v1/namespaces?resourceVersion=4710...3:5887] Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: I0606 00:54:51.405097 19837 wrap.go:47] GET /api/v1/namespaces/default/endpoints/ku...3:5887] Jun 06 00:54:51 mcwk8s03 kube-apiserver[19837]: E0606 00:54:51.408262 19837 controller.go:179] no master IPs were listed in storage...service Jun 06 00:54:51 mcwk8s03 systemd[1]: Stopped Kubernetes API Server. Hint: Some lines were ellipsized, use -l to show in full. [root@mcwk8s03 /]#
执行命令已经有问题了
[root@mcwk8s03 /]# kubectl get svc The connection to the server localhost:8080 was refused - did you specify the right host or port? [root@mcwk8s03 /]# kubectl get nodes The connection to the server localhost:8080 was refused - did you specify the right host or port? [root@mcwk8s03 /]#
/var/log/message报错
Jun 6 00:58:11 mcwk8s03 kube-scheduler: E0606 00:58:11.720321 123920 reflector.go:125] k8s.io/client-go/informers/factory.go:133:
Failed to list *v1.ReplicaSet: Get http://127.0.0.1:8080/apis/apps/v1/replicasets?limit=500&resourceVersion=0: dial tcp 127.0.0.1:8080: connect: connection refused
nodeip:nodeport的容器没有受到影响,还在运行
[root@mcwk8s03 mcwtest]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:01:02 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s03 mcwtest]# date Thu Jun 6 01:01:12 CST 2024 [root@mcwk8s03 mcwtest]#
还能正常访问
停掉schedule和controller-manager,pod可以正常提供服务
[root@mcwk8s03 mcwtest]# systemctl stop kube-scheduler.service [root@mcwk8s03 mcwtest]# systemctl stop kube-controller-manager.service [root@mcwk8s03 mcwtest]# date Thu Jun 6 01:03:23 CST 2024 [root@mcwk8s03 mcwtest]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:03:26 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s03 mcwtest]#
master没有kubelet和kube proxy
[root@mcwk8s03 mcwtest]# ps -ef|grep proxy root 125098 1429 0 01:05 pts/0 00:00:00 grep --color=auto proxy [root@mcwk8s03 mcwtest]# ps -ef|grep let root 125106 1429 0 01:05 pts/0 00:00:00 grep --color=auto let [root@mcwk8s03 mcwtest]#
【3】停掉node上的组件
[root@mcwk8s03 mcwtest]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:06:41 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s03 mcwtest]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:06:44 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s03 mcwtest]#
既然master不影响已有pod的正常使用,那么先把apiserver的启动一下,方便看环境
我们master上只启动了apiserver组件的,启动之后就可以看到,pod等信息正常显示
[root@mcwk8s03 mcwtest]# systemctl start apiserver Failed to start apiserver.service: Unit not found. [root@mcwk8s03 mcwtest]# systemctl start kube-apiserver.service [root@mcwk8s03 mcwtest]# kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.2.0.1 <none> 443/TCP 584d mcwtest-svc NodePort 10.2.0.155 <none> 2024:33958/TCP 26h nginx ClusterIP None <none> 80/TCP 414d [root@mcwk8s03 mcwtest]# kubectl get nodes NAME STATUS ROLES AGE VERSION mcwk8s05 Ready <none> 581d v1.15.12 mcwk8s06 Ready <none> 581d v1.15.12 [root@mcwk8s03 mcwtest]#
查看组件状态,也就是这三个,不影响已有pod的nodeip:nodeport方式的访问。
[root@mcwk8s03 mcwtest]# kubectl get cs NAME STATUS MESSAGE ERROR scheduler Unhealthy Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused controller-manager Unhealthy Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused etcd-1 Healthy {"health":"true"} etcd-2 Healthy {"health":"true"} etcd-0 Healthy {"health":"true"} [root@mcwk8s03 mcwtest]#
我们master看下,找个clusterIP
[root@mcwk8s03 mcwtest]# kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.2.0.1 <none> 443/TCP 584d mcwtest-svc NodePort 10.2.0.155 <none> 2024:33958/TCP 26h nginx ClusterIP None <none> 80/TCP 414d [root@mcwk8s03 mcwtest]#
然后去node上访问一下,也是可以正常访问的。为啥不在master请求clusterIP:port,这是因为没有ipvsadm规则,master没有部署kubeproxy这些node上的服务把
[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:12:34 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]#
停掉05和06上的kubelet服务
[root@mcwk8s05 ~]# systemctl stop kubelet.service [root@mcwk8s05 ~]# [root@mcwk8s06 ~]# systemctl stop kubelet.service [root@mcwk8s06 ~]#
容器的nodeip:nodeport访问还是正常的,clusterIP:port访问也正常
[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:19:13 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:19:17 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:19:21 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]#
下面开始停止kubeproxy服务,先停止06节点的
[root@mcwk8s06 ~]# systemctl stop kube-proxy.service
[root@mcwk8s06 ~]#
06节点的nodeip:nodeport依然可以访问这个服务
[root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:23:41 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:23:44 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]#
可以看到,06上路由和ipvs规则都还在
[root@mcwk8s06 ~]# ipvsadm -Ln|grep -C 2 10.0.0.36 -> 172.17.9.2:9100 Masq 1 0 0 -> 172.17.89.2:9100 Masq 1 0 0 TCP 10.0.0.36:31672 rr -> 172.17.9.2:9100 Masq 1 0 0 -> 172.17.89.2:9100 Masq 1 0 0 TCP 10.0.0.36:33958 rr -> 172.17.89.10:20000 Masq 1 0 1 TCP 10.0.0.36:46735 rr -> 172.17.89.13:3000 Masq 1 0 0 TCP 10.2.0.1:443 rr -- TCP 172.17.9.1:46735 rr -> 172.17.89.13:3000 Masq 1 0 0 TCP 10.0.0.36:30001 rr -> 172.17.89.5:8443 Masq 1 0 0 TCP 10.0.0.36:30003 rr -> 172.17.89.4:9090 Masq 1 0 0 TCP 10.2.0.155:2024 rr [root@mcwk8s06 ~]# [root@mcwk8s06 ~]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.0.0.254 0.0.0.0 UG 100 0 0 eth0 10.0.0.0 0.0.0.0 255.255.255.0 U 100 0 0 eth0 172.17.9.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0 172.17.83.0 172.17.83.0 255.255.255.0 UG 0 0 0 flannel.1 172.17.89.0 172.17.89.0 255.255.255.0 UG 0 0 0 flannel.1 [root@mcwk8s06 ~]#
pod在05机器上,把05的kube-proxy关掉,已经有的pod,也是不影响使用,目前两个node的都停掉了
[root@mcwk8s05 /]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:26:06 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 /]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:26:34 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 /]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:26:37 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 /]#
[root@mcwk8s05 /]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:26:06 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 /]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:26:34 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 /]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:26:37 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 /]#
【4】停掉flannel服务。下面有个重要的分析容器通信的过程案例
停止前查看服务情况,有去两个宿主机容器网段的路由。并且nodeip:nodeport是通的。nodeip:nodeport。master上没有ipvsadm规则,因此clusterIP:port就没法找到转到对应的容器IP:容器port,但是nodeip:nodeport,不需要ipvs规则找转发的后端,直接通过nodeip通信,访问nodeport,然后到了指定的nodeip:nodeport之后,再根据ipvsadm转发规则,转发给对应的容器IP:容器Port。如果这个容器是在刚刚请求的nodeip主机上,那么直接通过docker0通信找到容器IP:容器port;如果不在这个机器,那么再通过容器跨宿主机通信的方式,再进行找到对应的宿主机,然后找到宿主机上对应的容器IP。
[root@mcwk8s03 mcwtest]# systemctl status flanneld.service ● flanneld.service - Flanneld overlay address etcd agent Loaded: loaded (/usr/lib/systemd/system/flanneld.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2024-06-04 23:28:50 CST; 1 day 2h ago Main PID: 11892 (flanneld) Memory: 13.9M CGroup: /system.slice/flanneld.service └─11892 /opt/kubernetes/bin/flanneld --ip-masq --etcd-endpoints=https://10.0.0.33:2379,https://10.0.0.35:2379,https://10.0.0.36:2... Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable. [root@mcwk8s03 mcwtest]# [root@mcwk8s03 mcwtest]# [root@mcwk8s03 mcwtest]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.0.0.254 0.0.0.0 UG 100 0 0 eth0 10.0.0.0 0.0.0.0 255.255.255.0 U 100 0 0 eth0 172.17.9.0 172.17.9.0 255.255.255.0 UG 0 0 0 flannel.1 172.17.83.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0 172.17.89.0 172.17.89.0 255.255.255.0 UG 0 0 0 flannel.1 [root@mcwk8s03 mcwtest]# ifconfig flannel.1 flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 172.17.83.0 netmask 255.255.255.255 broadcast 0.0.0.0 inet6 fe80::b083:33ff:fe7b:fd37 prefixlen 64 scopeid 0x20<link> ether b2:83:33:7b:fd:37 txqueuelen 0 (Ethernet) RX packets 11 bytes 924 (924.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 11 bytes 924 (924.0 B) TX errors 0 dropped 8 overruns 0 carrier 0 collisions 0 [root@mcwk8s03 mcwtest]# ifconfig docker docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 172.17.83.1 netmask 255.255.255.0 broadcast 172.17.83.255 ether 02:42:e9:a4:51:4f txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@mcwk8s03 mcwtest]# curl 10.2.0.155:2024 ^C [root@mcwk8s03 mcwtest]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:30:23 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s03 mcwtest]#
我们停止03master上的flannel,发现还是正常用nodeip:nodeport访问已有的容器服务的,路由还在,网卡网段还没变化。
[root@mcwk8s03 mcwtest]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.0.0.254 0.0.0.0 UG 100 0 0 eth0 10.0.0.0 0.0.0.0 255.255.255.0 U 100 0 0 eth0 172.17.9.0 172.17.9.0 255.255.255.0 UG 0 0 0 flannel.1 172.17.83.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0 172.17.89.0 172.17.89.0 255.255.255.0 UG 0 0 0 flannel.1 [root@mcwk8s03 mcwtest]# ifconfig flannel.1 flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 172.17.83.0 netmask 255.255.255.255 broadcast 0.0.0.0 inet6 fe80::b083:33ff:fe7b:fd37 prefixlen 64 scopeid 0x20<link> ether b2:83:33:7b:fd:37 txqueuelen 0 (Ethernet) RX packets 11 bytes 924 (924.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 11 bytes 924 (924.0 B) TX errors 0 dropped 8 overruns 0 carrier 0 collisions 0 [root@mcwk8s03 mcwtest]# ifconfig docker docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 172.17.83.1 netmask 255.255.255.0 broadcast 172.17.83.255 ether 02:42:e9:a4:51:4f txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@mcwk8s03 mcwtest]#
下面我们查看下05 node的信息,准备关闭这个节点的flannel服务,正常访问pod服务
[root@mcwk8s05 ~]# systemctl status flanneld.service ● flanneld.service - Flanneld overlay address etcd agent Loaded: loaded (/usr/lib/systemd/system/flanneld.service; disabled; vendor preset: disabled) Active: active (running) since Wed 2024-06-05 00:50:25 CST; 24h ago Process: 4201 ExecStartPost=/opt/kubernetes/bin/mk-docker-opts.sh -k DOCKER_NETWORK_OPTIONS -d /run/flannel/subnet.env (code=exited, status=0/SUCCESS) Main PID: 4184 (flanneld) Memory: 16.2M CGroup: /system.slice/flanneld.service └─4184 /opt/kubernetes/bin/flanneld --ip-masq --etcd-endpoints=https://10.0.0.33:2379,https://10.0.0.35:2379,https://10.0.0.36:2379 -etcd-cafile=/opt/etcd/ssl/ca.pem -etcd-cert... Jun 05 23:50:25 mcwk8s05 flanneld[4184]: I0605 23:50:25.459783 4184 main.go:388] Lease renewed, new expiration: 2024-06-06 15:50:25.409920414 +0000 UTC Jun 05 23:50:25 mcwk8s05 flanneld[4184]: I0605 23:50:25.459933 4184 main.go:396] Waiting for 22h59m59.949994764s to renew lease Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable. [root@mcwk8s05 ~]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.0.0.254 0.0.0.0 UG 100 0 0 eth0 10.0.0.0 0.0.0.0 255.255.255.0 U 100 0 0 eth0 172.17.9.0 172.17.9.0 255.255.255.0 UG 0 0 0 flannel.1 172.17.83.0 172.17.83.0 255.255.255.0 UG 0 0 0 flannel.1 172.17.89.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0 [root@mcwk8s05 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:46:55 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:47:00 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:47:04 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# ifconfig flannel.1 flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 172.17.89.0 netmask 255.255.255.255 broadcast 0.0.0.0 inet6 fe80::3470:76ff:feea:39b8 prefixlen 64 scopeid 0x20<link> ether 36:70:76:ea:39:b8 txqueuelen 0 (Ethernet) RX packets 78 bytes 5468 (5.3 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 86 bytes 7144 (6.9 KiB) TX errors 0 dropped 8 overruns 0 carrier 0 collisions 0 [root@mcwk8s05 ~]# ifconfig docker docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 172.17.89.1 netmask 255.255.255.0 broadcast 172.17.89.255 inet6 fe80::42:18ff:fee1:e8fc prefixlen 64 scopeid 0x20<link> ether 02:42:18:e1:e8:fc txqueuelen 0 (Ethernet) RX packets 1212277 bytes 505391601 (481.9 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1374361 bytes 1966293845 (1.8 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@mcwk8s05 ~]#
停掉两个node上的flannel服务
[root@mcwk8s05 ~]# systemctl stop flanneld.service [root@mcwk8s05 ~]# [root@mcwk8s06 ~]# systemctl stop flanneld.service [root@mcwk8s06 ~]#
查看停掉所有flannel之后,05 node的信息。可以看的,依然可以访问容器服务,通过nodeip:nodeport或者是clusterip:port 。至此,k8s组件和网络插件的停止,目前来看对原有的pod是没有受到影响的。也就是,pod创建好之后,正常提供服务不会依赖k8s组件,网络组件等,也就是只有pod创建的时候,会依赖这些组件创建容器,创建好容器之后,就能正常使用,不依赖组件。
[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:49:28 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:49:33 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 17:49:50 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# ifconfig flannel.1 flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 172.17.89.0 netmask 255.255.255.255 broadcast 0.0.0.0 inet6 fe80::3470:76ff:feea:39b8 prefixlen 64 scopeid 0x20<link> ether 36:70:76:ea:39:b8 txqueuelen 0 (Ethernet) RX packets 83 bytes 5816 (5.6 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 91 bytes 7576 (7.3 KiB) TX errors 0 dropped 8 overruns 0 carrier 0 collisions 0 [root@mcwk8s05 ~]# ifconfig docker docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 172.17.89.1 netmask 255.255.255.0 broadcast 172.17.89.255 inet6 fe80::42:18ff:fee1:e8fc prefixlen 64 scopeid 0x20<link> ether 02:42:18:e1:e8:fc txqueuelen 0 (Ethernet) RX packets 1213264 bytes 505536182 (482.1 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1375186 bytes 1967017624 (1.8 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@mcwk8s05 ~]#
此时在master上查看etcd中网络信息,现在将flannel网络服务启动,会不会改变该宿主机的容器网段,从而对现有的容器产生影响,导致网络问题呢。
[root@mcwk8s03 mcwtest]# etcdctl ls / /coreos.com [root@mcwk8s03 mcwtest]# etcdctl ls /coreos.com /coreos.com/network [root@mcwk8s03 mcwtest]# etcdctl ls /coreos.com/network /coreos.com/network/config /coreos.com/network/subnets [root@mcwk8s03 mcwtest]# etcdctl ls /coreos.com/network/subnets/ /coreos.com/network/subnets/172.17.83.0-24 /coreos.com/network/subnets/172.17.9.0-24 /coreos.com/network/subnets/172.17.89.0-24 [root@mcwk8s03 mcwtest]# etcdctl ls /coreos.com/network/config /coreos.com/network/config [root@mcwk8s03 mcwtest]# etcdctl get /coreos.com/network/config { "Network": "172.17.0.0/16", "Backend": {"Type": "vxlan"}} [root@mcwk8s03 mcwtest]#
网络服务启动之后,并没有改变宿主机容器网段
[root@mcwk8s05 ~]# ifconfig flannel.1 flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 172.17.89.0 netmask 255.255.255.255 broadcast 0.0.0.0 inet6 fe80::3470:76ff:feea:39b8 prefixlen 64 scopeid 0x20<link> ether 36:70:76:ea:39:b8 txqueuelen 0 (Ethernet) RX packets 83 bytes 5816 (5.6 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 91 bytes 7576 (7.3 KiB) TX errors 0 dropped 8 overruns 0 carrier 0 collisions 0 [root@mcwk8s05 ~]# ifconfig docker docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 172.17.89.1 netmask 255.255.255.0 broadcast 172.17.89.255 inet6 fe80::42:18ff:fee1:e8fc prefixlen 64 scopeid 0x20<link> ether 02:42:18:e1:e8:fc txqueuelen 0 (Ethernet) RX packets 1213264 bytes 505536182 (482.1 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1375186 bytes 1967017624 (1.8 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@mcwk8s05 ~]# [root@mcwk8s05 ~]# systemctl status flanneld.service ● flanneld.service - Flanneld overlay address etcd agent Loaded: loaded (/usr/lib/systemd/system/flanneld.service; disabled; vendor preset: disabled) Active: inactive (dead) Jun 05 23:50:25 mcwk8s05 flanneld[4184]: I0605 23:50:25.459783 4184 main.go:388] Lease renewed, new expiration: 2024-06-06 15:50:25.409920414 +0000 UTC Jun 05 23:50:25 mcwk8s05 flanneld[4184]: I0605 23:50:25.459933 4184 main.go:396] Waiting for 22h59m59.949994764s to renew lease Jun 06 01:48:25 mcwk8s05 systemd[1]: Stopping Flanneld overlay address etcd agent... Jun 06 01:48:25 mcwk8s05 flanneld[4184]: I0606 01:48:25.715882 4184 main.go:404] Stopped monitoring lease Jun 06 01:48:25 mcwk8s05 flanneld[4184]: I0606 01:48:25.716035 4184 main.go:322] Waiting for all goroutines to exit Jun 06 01:48:25 mcwk8s05 flanneld[4184]: I0606 01:48:25.741648 4184 main.go:337] shutdownHandler sent cancel signal... Jun 06 01:48:25 mcwk8s05 flanneld[4184]: I0606 01:48:25.759081 4184 main.go:325] Exiting cleanly... Jun 06 01:48:25 mcwk8s05 systemd[1]: Stopped Flanneld overlay address etcd agent. [root@mcwk8s05 ~]# systemctl start flanneld.service [root@mcwk8s05 ~]# ifconfig flannel.1 flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 172.17.89.0 netmask 255.255.255.255 broadcast 0.0.0.0 inet6 fe80::3470:76ff:feea:39b8 prefixlen 64 scopeid 0x20<link> ether 36:70:76:ea:39:b8 txqueuelen 0 (Ethernet) RX packets 83 bytes 5816 (5.6 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 91 bytes 7576 (7.3 KiB) TX errors 0 dropped 8 overruns 0 carrier 0 collisions 0 [root@mcwk8s05 ~]# ifconfig docker docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 172.17.89.1 netmask 255.255.255.0 broadcast 172.17.89.255 inet6 fe80::42:18ff:fee1:e8fc prefixlen 64 scopeid 0x20<link> ether 02:42:18:e1:e8:fc txqueuelen 0 (Ethernet) RX packets 1216321 bytes 505986272 (482.5 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1377747 bytes 1969464945 (1.8 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@mcwk8s05 ~]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.0.0.254 0.0.0.0 UG 100 0 0 eth0 10.0.0.0 0.0.0.0 255.255.255.0 U 100 0 0 eth0 172.17.9.0 172.17.9.0 255.255.255.0 UG 0 0 0 flannel.1 172.17.83.0 172.17.83.0 255.255.255.0 UG 0 0 0 flannel.1 172.17.89.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0 [root@mcwk8s05 ~]#
并且在其它节点,依然可以正常访问到容器服务
[root@mcwk8s03 mcwtest]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 18:01:00 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s03 mcwtest]#
[root@mcwk8s06 ~]# systemctl stop flanneld.service [root@mcwk8s06 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Wed, 05 Jun 2024 18:02:02 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s06 ~]#
我们再试一下,重启flannel服务,发现容器网段还是没有发生变化,之前遇到那些会变化的原因是什么呢?
[root@mcwk8s03 mcwtest]# etcdctl ls /coreos.com/network/subnets/ /coreos.com/network/subnets/172.17.89.0-24 /coreos.com/network/subnets/172.17.83.0-24 /coreos.com/network/subnets/172.17.9.0-24 [root@mcwk8s03 mcwtest]# ifconfig flannel.1 flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 172.17.83.0 netmask 255.255.255.255 broadcast 0.0.0.0 inet6 fe80::b083:33ff:fe7b:fd37 prefixlen 64 scopeid 0x20<link> ether b2:83:33:7b:fd:37 txqueuelen 0 (Ethernet) RX packets 11 bytes 924 (924.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 11 bytes 924 (924.0 B) TX errors 0 dropped 8 overruns 0 carrier 0 collisions 0 [root@mcwk8s03 mcwtest]# ifconfig docker docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 172.17.83.1 netmask 255.255.255.0 broadcast 172.17.83.255 ether 02:42:e9:a4:51:4f txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@mcwk8s03 mcwtest]# systemctl restart flanneld.service [root@mcwk8s03 mcwtest]# etcdctl ls /coreos.com/network/subnets/ /coreos.com/network/subnets/172.17.9.0-24 /coreos.com/network/subnets/172.17.89.0-24 /coreos.com/network/subnets/172.17.83.0-24 [root@mcwk8s03 mcwtest]# ifconfig flannel.1 flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 172.17.83.0 netmask 255.255.255.255 broadcast 0.0.0.0 inet6 fe80::b083:33ff:fe7b:fd37 prefixlen 64 scopeid 0x20<link> ether b2:83:33:7b:fd:37 txqueuelen 0 (Ethernet) RX packets 11 bytes 924 (924.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 11 bytes 924 (924.0 B) TX errors 0 dropped 8 overruns 0 carrier 0 collisions 0 [root@mcwk8s03 mcwtest]# ifconfig docker docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 172.17.83.1 netmask 255.255.255.0 broadcast 172.17.83.255 ether 02:42:e9:a4:51:4f txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@mcwk8s03 mcwtest]#
一天后,发现05节点的网络有问题,不通了,启动kube-proxy之后就好了
未见异常,05不可以请求,但是06可以访问到pod,pod在05机器上
[root@mcwk8s03 mcwtest]# kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.2.0.1 <none> 443/TCP 584d mcwtest-svc NodePort 10.2.0.155 <none> 2024:33958/TCP 2d nginx ClusterIP None <none> 80/TCP 415d [root@mcwk8s03 mcwtest]# kubectl get pod NAME READY STATUS RESTARTS AGE mcwnginx-depoyment-8shfd 1/1 Running 20 415d mcwnginx-depoyment-ngthk 1/1 Running 7 415d mcwtest-deploy-6465665557-kbpdr 1/1 Running 0 115m nginx-deployment-68c7f5464c-n8t2b 1/1 Running 0 115m nginx-statefulset-0 1/1 Running 0 19m nginx-statefulset-1 1/1 Running 1 46h nginx-statefulset-2 1/1 Running 1 46h prometheus-0 0/1 CrashLoopBackOff 8 19m prometheus-1 0/1 CrashLoopBackOff 8 19m prometheus-2 0/1 CrashLoopBackOff 536 46h [root@mcwk8s03 mcwtest]# curl -I 10.0.0.35:33958 curl: (7) Failed connect to 10.0.0.35:33958; Connection timed out [root@mcwk8s03 mcwtest]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 14:57:09 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s03 mcwtest]#
查看还在的
在05上用05的nodeip访问不了pod,但是用06的nodeip可以访问到。查看kube-proxy服务是停止的。启动之后,可以正常用05的nodeip 去访问了。
[root@mcwk8s05 /]# curl -I 10.0.0.35:33958 curl: (7) Failed connect to 10.0.0.35:33958; Connection timed out [root@mcwk8s05 /]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 15:01:22 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 /]# systemctl status kube-proxy ● kube-proxy.service - Kubernetes Proxy Loaded: loaded (/usr/lib/systemd/system/kube-proxy.service; enabled; vendor preset: disabled) Active: inactive (dead) since Thu 2024-06-06 21:07:32 CST; 1h 54min ago Process: 50648 ExecStart=/opt/kubernetes/bin/kube-proxy $KUBE_PROXY_OPTS (code=killed, signal=PIPE) Main PID: 50648 (code=killed, signal=PIPE) Jun 06 20:59:58 mcwk8s05 kube-proxy[50648]: I0606 20:59:58.068662 50648 config.go:132] Calling handler.OnEndpointsUpdate Jun 06 20:59:58 mcwk8s05 kube-proxy[50648]: I0606 20:59:58.068668 50648 config.go:132] Calling handler.OnEndpointsUpdate Jun 06 20:59:58 mcwk8s05 kube-proxy[50648]: I0606 20:59:58.068673 50648 config.go:132] Calling handler.OnEndpointsUpdate Jun 06 20:59:58 mcwk8s05 kube-proxy[50648]: I0606 20:59:58.068679 50648 config.go:132] Calling handler.OnEndpointsUpdate Jun 06 20:59:58 mcwk8s05 kube-proxy[50648]: I0606 20:59:58.068683 50648 config.go:132] Calling handler.OnEndpointsUpdate Jun 06 20:59:58 mcwk8s05 kube-proxy[50648]: I0606 20:59:58.068688 50648 config.go:132] Calling handler.OnEndpointsUpdate Jun 06 21:00:24 mcwk8s05 kube-proxy[50648]: I0606 21:00:02.190985 50648 config.go:132] Calling handler.OnEndpointsUpdate Jun 06 21:00:30 mcwk8s05 kube-proxy[50648]: I0606 21:00:03.493815 50648 config.go:132] Calling handler.OnEndpointsUpdate Jun 06 21:00:45 mcwk8s05 kube-proxy[50648]: I0606 21:00:07.350882 50648 config.go:132] Calling handler.OnEndpointsUpdate Jun 06 21:00:49 mcwk8s05 kube-proxy[50648]: I0606 21:00:16.466427 50648 config.go:132] Calling handler.OnEndpointsUpdate [root@mcwk8s05 /]# systemctl start kube-proxy [root@mcwk8s05 /]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 15:01:54 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 /]#
通过查看日志/var/log/measage,可以看到,这个规则0个激活,一个未连接,也就是ipvs规则有点问题,启动kube-proxy之后,在删除这个规则,或许是删除后重加,然后才激活连接把,那么怎么检查这个转发rs是有效的呢
停止etcd,对访问容器的影响
有时间再研究下停止etcd,是否会对容器正常访问造成影响呢,应该是会的吧,毕竟容器跨宿主机通信的时候,etcd保持了容器网段和宿主机ip之间的路由关系,它应该就是个路由表,有这个才能知道要访问的容器在哪个宿主机上,然后将数据包发过去,到那里再解封装然后再~~吧
查看环境
[root@mcwk8s03 mcwtest]# kubectl get svc|grep mcwtest mcwtest-svc NodePort 10.2.0.155 <none> 2024:33958/TCP 2d1h [root@mcwk8s03 mcwtest]# kubectl get pod|grep mcwtest mcwtest-deploy-6465665557-kbpdr 1/1 Running 0 152m [root@mcwk8s03 mcwtest]# kubectl get csd error: the server doesn't have a resource type "csd" [root@mcwk8s03 mcwtest]# kubectl get cs NAME STATUS MESSAGE ERROR scheduler Healthy ok controller-manager Healthy ok etcd-0 Healthy {"health":"true"} etcd-2 Healthy {"health":"true"} etcd-1 Healthy {"health":"true"} [root@mcwk8s03 mcwtest]#
查看正常访问
[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 15:29:32 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 15:29:53 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 15:29:56 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# date Thu Jun 6 23:30:06 CST 2024 [root@mcwk8s05 ~]#
将etcd停下来
[root@mcwk8s03 mcwtest]# kubectl get cs NAME STATUS MESSAGE ERROR scheduler Healthy ok controller-manager Healthy ok etcd-0 Healthy {"health":"true"} etcd-2 Healthy {"health":"true"} etcd-1 Healthy {"health":"true"} [root@mcwk8s03 mcwtest]# [root@mcwk8s03 mcwtest]# [root@mcwk8s03 mcwtest]# systemctl stop etcd [root@mcwk8s03 mcwtest]# kubectl get cs NAME STATUS MESSAGE ERROR etcd-1 Unhealthy Get https://10.0.0.33:2379/health: dial tcp 10.0.0.33:2379: connect: connection refused controller-manager Healthy ok scheduler Healthy ok etcd-2 Healthy {"health":"true"} etcd-0 Healthy {"health":"true"} [root@mcwk8s03 mcwtest]# etcdctl ls / /coreos.com [root@mcwk8s03 mcwtest]# kubectl get cs NAME STATUS MESSAGE ERROR etcd-1 Unhealthy Get https://10.0.0.33:2379/health: dial tcp 10.0.0.33:2379: connect: connection refused etcd-2 Unhealthy Get https://10.0.0.35:2379/health: dial tcp 10.0.0.35:2379: connect: connection refused controller-manager Healthy ok scheduler Healthy ok etcd-0 Unhealthy HTTP probe failed with statuscode: 503 [root@mcwk8s03 mcwtest]# etcdctl ls / /coreos.com [root@mcwk8s03 mcwtest]#
此时停了两个节点,访问pod还是正常的
[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:15:10 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:15:11 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:15:15 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]#
此时etcd都挂了,无法查询数据了。不过我这里的etcd查看保存的数据,好像一点点只有网络数据,其他人部署的,是有很多类数据的,不知道差别在哪里,我这里停止etcd服务好像影响较小
[root@mcwk8s03 mcwtest]# kubectl get cs NAME STATUS MESSAGE ERROR etcd-0 Unhealthy Get https://10.0.0.33:2379/health: dial tcp 10.0.0.33:2379: connect: connection refused etcd-2 Unhealthy Get https://10.0.0.36:2379/health: dial tcp 10.0.0.36:2379: connect: connection refused controller-manager Healthy ok scheduler Healthy ok etcd-1 Unhealthy Get https://10.0.0.35:2379/health: dial tcp 10.0.0.35:2379: connect: connection refused [root@mcwk8s03 mcwtest]# etcdctl ls / Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.0.0.36:2379: connect: connection refused ; error #1: dial tcp 10.0.0.33:2379: connect: connection refused ; error #2: dial tcp 10.0.0.35:2379: connect: connection refused error #0: dial tcp 10.0.0.36:2379: connect: connection refused error #1: dial tcp 10.0.0.33:2379: connect: connection refused error #2: dial tcp 10.0.0.35:2379: connect: connection refused [root@mcwk8s03 mcwtest]#
etcd停止后,依然正常访问到05节点上的pod
[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:19:38 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:19:41 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:19:46 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]#
不过此时,master上无法访问k8s资源了
[root@mcwk8s03 mcwtest]# kubectl get svc Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get services) [root@mcwk8s03 mcwtest]#
我们查看日志/var/log/messages,flanneld报错etcd集群不可用。etcd几个节点连接拒绝
un 7 00:22:09 mcwk8s03 flanneld: E0607 00:22:09.538405 130015 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigur ed; error #0: dial tcp 10.0.0.35:2379: getsockopt: connection refused Jun 7 00:22:09 mcwk8s03 flanneld: ; error #1: dial tcp 10.0.0.36:2379: getsockopt: connection refused
Jun 7 00:22:14 mcwk8s03 flanneld: E0607 00:22:14.547963 130015 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 10.0.0.35:2379: getsockopt: connection refused
Jun 7 00:22:14 mcwk8s03 flanneld: ; error #1: dial tcp 10.0.0.36:2379: getsockopt: connection refused
Jun 7 00:22:14 mcwk8s03 flanneld: ; error #2: dial tcp 10.0.0.33:2379: getsockopt: connection refused
apiserver也报错连接etcd异常
Jun 7 00:26:08 mcwk8s03 kube-apiserver: W0607 00:26:08.919244 125396 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {10.0.0.36:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.36:2379: connect: connection refused". Reconnecting... Jun 7 00:26:08 mcwk8s03 kube-apiserver: W0607 00:26:08.998513 125396 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {10.0.0.33:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.33:2379: connect: connection refused". Reconnecting...
过了一阵,我们的pod还是可以访问到
[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:27:41 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:27:44 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:27:47 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# date Fri Jun 7 00:27:53 CST 2024 [root@mcwk8s05 ~]#
把05和06两个node的flanneld都停掉,依然可以访问pod
[root@mcwk8s05 ~]# systemctl stop flanneld [root@mcwk8s05 ~]# date Fri Jun 7 00:29:08 CST 2024 [root@mcwk8s05 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:29:07 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:29:11 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:29:16 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]#
上面之前已经关了flanneld了,这里想启动一个etcd节点失败了
[root@mcwk8s06 ~]# systemctl stop flanneld [root@mcwk8s06 ~]# systemctl start etcd Job for etcd.service failed because a timeout was exceeded. See "systemctl status etcd.service" and "journalctl -xe" for details. [root@mcwk8s06 ~]# [root@mcwk8s06 ~]# [root@mcwk8s06 ~]# systemctl status etcd.service ● etcd.service - Etcd Server Loaded: loaded (/usr/lib/systemd/system/etcd.service; disabled; vendor preset: disabled) Active: activating (start) since Fri 2024-06-07 00:31:48 CST; 15s ago Main PID: 33888 (etcd) Memory: 71.8M CGroup: /system.slice/etcd.service └─33888 /opt/etcd/bin/etcd --name=etcd03 --data-dir=/var/lib/etcd/default.etcd --listen-peer-urls=https://10.0.0.36:2380 --listen-client-urls=https://10.0.0.36:2379,http://127.... Jun 07 00:32:02 mcwk8s06 etcd[33888]: publish error: etcdserver: request timed out Jun 07 00:32:03 mcwk8s06 etcd[33888]: 1f77d024d243f007 is starting a new election at term 910111 Jun 07 00:32:03 mcwk8s06 etcd[33888]: 1f77d024d243f007 became candidate at term 910112 Jun 07 00:32:03 mcwk8s06 etcd[33888]: 1f77d024d243f007 received MsgVoteResp from 1f77d024d243f007 at term 910112 Jun 07 00:32:03 mcwk8s06 etcd[33888]: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to 64051d53b5971b69 at term 910112 Jun 07 00:32:03 mcwk8s06 etcd[33888]: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to f1ec1f6015c9d4a4 at term 910112 Jun 07 00:32:03 mcwk8s06 etcd[33888]: health check for peer 64051d53b5971b69 could not connect: dial tcp 10.0.0.35:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE") Jun 07 00:32:03 mcwk8s06 etcd[33888]: health check for peer f1ec1f6015c9d4a4 could not connect: dial tcp 10.0.0.33:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE") Jun 07 00:32:03 mcwk8s06 etcd[33888]: health check for peer 64051d53b5971b69 could not connect: dial tcp 10.0.0.35:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT") Jun 07 00:32:03 mcwk8s06 etcd[33888]: health check for peer f1ec1f6015c9d4a4 could not connect: dial tcp 10.0.0.33:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
/var/log/message报错
Jun 7 00:34:13 mcwk8s06 etcd: 1f77d024d243f007 is starting a new election at term 910195 Jun 7 00:34:13 mcwk8s06 etcd: 1f77d024d243f007 became candidate at term 910196 Jun 7 00:34:13 mcwk8s06 etcd: 1f77d024d243f007 received MsgVoteResp from 1f77d024d243f007 at term 910196 Jun 7 00:34:13 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to 64051d53b5971b69 at term 910196 Jun 7 00:34:13 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to f1ec1f6015c9d4a4 at term 910196 Jun 7 00:34:15 mcwk8s06 etcd: health check for peer f1ec1f6015c9d4a4 could not connect: dial tcp 10.0.0.33:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT") Jun 7 00:34:15 mcwk8s06 etcd: health check for peer 64051d53b5971b69 could not connect: dial tcp 10.0.0.35:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE") Jun 7 00:34:15 mcwk8s06 etcd: health check for peer f1ec1f6015c9d4a4 could not connect: dial tcp 10.0.0.33:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE") Jun 7 00:34:15 mcwk8s06 etcd: health check for peer 64051d53b5971b69 could not connect: dial tcp 10.0.0.35:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT") Jun 7 00:34:15 mcwk8s06 etcd: publish error: etcdserver: request timed out Jun 7 00:34:15 mcwk8s06 etcd: 1f77d024d243f007 is starting a new election at term 910196 Jun 7 00:34:15 mcwk8s06 etcd: 1f77d024d243f007 became candidate at term 910197 Jun 7 00:34:15 mcwk8s06 etcd: 1f77d024d243f007 received MsgVoteResp from 1f77d024d243f007 at term 910197 Jun 7 00:34:15 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to 64051d53b5971b69 at term 910197 Jun 7 00:34:15 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to f1ec1f6015c9d4a4 at term 910197 Jun 7 00:34:16 mcwk8s06 etcd: 1f77d024d243f007 is starting a new election at term 910197 Jun 7 00:34:16 mcwk8s06 etcd: 1f77d024d243f007 became candidate at term 910198 Jun 7 00:34:16 mcwk8s06 etcd: 1f77d024d243f007 received MsgVoteResp from 1f77d024d243f007 at term 910198 Jun 7 00:34:16 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to 64051d53b5971b69 at term 910198 Jun 7 00:34:16 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to f1ec1f6015c9d4a4 at term 910198 Jun 7 00:34:17 mcwk8s06 etcd: 1f77d024d243f007 is starting a new election at term 910198 Jun 7 00:34:17 mcwk8s06 etcd: 1f77d024d243f007 became candidate at term 910199 Jun 7 00:34:17 mcwk8s06 etcd: 1f77d024d243f007 received MsgVoteResp from 1f77d024d243f007 at term 910199 Jun 7 00:34:17 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to 64051d53b5971b69 at term 910199 Jun 7 00:34:17 mcwk8s06 etcd: 1f77d024d243f007 [logterm: 909955, index: 10893385] sent MsgVote request to f1ec1f6015c9d4a4 at term 910199 Jun 7 00:34:18 mcwk8s06 etcd: 1f77d024d243f007 is starting a new election at term 910199 Jun 7 00:34:18 mcwk8s06 etcd: 1f77d024d243f007 became candidate at term 910200 Jun 7 00:34:18 mcwk8s06 etcd: 1f77d024d243f007 received MsgVoteResp from 1f77d024d243f007 at term 910200 :
此时想先把06 node的flanneld启动起来,但是也启动失败
[root@mcwk8s06 ~]# systemctl start flanneld Job for flanneld.service failed because a timeout was exceeded. See "systemctl status flanneld.service" and "journalctl -xe" for details. [root@mcwk8s06 ~]# systemctl status flanneld ● flanneld.service - Flanneld overlay address etcd agent Loaded: loaded (/usr/lib/systemd/system/flanneld.service; disabled; vendor preset: disabled) Active: activating (start) since Fri 2024-06-07 00:42:11 CST; 1min 12s ago Main PID: 36542 (flanneld) Memory: 8.5M CGroup: /system.slice/flanneld.service └─36542 /opt/kubernetes/bin/flanneld --ip-masq --etcd-endpoints=https://10.0.0.33:2379,https://10.0.0.35:2379,https://10.0.0.36:2379 -etcd-cafile=/opt/etcd/ssl/ca.pem -etcd-cer... Jun 07 00:42:54 mcwk8s06 flanneld[36542]: ; error #2: net/http: TLS handshake timeout Jun 07 00:42:55 mcwk8s06 flanneld[36542]: timed out Jun 07 00:43:05 mcwk8s06 flanneld[36542]: E0607 00:43:05.841656 36542 main.go:349] Couldn't fetch network config: client: etcd cluster is unavailable or misconfigured; error...tion refused Jun 07 00:43:05 mcwk8s06 flanneld[36542]: ; error #1: dial tcp 10.0.0.33:2379: getsockopt: connection refused Jun 07 00:43:05 mcwk8s06 flanneld[36542]: ; error #2: net/http: TLS handshake timeout Jun 07 00:43:06 mcwk8s06 flanneld[36542]: timed out Jun 07 00:43:16 mcwk8s06 flanneld[36542]: E0607 00:43:16.846177 36542 main.go:349] Couldn't fetch network config: client: etcd cluster is unavailable or misconfigured; error...tion refused Jun 07 00:43:16 mcwk8s06 flanneld[36542]: ; error #1: dial tcp 10.0.0.33:2379: getsockopt: connection refused Jun 07 00:43:16 mcwk8s06 flanneld[36542]: ; error #2: net/http: TLS handshake timeout Jun 07 00:43:17 mcwk8s06 flanneld[36542]: timed out Hint: Some lines were ellipsized, use -l to show in full. [root@mcwk8s06 ~]#
查看/var/log/message,去连接etcd找网络信息去了,但是没有连接上
Jun 7 00:42:11 mcwk8s06 systemd: flanneld.service start operation timed out. Terminating. Jun 7 00:42:11 mcwk8s06 flanneld: E0607 00:42:11.520247 36223 main.go:349] Couldn't fetch network config: context canceled Jun 7 00:42:11 mcwk8s06 flanneld: I0607 00:42:11.520736 36223 main.go:337] shutdownHandler sent cancel signal... Jun 7 00:42:11 mcwk8s06 systemd: Failed to start Flanneld overlay address etcd agent. Jun 7 00:42:11 mcwk8s06 systemd: Unit flanneld.service entered failed state. Jun 7 00:42:11 mcwk8s06 systemd: flanneld.service failed. Jun 7 00:42:11 mcwk8s06 kubelet: I0607 00:42:11.587113 61577 prober.go:173] HTTP-Probe Host: http://172.17.9.10, Port: 3000, Path: /login Jun 7 00:42:11 mcwk8s06 kubelet: I0607 00:42:11.587152 61577 prober.go:176] HTTP-Probe Headers: map[] Jun 7 00:42:11 mcwk8s06 kubelet: I0607 00:42:11.589166 61577 http.go:120] Probe succeeded for http://172.17.9.10:3000/login, Response: {200 OK 200 HTTP/1.1 1 1 map[Content-Type:[text/html ; charset=UTF-8] Date:[Thu, 06 Jun 2024 16:42:11 GMT]] 0xc0014ee7e0 -1 [chunked] true false map[] 0xc00140c100 <nil>} Jun 7 00:42:11 mcwk8s06 kubelet: I0607 00:42:11.589213 61577 prober.go:125] Readiness probe for "grafana-core-5cc8dff58b-t97pb_kube-system(d15fbb7f-6d16-4471-9e58-e64ccbfd89da):grafana-co re" succeeded Jun 7 00:42:11 mcwk8s06 systemd: flanneld.service holdoff time over, scheduling restart. Jun 7 00:42:11 mcwk8s06 systemd: Starting Flanneld overlay address etcd agent... Jun 7 00:42:11 mcwk8s06 flanneld: I0607 00:42:11.823038 36542 main.go:475] Determining IP address of default interface Jun 7 00:42:11 mcwk8s06 flanneld: I0607 00:42:11.824305 36542 main.go:488] Using interface with name eth0 and address 10.0.0.36 Jun 7 00:42:11 mcwk8s06 flanneld: I0607 00:42:11.824322 36542 main.go:505] Defaulting external address to interface address (10.0.0.36) Jun 7 00:42:11 mcwk8s06 flanneld: warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated Jun 7 00:42:11 mcwk8s06 flanneld: I0607 00:42:11.825141 36542 main.go:235] Created subnet manager: Etcd Local Manager with Previous Subnet: None Jun 7 00:42:11 mcwk8s06 flanneld: I0607 00:42:11.825148 36542 main.go:238] Installing signal handlers
06上的节点起不来,那先启动03上的etcd试试,毕竟它在启动命令的一个节点。
启动第一个etcd节点,正常启动,启动之后06节点之前启动失败的,也已经启动起来了,估计我虽然点击了启动,并且启动失败了,但是在后台06还在尝试重启,这才在第一个节点起来之后,它就起来了。也就是etcd如果起不来,可以先尝试启动第一个节点。毕竟之前日志中,启动06的etcd,检查连接其它两个节点的,好像第一个节点的关注度更高点。
[root@mcwk8s03 /]# kubectl get cs NAME STATUS MESSAGE ERROR etcd-0 Unhealthy Get https://10.0.0.33:2379/health: dial tcp 10.0.0.33:2379: connect: connection refused etcd-1 Unhealthy Get https://10.0.0.35:2379/health: dial tcp 10.0.0.35:2379: connect: connection refused scheduler Healthy ok controller-manager Healthy ok etcd-2 Unhealthy Get https://10.0.0.36:2379/health: net/http: TLS handshake timeout [root@mcwk8s03 /]# systemctl start etcd [root@mcwk8s03 /]# kubectl get cs NAME STATUS MESSAGE ERROR etcd-1 Unhealthy Get https://10.0.0.35:2379/health: dial tcp 10.0.0.35:2379: connect: connection refused controller-manager Healthy ok etcd-0 Healthy {"health":"true"} etcd-2 Healthy {"health":"true"} scheduler Healthy ok [root@mcwk8s03 /]#
我们看下数据,可以正常用kubectl命令查看k8s资源了,并且网段目前还是没有变化
[root@mcwk8s03 /]# kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.2.0.1 <none> 443/TCP 585d mcwtest-svc NodePort 10.2.0.155 <none> 2024:33958/TCP 2d2h nginx ClusterIP None <none> 80/TCP 415d [root@mcwk8s03 /]# kubectl get nodes NAME STATUS ROLES AGE VERSION mcwk8s05 Ready <none> 582d v1.15.12 mcwk8s06 Ready <none> 582d v1.15.12 [root@mcwk8s03 /]# etcdctl ls / /coreos.com [root@mcwk8s03 /]# etcdctl ls /coreos.com/ /coreos.com/network [root@mcwk8s03 /]# etcdctl ls /coreos.com/network/ /coreos.com/network/config /coreos.com/network/subnets [root@mcwk8s03 /]# etcdctl ls /coreos.com/network/subnets /coreos.com/network/subnets/172.17.83.0-24 /coreos.com/network/subnets/172.17.89.0-24 /coreos.com/network/subnets/172.17.9.0-24 [root@mcwk8s03 /]#
此时pod一直是可以正常访问的,也就是说,如果k8s集群组件出现问题,已有的pod在没发生状态改变,比如重建之类的,那么应该是不影响pod正常提供服务的。我这里验证,不排除考虑不全面而信息比较片面的问题。
[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:54:20 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:54:23 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 16:54:27 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]#
现在pod在06 node上,现在有个etcd节点没有启动,不过其它两个节点已经启动了这个不会是存在的异常问题。之前一直以为pod还在05上,没有注意到pod已经到了06上了,从
[root@mcwk8s03 /]# kubectl get svc |grep mcwtest mcwtest-svc NodePort 10.2.0.155 <none> 2024:33958/TCP 2d2h [root@mcwk8s03 /]# kubectl get deploy -o wide|grep mcwtest mcwtest-deploy 1/1 1 1 2d2h mcwtest centos app=mcwpython [root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest mcwtest-deploy-6465665557-kbpdr 1/1 Running 0 4h 172.17.9.12 mcwk8s06 <none> <none> [root@mcwk8s03 /]#
停止06上flanneld,然后重启上面的容器,
[root@mcwk8s06 ~]# systemctl stop flanneld [root@mcwk8s06 ~]# docker ps|grep mcwtest 04f4ddf6c5ef 5d0da3dc9764 "sh -c 'echo 123 >>/…" 4 hours ago Up 4 hours k8s_mcwtest_mcwtest-deploy-6465665557-kbpdr_default_569facd2-279d-4d87-b8c6-1edbea1015c4_0 02aae29ff065 registry.cn-hangzhou.aliyuncs.com/google-containers/pause-amd64:3.0 "/pause" 4 hours ago Up 4 hours k8s_POD_mcwtest-deploy-6465665557-kbpdr_default_569facd2-279d-4d87-b8c6-1edbea1015c4_0 [root@mcwk8s06 ~]# docker restart 04f4 04f4 [root@mcwk8s06 ~]#
pod还能正常提供服务
[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 17:06:19 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 17:06:21 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 17:06:24 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# date Fri Jun 7 01:07:33 CST 2024 [root@mcwk8s05 ~]#
现在把etcd都恢复,也就是05节点上的也恢复下,master和所有node的flanneld都已经是停掉,kube-proxy kubelet node上都正常运行
[root@mcwk8s03 /]# kubectl get cs NAME STATUS MESSAGE ERROR scheduler Healthy ok controller-manager Healthy ok etcd-0 Healthy {"health":"true"} etcd-1 Healthy {"health":"true"} etcd-2 Healthy {"health":"true"} [root@mcwk8s03 /]#
此时删除pod,不影响重建pod,现在从06节点新建到05节点,但是访问服务时网络不通了,这跟所有的flanneld停掉有关,如果已有的pod,不受影响,但是重建pod,是受到影响的,需要flanneld参与的,但是它做了啥,以后看源码研究下
[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest mcwtest-deploy-6465665557-kbpdr 1/1 Running 0 4h15m 172.17.9.12 mcwk8s06 <none> <none> [root@mcwk8s03 /]# kubectl delete pod mcwtest-deploy-6465665557-kbpdr pod "mcwtest-deploy-6465665557-kbpdr" deleted [root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest mcwtest-deploy-6465665557-nwbnn 1/1 Running 0 44s 172.17.89.7 mcwk8s05 <none> <none> [root@mcwk8s03 /]# kubectl get svc -o wide|grep mcwtest mcwtest-svc NodePort 10.2.0.155 <none> 2024:33958/TCP 2d2h app=mcwpython [root@mcwk8s03 /]# curl -I 10.0.0.36:33958 curl: (7) Failed connect to 10.0.0.36:33958; Connection refused [root@mcwk8s03 /]# curl -I 10.0.0.35:33958 curl: (7) Failed connect to 10.0.0.35:33958; Connection refused [root@mcwk8s03 /]#
但是pod在05上,在05上访问还是通的,那么就是跨宿主机可能存在问题了,看下什么问题
[root@mcwk8s05 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 17:20:52 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 17:20:59 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 17:21:07 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]#
可是一开始的确是不通的,但是现在06上又通了
[root@mcwk8s06 ~]# curl -I 10.0.0.36:33958 curl: (7) Failed connect to 10.0.0.36:33958; Connection timed out [root@mcwk8s06 ~]# curl -I 10.0.0.35:33958 curl: (7) Failed connect to 10.0.0.35:33958; Connection timed out [root@mcwk8s06 ~]# curl -I 10.2.0.155:2024 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 17:23:08 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s06 ~]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 17:23:13 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s06 ~]#
03上也访问通了,那么问题是不是ipvs规则,没有及时更新,有延长的问题导致的呢?
[root@mcwk8s03 /]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 17:24:11 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s03 /]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 17:24:17 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s03 /]#
我们准备再次删除pod,重建pod,然后看不通的时候,是不是ipvs还没更新。目前pod ip如下,可以正常访问
[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest mcwtest-deploy-6465665557-nwbnn 1/1 Running 0 12m 172.17.89.7 mcwk8s05 <none> <none> [root@mcwk8s03 /]# kubectl get svc |grep mcwtest mcwtest-svc NodePort 10.2.0.155 <none> 2024:33958/TCP 2d3h [root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest mcwtest-deploy-6465665557-nwbnn 1/1 Running 0 12m 172.17.89.7 mcwk8s05 <none> <none> [root@mcwk8s03 /]#
[root@mcwk8s05 ~]# ipvsadm -Ln|grep -2 10.0.0.35:33958 -> 172.17.9.2:9100 Masq 1 0 0 -> 172.17.89.2:9100 Masq 1 0 0 TCP 10.0.0.35:33958 rr -> 172.17.89.7:20000 Masq 1 0 0 TCP 10.0.0.35:46735 rr [root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 17:29:26 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]#
重建之后,两个node上ipvs规则已经改为新的rs了,但是网络还是没有通。估算是过了一分钟,才通网的
[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest mcwtest-deploy-6465665557-nwbnn 1/1 Running 0 12m 172.17.89.7 mcwk8s05 <none> <none> [root@mcwk8s03 /]# [root@mcwk8s03 /]# kubectl delete pod mcwtest-deploy-6465665557-nwbnn pod "mcwtest-deploy-6465665557-nwbnn" deleted [root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest mcwtest-deploy-6465665557-zpfvp 1/1 Running 0 37s 172.17.89.8 mcwk8s05 <none> <none> [root@mcwk8s03 /]#
[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 curl: (7) Failed connect to 10.0.0.35:33958; Connection refused [root@mcwk8s05 ~]# ipvsadm -Ln|grep -2 10.0.0.35:33958 -> 172.17.9.2:9100 Masq 1 0 0 -> 172.17.89.2:9100 Masq 1 0 0 TCP 10.0.0.35:33958 rr -> 172.17.89.8:20000 Masq 1 0 1 TCP 10.0.0.35:46735 rr [root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 curl: (7) Failed connect to 10.0.0.35:33958; Connection refused [root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 curl: (7) Failed connect to 10.0.0.35:33958; Connection refused
现在将两个node上的kube-proxy停掉,理论上pod重建,kubelet管理创建新的容器,但是网络ipvs规则方面,应该是不会被重置
[root@mcwk8s05 ~]# systemctl stop kube-proxy [root@mcwk8s05 ~]# [root@mcwk8s06 ~]# systemctl stop kube-proxy [root@mcwk8s06 ~]#
删除pod前检查
[root@mcwk8s03 /]# kubectl get svc|grep mcwtest mcwtest-svc NodePort 10.2.0.155 <none> 2024:33958/TCP 2d3h [root@mcwk8s03 /]# kubectl get pod|grep mcwtest mcwtest-deploy-6465665557-zpfvp 1/1 Running 0 9m25s [root@mcwk8s03 /]#
[root@mcwk8s05 ~]# ipvsadm -Ln|grep -2 10.0.0.35:33958 -> 172.17.9.2:9100 Masq 1 0 0 -> 172.17.89.2:9100 Masq 1 0 0 TCP 10.0.0.35:33958 rr -> 172.17.89.8:20000 Masq 1 0 0 TCP 10.0.0.35:46735 rr [root@mcwk8s05 ~]#
删除重建pod
[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest mcwtest-deploy-6465665557-zpfvp 1/1 Running 0 37s 172.17.89.8 mcwk8s05 <none> <none> [root@mcwk8s03 /]# kubectl delete pod mcwtest-deploy-6465665557-zpfvp pod "mcwtest-deploy-6465665557-zpfvp" deleted [root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest mcwtest-deploy-6465665557-qc9t2 1/1 Running 0 112s 172.17.89.7 mcwk8s05 <none> <none> [root@mcwk8s03 /]#
几分钟后看两个node上还是没有被修改为新的rs,
[root@mcwk8s05 ~]# ipvsadm -Ln|grep -2 10.0.0.35:33958 -> 172.17.9.2:9100 Masq 1 0 0 -> 172.17.89.2:9100 Masq 1 0 0 TCP 10.0.0.35:33958 rr -> 172.17.89.8:20000 Masq 1 0 0 TCP 10.0.0.35:46735 rr [root@mcwk8s05 ~]#
此时网络肯定是不通的
[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 curl: (7) Failed connect to 10.0.0.35:33958; Connection timed out [root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 curl: (7) Failed connect to 10.0.0.36:33958; Connection timed out [root@mcwk8s05 ~]#
我们看下service,这个和新的pod一致的,及时修改为新的IP的
[root@mcwk8s03 /]# kubectl get pod -o wide|grep mcwtest mcwtest-deploy-6465665557-qc9t2 1/1 Running 0 112s 172.17.89.7 mcwk8s05 <none> <none> [root@mcwk8s03 /]# kubectl get svc|grep mcwtest mcwtest-svc NodePort 10.2.0.155 <none> 2024:33958/TCP 2d3h [root@mcwk8s03 /]# kubectl describe svc mcwtest-svc Name: mcwtest-svc Namespace: default Labels: <none> Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"mcwtest-svc","namespace":"default"},"spec":{"ports":[{"name":"mcw... Selector: app=mcwpython Type: NodePort IP: 10.2.0.155 Port: mcwport 2024/TCP TargetPort: 20000/TCP NodePort: mcwport 33958/TCP Endpoints: 172.17.89.7:20000 Session Affinity: None External Traffic Policy: Cluster Events: <none> [root@mcwk8s03 /]#
我们看下上面,这个是集群IP和集群IP端口。然后还有目标端口就是容器里面的端口,而endport就是容器IP加上容器里面的端口,nodeport就是nodeport,端口名字我们之前定义的,跟deplolyment,pod等资源名称没有关系
IP: 10.2.0.155
Port: mcwport 2024/TCP
我们把kube-proxy启动之后,立马就更新了新的rs了 。
[root@mcwk8s06 ~]# ipvsadm -Ln|grep -2 10.0.0.36:33958 -> 172.17.9.2:9100 Masq 1 0 0 -> 172.17.89.2:9100 Masq 1 0 0 TCP 10.0.0.36:33958 rr -> 172.17.89.8:20000 Masq 1 0 0 TCP 10.0.0.36:46735 rr [root@mcwk8s06 ~]# systemctl start kube-proxy [root@mcwk8s06 ~]# ipvsadm -Ln|grep -2 10.0.0.36:33958 -> 172.17.9.2:9100 Masq 1 0 0 -> 172.17.89.2:9100 Masq 1 0 0 TCP 10.0.0.36:33958 rr -> 172.17.89.7:20000 Masq 1 0 0 TCP 10.0.0.36:46735 rr [root@mcwk8s06 ~]#
此时05的kube-proxy还没有启动,06的启动,所以05的没有更新ipvs 的rs,06的更新了。05的不可以访问到pod,06的可以了
[root@mcwk8s05 ~]# curl -I 10.0.0.35:33958 curl: (7) Failed connect to 10.0.0.35:33958; Connection timed out [root@mcwk8s05 ~]# curl -I 10.0.0.36:33958 HTTP/1.0 200 OK Server: SimpleHTTP/0.6 Python/2.7.18 Date: Thu, 06 Jun 2024 17:58:40 GMT Content-type: text/html; charset=ANSI_X3.4-1968 Content-Length: 816 [root@mcwk8s05 ~]#
实验完毕,把环境恢复正常。
本节总结:
如果网络不通可以从下面查看
1、可以看svc和pod的IP是否对应,
2、查看ipvs规则是否和pod对应,如果不对应,查看kube-proxy是否正常运行,是否需要考虑重启等等,重启似乎不影响已有pod的使用。
综上可知:
容器创建销毁等等,需要用到k8s组件,网络插件等等。但是已经创建的容器,网络已经在宿主机上存在了,把K8S组件和网络插件停止,不影响同宿主机容器间通信,以及不影响这些容器跨主机通信。也就是已有的容器服务,clusterip:port和nodeip:nodeport等方式去访问容器应用,还是正常提供服务可以访问到的。并且我这里停止flannel之后,然后再启动,这个宿主机的容器网段还是原来的,容器网关也是原来的没有变动,etcd保存的网段也是没有变动。所以重新启动网络插件,没有使得已有的容器网段发生改变,因此有个疑问,之前为什么重启flannel服务,会让宿主机的容器网段发生改变呢,从而也要重启容器,使得所在宿主机的容器网段对应上呢
注意:按照我的理解,容器跨主机通信,可能需要flanneld去请求etcd,查询出要前往的容器网段所在的宿主机IP,然后进行数据包的封装,通过物理网卡通信,而flannel.1相当于隧道网络,网关是docker0,docker0又会走到默认路由,到eth0 ,然后根据进行宿主机间数据包传输,从而实现宿主机间的数据包的路由转发。但是flanneld停掉之后,又是谁在查询容器网段,或者是怎么找到需要去的容器网段的宿主机IP呢,这个关系是怎么找到的呢,以后有时间研究补充。部分疑问可见下面链接
某天猜测,宿主机可能会缓存路由表信息么??所以即使etcd和flannel挂了容器它也能跨宿主机通信么 ??
由此产生的疑问:回头再验证: 创建一条隧道网络,进行传输的时候,是否是转换为物理网卡IP进行通信?