一次calico-kube-controllers 一直处于创建中引发的后续

背景:

由于课程代码都是基于amd64架构进行编写的,这将导致我的主力机arm64架构机器无法顺利进行实验内容,因此我得在x64的机器上进行实验内容,先是需要搭建K8S环境,此处省略搭建步骤,在我进行kubeadm init操作后,发现镜像拉取一直不成功,镜像地址我写的是默认从K8S官方地址拉取镜像的(这里提一下为什么不写国内镜像地址的原因,原因在于国内镜像仓库更新速度过慢,有时候拉取一些images时会找不到),于是我在我的宿主机开启了代理启用了全局代理模式,发现我的K8S集群仍是无法拉取镜像,提示TimeOut。

于是,我将流量转发配置写在了containerd.service文件内,如以下所示:

root@Y76-Master01-16-181:~# cat /usr/lib/systemd/system/containerd.service 
# Copyright The containerd Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target

[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/containerd

Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5

# 添加以下三行
Environment="HTTPS_PROXY=http://172.164.17.103:9999"
Environment="HTTP_PROXY=http://172.164.17.103:9999"
Environment="ALL_PROXY=socks5://172.164.17.103:9999"

# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target

 添加代理后,再次进行kubeadm init 操作后,K8S的镜像能顺利被拉到本地中,即通过 crictl命令查看

root@Y76-Master01-16-181:~# crictl -r unix:///var/run/containerd/containerd.sock images

 此处,集群组件状态一切正常,都处于Runing状态,于是我进行calico部署,calico-node Pod都处于Runing状态,唯独calico-kube-controllers  Pod一直处于创建中,查看Pod详细信息

root@Y76-Master01-16-181:~# kubectl describe pod -n kube-system calico-kube-controllers-9449c44c5-v8ssv 


Normal   Scheduled               72s    default-scheduler  Successfully assigned kube-system/calico-kube-controllers-57b57c56f-wz4wm to y76-node01-16-182
Warning  FailedCreatePodSandBox  52s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c0805304ad1009d138d00cad8b5a4d9ddfdd27b8d6a8a886d4df4690cace4452": plugin type="calico" failed (add): error getting ClusterInformation: Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": net/http: TLS handshake timeout
Normal   SandboxChanged          5s (x3 over 51s)  kubelet            Pod sandbox changed, it will be killed and re-created.

 到此处,我进行了一系列排查,但没能解决calico-kube-controllers 状态问题,即使我创建新的Pod也是无法成功创建出来,报错如上图一致,当百思不得其解时,我将虚拟机都还原成原先的快照,填写了国内的镜像地址后进行kubeadm init时,能成功了将所有组件的Pod都Runing起来

思考:

仅仅是镜像地址不同,但却是两个结果,这不应该。我联想到了我一开始的proxy代理操作(即在containerd.service配置了代理),于是,我将虚拟机再次还原快照,重新填写K8S官方的镜像仓库地址,再次进行kubeadm 初始化时,遇到了同样问题,我将containrd.service的配置改成如下内容:

root@Y76-Master01-16-181:~# cat /usr/lib/systemd/system/containerd.service 
# Copyright The containerd Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target

[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/containerd

Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5

# 优化成以下内容
Environment="HTTPS_PROXY=http://172.164.17.103:9999"
Environment="HTTP_PROXY=http://172.164.17.103:9999"
Environment="NO_PROXY=localhost,127.0.0.1,172.16.0.0/12,10.96.0.0/12,10.244.0.0/16"


# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target


随时将配置同步给其他节点且重启containerd服务

root@Y76-Master01-16-181:~# ansible all -m copy -a "src=/usr/lib/systemd/system/containerd.service  dest=/usr/lib/systemd/system/containerd.service"

root@Y76-Master01-16-181:~# ansible all -m shell -a "systemctl daemon-reload && systemctl restart containerd.service "

 果不其然,Pod状态一切正常

root@Y76-Master01-16-181:~# kubectl get pod -n kube-system 
NAME                                          READY   STATUS    RESTARTS       AGE
calico-kube-controllers-9449c44c5-v8ssv       1/1     Running   0              92m
calico-node-97qbc                             1/1     Running   3 (38m ago)    6h1m
calico-node-bl59h                             1/1     Running   2 (178m ago)   6h1m
calico-node-rzzq7                             1/1     Running   2 (178m ago)   6h1m
coredns-567c556887-8knp9                      1/1     Running   3 (51m ago)    8h
coredns-567c556887-dwg6d                      1/1     Running   2 (178m ago)   8h
etcd-y76-master01-16-181                      1/1     Running   3 (178m ago)   8h
kube-apiserver-y76-master01-16-181            1/1     Running   3 (178m ago)   8h
kube-controller-manager-y76-master01-16-181   1/1     Running   6 (46m ago)    5h46m
kube-proxy-88nd6                              1/1     Running   2 (178m ago)   5h47m
kube-proxy-vrgtp                              1/1     Running   2 (178m ago)   5h47m
kube-proxy-z5jmc                              1/1     Running   2 (178m ago)   5h47m
kube-scheduler-y76-master01-16-181            1/1     Running   6 (46m ago)    8h

 总结:

在进行排错时,应当回想操作过程中自己执行了哪些操作,再排查问题时,应当细究自己做的操作会有怎样的影响,例如此次操作,我将proxy代理给了宿主机,这意味着我的Pod会把流量转发给宿主机,通过宿主机进行通信,而Pod要通信的对端IP地址正是我定义的Pod网段(10.96.0.0/12,10.244.0.0/16),这通过宿主机进行通信肯定是找不到对端的。

 

posted @ 2024-07-07 22:18  Ky150  阅读(62)  评论(0编辑  收藏  举报