一次calico-kube-controllers 一直处于创建中引发的后续

背景：

由于课程代码都是基于amd64架构进行编写的，这将导致我的主力机arm64架构机器无法顺利进行实验内容，因此我得在x64的机器上进行实验内容，先是需要搭建K8S环境，此处省略搭建步骤，在我进行kubeadm init操作后，发现镜像拉取一直不成功，镜像地址我写的是默认从K8S官方地址拉取镜像的（这里提一下为什么不写国内镜像地址的原因，原因在于国内镜像仓库更新速度过慢，有时候拉取一些images时会找不到），于是我在我的宿主机开启了代理启用了全局代理模式，发现我的K8S集群仍是无法拉取镜像，提示TimeOut。

于是，我将流量转发配置写在了containerd.service文件内，如以下所示：

root@Y76-Master01-16-181:~# cat /usr/lib/systemd/system/containerd.service 
# Copyright The containerd Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target

[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/containerd

Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5

# 添加以下三行
Environment="HTTPS_PROXY=http://172.164.17.103:9999"
Environment="HTTP_PROXY=http://172.164.17.103:9999"
Environment="ALL_PROXY=socks5://172.164.17.103:9999"

# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target

添加代理后，再次进行kubeadm init 操作后，K8S的镜像能顺利被拉到本地中，即通过 crictl命令查看

root@Y76-Master01-16-181:~# crictl -r unix:///var/run/containerd/containerd.sock images

此处，集群组件状态一切正常，都处于Runing状态，于是我进行calico部署，calico-node Pod都处于Runing状态，唯独calico-kube-controllers Pod一直处于创建中，查看Pod详细信息

root@Y76-Master01-16-181:~# kubectl describe pod -n kube-system calico-kube-controllers-9449c44c5-v8ssv 


Normal   Scheduled               72s    default-scheduler  Successfully assigned kube-system/calico-kube-controllers-57b57c56f-wz4wm to y76-node01-16-182
Warning  FailedCreatePodSandBox  52s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c0805304ad1009d138d00cad8b5a4d9ddfdd27b8d6a8a886d4df4690cace4452": plugin type="calico" failed (add): error getting ClusterInformation: Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": net/http: TLS handshake timeout
Normal   SandboxChanged          5s (x3 over 51s)  kubelet            Pod sandbox changed, it will be killed and re-created.

到此处，我进行了一系列排查，但没能解决calico-kube-controllers 状态问题，即使我创建新的Pod也是无法成功创建出来，报错如上图一致，当百思不得其解时，我将虚拟机都还原成原先的快照，填写了国内的镜像地址后进行kubeadm init时，能成功了将所有组件的Pod都Runing起来

思考：

仅仅是镜像地址不同，但却是两个结果，这不应该。我联想到了我一开始的proxy代理操作（即在containerd.service配置了代理），于是，我将虚拟机再次还原快照，重新填写K8S官方的镜像仓库地址，再次进行kubeadm 初始化时，遇到了同样问题，我将containrd.service的配置改成如下内容：

root@Y76-Master01-16-181:~# cat /usr/lib/systemd/system/containerd.service 
# Copyright The containerd Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target

[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/containerd

Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5

# 优化成以下内容
Environment="HTTPS_PROXY=http://172.164.17.103:9999"
Environment="HTTP_PROXY=http://172.164.17.103:9999"
Environment="NO_PROXY=localhost,127.0.0.1,172.16.0.0/12,10.96.0.0/12,10.244.0.0/16"


# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target

随时将配置同步给其他节点且重启containerd服务

root@Y76-Master01-16-181:~# ansible all -m copy -a "src=/usr/lib/systemd/system/containerd.service  dest=/usr/lib/systemd/system/containerd.service"

root@Y76-Master01-16-181:~# ansible all -m shell -a "systemctl daemon-reload && systemctl restart containerd.service "

果不其然，Pod状态一切正常

root@Y76-Master01-16-181:~# kubectl get pod -n kube-system 
NAME                                          READY   STATUS    RESTARTS       AGE
calico-kube-controllers-9449c44c5-v8ssv       1/1     Running   0              92m
calico-node-97qbc                             1/1     Running   3 (38m ago)    6h1m
calico-node-bl59h                             1/1     Running   2 (178m ago)   6h1m
calico-node-rzzq7                             1/1     Running   2 (178m ago)   6h1m
coredns-567c556887-8knp9                      1/1     Running   3 (51m ago)    8h
coredns-567c556887-dwg6d                      1/1     Running   2 (178m ago)   8h
etcd-y76-master01-16-181                      1/1     Running   3 (178m ago)   8h
kube-apiserver-y76-master01-16-181            1/1     Running   3 (178m ago)   8h
kube-controller-manager-y76-master01-16-181   1/1     Running   6 (46m ago)    5h46m
kube-proxy-88nd6                              1/1     Running   2 (178m ago)   5h47m
kube-proxy-vrgtp                              1/1     Running   2 (178m ago)   5h47m
kube-proxy-z5jmc                              1/1     Running   2 (178m ago)   5h47m
kube-scheduler-y76-master01-16-181            1/1     Running   6 (46m ago)    8h

总结：

在进行排错时，应当回想操作过程中自己执行了哪些操作，再排查问题时，应当细究自己做的操作会有怎样的影响，例如此次操作，我将proxy代理给了宿主机，这意味着我的Pod会把流量转发给宿主机，通过宿主机进行通信，而Pod要通信的对端IP地址正是我定义的Pod网段（10.96.0.0/12，10.244.0.0/16），这通过宿主机进行通信肯定是找不到对端的。

posted @ 2024-07-07 22:18 Ky150 阅读(62) 评论(0) 编辑收藏举报

刷新页面返回顶部

Ky150

一次calico-kube-controllers 一直处于创建中引发的后续

公告