Calico 3.16.0的bug导致POD启动失败

最近在搭建新版本kubernetes做验证,安装完calico网络插件之后,部署应用,但是POD却启动失败,报CNI错误,具体如下:

Mar 11 10:23:40 SZX-xxxxxx kubelet: E0311 10:23:40:849284 32339 cni.go:366] Error adding kube-system_coredns -xxxx-xxxx/xxxxxxxxxxxxxxxxxxx to network calico/k8s-pod-network: resource does not exist: Node (szx-xxxxxx) with error: nodes (szx-xxxxxx) not found

 

 

报错比较明显,就是etcd中不存在szx-xxxxxx这个node,因为我们的kubelet使用--hostname-override参数把node name修改成了IP,这样子node name就不需要域名解析,可以直接连接了。

 

kubectl get node 可以看到node name是IP而不是hostname。

 

于是查看/etc/cni/net.d/10-calico.conflist,nodename字段确实是szx-xxxxxx。这个文件是由install-cni这个容器初始化的,肯定是初始化出了问题。

 

这里需要提一下,我的calico是直接下载官方yaml安装,还没做什么参数定制。

curl https://docs.projectcalico.org/manifests/calico.yaml -O

kubectl apply -f calico.yaml

 

查看install-cni这个容器的yaml,配置文件10-calico.conflist应该是根据configmap calico-config中的 cni_network_config生成的。

          env:
            # Name of the CNI config file to create.
            - name: CNI_CONF_NAME
              value: "10-calico.conflist"
            # The CNI network config to install on each node.
            - name: CNI_NETWORK_CONFIG
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: cni_network_config

 

再看看configmap,nodename的键值为__KUBERNETES_NODE_NAME__。

 

kind: ConfigMap
apiVersion: v1
metadata:
  name: calico-config
  namespace: kube-system
data:
  # Typha is disabled.
  typha_service_name: "none"
  # Configure the backend to use.
  calico_backend: "bird"

  # Configure the MTU to use for workload interfaces and tunnels.
  # By default, MTU is auto-detected, and explicitly setting this field should not be required.
  # You can override auto-detection by providing a non-zero value.
  veth_mtu: "0"

  # The CNI network configuration to install on each node. The special
  # values in this config will be automatically populated.
  cni_network_config: |-
    {
      "name": "k8s-pod-network",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "calico",
          "log_level": "info",
          "log_file_path": "/var/log/calico/cni/cni.log",
          "datastore_type": "kubernetes",
          "nodename": "__KUBERNETES_NODE_NAME__",
          "mtu": __CNI_MTU__,
          "ipam": {
              "type": "calico-ipam"
          },
          "policy": {
              "type": "k8s"
          },
          "kubernetes": {
              "kubeconfig": "__KUBECONFIG_FILEPATH__"
          }
        },
        {
          "type": "portmap",
          "snat": true,
          "capabilities": {"portMappings": true}
        },
        {
          "type": "bandwidth",
          "capabilities": {"bandwidth": true}
        }
      ]
    }

 

这个键值又是如何变成szx-xxxxxx的呢?需要看看calico源码中是如何处理的。

 

直接在github上查看https://github.com/projectcalico/cni-plugin/blob/v3.16.0/pkg/install/install.go,下面代码段中的第7行显示,__KUBERNETES_NODE_NAME__的值为环境变量中的NODENAME,如果没有这个环境变量,则使用变量nodename。

1     netconf = strings.Replace(netconf, "__LOG_LEVEL__", getEnv("LOG_LEVEL", "info"), -1)
2     netconf = strings.Replace(netconf, "__LOG_FILE_PATH__", getEnv("LOG_FILE_PATH", "/var/log/calico/cni/cni.log"), -1)
3     netconf = strings.Replace(netconf, "__LOG_FILE_MAX_SIZE__", getEnv("LOG_FILE_MAX_SIZE", "100"), -1)
4     netconf = strings.Replace(netconf, "__LOG_FILE_MAX_AGE__", getEnv("LOG_FILE_MAX_AGE", "30"), -1)
5     netconf = strings.Replace(netconf, "__LOG_FILE_MAX_COUNT__", getEnv("LOG_FILE_MAX_COUNT", "10"), -1)
6     netconf = strings.Replace(netconf, "__DATASTORE_TYPE__", getEnv("DATASTORE_TYPE", "kubernetes"), -1)
7     netconf = strings.Replace(netconf, "__KUBERNETES_NODE_NAME__", getEnv("NODENAME", nodename), -1)
8     netconf = strings.Replace(netconf, "__KUBECONFIG_FILEPATH__", kubeconfigPath, -1)
9     netconf = strings.Replace(netconf, "__CNI_MTU__", getEnv("CNI_MTU", "1500"), -1)

 

在install-cni这个容器的定义中,只有KUBERNETES_NODE_NAME这个环境变量,没有NODENAME。

            # Set the hostname based on the k8s node name.
            - name: KUBERNETES_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName

 

再看看nodename这个变量名,是从hostname获取到的,这就解释了为何我们的10-calico.conflist中的nodename是szx-xxxxxx了。

    // Perform replacements of variables.
    nodename, err := names.Hostname()
    if err != nil {
        log.Fatal(err)
    }

 

我看回以前旧版本的calico configmap, nodename这个配置一直都是__KUBERNETES_NODE_NAME__,为啥在3.16.0上就不行了呢?

 

对比了一下install-cni这个容器的yaml配置,v3.14.2中是使用脚本安装的,但v3.16.0则是Go写的程序,估计是新版本有bug。

https://github.com/projectcalico/cni-plugin/blob/v3.14.2/k8s-install/scripts/install-cni.sh

 

于是我想会不会3.16后面的小版本修复了呢,查看最新版本3.16.9的源码,确实,最新版的代码已经改了。

https://github.com/projectcalico/cni-plugin/blob/v3.16.9/pkg/install/install.go

 

    netconf = strings.Replace(netconf, "__LOG_LEVEL__", getEnv("LOG_LEVEL", "info"), -1)
    netconf = strings.Replace(netconf, "__LOG_FILE_PATH__", getEnv("LOG_FILE_PATH", "/var/log/calico/cni/cni.log"), -1)
    netconf = strings.Replace(netconf, "__LOG_FILE_MAX_SIZE__", getEnv("LOG_FILE_MAX_SIZE", "100"), -1)
    netconf = strings.Replace(netconf, "__LOG_FILE_MAX_AGE__", getEnv("LOG_FILE_MAX_AGE", "30"), -1)
    netconf = strings.Replace(netconf, "__LOG_FILE_MAX_COUNT__", getEnv("LOG_FILE_MAX_COUNT", "10"), -1)
    netconf = strings.Replace(netconf, "__DATASTORE_TYPE__", getEnv("DATASTORE_TYPE", "kubernetes"), -1)
    netconf = strings.Replace(netconf, "__KUBERNETES_NODE_NAME__", getEnv("KUBERNETES_NODE_NAME", nodename), -1)
    netconf = strings.Replace(netconf, "__KUBECONFIG_FILEPATH__", kubeconfigPath, -1)
    netconf = strings.Replace(netconf, "__CNI_MTU__", getEnv("CNI_MTU", "1500"), -1)

 

于是乎直接把镜像版本改成最新的v3.16.9,问题解决了。

posted @ 2021-03-14 09:32  雨后彩虹,如此绚烂  阅读(892)  评论(0编辑  收藏  举报