在现有集群中部署calico踩坑记录

现有集群是docker默认的bridge网络模型,不支持跨节点通信。因此部署网络插件calico. 另外需要把kubelet的网络模型改成cni(--network-plugin=cni).calico官网(https://docs.projectcalico.org/getting-started/kubernetes/self-managed-onprem/onpremises)给出的安装步骤如下:

  1. Download the Calico networking manifest for the Kubernetes API datastore.
curl https://docs.projectcalico.org/manifests/calico.yaml -O
  1. 修改CALICO_IPV4POOL_CIDR 字段为你所要使用的网段
  2. 按需客制化manifest
    • CALICO_DISABLE_FILE_LOGGING 默认为true,表示除了cni的log都通过kubectl logs打印;如果想在/var/log/calico/目录下的文件查看log,需要把该值设为false.并且需要共享主机目录/var/log/calico
    • BGP_LOGSEVERITYSCREEN 设置log level,默认为info. 还可以设为debug,error等。
    • FELIX_LOGSEVERITYSCREEN 设置felix的log level
  3. Apply the manifest using the following command.
kubectl apply -f calico.yaml

但是在最后一步时,calico-kube-controllers容器起不来,同时calico-node容器也一直在重启。查看calico-kube-controllers的logs,如下所示:

2020-09-29 09:39:55.356 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0929 09:39:55.359900       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020-09-29 09:39:55.362 [INFO][1] main.go 109: Ensuring Calico datastore is initialized
2020-09-29 09:39:55.372 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://10.0.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": x509: certificate is valid for 127.0.0.1, 172.171.19.210, not 10.0.0.1
2020-09-29 09:39:55.373 [FATAL][1] main.go 114: Failed to initialize Calico datastore error=Get "https://10.0.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": x509: certificate is valid for 127.0.0.1, 172.171.19.210, not 10.0.0.1

判断是kubeconfig未配置好,我直接修改yaml文件中关于calico-kube-controllers容器的配置,用挂载卷的方式从主机的 /root/.kube/目录下读取配置文件,从主机的/opt/kubernetes/ssl目录下读取Etcd认证文件(注意该目录下要有文件),显示配置KUBECONFIG如下所示(经过几次验证知道不用操心kubeconfig的事,重启所有calico-node 就能好,而且它默认的kubeconfig是/etc/cni/net.d/calico-kubeconfig这个文件是由calico-node的install程序自动生成):

      containers:
       - name: calico-kube-controllers
         image: calico/kube-controllers:v3.16.1
         volumeMounts:
         - mountPath: /test-pd
           name: test-volume
         - mountPath: /opt/kubernetes/ssl
           name: test-etcd
         env:
           # Choose which controllers to run.
           - name: ENABLED_CONTROLLERS
             value: node
           - name: DATASTORE_TYPE
             value: kubernetes
           - name: KUBECONFIG
             value: /test-pd/config
         readinessProbe:
           exec:
             command:
             - /usr/bin/check-status
             - -r
     volumes:
       - name: test-volume
         hostPath:
         # directory location on host
           path: /root/.kube/
       - name: test-etcd
         hostPath:
           path: /opt/kubernetes/ssl/

重新创建后calico-kube-controller可正确启动,但这是看calico-node仍然不停重启,查看log如下所示:

2020-09-30 01:43:32.539 [INFO][8] startup/startup.go 361: Early log level set to info
2020-09-30 01:43:32.539 [INFO][8] startup/startup.go 377: Using NODENAME environment for node name
2020-09-30 01:43:32.540 [INFO][8] startup/startup.go 389: Determined node name: k8s-node1
2020-09-30 01:43:32.543 [INFO][8] startup/startup.go 421: Checking datastore connection
2020-09-30 01:43:32.552 [INFO][8] startup/startup.go 436: Hit error connecting to datastore - retry error=Get "https://10.0.0.1:443/api/v1/nodes/foo": x509: certificate is valid for 127.0.0.1, 172.171.19.210, not 10.0.0.1

calico node工作节点启动找不到apiserver的地址,检查一下calico的配置文件,要把apiserver的IP和端口配置上,如果不配置的话,calico默认将设置默认的calico网段和443端口。字段名:KUBERNETES_SERVICE_HOST、KUBERNETES_SERVICE_PORT、KUBERNETES_SERVICE_PORT_HTTPS如下:

再重新创建,查看log,运行正常。

修改kubernetes数据存储类型为etcdv3

从官网上下载的Calico.yaml对calico-node和calico-kube-controller的数据存储类型定义如下,如果屏蔽改值那么使用默认值etcdv3.但是数据存储在kubernetes上不方便查看,因此改为etcdv3. (官方推荐使用k8s数据存储)

          env:
            # Use Kubernetes API as the backing datastore.
            - name: DATASTORE_TYPE
              value: "kubernetes"

但是使用该方法需要配置证书等,我一直没有配置成功。其实calico已经提供了一种便捷的使用etcd数据库的方法,并且官网也有使用etcd数据库的模板YAML文件。步骤如下:

下载etcd数据存储类型的calico yaml文件

curl https://docs.projectcalico.org/v3.16/manifests/calico-etcd.yaml -o calico-etcd.yaml

生成密钥

  • mkdir /opt/calico
  • cp -fr /opt/etcd/ssl /opt/calico/
  • cd /opt/calico/ssl
  • cat server.pem | base64 -w 0 > etcd-cert
  • cat server-key.pem | base64 -w 0 > etcd-key
  • cat ca.pem | base64 -w 0 > etcd-ca

把密钥填写到calico-etcd.yaml文件中

将上述base64加密的字符串修改至文件中声明:ca.pem对应etcd-ca、server-key.pem对应etcd-key、server.pem对应etcd-cert;修改etcd证书的位置(我没修改,就用的默认值,不知道为啥也可以);修改etcd的连接地址(与api-server中配置/opt/kubernetes/cfg/kube-apiserver.conf中相同)

# vim calico-etcd.yaml
...
apiVersion: v1
kind: Secret
type: Opaque
metadata:
  name: calico-etcd-secrets
  namespace: kube-system
data:
  # Populate the following with etcd TLS configuration if desired, but leave blank if
  # not using TLS for etcd.
  # The keys below should be uncommented and the values populated with the base64
  # encoded contents of each file that would be associated with the TLS data.
  # Example command for encoding a file contents: cat <file> | base64 -w 0
  etcd-key: 填写上面的加密字符串
  etcd-cert: 填写上面的加密字符串
  etcd-ca: 填写上面的加密字符串
...
kind: ConfigMap
apiVersion: v1
metadata:
  name: calico-config
  namespace: kube-system
data:
  # Configure this with the location of your etcd cluster.
  etcd_endpoints: "https://192.168.1.2:2379"
  # If you're using TLS enabled etcd uncomment the following.
  # You must also populate the Secret below with these files.
  etcd_ca: "/calico-secrets/etcd-ca"       #这三个值不需要修改
  etcd_cert: "/calico-secrets/etcd-cert"   #这三个值不需要修改
  etcd_key: "/calico-secrets/etcd-key"     #这三个值不需要修改

重新创建Calico相关资源

kubectl delete -f calico.yaml
kubectl create -f calico-etcd.yaml

修改/root/.bashrc,添加如下一行

alias etcdctl='ETCDCTL_API=3 etcdctl --endpoints https://192.168.1.2:2379 --cacert /opt/etcd/ssl/ca.pem --key /opt/etcd/ssl/server-key.pem --cert /opt/etcd/ssl/server.pem'
并执行命令source ~/.bashrc

验证calico数据是否在Etcd上

创建一个pod,并在etcd上查找给pod,可以看到的以/calico/resources/v3/打头的pod信息。

安装calicoctl工具

(参考https://docs.projectcalico.org/getting-started/clis/calicoctl/install)
(1) curl -O -L https://github.com/projectcalico/calicoctl/releases/download/v3.16.1/calicoctl 放到/usr/local/bin/目录下,chmod +x
(2) 配置Etcd: cat << EOF > /etc/calico/calicoctl.cfg
apiVersion: projectcalico.org/v3
kind: CalicoAPIConfig
metadata:
spec:
datastoreType: "kubernetes"
kubeconfig: "/root/.kube/config"
EOF

在非集群节点以容器的形式安装calico-node

  • 创建/etc/calico/calico.env 配置文件
# cat /etc/calico/calico.env 
CALICO_NODENAME=""
CALICO_K8S_NODE_REF="192-168-1-210"
CALICO_IPV4POOL_IPIP="Always" 
CALICO_IP="" 
CALICO_IP6=""
CALICO_NETWORKING_BACKEND="bird"
DATASTORE_TYPE="etcdv3"
ETCD_ENDPOINTS="https://xxx1:2379,https://xxx2:2379,https:/xxx3:2379"
ETCD_CA_CERT_FILE="/etc/calico/pki/etcd-ca"
ETCD_CERT_FILE="/etc/calico/pki/etcd-cert"
ETCD_KEY_FILE="/etc/calico/pki/etcd-key"
KUBERNETES_SERVICE_HOST="192.168.1.130"
KUBERNETES_SERVICE_PORT="6443"
KUBECONFIG="/etc/calico/config"
WAIT_FOR_DATASTORE="true"
BGP_LOGSEVERITYSCREEN="info"

  • 创建calico-node守护进程配置文件
# cat lib/systemd/system/calico-node.service 
[Unit]
Description=calico-node
After=docker.service
Requires=docker.service

[Service]
EnvironmentFile=/etc/calico/calico.env
ExecStartPre=-/usr/bin/docker rm -f calico-node
ExecStart=/usr/bin/docker run --net=host --privileged \
 --name=calico-node \
 -e NODENAME=${CALICO_NODENAME} \
 -e IP=${CALICO_IP} \
 -e IP6=${CALICO_IP6} \
 -e CALICO_NETWORKING_BACKEND=${CALICO_NETWORKING_BACKEND} \
 -e AS=${CALICO_AS} \
 -e CALICO_IPV4POOL_IPIP=${CALICO_IPV4POOL_IPIP} \
 -e DATASTORE_TYPE=${DATASTORE_TYPE} \
 -e ETCD_ENDPOINTS=${ETCD_ENDPOINTS} \
 -e ETCD_CA_CERT_FILE=${ETCD_CA_CERT_FILE} \
 -e ETCD_CERT_FILE=${ETCD_CERT_FILE} \
 -e ETCD_KEY_FILE=${ETCD_KEY_FILE} \
 -e KUBERNETES_SERVICE_HOST=${KUBERNETES_SERVICE_HOST} \
 -e KUBERNETES_SERVICE_PORT=${KUBERNETES_SERVICE_PORT} \
 -e KUBECONFIG=${KUBECONFIG} \
 -e WAIT_FOR_DATASTORE=${WAIT_FOR_DATASTORE} \
 -e BGP_LOGSEVERITYSCREEN=${BGP_LOGSEVERITYSCREEN} \
 -v /var/log/calico:/var/log/calico \
 -v /run/docker/plugins:/run/docker/plugins \
 -v /lib/modules:/lib/modules \
 -v /var/run/calico:/var/run/calico \
 -v /etc/calico:/etc/calico \
 -v /var/lib/calico:/var/lib/calico \
 calico/node:v3.16.5

ExecStop=-/usr/bin/docker stop calico-node

Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s

[Install]
WantedBy=multi-user.target

然后systemctl daemon-reload, systemctl enable calico-node, systemctl start calico-node

其他常见问题:

  1. 新加节点后需不需要额外部署calico?
    答:不需要,因为这里calico的部署方式是daemonset,当有新节点加进来时,会启动一个calico-node pod.
  2. calico-node一直处于Init状态,是怎么回事?
    答:应该是连接不了外网,导致镜像下载不了。如果可以下载镜像,那有可能是kubelet无法创建sandbox导致容器起不来,例如缺少文件/run/systemd/resolve/resolv.conf,从一个正常的节点拷贝过来即可。
  3. 如果是多节点集群,多数节点的calico-node都是ready的,只有个别calico-node Ready的个数为“0/1”,再用命令calicoctl node status查看节点间连接建立的情况,如果连接没有建立,查看各个需要建立连接网卡的名称是否一致,如果不一致需要改成一致。当然这些网卡的IP地址需要在同一个网段。如果是多个网卡,可以指定网卡,如下所示:
            # IP automatic detection
            - name: IP_AUTODETECTION_METHOD
              value: "interface=eth2"

另外,还有一种指定IP地址段的方法更加方便,而且一般集群中的业务网卡都在同一个网段。如下设置:

IP_AUTODETECTION_METHOD=cidr=10.0.1.0/24,10.0.2.0/24
IP6_AUTODETECTION_METHOD=cidr=2001:4860::0/64

注意,这里设置指定建立BGP连接的网段还需要指定“IP”这个环境变量,如下所示,意思是自动探测BGP IP address:

- name: IP
  value: autodetect
  1. 如果calico-node容器显示Init:CrashLoopBackOff,说明初始化失败,那么使用命令kubectl describe 该容器,查看是initContainer中哪个步骤出错了。如下图所示是kubectl describe 初始化失败的容器的InitContainers的部分显示
Init Containers:
  upgrade-ipam:
    Container ID:  docker://caaa485d0880c1cb022873c5017ec60ba1970ed8dc897a0b458fa6bb4b6b4179
    Image:         192.168.3.224:5000/library/calico/cni:v3.18.1
    Image ID:      docker-pullable://192.168.3.224:5000/library/calico/cni@sha256:bc6507d4c122b69609fed5839d899569a80b709358836dd9cd1052470dfdd47a
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/calico-ipam
      -upgrade
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 27 Oct 2021 16:16:15 +0800
      Finished:     Wed, 27 Oct 2021 16:16:15 +0800
    Ready:          True
    Restart Count:  0
    Environment Variables from:
      kubernetes-services-endpoint  ConfigMap  Optional: true
    Environment:
      KUBERNETES_NODE_NAME:        (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:  <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/lib/cni/networks from host-local-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-vz89c (ro)
  install-cni:
    Container ID:  docker://87bfc3cea39247d10796148bf88f94a552d327bf3038f87f4e981feb02393cb8
    Image:         192.168.3.224:5000/library/calico/cni:v3.18.1
    Image ID:      docker-pullable://192.168.3.224:5000/library/calico/cni@sha256:bc6507d4c122b69609fed5839d899569a80b709358836dd9cd1052470dfdd47a
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/install
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 27 Oct 2021 16:16:16 +0800
      Finished:     Wed, 27 Oct 2021 16:16:17 +0800
    Ready:          True
    Restart Count:  0
    Environment Variables from:
      kubernetes-services-endpoint  ConfigMap  Optional: true
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-vz89c (ro)
  flexvol-driver:
    Container ID:   docker://d095ec7e2ca15c90e234b207890caec380e0ae1491556e4b61f58e0db0e0df00
    Image:          192.168.3.224:5000/library/calico/pod2daemon-flexvol:v3.18.1
    Image ID:       docker-pullable://192.168.3.224:5000/library/calico/pod2daemon-flexvol@sha256:4ac1844531e0592b2c609a0b0d2e8f740f4c66c7e27c7e5dda994dec98d7fb28
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 27 Oct 2021 16:16:18 +0800
      Finished:     Wed, 27 Oct 2021 16:16:18 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-vz89c (ro)
Containers:

可以看出容器初始化时用了三个容器,分别是upgrade-ipam、install-cni和flexvol-driver.其中upgrade-ipam容器的作用是查看是否有/var/lib/cni/networks/k8s-pod-network数据,如果有就把本地的ipam数据迁移到Calico-ipam;install-cni是cni-plugin项目里编译出来的一个二进制文件,用来拷贝二进制文件到各个主机的/opt/cni/bin下面的,并生成calico配置文件拷贝到/etc/cni/net.d下面;flexvol-driver使用的镜像是pod2daemon-flexvol,它的作用是 Adds a Flex Volume Driver that creates a per-pod Unix Domain Socket to allow Dikastes to communicate with Felix over the Policy Sync API.如果容器初试化错误,查看calico-node看不出问题,但可以通过查看这三个容器的log来分析,例如查看第二容器初始化install-cni的log:

kubectl logs -n kube-system   calico-node-123xx -c install-cni

有时这个三个步骤有个别步骤出现系统级别无法排除的错误,在不影响功能的前提下可以删除这个容器初始化的部分yaml代码,或者手动完成初始化的功能,那么容器就能起来了。

  1. 看上去所有都正常,但本地的calico容器都ping不通,该怎么办?
    我遇到过这个奇葩的问题,结果是因为本地主机侧没有配置默认路由导致calico容器侧的ARP表异常显示incomplete.给本机host命名空间配置默认路由即可。
    另外,如果本地不配置默认路由calico-kube-controller也会起不来,即使它和kube-apiserver在同一个节点,calico-kube-controller会报错:client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://10.244.64.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.244.64.1:443: connect: no route to host
    2024-06-26 02:20:58.320 [FATAL][1] main.go 118: Failed to initialize Calico datastore error=Get "https://10.244.64.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.244.64.1:443: connect: no route to host

  2. 有没有什么场景是只能用IPIP模式而不能用BGP模式的?
    IPIP模式适用于相对简单的场景,而BGP几乎可以用在所有场景中。但是BGP有使用门槛,常用于大规模网络、多AS之间的互联、需要动态路由协商的情况等。

  3. 有个别情况报错:failed to look up reserved IPs: connection is unauthorized: ipreservations.crd.projectcalico.org is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot list resource "ipreservations" in API group "crd.projectcalico.org" at the cluster scope
    这是说calico-node用户没有权限去list ipreservations资源, 应该是RBAC(基于角色的访问控制)配置不正确引起的。一般情况下是calico是有权限的,这里报错没有权限大概率是配置了calico不可用的IP网段。如果建立连接的网卡没错,分配的IP地址网卡也没错。那么可以通过如下方法赋予权限。

1. 定义一个新的 ClusterRole,允许列出 ipreservations 资源。
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: calico-ipreservations-role
rules:
  - apiGroups: ["crd.projectcalico.org"]
    resources: ["ipreservations"]
    verbs: ["list", "get", "watch"]

2. 创建 ClusterRoleBinding,将这个 ClusterRole 绑定到 calico-node 服务账户上
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: calico-ipreservations-binding
subjects:
  - kind: ServiceAccount
    name: calico-node
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: calico-ipreservations-role
  apiGroup: rbac.authorization.k8s.io

然后kubectl apply -f <上面两个yaml文件>

参考文档

posted @   JaneySJ  阅读(31375)  评论(1编辑  收藏  举报
编辑推荐:
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
阅读排行:
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
点击右上角即可分享
微信分享提示