在现有集群中部署calico踩坑记录
现有集群是docker默认的bridge网络模型,不支持跨节点通信。因此部署网络插件calico. 另外需要把kubelet的网络模型改成cni(--network-plugin=cni).calico官网(https://docs.projectcalico.org/getting-started/kubernetes/self-managed-onprem/onpremises)给出的安装步骤如下:
- Download the Calico networking manifest for the Kubernetes API datastore.
curl https://docs.projectcalico.org/manifests/calico.yaml -O
- 修改CALICO_IPV4POOL_CIDR 字段为你所要使用的网段
- 按需客制化manifest
- CALICO_DISABLE_FILE_LOGGING 默认为true,表示除了cni的log都通过kubectl logs打印;如果想在/var/log/calico/目录下的文件查看log,需要把该值设为false.并且需要共享主机目录/var/log/calico
- BGP_LOGSEVERITYSCREEN 设置log level,默认为info. 还可以设为debug,error等。
- FELIX_LOGSEVERITYSCREEN 设置felix的log level
- Apply the manifest using the following command.
kubectl apply -f calico.yaml
但是在最后一步时,calico-kube-controllers容器起不来,同时calico-node容器也一直在重启。查看calico-kube-controllers的logs,如下所示:
2020-09-29 09:39:55.356 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0929 09:39:55.359900 1 client_config.go:543] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2020-09-29 09:39:55.362 [INFO][1] main.go 109: Ensuring Calico datastore is initialized
2020-09-29 09:39:55.372 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://10.0.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": x509: certificate is valid for 127.0.0.1, 172.171.19.210, not 10.0.0.1
2020-09-29 09:39:55.373 [FATAL][1] main.go 114: Failed to initialize Calico datastore error=Get "https://10.0.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": x509: certificate is valid for 127.0.0.1, 172.171.19.210, not 10.0.0.1
判断是kubeconfig未配置好,我直接修改yaml文件中关于calico-kube-controllers容器的配置,用挂载卷的方式从主机的 /root/.kube/目录下读取配置文件,从主机的/opt/kubernetes/ssl目录下读取Etcd认证文件(注意该目录下要有文件),显示配置KUBECONFIG如下所示(经过几次验证知道不用操心kubeconfig的事,重启所有calico-node 就能好,而且它默认的kubeconfig是/etc/cni/net.d/calico-kubeconfig这个文件是由calico-node的install程序自动生成):
containers:
- name: calico-kube-controllers
image: calico/kube-controllers:v3.16.1
volumeMounts:
- mountPath: /test-pd
name: test-volume
- mountPath: /opt/kubernetes/ssl
name: test-etcd
env:
# Choose which controllers to run.
- name: ENABLED_CONTROLLERS
value: node
- name: DATASTORE_TYPE
value: kubernetes
- name: KUBECONFIG
value: /test-pd/config
readinessProbe:
exec:
command:
- /usr/bin/check-status
- -r
volumes:
- name: test-volume
hostPath:
# directory location on host
path: /root/.kube/
- name: test-etcd
hostPath:
path: /opt/kubernetes/ssl/
重新创建后calico-kube-controller可正确启动,但这是看calico-node仍然不停重启,查看log如下所示:
2020-09-30 01:43:32.539 [INFO][8] startup/startup.go 361: Early log level set to info
2020-09-30 01:43:32.539 [INFO][8] startup/startup.go 377: Using NODENAME environment for node name
2020-09-30 01:43:32.540 [INFO][8] startup/startup.go 389: Determined node name: k8s-node1
2020-09-30 01:43:32.543 [INFO][8] startup/startup.go 421: Checking datastore connection
2020-09-30 01:43:32.552 [INFO][8] startup/startup.go 436: Hit error connecting to datastore - retry error=Get "https://10.0.0.1:443/api/v1/nodes/foo": x509: certificate is valid for 127.0.0.1, 172.171.19.210, not 10.0.0.1
calico node工作节点启动找不到apiserver的地址,检查一下calico的配置文件,要把apiserver的IP和端口配置上,如果不配置的话,calico默认将设置默认的calico网段和443端口。字段名:KUBERNETES_SERVICE_HOST、KUBERNETES_SERVICE_PORT、KUBERNETES_SERVICE_PORT_HTTPS如下:
再重新创建,查看log,运行正常。
修改kubernetes数据存储类型为etcdv3
从官网上下载的Calico.yaml对calico-node和calico-kube-controller的数据存储类型定义如下,如果屏蔽改值那么使用默认值etcdv3.但是数据存储在kubernetes上不方便查看,因此改为etcdv3. (官方推荐使用k8s数据存储)
env:
# Use Kubernetes API as the backing datastore.
- name: DATASTORE_TYPE
value: "kubernetes"
但是使用该方法需要配置证书等,我一直没有配置成功。其实calico已经提供了一种便捷的使用etcd数据库的方法,并且官网也有使用etcd数据库的模板YAML文件。步骤如下:
下载etcd数据存储类型的calico yaml文件
curl https://docs.projectcalico.org/v3.16/manifests/calico-etcd.yaml -o calico-etcd.yaml
生成密钥
- mkdir /opt/calico
- cp -fr /opt/etcd/ssl /opt/calico/
- cd /opt/calico/ssl
- cat server.pem | base64 -w 0 > etcd-cert
- cat server-key.pem | base64 -w 0 > etcd-key
- cat ca.pem | base64 -w 0 > etcd-ca
把密钥填写到calico-etcd.yaml文件中
将上述base64加密的字符串修改至文件中声明:ca.pem对应etcd-ca、server-key.pem对应etcd-key、server.pem对应etcd-cert;修改etcd证书的位置(我没修改,就用的默认值,不知道为啥也可以);修改etcd的连接地址(与api-server中配置/opt/kubernetes/cfg/kube-apiserver.conf中相同)
# vim calico-etcd.yaml
...
apiVersion: v1
kind: Secret
type: Opaque
metadata:
name: calico-etcd-secrets
namespace: kube-system
data:
# Populate the following with etcd TLS configuration if desired, but leave blank if
# not using TLS for etcd.
# The keys below should be uncommented and the values populated with the base64
# encoded contents of each file that would be associated with the TLS data.
# Example command for encoding a file contents: cat <file> | base64 -w 0
etcd-key: 填写上面的加密字符串
etcd-cert: 填写上面的加密字符串
etcd-ca: 填写上面的加密字符串
...
kind: ConfigMap
apiVersion: v1
metadata:
name: calico-config
namespace: kube-system
data:
# Configure this with the location of your etcd cluster.
etcd_endpoints: "https://192.168.1.2:2379"
# If you're using TLS enabled etcd uncomment the following.
# You must also populate the Secret below with these files.
etcd_ca: "/calico-secrets/etcd-ca" #这三个值不需要修改
etcd_cert: "/calico-secrets/etcd-cert" #这三个值不需要修改
etcd_key: "/calico-secrets/etcd-key" #这三个值不需要修改
重新创建Calico相关资源
kubectl delete -f calico.yaml
kubectl create -f calico-etcd.yaml
修改/root/.bashrc,添加如下一行
alias etcdctl='ETCDCTL_API=3 etcdctl --endpoints https://192.168.1.2:2379 --cacert /opt/etcd/ssl/ca.pem --key /opt/etcd/ssl/server-key.pem --cert /opt/etcd/ssl/server.pem'
并执行命令source ~/.bashrc
验证calico数据是否在Etcd上
创建一个pod,并在etcd上查找给pod,可以看到的以/calico/resources/v3/打头的pod信息。
安装calicoctl工具
(参考https://docs.projectcalico.org/getting-started/clis/calicoctl/install)
(1) curl -O -L https://github.com/projectcalico/calicoctl/releases/download/v3.16.1/calicoctl 放到/usr/local/bin/目录下,chmod +x
(2) 配置Etcd: cat << EOF > /etc/calico/calicoctl.cfg
apiVersion: projectcalico.org/v3
kind: CalicoAPIConfig
metadata:
spec:
datastoreType: "kubernetes"
kubeconfig: "/root/.kube/config"
EOF
在非集群节点以容器的形式安装calico-node
- 创建/etc/calico/calico.env 配置文件
# cat /etc/calico/calico.env
CALICO_NODENAME=""
CALICO_K8S_NODE_REF="192-168-1-210"
CALICO_IPV4POOL_IPIP="Always"
CALICO_IP=""
CALICO_IP6=""
CALICO_NETWORKING_BACKEND="bird"
DATASTORE_TYPE="etcdv3"
ETCD_ENDPOINTS="https://xxx1:2379,https://xxx2:2379,https:/xxx3:2379"
ETCD_CA_CERT_FILE="/etc/calico/pki/etcd-ca"
ETCD_CERT_FILE="/etc/calico/pki/etcd-cert"
ETCD_KEY_FILE="/etc/calico/pki/etcd-key"
KUBERNETES_SERVICE_HOST="192.168.1.130"
KUBERNETES_SERVICE_PORT="6443"
KUBECONFIG="/etc/calico/config"
WAIT_FOR_DATASTORE="true"
BGP_LOGSEVERITYSCREEN="info"
- 创建calico-node守护进程配置文件
# cat lib/systemd/system/calico-node.service
[Unit]
Description=calico-node
After=docker.service
Requires=docker.service
[Service]
EnvironmentFile=/etc/calico/calico.env
ExecStartPre=-/usr/bin/docker rm -f calico-node
ExecStart=/usr/bin/docker run --net=host --privileged \
--name=calico-node \
-e NODENAME=${CALICO_NODENAME} \
-e IP=${CALICO_IP} \
-e IP6=${CALICO_IP6} \
-e CALICO_NETWORKING_BACKEND=${CALICO_NETWORKING_BACKEND} \
-e AS=${CALICO_AS} \
-e CALICO_IPV4POOL_IPIP=${CALICO_IPV4POOL_IPIP} \
-e DATASTORE_TYPE=${DATASTORE_TYPE} \
-e ETCD_ENDPOINTS=${ETCD_ENDPOINTS} \
-e ETCD_CA_CERT_FILE=${ETCD_CA_CERT_FILE} \
-e ETCD_CERT_FILE=${ETCD_CERT_FILE} \
-e ETCD_KEY_FILE=${ETCD_KEY_FILE} \
-e KUBERNETES_SERVICE_HOST=${KUBERNETES_SERVICE_HOST} \
-e KUBERNETES_SERVICE_PORT=${KUBERNETES_SERVICE_PORT} \
-e KUBECONFIG=${KUBECONFIG} \
-e WAIT_FOR_DATASTORE=${WAIT_FOR_DATASTORE} \
-e BGP_LOGSEVERITYSCREEN=${BGP_LOGSEVERITYSCREEN} \
-v /var/log/calico:/var/log/calico \
-v /run/docker/plugins:/run/docker/plugins \
-v /lib/modules:/lib/modules \
-v /var/run/calico:/var/run/calico \
-v /etc/calico:/etc/calico \
-v /var/lib/calico:/var/lib/calico \
calico/node:v3.16.5
ExecStop=-/usr/bin/docker stop calico-node
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s
[Install]
WantedBy=multi-user.target
然后systemctl daemon-reload, systemctl enable calico-node, systemctl start calico-node
其他常见问题:
- 新加节点后需不需要额外部署calico?
答:不需要,因为这里calico的部署方式是daemonset,当有新节点加进来时,会启动一个calico-node pod. - calico-node一直处于Init状态,是怎么回事?
答:应该是连接不了外网,导致镜像下载不了。如果可以下载镜像,那有可能是kubelet无法创建sandbox导致容器起不来,例如缺少文件/run/systemd/resolve/resolv.conf,从一个正常的节点拷贝过来即可。 - 如果是多节点集群,多数节点的calico-node都是ready的,只有个别calico-node Ready的个数为“0/1”,再用命令calicoctl node status查看节点间连接建立的情况,如果连接没有建立,查看各个需要建立连接网卡的名称是否一致,如果不一致需要改成一致。当然这些网卡的IP地址需要在同一个网段。如果是多个网卡,可以指定网卡,如下所示:
# IP automatic detection
- name: IP_AUTODETECTION_METHOD
value: "interface=eth2"
另外,还有一种指定IP地址段的方法更加方便,而且一般集群中的业务网卡都在同一个网段。如下设置:
IP_AUTODETECTION_METHOD=cidr=10.0.1.0/24,10.0.2.0/24
IP6_AUTODETECTION_METHOD=cidr=2001:4860::0/64
注意,这里设置指定建立BGP连接的网段还需要指定“IP”这个环境变量,如下所示,意思是自动探测BGP IP address:
- name: IP
value: autodetect
- 如果calico-node容器显示Init:CrashLoopBackOff,说明初始化失败,那么使用命令kubectl describe 该容器,查看是initContainer中哪个步骤出错了。如下图所示是kubectl describe 初始化失败的容器的InitContainers的部分显示
Init Containers:
upgrade-ipam:
Container ID: docker://caaa485d0880c1cb022873c5017ec60ba1970ed8dc897a0b458fa6bb4b6b4179
Image: 192.168.3.224:5000/library/calico/cni:v3.18.1
Image ID: docker-pullable://192.168.3.224:5000/library/calico/cni@sha256:bc6507d4c122b69609fed5839d899569a80b709358836dd9cd1052470dfdd47a
Port: <none>
Host Port: <none>
Command:
/opt/cni/bin/calico-ipam
-upgrade
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 27 Oct 2021 16:16:15 +0800
Finished: Wed, 27 Oct 2021 16:16:15 +0800
Ready: True
Restart Count: 0
Environment Variables from:
kubernetes-services-endpoint ConfigMap Optional: true
Environment:
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
Mounts:
/host/opt/cni/bin from cni-bin-dir (rw)
/var/lib/cni/networks from host-local-net-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-vz89c (ro)
install-cni:
Container ID: docker://87bfc3cea39247d10796148bf88f94a552d327bf3038f87f4e981feb02393cb8
Image: 192.168.3.224:5000/library/calico/cni:v3.18.1
Image ID: docker-pullable://192.168.3.224:5000/library/calico/cni@sha256:bc6507d4c122b69609fed5839d899569a80b709358836dd9cd1052470dfdd47a
Port: <none>
Host Port: <none>
Command:
/opt/cni/bin/install
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 27 Oct 2021 16:16:16 +0800
Finished: Wed, 27 Oct 2021 16:16:17 +0800
Ready: True
Restart Count: 0
Environment Variables from:
kubernetes-services-endpoint ConfigMap Optional: true
Environment:
CNI_CONF_NAME: 10-calico.conflist
CNI_NETWORK_CONFIG: <set to the key 'cni_network_config' of config map 'calico-config'> Optional: false
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CNI_MTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
SLEEP: false
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-vz89c (ro)
flexvol-driver:
Container ID: docker://d095ec7e2ca15c90e234b207890caec380e0ae1491556e4b61f58e0db0e0df00
Image: 192.168.3.224:5000/library/calico/pod2daemon-flexvol:v3.18.1
Image ID: docker-pullable://192.168.3.224:5000/library/calico/pod2daemon-flexvol@sha256:4ac1844531e0592b2c609a0b0d2e8f740f4c66c7e27c7e5dda994dec98d7fb28
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 27 Oct 2021 16:16:18 +0800
Finished: Wed, 27 Oct 2021 16:16:18 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/host/driver from flexvol-driver-host (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-vz89c (ro)
Containers:
可以看出容器初始化时用了三个容器,分别是upgrade-ipam、install-cni和flexvol-driver.其中upgrade-ipam容器的作用是查看是否有/var/lib/cni/networks/k8s-pod-network数据,如果有就把本地的ipam数据迁移到Calico-ipam;install-cni是cni-plugin项目里编译出来的一个二进制文件,用来拷贝二进制文件到各个主机的/opt/cni/bin下面的,并生成calico配置文件拷贝到/etc/cni/net.d下面;flexvol-driver使用的镜像是pod2daemon-flexvol,它的作用是 Adds a Flex Volume Driver that creates a per-pod Unix Domain Socket to allow Dikastes to communicate with Felix over the Policy Sync API.如果容器初试化错误,查看calico-node看不出问题,但可以通过查看这三个容器的log来分析,例如查看第二容器初始化install-cni的log:
kubectl logs -n kube-system calico-node-123xx -c install-cni
有时这个三个步骤有个别步骤出现系统级别无法排除的错误,在不影响功能的前提下可以删除这个容器初始化的部分yaml代码,或者手动完成初始化的功能,那么容器就能起来了。
-
看上去所有都正常,但本地的calico容器都ping不通,该怎么办?
我遇到过这个奇葩的问题,结果是因为本地主机侧没有配置默认路由导致calico容器侧的ARP表异常显示incomplete.给本机host命名空间配置默认路由即可。
另外,如果本地不配置默认路由calico-kube-controller也会起不来,即使它和kube-apiserver在同一个节点,calico-kube-controller会报错:client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://10.244.64.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.244.64.1:443: connect: no route to host
2024-06-26 02:20:58.320 [FATAL][1] main.go 118: Failed to initialize Calico datastore error=Get "https://10.244.64.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.244.64.1:443: connect: no route to host -
有没有什么场景是只能用IPIP模式而不能用BGP模式的?
IPIP模式适用于相对简单的场景,而BGP几乎可以用在所有场景中。但是BGP有使用门槛,常用于大规模网络、多AS之间的互联、需要动态路由协商的情况等。 -
有个别情况报错:failed to look up reserved IPs: connection is unauthorized: ipreservations.crd.projectcalico.org is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot list resource "ipreservations" in API group "crd.projectcalico.org" at the cluster scope
这是说calico-node用户没有权限去list ipreservations资源, 应该是RBAC(基于角色的访问控制)配置不正确引起的。一般情况下是calico是有权限的,这里报错没有权限大概率是配置了calico不可用的IP网段。如果建立连接的网卡没错,分配的IP地址网卡也没错。那么可以通过如下方法赋予权限。
1. 定义一个新的 ClusterRole,允许列出 ipreservations 资源。
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: calico-ipreservations-role
rules:
- apiGroups: ["crd.projectcalico.org"]
resources: ["ipreservations"]
verbs: ["list", "get", "watch"]
2. 创建 ClusterRoleBinding,将这个 ClusterRole 绑定到 calico-node 服务账户上
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: calico-ipreservations-binding
subjects:
- kind: ServiceAccount
name: calico-node
namespace: kube-system
roleRef:
kind: ClusterRole
name: calico-ipreservations-role
apiGroup: rbac.authorization.k8s.io
然后kubectl apply -f <上面两个yaml文件>
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南