K8S学习笔记之kubeadm reset后的环境清理
0x00 概述
本文主要记录在kubeadm reset后,在重装集群,加入和管理节点过程中遇到的问题。
0x01 kubeadm reset后的清理工作
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X ipvsadm --clear systemctl stop kubelet systemctl stop docker rm -rf /var/lib/cni/* rm -rf /var/lib/kubelet/* rm -rf /etc/cni/* rm -rf $HOME/.kube/config systemctl start docker
根据自己需要选择清空iptables或者ipvs,如果后续加入集群过程中,提示有别的文件夹存在文件,可以直接rm -rf删除指定文件夹。
0x02 kubeadm reset后重装集群遇到的问题
2.1 calico报错
在执行kubeadm reset之后,开始进行重装集群操作,此时会遇到很多calico相关的报错日志,包括以下日志但是不限于这些日志:
以下日志是在执行kubectl apply -f calico.yaml之后出现的。
calico的BGP报错
Warning Unhealthy pod/calico-node-k6tz5 Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory Warning Unhealthy pod/calico-node-k6tz5 Liveness probe failed: calico/node is not ready: bird/confd is not live: exit status 1 Warning BackOff pod/calico-node-k6tz5 Back-off restarting failed container
分支网卡解决方案,参考这里,但是更改了网卡后还是报错,参考2.2操作(根源就在kube-proxy)。
dail tcp 10.96.0.1:443: connect: connection refused报错
Hit error connecting to datastore - retry error=Get “https://10.96.0.1:443/api/v1/nodes/foo”: dial tcp 10.96.0.1:443: connect: connection refused
calico liveness和readniess probe探针报错
Warning Unhealthy 69m (x2936 over 12d) kubelet Readiness probe errored: rpc error: code = Unknown desc = operation timeout: context deadline exceeded Warning Unhealthy 57m (x2938 over 12d) kubelet Liveness probe errored: rpc error: code = Unknown desc = operation timeout: context deadline exceeded Warning Unhealthy 12m (x6 over 13m) kubelet Liveness probe failed: container is not running Normal SandboxChanged 11m (x2 over 13m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Killing 11m (x2 over 13m) kubelet Stopping container calico-node Warning Unhealthy 8m3s (x32 over 13m) kubelet Readiness probe failed: container is not running Warning Unhealthy 4m45s (x6 over 5m35s) kubelet Liveness probe failed: container is not running Normal SandboxChanged 3m42s (x2 over 5m42s) kubelet Pod sandbox changed, it will be killed and re-created. Normal Killing 3m42s (x2 over 5m42s) kubelet Stopping container calico-node Warning Unhealthy 42s (x31 over 5m42s) kubelet Readiness probe failed: container is not running
2.2 kube-proxy报错
kube-proxy报错包括Failed to retrieve node info: Unauthorized和Failed to list *v1.Endpoints: Unauthorized等,具体如下:
W0430 12:33:28.887260 1 server_others.go:267] Flag proxy-mode="" unknown, assuming iptables proxy W0430 12:33:28.913671 1 node.go:113] Failed to retrieve node info: Unauthorized I0430 12:33:28.915780 1 server_others.go:147] Using iptables Proxier. W0430 12:33:28.916065 1 proxier.go:314] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP W0430 12:33:28.916089 1 proxier.go:319] clusterCIDR not specified, unable to distinguish between internal and external traffic I0430 12:33:28.917555 1 server.go:555] Version: v1.14.1 I0430 12:33:28.959345 1 conntrack.go:52] Setting nf_conntrack_max to 131072 I0430 12:33:28.960392 1 config.go:202] Starting service config controller I0430 12:33:28.960444 1 controller_utils.go:1027] Waiting for caches to sync for service config controller I0430 12:33:28.960572 1 config.go:102] Starting endpoints config controller I0430 12:33:28.960609 1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller E0430 12:33:28.970720 1 event.go:191] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"fh-ubuntu01.159a40901fa85264", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"fh-ubuntu01", UID:"fh-ubuntu01", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kube-proxy.", Source:v1.EventSource{Component:"kube-proxy", Host:"fh-ubuntu01"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf2a2e0639406264, ext:334442672, loc:(*time.Location)(0x2703080)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf2a2e0639406264, ext:334442672, loc:(*time.Location)(0x2703080)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Unauthorized' (will not retry!) E0430 12:33:28.970939 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized E0430 12:33:28.971106 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Unauthorized E0430 12:33:29.977038 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized E0430 12:33:29.979890 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Unauthorized E0430 12:33:30.980098 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized
清空kube-proxy的secrets,参考这里;
kubeadm reset后 proxy的pod用的还是上次产生的secret,删除此secret后会自动生成新的,然后再删除相关的kube-proxy容器,问题是重新运行了kubeadm,集群里保存的证书和新生成的证书不一致导致。
解决kube-proxy的问题,你会发现2.1的calico问题已经自动解决了。
0x03 kubeadm join时kube-apiserver报错
加入集群时候,提示kube-apiserver x509证书问题
kube-apiserver[16692]: E0211 14:34:11.507411 16692 authentication.go:63] “Unable to authenticate the request” err=“[x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
删除$HOME/.kube,参考这里
0x04 总结
2.1的问题基本是由2.2导致的,优先处理2.2集群内过期的secret问题,2.1的calico问题会自动解决;
以上问题,仅是解决问题的一个切面,记录仅供参考。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· 单线程的Redis速度为什么快?