K8S学习笔记之kubeadm reset后的环境清理

0x00 概述

本文主要记录在kubeadm reset后，在重装集群，加入和管理节点过程中遇到的问题。

0x01 kubeadm reset后的清理工作

iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
ipvsadm --clear
systemctl stop kubelet
systemctl stop docker
rm -rf /var/lib/cni/*
rm -rf /var/lib/kubelet/*
rm -rf /etc/cni/*
rm -rf $HOME/.kube/config
systemctl start docker

根据自己需要选择清空iptables或者ipvs，如果后续加入集群过程中，提示有别的文件夹存在文件，可以直接rm -rf删除指定文件夹。

0x02 kubeadm reset后重装集群遇到的问题

2.1 calico报错

在执行kubeadm reset之后，开始进行重装集群操作，此时会遇到很多calico相关的报错日志，包括以下日志但是不限于这些日志：

以下日志是在执行kubectl apply -f calico.yaml之后出现的。

calico的BGP报错

Warning Unhealthy pod/calico-node-k6tz5 Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory

Warning Unhealthy pod/calico-node-k6tz5 Liveness probe failed: calico/node is not ready: bird/confd is not live: exit status 1

Warning BackOff pod/calico-node-k6tz5 Back-off restarting failed container

分支网卡解决方案，参考这里，但是更改了网卡后还是报错，参考2.2操作（根源就在kube-proxy）。

dail tcp 10.96.0.1:443: connect: connection refused报错

Hit error connecting to datastore - retry error=Get “https://10.96.0.1:443/api/v1/nodes/foo”: dial tcp 10.96.0.1:443: connect: connection refused

calico liveness和readniess probe探针报错

  Warning  Unhealthy       69m (x2936 over 12d)   kubelet  Readiness probe errored: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
  Warning  Unhealthy       57m (x2938 over 12d)   kubelet  Liveness probe errored: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
  Warning  Unhealthy       12m (x6 over 13m)      kubelet  Liveness probe failed: container is not running
  Normal   SandboxChanged  11m (x2 over 13m)      kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Killing         11m (x2 over 13m)      kubelet  Stopping container calico-node
  Warning  Unhealthy       8m3s (x32 over 13m)    kubelet  Readiness probe failed: container is not running
  Warning  Unhealthy       4m45s (x6 over 5m35s)  kubelet  Liveness probe failed: container is not running
  Normal   SandboxChanged  3m42s (x2 over 5m42s)  kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Killing         3m42s (x2 over 5m42s)  kubelet  Stopping container calico-node
  Warning  Unhealthy       42s (x31 over 5m42s)   kubelet  Readiness probe failed: container is not running

2.2 kube-proxy报错

kube-proxy报错包括Failed to retrieve node info: Unauthorized和Failed to list *v1.Endpoints: Unauthorized等，具体如下：

W0430 12:33:28.887260       1 server_others.go:267] Flag proxy-mode="" unknown, assuming iptables proxy
W0430 12:33:28.913671       1 node.go:113] Failed to retrieve node info: Unauthorized
I0430 12:33:28.915780       1 server_others.go:147] Using iptables Proxier.
W0430 12:33:28.916065       1 proxier.go:314] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
W0430 12:33:28.916089       1 proxier.go:319] clusterCIDR not specified, unable to distinguish between internal and external traffic
I0430 12:33:28.917555       1 server.go:555] Version: v1.14.1
I0430 12:33:28.959345       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0430 12:33:28.960392       1 config.go:202] Starting service config controller
I0430 12:33:28.960444       1 controller_utils.go:1027] Waiting for caches to sync for service config controller
I0430 12:33:28.960572       1 config.go:102] Starting endpoints config controller
I0430 12:33:28.960609       1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
E0430 12:33:28.970720       1 event.go:191] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"fh-ubuntu01.159a40901fa85264", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"fh-ubuntu01", UID:"fh-ubuntu01", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kube-proxy.", Source:v1.EventSource{Component:"kube-proxy", Host:"fh-ubuntu01"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf2a2e0639406264, ext:334442672, loc:(*time.Location)(0x2703080)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf2a2e0639406264, ext:334442672, loc:(*time.Location)(0x2703080)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Unauthorized' (will not retry!)
E0430 12:33:28.970939       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized
E0430 12:33:28.971106       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Unauthorized
E0430 12:33:29.977038       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized
E0430 12:33:29.979890       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Unauthorized
E0430 12:33:30.980098       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized

清空kube-proxy的secrets，参考这里；

kubeadm reset后 proxy的pod用的还是上次产生的secret，删除此secret后会自动生成新的，然后再删除相关的kube-proxy容器，问题是重新运行了kubeadm，集群里保存的证书和新生成的证书不一致导致。

解决kube-proxy的问题，你会发现2.1的calico问题已经自动解决了。

0x03 kubeadm join时kube-apiserver报错

加入集群时候，提示kube-apiserver x509证书问题

kube-apiserver[16692]: E0211 14:34:11.507411 16692 authentication.go:63] “Unable to authenticate the request” err=“[x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

删除$HOME/.kube，参考这里

0x04 总结

2.1的问题基本是由2.2导致的，优先处理2.2集群内过期的secret问题，2.1的calico问题会自动解决；

以上问题，仅是解决问题的一个切面，记录仅供参考。

posted @ 2022-08-29 10:52 时光飞逝，逝者如斯阅读(3591) 评论(0) 收藏举报

刷新页面返回顶部