k8s常见问题汇总
问题1:单node上容器网段pod被隔离
现象:
1)生产环境节点node_A重启后,上面有几个pod_1(pod namespace的pod) 的init container初始化失败,报错连接svc/kubernetes(apiserver的service) timed out
2)该节点上:容器网络ns(namespace)的pod访问不了自己的GW(GW1),node网络ns的可以访问自己的GW(GW2)
3)容器网络ns的pod可以ping通自己node_A,但是ping不通其他node
4)物理机有两块网卡:ens192和ens224,ens192持有节点ip,ens224加入了br0虚拟网桥
5)网络模型为: pod(eth0)->br0网桥[接口1,接口2,接口3,ens224,...]->ens192
6)现尝试判断到自己的host这一段网络:
nsenter -t -n $Pid
($Pid为init失败pod的的pause容器pause进程的pid) 进入pod_1的命名空间ping node_A
抓包结果:ens224上没有抓到相关的包,br0上抓包只有出去的包没有回来的包。判断为ens224和br0之间的网络有问题
7)尝试:brctl delif br0 ens224/brctl addif br0 ens224
不能解决
8)重启接口:ip link set dev [br0|ens224] [down|up]
先重启br0,后ens224不行
9)echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables
可以防火墙过滤规则后恢复,判断为防火墙规则问题,echo 1> ..
恢复配置
10)iptables -t [raw|filter|nat] -S
发现fileter有策略-P FORWARD DROP
,修改策略:iptables -P FORWARD ACCEPT
后解决
11)这里kube-proxy使用的是iptables;最简单的确认方法是:iptables或者ipvsadm 命令查看
12)排查常用命令:
iptables -t [filter|raw|nat] -S |grep br0 #查看防火墙规则
ifconfig [br0|ens224|ens192] #查看接口信息
tc qdisc show #查看流控规则
ip rule #linux系统维护的路由表,
sysctl -a | grep rp_filter #网卡对接收到的数据包进行反向路由验证,0关闭,1开启(源目都验是否为最佳路由),2开启(源可达即可)
sysctl -a | grep ip_forward #确认转发策略是否开启
ip rule && ip route show table 10000 #查看策略路由内容
cat /proc/sys/net/bridge/bridge-nf-call-iptables #iptables是否对bridge的数据进行处理
brctl show br0 #查看虚拟网桥
ebtables -t broute -L #
13)根本原因: docker会自动设置iptables的FORWARD默认策略为DROP https://docs.docker.com/network/iptables/
在dockerd添加启动参数ExecStartPort=/usr/bin/iptables -P FORWARD ACCEPT
问题2: kubectl get nodes发现所有节点NotReady;
可以执行kubectl命令说明kubectl访问apiserver的证书没有到期。
grep data ~/.kube/config |cut -d':' -f2 |head -1 |sed 's/ //g' |base64 -d |openssl x509 -text -noout |grep Validity -A5
查看kubectl证书过期时间
备份 /etc/kubernetes/kubelet.kubeconfig
替换: cp ~/.kube/config /etc/kubernetes/kubelet.kubeconfig
问题3:flannel模式下cni0和flannel.1的网段不一致
节点不干净,残留有数据。ip link delete cni0
问题4:kube-controller和kube-controller-manager pod失败:CreateContainerError
kubelet, master1 Error: Error response from daemon: Conflict. The container name "/k8s_kube-scheduler_kube-scheduler-master1_kube-system_ca2aa1b3224c37fa1791ef6c7d883bbe_3" is already in use by container "7a70a1ef544ff30884d54b1b1ca13ea9f648891d740bd16d9e2b3671f8ccc580". You have to remove (or rename) that container to be able to reuse that name.
docker ps -a |grep Exit |cut -d' ' -f1 |xargs docker rm //删除失败的容器
问题5:查看证书到期
3、证书更新:
模拟证书过期: 修改系统时间即可
查看证书到期时间: kubeadm alpha certs check-expiration
申请续期:kubeadm alpha phase kubeconfig all --config cluster.yaml;kubeadm alpha certs renew all --config=/root/cluster.yaml
手动查看证书到期:
bash for i in
ll /etc/kubernetes/pki/.crt ;do echo $i && openssl x509 -in $i -noout -text |grep "Not After";done for i in
ll /etc/kubernetes/pki/.pem `;do echo $i && openssl x509 -in $i -noout -text |grep "Not After";done
`
问题6: kubectl get nodes发现所有节点NotReady 证书到期
可以执行kubectl命令说明kubectl访问apiserver的证书没有到期。
base64 -d ~/.kube/config | openssl x509 -text -noout |grep -i vali -A10
查看kubectl证书过期时间
备份 /etc/kubernetes/kubelet.kubeconfig
替换: cp ~/.kube/config /etc/kubernetes/kubelet.kubeconfig
问题7:node 异常无法登录
1、现象1)节点: NotReady;2)节点上的pod正常提供服务且可以登录;3)ssh登录node报错pam failure
2、根据man pam_systemd,如果pam(/etc/pam.d/password-auth)里面配置了systemd,登录过程将依赖systemd-logind服务来做user sessions的管理。
从var/log/secure日志得知,logging进程因为一个错误正在退出:
错误1: Failed to release session: Interrupted system call
错误2: pam_env(crond:setcred): Unable to open config file /etc/security/pam_env.conf
错误3: login: Couldn't initialize PAM: Critical error - immedidate abort ;
PAM _pam_init_handlers: could not open /etc/pam.conf
错误4:
systemd-logind[2210]: Failed to abandon session scope: No buffer space available
systemd-logind[2210]: Assertion 's->user->slice' failed at src/login/logind-session.c:510, function session_start_scope(). Aborting.
3、根据在后来生成的内核core文件得知,logind进程已经长时间处于zombie状态,并且没有得到systemd进程的回收:
crash> ps | grep login
2210 1 10 ffff883f1f02d710 ZO 0.0 0 0 systemd-logind
分析时间戳,logind进程最后一次得到运行的时间正好和退出的时间(也就是出现宕机状态的时间)接近。因此,解决登录问题的关键是,要让登录过程和logind进程无关。
4、值得一提的是,cron相关的任务也都阻塞在这里(通过内核core分析,有3068个crond进程残留),因为cron任务也需要logind维护session。在/var/log/secure日志:crond[115446]: pam_systemd(crond:session): Failed to create session: Message did not receive a reply (timeout by message bus),也说明了这一点。
5、修复 该系统所带的systemd版本,正是刚引入logind的版本,存在各种问题;在我们很多场景下都已经采用下面的方法,来禁用logind。
# sudo sed -i -e '/^[^#]*pam_systemd.so/ s/^/#&/g' /etc/pam.d/* --follow-symlinks
# sudo systemctl mask systemd-logind
# sudo systemctl stop systemd-logind
问题8:namespace Terminating
原因: 1)namespace 中还存在资源未被删除;2)存在unReady 的 APIService
kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get --show-kind --ignore-not-found -n {namespace}
残留资源
kubectl get APIService
查看为就绪的apiserver
如果处理完未被删除的资源后,问题仍存在,则可按照如下步骤操作:
kubectl get namespace $ns -o json >tmp.json
;去除 spec.finalizers 配置
本地通过 kubectl proxy 开启代理,便于无认证访问 apiserver api
curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://127.0.0.1:8001/api/v1/namespaces/$ns/finalize
问题9:etcd三节点任意一个节点宕机
etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --endpoints=https://$ip:2379,https://$ip:2379,https://$ip:2379 endpoint health -w table
一个node报错:context deadline exceeded
状态为false
1)停止故障的etcd容器
2)备份数据和配置文件
3)修改配置文件:sed -i s/new/existing/g /etc/kubernetes/manifests/etcd.yam
4)重启etcd
问题10:kubelet报错container runtime is down PLEG is not healthy
1)清理非运行状态的容器
2)重启kubelet
3)重启docker
4)删除/var/lib/kubelet/pods/, /var/lib/docker 并重启docker
问题11:local pv
https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner
https://github.com/rancher/local-path-provisioner#deployment
扩容pvc的事情注意事项:
- 1)扩容方式1:删除sts,然后重新apply的方式,但是这样volume数据会丢失
- 2)扩容方式2:直接修改
sts.spec.volumeClaimTemplate.spec.resources.requests.storage
的大小[不可取] - 3)扩容方式3:修改pod绑定的pvc的
pvc.spec.resources.requests.storage
,修改后注意describe
查看是否有succeed...
, - 注意事项:登录pod内
df -h
验证,如果pvc扩容成功,但是pod内df仍然为扩容前大小,可尝试重新edit pvc
继续扩容,看describe输出event。另外需要关注storageclass空间是否充足;lvm场景下可以执行pvs
确认
问题12: 集群重启后master启动失败
1、master重启后,flannel pod起不来,其他pod也已经漂移,报错
I1220 03:59:32.958363 1 main.go:550] Defaulting external address to interface address (172.19.20.215)
W1220 03:59:33.064856 1 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
W1220 03:59:33.162450 1 client_config.go:613] error creating inClusterConfig, falling back to default config: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
E1220 03:59:33.247127 1 main.go:251] Failed to create SubnetManager: fail to create kubernetes config: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
I1220 03:59:43.154375 1 main.go:533] Using interface with name eth0 and address 172.19.20.215
I1220 03:59:43.154441 1 main.go:550] Defaulting external address to interface address (172.19.20.215)
W1220 03:59:43.154470 1 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
W1220 03:59:43.154504 1 client_config.go:613] error creating inClusterConfig, falling back to default config: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
E1220 03:59:43.154710 1 main.go:251] Failed to create SubnetManager: fail to create kubernetes config: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
kubectl apply -f https://github.com/flannel-io/flannel/tree/master/Documentation/k8s-manifests/下的所有yaml文件后解决
2、kubelet 服务处于 停止状态
[root@master1 790948983fd73a7372746b3d3fa3c1c2608c077900c9bffe9dff1aae22761dcc]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Mon 2021-12-20 12:24:44 CST; 5s ago
Docs: https://kubernetes.io/docs/
Process: 8348 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS $KUBE_ALLOW_PRIV (code=exited, status=255)
Main PID: 8348 (code=exited, status=255)
Dec 20 12:24:44 master1 kubelet[8348]: --tls-cipher-suites strings Comma-separated list of...ECDSA_WITH_A
Dec 20 12:24:44 master1 kubelet[8348]: --tls-min-version string Minimum TLS version supported. Poss...
Dec 20 12:24:44 master1 kubelet[8348]: --tls-private-key-file string File containing x509 private key ma...
Dec 20 12:24:44 master1 kubelet[8348]: --topology-manager-policy string Topology Manager policy to use. Pos...
Dec 20 12:24:44 master1 kubelet[8348]: -v, --v Level number for the log level verbosity
Dec 20 12:24:44 master1 kubelet[8348]: --version version[=true] Print version information and quit
Dec 20 12:24:44 master1 kubelet[8348]: --vmodule moduleSpec comma-separated list of...ered logging
Dec 20 12:24:44 master1 kubelet[8348]: --volume-plugin-dir string The full path of the di...lume/exec/")
Dec 20 12:24:44 master1 kubelet[8348]: --volume-stats-agg-period duration Specifies interval for kubelet to c...
Dec 20 12:24:44 master1 kubelet[8348]: F1220 12:24:44.378643 8348 server.go:157] unknown flag: --allow-privileged
Hint: Some lines were ellipsized, use -l to show in full.
cat /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
Environment="KUBE_ALLOW_PRIV=--allow-privileged=false" #注释改行 解决
问题十三、查看docker启动命令和dockerfile内容
runlike //pip install runlike
recod //npm i -g rekcod
whaler:通过镜像导出dockerfile
或者:
docker pull nexdrew/rekcod
alias rekcod="docker run --rm -v /var/run/docker.sock:/var/run/docker.sock nexdrew/rekcod"
alias whaler="docker run -t --rm -v /var/run/docker.sock:/var/run/docker.sock:ro pegleg/whaler"