汇总Kubernetes运维中遇到的问题
汇总Kubernetes运维中遇到的问题
1、挂载卷权限问题导致pod运行异常
# 调试:增加command字段,进入容器查看应用运行uid
spec:
containers:
- command:
- /bin/sh
- -c
- sleep 500000
# 使用initContainer修改目录权限
spec:
initContainers:
- command:
- /bin/sh
- -c
- chmod 777 /prometheus
image: busybox
imagePullPolicy: IfNotPresent
name: volume-permissions
securityContext:
runAsUser: 0
volumeMounts:
- mountPath: /prometheus
name: prometheus-data
2、挂载卷内默认生成lost+found目录导致数据库初始化失败(数据目录初始化要求必须为空目录)
Initializing database
2023-04-12T08:11:26.631401Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2023-04-12T08:11:26.636640Z 0 [ERROR] --initialize specified but the data directory has files in it. Aborting.
2023-04-12T08:11:26.636700Z 0 [ERROR] Aborting
# 调试:增加command字段,进入容器删除lost+found目录
spec:
containers:
- command:
- /bin/sh
- -c
- sleep 500000
# 进容器删除lost+found/
mysql@flashcatcloud-nightingale-database-0:/$ cd /var/lib/mysql
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ ls
lost+found
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ rm -r lost+found/
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ ls
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$
# 或通过挂载initContainer的方式删除lost+found目录
spec:
initContainers:
- command:
- /bin/sh
- -c
- rm -rf /var/lib/mysql/*
image: busybox
imagePullPolicy: IfNotPresent
name: volume-permissions
resources: {}
securityContext:
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/mysql/
name: database-data
3、pod一直保持在terminating状态
# 查看所在节点kubelet日志:
failed to "KillPodSandbox" for "a594f4a1-c67b-42c5-84ea-62f7fb1e386d" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to check network namespace closed: remove netns: unlinkat /var/run/netns/cni-b70f6268-4fed-8c40-73f4-2e0ad0d325f4: device or resource busy"
# 解决方法
echo 1 > /proc/sys/fs/may_detach_mounts
# 基于纯shell的 kubernetes 生产集群的 sysctl 配置
https://www.boysec.cn/boy/f0530e00.html
4、强制删除一直在terminating状态的namespace
# 方式一
kubectl get namespace <Terminating_Namespace> -o json | tr -d "\n" | sed "s/\"finalizers\": \[[^]]\+\]/\"finalizers\": []/" | kubectl replace --raw /api/v1/namespaces/<Terminating_Namespace>/finalize -f -
# 方式二:
kubectl get namespace <Terminating_Namespace> -o json > Terminating_Namespace.json
# 编辑Terminating_Namespace.json删除finalizers字段下的内容
kubectl proxy
curl -k -H "Content-Type: application/json" -X PUT --data-binary @Terminating_Namespace.json http://127.0.0.1:8001/api/v1/namespaces/<Terminating_Namespace>/finalize
5、pv/pvc强制删除
[root@master01 ~]# kubectl patch pvc xxxxxx -p '{"metadata":{"finalizers":null}}' -n yyyyyy
[root@master01 ~]# kubectl patch pv xxxxxxxxxxxxxxxx -p '{"metadata":{"finalizers":null}}'
6、拉取私有镜像仓库的镜像证书受信问题
x509: certificate signed by unknown authority
# 1、容器运行时为Docker
cat >/etc/docker/daemon.json <<EOF
{
"graph": "/var/lib/docker",
"registry-mirrors": ["https://registry.cn-hangzhou.aliyuncs.com", "https://harbor.example.com"],
"insecure-registries": ["https://harbor.example.com"],
"live-restore": true,
"exec-opts": ["native.cgroupdriver=systemd"],
"storage-driver": "overlay2",
"log-driver": "json-file",
"log-opts": {
"max-size": "500m",
"max-file": "3"
}
}
EOF
systemctl restart docker.service
systemctl status docker.service
# 2、容器运行时为Containerd
mkdir -p /etc/containerd/certs.d/harbor.example.com/
cat >/etc/containerd/certs.d/harbor.example.com/hosts.toml <<EOF
[host."https://harbor.example.com"]
capabilities = ["pull", "resolve", "push"]
skip_verify = true
EOF
cat >>/etc/containerd/config.toml <<EOF
[plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.example.com".auth]
username = "admin"
password = "Harbor12345"
EOF
systemctl restart containerd.service
systemctl status containerd.service
7、Docker守护进程代理配置
方式一:修改/etc/docker/daemon.json
{
"proxies": {
"http-proxy": "http://proxy.example.com:3128",
"https-proxy": "https://proxy.example.com:3129",
"no-proxy": "*.test.example.com,.example.org,127.0.0.0/8"
}
}
systemctl restart docker
方式二:修改/usr/lib/systemd/system/docker.service
[Service]
Environment="HTTP_PROXY=http://proxy.example.com:3128"
Environment="HTTPS_PROXY=https://proxy.example.com:3129"
Environment="NO_PROXY=localhost,127.0.0.1,docker-registry.example.com,.corp"
systemctl daemon-reload
systemctl restart docker
systemctl show --property=Environment docker
Docker Client Proxy:负责 docker run 和 docker build 阶段的代理配置
Docker Daemon Proxy:负责 docker push 和 docker pull 阶段的代理配置
8、Podman使用配置
别名:
echo "alias docker=podman" >> ~/.bashrc
source ~/.bashrc
仓库源全局配置: /etc/containers/registries.conf
仓库源个人配置: ~/.config/containers/registries.conf
unqualified-search-registries = ["docker.io", "registry.access.redhat.com"]
[[registry]]
prefix = "docker.io"
location = "docker.io"
[[registry.mirror]]
location = "docker.mirrors.ustc.edu.cn"
[[registry.mirror]]
location = "registry.docker-cn.com"
9、Helm使用私有Chart仓库证书受信问题
insecureSkipTLSVerify
是一个 Helm 的参数选项,用于在 Helm 安装 Chart 时跳过 TLS 验证。如果您的 Chart 仓库使用的是自签名证书,或者您的网络环境中存在代理等安全设施,可能会导致 TLS 验证失败。在这种情况下,您可以使用 insecureSkipTLSVerify: true
参数选项来跳过 TLS 验证,以确保 Helm 能够正常安装 Chart。
10、etcd took too long报错问题
etcd
的性能受到集群规模、硬件配置、网络延迟等多种因素的影响。当 etcd
处理任务时,可能会出现 etcd took too long
的错误提示。这通常是由于 etcd
无法及时响应请求,导致请求超时或被取消。最佳做法是将 etcd
数据迁移至高性能的硬盘介质。
11、K8S扩容在线已经挂载的PV,需要满足的前提条件
- K8S版本不能低于1.11
- PV的存储类(storageClass)必须支持扩容,即allowVolumeExpansion字段为true。
- PV的回收策略(reclaimPolicy)必须为Retain或Delete,不能为Recycle。
- PV的访问模式(accessMode)必须为 ReadWriteOnce或 ReadWriteMany,不能为 ReadOnlyMany。
- PV和PVC所使用的底层存储设备(如Ceph、NFS等)必须支持在线扩容,即不需要卸载或重启。
- PV的容器运行时(Container Runtime)必须支持在线扩容,例如docker或containerd。
- PV的文件系统必须支持在线扩容,例如ext4或xfs。
- 如果底层存储是Ceph,那么Ceph集群的版本必须是Nautilus或更高。
- PV和PVC绑定的pod必须先缩容到0,然后才能修改PVC的请求存储量(resources.requests.storage),再重新扩容pod。
12、library initialization failed - unable to allocate file descriptor table - out of memory
解决方法:docker配置文件中增加default-ulimits参数
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Hard": 65535,
"Soft": 65535
}
}
13、Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers).
cat >/etc/docker/daemon.json<<EOF
{
"registry-mirrors": [
"https://hub-mirror.c.163.com",
"https://mirror.baidubce.com",
"https://mirror.ccs.tencentyun.com",
"https://docker.mirrors.jdcloud.com",
"https://kscgcr.m.daocloud.io",
"https://docker.mirrors.sohu.com",
"https://docker.mirrors.ustc.edu.cn",
"https://docker.mirrors.tuna.tsinghua.edu.cn",
"https://reg-mirror.qiniu.com",
"https://docker.mirrors.tenxcloud.com",
"https://hub-mirror.alauda.cn"
]
}
EOF
systemctl daemon-reload
systemctl restart docker
14、Calico网卡丢失IP的问题
同时启用NetworkManager与Calico的情况下,两者处于竞争关系,cali*
会丢失IP,但并非总是如此。
cat > /etc/NetworkManager/conf.d/calico.conf <
[keyfile]
unmanaged-devices=interface-name:cali*;interface-name:tunl*;interface-name:vxlan.calico
EOF
上述配置将cali*
/tunl*
/vxlan.calico
网卡设置为不受NetworkManager管理,而只接受Calico管理(本身也应该如此)。
官网解释:NetworkManager会操纵默认网络命名空间中接口的路由表,Calico veth对会在该命名空间中锚定以连接到容器。这可能会干扰Calico代理正确路由的能力。
参考链接:https://docs.tigera.io/archive/v3.7/maintenance/troubleshooting#configure-networkmanager
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步