汇总Kubernetes运维中遇到的问题
汇总Kubernetes运维中遇到的问题
1、挂载卷权限问题导致pod运行异常
# 调试:增加command字段,进入容器查看应用运行uid
spec:
containers:
- command:
- /bin/sh
- -c
- sleep 500000
# 使用initContainer修改目录权限
spec:
initContainers:
- command:
- /bin/sh
- -c
- chmod 777 /prometheus
image: busybox
imagePullPolicy: IfNotPresent
name: volume-permissions
securityContext:
runAsUser: 0
volumeMounts:
- mountPath: /prometheus
name: prometheus-data
2、挂载卷内默认生成lost+found目录导致数据库初始化失败(数据目录初始化要求必须为空目录)
Initializing database
2023-04-12T08:11:26.631401Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2023-04-12T08:11:26.636640Z 0 [ERROR] --initialize specified but the data directory has files in it. Aborting.
2023-04-12T08:11:26.636700Z 0 [ERROR] Aborting
# 调试:增加command字段,进入容器删除lost+found目录
spec:
containers:
- command:
- /bin/sh
- -c
- sleep 500000
# 进容器删除lost+found/
mysql@flashcatcloud-nightingale-database-0:/$ cd /var/lib/mysql
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ ls
lost+found
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ rm -r lost+found/
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ ls
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$
# 或通过挂载initContainer的方式删除lost+found目录
spec:
initContainers:
- command:
- /bin/sh
- -c
- rm -rf /var/lib/mysql/*
image: busybox
imagePullPolicy: IfNotPresent
name: volume-permissions
resources: {}
securityContext:
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/mysql/
name: database-data
3、pod一直保持在terminating状态
# 查看所在节点kubelet日志:
failed to "KillPodSandbox" for "a594f4a1-c67b-42c5-84ea-62f7fb1e386d" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to check network namespace closed: remove netns: unlinkat /var/run/netns/cni-b70f6268-4fed-8c40-73f4-2e0ad0d325f4: device or resource busy"
# 解决方法
echo 1 > /proc/sys/fs/may_detach_mounts
# 基于纯shell的 kubernetes 生产集群的 sysctl 配置
https://www.boysec.cn/boy/f0530e00.html
4、强制删除一直在terminating状态的namespace
# 方式一
kubectl get namespace <Terminating_Namespace> -o json | tr -d "\n" | sed "s/\"finalizers\": \[[^]]\+\]/\"finalizers\": []/" | kubectl replace --raw /api/v1/namespaces/<Terminating_Namespace>/finalize -f -
# 方式二:
kubectl get namespace <Terminating_Namespace> -o json > Terminating_Namespace.json
# 编辑Terminating_Namespace.json删除finalizers字段下的内容
kubectl proxy
curl -k -H "Content-Type: application/json" -X PUT --data-binary @Terminating_Namespace.json http://127.0.0.1:8001/api/v1/namespaces/<Terminating_Namespace>/finalize
5、pv/pvc强制删除
[root@master01 ~]# kubectl patch pvc xxxxxx -p '{"metadata":{"finalizers":null}}' -n yyyyyy
[root@master01 ~]# kubectl patch pv xxxxxxxxxxxxxxxx -p '{"metadata":{"finalizers":null}}'
6、拉取私有镜像仓库的镜像证书受信问题
x509: certificate signed by unknown authority
# 1、容器运行时为Docker
cat >/etc/docker/daemon.json <<EOF
{
"graph": "/var/lib/docker",
"registry-mirrors": ["https://registry.cn-hangzhou.aliyuncs.com", "https://harbor.example.com"],
"insecure-registries": ["https://harbor.example.com"],
"live-restore": true,
"exec-opts": ["native.cgroupdriver=systemd"],
"storage-driver": "overlay2",
"log-driver": "json-file",
"log-opts": {
"max-size": "500m",
"max-file": "3"
}
}
EOF
systemctl restart docker.service
systemctl status docker.service
# 2、容器运行时为Containerd
mkdir -p /etc/containerd/certs.d/harbor.example.com/
cat >/etc/containerd/certs.d/harbor.example.com/hosts.toml <<EOF
[host."https://harbor.example.com"]
capabilities = ["pull", "resolve", "push"]
skip_verify = true
EOF
cat >>/etc/containerd/config.toml <<EOF
[plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.example.com".auth]
username = "admin"
password = "Harbor12345"
EOF
systemctl restart containerd.service
systemctl status containerd.service
7、通过代理服务器拉取Docker镜像
修改 /etc/systemd/system/multi-user.target.wants/docker.service 文件,在 service 下面加入代理的配置,比如:
Environment=HTTP_PROXY=http://admin:admin123@192.168.56.1:1080
Environment=HTTPS_PROXY=http://admin:admin123@192.168.56.1:1080
Environment=NO_PROXY=localhost,127.0.0.1
重启 docker 服务:
systemctl daemon-reload
systemctl restart docker
8、Podman使用配置
别名:
echo "alias docker=podman" >> ~/.bashrc
source ~/.bashrc
仓库源全局配置: /etc/containers/registries.conf
仓库源个人配置: ~/.config/containers/registries.conf
unqualified-search-registries = ["docker.io", "registry.access.redhat.com"]
[[registry]]
prefix = "docker.io"
location = "docker.io"
[[registry.mirror]]
location = "docker.mirrors.ustc.edu.cn"
[[registry.mirror]]
location = "registry.docker-cn.com"
9、Helm使用私有Chart仓库证书受信问题
insecureSkipTLSVerify
是一个 Helm 的参数选项,用于在 Helm 安装 Chart 时跳过 TLS 验证。如果您的 Chart 仓库使用的是自签名证书,或者您的网络环境中存在代理等安全设施,可能会导致 TLS 验证失败。在这种情况下,您可以使用 insecureSkipTLSVerify: true
参数选项来跳过 TLS 验证,以确保 Helm 能够正常安装 Chart。
10、etcd took too long报错问题
etcd
的性能受到集群规模、硬件配置、网络延迟等多种因素的影响。当 etcd
处理任务时,可能会出现 etcd took too long
的错误提示。这通常是由于 etcd
无法及时响应请求,导致请求超时或被取消。最佳做法是将 etcd
数据迁移至高性能的硬盘介质。
11、K8S扩容在线已经挂载的PV,需要满足的前提条件
- K8S版本不能低于1.11
- PV的存储类(storageClass)必须支持扩容,即allowVolumeExpansion字段为true。
- PV的回收策略(reclaimPolicy)必须为Retain或Delete,不能为Recycle。
- PV的访问模式(accessMode)必须为 ReadWriteOnce或 ReadWriteMany,不能为 ReadOnlyMany。
- PV和PVC所使用的底层存储设备(如Ceph、NFS等)必须支持在线扩容,即不需要卸载或重启。
- PV的容器运行时(Container Runtime)必须支持在线扩容,例如docker或containerd。
- PV的文件系统必须支持在线扩容,例如ext4或xfs。
- 如果底层存储是Ceph,那么Ceph集群的版本必须是Nautilus或更高。
- PV和PVC绑定的pod必须先缩容到0,然后才能修改PVC的请求存储量(resources.requests.storage),再重新扩容pod。
12、library initialization failed - unable to allocate file descriptor table - out of memory
解决方法:docker配置文件中增加default-ulimits参数
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Hard": 65535,
"Soft": 65535
}
}
作者:wanghongwei
版权声明:本作品遵循<CC BY-NC-ND 4.0>版权协议,商业转载请联系作者获得授权,非商业转载请附上原文出处链接及本声明。