汇总Kubernetes运维中遇到的问题

汇总Kubernetes运维中遇到的问题

1、挂载卷权限问题导致pod运行异常

# 调试:增加command字段,进入容器查看应用运行uid
spec:
  containers:
  - command:
    - /bin/sh
    - -c
    - sleep 500000

# 使用initContainer修改目录权限
spec:
  initContainers:
  - command:
    - /bin/sh
    - -c
    - chmod 777 /prometheus
    image: busybox
    imagePullPolicy: IfNotPresent
    name: volume-permissions
    securityContext:
      runAsUser: 0
    volumeMounts:
    - mountPath: /prometheus
      name: prometheus-data

2、挂载卷内默认生成lost+found目录导致数据库初始化失败(数据目录初始化要求必须为空目录)

Initializing database
2023-04-12T08:11:26.631401Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2023-04-12T08:11:26.636640Z 0 [ERROR] --initialize specified but the data directory has files in it. Aborting.
2023-04-12T08:11:26.636700Z 0 [ERROR] Aborting

# 调试:增加command字段,进入容器删除lost+found目录
spec:
  containers:
  - command:
    - /bin/sh
    - -c
    - sleep 500000

# 进容器删除lost+found/
mysql@flashcatcloud-nightingale-database-0:/$ cd /var/lib/mysql
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ ls
lost+found
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ rm -r lost+found/
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ ls 
mysql@flashcatcloud-nightingale-database-0:/var/lib/mysql$ 

# 或通过挂载initContainer的方式删除lost+found目录
spec:
  initContainers:
  - command:
    - /bin/sh
    - -c
    - rm -rf /var/lib/mysql/*
    image: busybox
    imagePullPolicy: IfNotPresent
    name: volume-permissions
    resources: {}
    securityContext:
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/mysql/
      name: database-data

3、pod一直保持在terminating状态

# 查看所在节点kubelet日志: 
failed to "KillPodSandbox" for "a594f4a1-c67b-42c5-84ea-62f7fb1e386d" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to check network namespace closed: remove netns: unlinkat /var/run/netns/cni-b70f6268-4fed-8c40-73f4-2e0ad0d325f4: device or resource busy"

# 解决方法
echo 1 > /proc/sys/fs/may_detach_mounts 

# 基于纯shell的 kubernetes 生产集群的 sysctl 配置
https://www.boysec.cn/boy/f0530e00.html

4、强制删除一直在terminating状态的namespace

# 方式一
kubectl get namespace <Terminating_Namespace> -o json | tr -d "\n" | sed "s/\"finalizers\": \[[^]]\+\]/\"finalizers\": []/" | kubectl replace --raw /api/v1/namespaces/<Terminating_Namespace>/finalize -f -
 
# 方式二:
kubectl get namespace <Terminating_Namespace> -o json > Terminating_Namespace.json
# 编辑Terminating_Namespace.json删除finalizers字段下的内容

kubectl proxy
curl -k -H "Content-Type: application/json" -X PUT --data-binary @Terminating_Namespace.json http://127.0.0.1:8001/api/v1/namespaces/<Terminating_Namespace>/finalize

5、pv/pvc强制删除

[root@master01 ~]# kubectl patch pvc xxxxxx -p '{"metadata":{"finalizers":null}}' -n yyyyyy
[root@master01 ~]# kubectl patch pv xxxxxxxxxxxxxxxx -p '{"metadata":{"finalizers":null}}'

6、拉取私有镜像仓库的镜像证书受信问题

x509: certificate signed by unknown authority

# 1、容器运行时为Docker
cat >/etc/docker/daemon.json <<EOF
{
	"graph": "/var/lib/docker",
	"registry-mirrors": ["https://registry.cn-hangzhou.aliyuncs.com", "https://harbor.example.com"],
	"insecure-registries": ["https://harbor.example.com"],
	"live-restore": true,
	"exec-opts": ["native.cgroupdriver=systemd"],
	"storage-driver": "overlay2",
	"log-driver": "json-file",
	"log-opts": {
		"max-size": "500m",
		"max-file": "3"
	}
}
EOF
systemctl restart docker.service
systemctl status docker.service

# 2、容器运行时为Containerd
mkdir -p /etc/containerd/certs.d/harbor.example.com/
cat >/etc/containerd/certs.d/harbor.example.com/hosts.toml <<EOF
[host."https://harbor.example.com"]
  capabilities = ["pull", "resolve", "push"]
  skip_verify = true
EOF

cat >>/etc/containerd/config.toml <<EOF
          [plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.example.com".auth]
            username = "admin"
            password = "Harbor12345"
EOF

systemctl restart containerd.service
systemctl status containerd.service

7、通过代理服务器拉取Docker镜像

修改 /etc/systemd/system/multi-user.target.wants/docker.service 文件,在 service 下面加入代理的配置,比如:

Environment=HTTP_PROXY=http://admin:admin123@192.168.56.1:1080
Environment=HTTPS_PROXY=http://admin:admin123@192.168.56.1:1080
Environment=NO_PROXY=localhost,127.0.0.1

重启 docker 服务:

systemctl daemon-reload
systemctl restart docker

8、Podman使用配置

别名:

echo "alias docker=podman" >> ~/.bashrc
source ~/.bashrc

仓库源全局配置: /etc/containers/registries.conf
仓库源个人配置: ~/.config/containers/registries.conf

unqualified-search-registries = ["docker.io", "registry.access.redhat.com"]

[[registry]]
prefix = "docker.io"
location = "docker.io"

[[registry.mirror]]
location = "docker.mirrors.ustc.edu.cn"
[[registry.mirror]]
location = "registry.docker-cn.com"

9、Helm使用私有Chart仓库证书受信问题

insecureSkipTLSVerify 是一个 Helm 的参数选项,用于在 Helm 安装 Chart 时跳过 TLS 验证。如果您的 Chart 仓库使用的是自签名证书,或者您的网络环境中存在代理等安全设施,可能会导致 TLS 验证失败。在这种情况下,您可以使用 insecureSkipTLSVerify: true 参数选项来跳过 TLS 验证,以确保 Helm 能够正常安装 Chart。

10、etcd took too long报错问题

etcd 的性能受到集群规模、硬件配置、网络延迟等多种因素的影响。当 etcd 处理任务时,可能会出现 etcd took too long 的错误提示。这通常是由于 etcd 无法及时响应请求,导致请求超时或被取消。最佳做法是将 etcd 数据迁移至高性能的硬盘介质。

11、K8S扩容在线已经挂载的PV,需要满足的前提条件

  • K8S版本不能低于1.11
  • PV的存储类(storageClass)必须支持扩容,即allowVolumeExpansion字段为true。
  • PV的回收策略(reclaimPolicy)必须为Retain或Delete,不能为Recycle。
  • PV的访问模式(accessMode)必须为 ReadWriteOnce或 ReadWriteMany,不能为 ReadOnlyMany。
  • PV和PVC所使用的底层存储设备(如Ceph、NFS等)必须支持在线扩容,即不需要卸载或重启。
  • PV的容器运行时(Container Runtime)必须支持在线扩容,例如docker或containerd。
  • PV的文件系统必须支持在线扩容,例如ext4或xfs。
  • 如果底层存储是Ceph,那么Ceph集群的版本必须是Nautilus或更高。
  • PV和PVC绑定的pod必须先缩容到0,然后才能修改PVC的请求存储量(resources.requests.storage),再重新扩容pod。

12、library initialization failed - unable to allocate file descriptor table - out of memory

解决方法:docker配置文件中增加default-ulimits参数

    "default-ulimits": {
        "nofile": {
            "Name": "nofile",
            "Hard": 65535,
            "Soft": 65535
        }
    }
posted @ 2023-04-17 09:33  wanghongwei-dev  阅读(348)  评论(0编辑  收藏  举报