k8s debug记录之kubelet user.slice container monitor failure
在kubernetes中,如果使用其自带的单机启动脚本./hack/local-up-cluster.sh来启动一个本地集群的话,会在kubelet的日志中观察到类似以下内容的日志:
Failed to get system container stats for "/user.slice/user-1000.slice/session-c2.scope": failed to get cgroup stats for "/user.slice/user-1000.slice/session-c2.scope": failed to get container info for "/user.slice/user-1000.slice/session-c2.scope": unknown container "/user.slice/user-1000.slice/session-c2.scope"
经过一番调查,问题的根源在:
if cm.KubeletCgroupsName != "" { cont := newSystemCgroups(cm.KubeletCgroupsName) allowAllDevices := true manager := fs.Manager{ Cgroups: &configs.Cgroup{ Parent: "/", Name: cm.KubeletCgroupsName, Resources: &configs.Resources{ AllowAllDevices: &allowAllDevices, }, }, } cont.ensureStateFunc = func(_ *fs.Manager) error { return ensureProcessInContainerWithOOMScore(os.Getpid(), qos.KubeletOOMScoreAdj, &manager) } systemContainers = append(systemContainers, cont) } else { cm.periodicTasks = append(cm.periodicTasks, func() { if err := ensureProcessInContainerWithOOMScore(os.Getpid(), qos.KubeletOOMScoreAdj, nil); err != nil { klog.Error(err) return } klog.V(1).Infof("jay the pid is %#v\n", os.Getpid()) cont, err := getContainer(os.Getpid()) if err != nil { klog.Errorf("failed to find cgroups of kubelet - %v", err) return } cm.Lock() defer cm.Unlock() cm.KubeletCgroupsName = cont }) }
这里的核心代码是: cont, err := getContainer(os.Getpid())
理论上来说,该代码所执行的所在进程就应该是kubelet进程,因此它根据当前进程的pid找到自己所在的cgroup, 然后生成一个manager去监控kubelet进程所
消耗的资源。但是这里的问题是,使用./hack/local-up-cluster.sh脚本启动的本地集群是使用docker启动一个container,然后在这个container里面去运行这个kubelet进程的,因此
这里就会产生一些冲突。