kubelet 创建 Pod 前发生了什么?
Kubelet Watch 到新增的 Pod,需要做的主要有以下几件事:
-
管理 Pod 状态,除了更新本地缓存,还要同步给 API server
-
计算节点的资源是否足够创建 Pod
-
创建 Cgroup 并更新资源配置
-
创建 Pod、存储卷、插件的根目录
-
挂载存储卷并等待所有存储卷挂载成功
-
获取 imagePullSecrets
-
调用 CRI 创建 PodSandbox
-
调用 CRI 创建 Container(ephemeral Container、Init Container、普通 Container)
Pod 的 event 是 k8s 组件向用户展示关键信息的渠道,对于某个 Pod,正常情况下会有以下事件:
Pod 被调度
当 Pod 被创建出来时,会首先提交到 API server(如果由 controller 管理的 Pod,也是如此)。此时,调度器发现未被调度的 Pod,会对其进行调度。当通过一系列算法找到合适的调度节点后,调度器会将 Pod 和节点进行绑定。对于默认调度器,调度完成的标志是往 Pod 的事件打印以下内容:
Successfully assigned default/dp-test-5986bbcc8f-4vlmd to 192.168.0.9
Scheduler 中对应代码如下:
func (sched *Scheduler) finishBinding(fwk framework.Framework, assumed *v1.Pod, targetNode string, status *framework.Status) {
if finErr := sched.Cache.FinishBinding(assumed); finErr != nil {
klog.ErrorS(finErr, "Scheduler cache FinishBinding failed")
}
if !status.IsSuccess() {
klog.V(1).InfoS("Failed to bind pod", "pod", klog.KObj(assumed))
return
}
fwk.EventRecorder().Eventf(assumed, nil, v1.EventTypeNormal, "Scheduled", "Binding", "Successfully assigned %v/%v to %v", assumed.Namespace, assumed.Name, targetNode)
}
Kubelet 发现需要被处理的 Pod
Kubelet 会持续通过 watch 的方式从 API server 获取调度到当前节点的 Pod 列表信息:
“
根据 Selector spec.nodeName 进行过滤。
// NewSourceApiserver creates a config source that watches and pulls from the apiserver.
func NewSourceApiserver(c clientset.Interface, nodeName types.NodeName, nodeHasSynced func() bool, updates chan<- interface{}) {
lw := cache.NewListWatchFromClient(c.CoreV1().RESTClient(), "pods", metav1.NamespaceAll, fields.OneTermEqualSelector("spec.nodeName", string(nodeName)))
// The Reflector responsible for watching pods at the apiserver should be run only after
// the node sync with the apiserver has completed.
klog.InfoS("Waiting for node sync before watching apiserver pods")
go func() {
for {
if nodeHasSynced() {
klog.V(4).InfoS("node sync completed")
break
}
time.Sleep(WaitForAPIServerSyncPeriod)
klog.V(4).InfoS("node sync has not completed yet")
}
klog.InfoS("Watching apiserver")
newSourceApiserverFromLW(lw, updates)
}()
}
当 Kubelet 发现有新的 Pod 被调度到当前节点,会打印以下日志:
“
需要调整日志等级为 5,下同。
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269157 3390574 config.go:398] "Receiving a new pod" pod="default/dp-test-5986bbcc8f-gcrmm"
对应代码如下:
// recordFirstSeenTime records the first seen time of this pod.
func recordFirstSeenTime(pod *v1.Pod) {
klog.V(4).InfoS("Receiving a new pod", "pod", klog.KObj(pod))
pod.Annotations[kubetypes.ConfigFirstSeenAnnotationKey] = kubetypes.NewTimestamp().GetString()
}
在这里,kubelet 会为这个 Pod 增加 key 为kubernetes.io/config.seen
的 Annotation(这个 Annotation 只在 kubelet 使用,没有同步到 API Server),记录 kubelet 第一次看到这个 Pod 的时间。
Kubelet 对拿到的 Pod 进行分类,并通过s.updates
Channel 发送给对应的 handler。通过这个 Channel,我们可以找到对应的处理函数:
// syncLoop is the main loop for processing changes. It watches for changes from
// three channels (file, apiserver, and http) and creates a union of them. For
// any new change seen, will run a sync against desired state and running state. If
// no changes are seen to the configuration, will synchronize the last known desired
// state every sync-frequency seconds. Never returns.
func (kl *Kubelet) syncLoop(ctx context.Context, updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
klog.InfoS("Starting kubelet main sync loop")
// The syncTicker wakes up kubelet to checks if there are any pod workers
// that need to be sync'd. A one-second period is sufficient because the
// sync interval is defaulted to 10s.
syncTicker := time.NewTicker(time.Second)
defer syncTicker.Stop()
housekeepingTicker := time.NewTicker(housekeepingPeriod)
defer housekeepingTicker.Stop()
plegCh := kl.pleg.Watch()
const (
base = 100 * time.Millisecond
max = 5 * time.Second
factor = 2
)
duration := base
// Responsible for checking limits in resolv.conf
// The limits do not have anything to do with individual pods
// Since this is called in syncLoop, we don't need to call it anywhere else
if kl.dnsConfigurer != nil && kl.dnsConfigurer.ResolverConfig != "" {
kl.dnsConfigurer.CheckLimitsForResolvConf()
}
for {
if err := kl.runtimeState.runtimeErrors(); err != nil {
klog.ErrorS(err, "Skipping pod synchronization")
// exponential backoff
time.Sleep(duration)
duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
continue
}
// reset backoff if we have a success
duration = base
kl.syncLoopMonitor.Store(kl.clock.Now())
if !kl.syncLoopIteration(ctx, updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
break
}
kl.syncLoopMonitor.Store(kl.clock.Now())
}
}
syncLoop
是 kubelet 的主要处理逻辑,方法syncLoopIteration
从不同 Channel 读取事件并且分发给 worker。内容较多,可以通过注释了解大概逻辑:
// syncLoopIteration reads from various channels and dispatches pods to the
// given handler.
//
// Arguments:
// 1. configCh: a channel to read config events from
// 2. handler: the SyncHandler to dispatch pods to
// 3. syncCh: a channel to read periodic sync events from
// 4. housekeepingCh: a channel to read housekeeping events from
// 5. plegCh: a channel to read PLEG updates from
//
// Events are also read from the kubelet liveness manager's update channel.
//
// The workflow is to read from one of the channels, handle that event, and
// update the timestamp in the sync loop monitor.
//
// Here is an appropriate place to note that despite the syntactical
// similarity to the switch statement, the case statements in a select are
// evaluated in a pseudorandom order if there are multiple channels ready to
// read from when the select is evaluated. In other words, case statements
// are evaluated in random order, and you can not assume that the case
// statements evaluate in order if multiple channels have events.
//
// With that in mind, in truly no particular order, the different channels
// are handled as follows:
//
// - configCh: dispatch the pods for the config change to the appropriate
// handler callback for the event type
// - plegCh: update the runtime cache; sync pod
// - syncCh: sync all pods waiting for sync
// - housekeepingCh: trigger cleanup of pods
// - health manager: sync pods that have failed or in which one or more
// containers have failed health checks
此时,Kubelet 打印以下日志,表示已经收到新增 Pod 的请求,并调用对应 handler 进行处理:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269224 3390574 kubelet.go:2206] "SyncLoop ADD" source="api" pods="[default/dp-test-5986bbcc8f-gcrmm]"
对应代码:
func (kl *Kubelet) syncLoopIteration(ctx context.Context, configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
select {
case u, open := <-configCh:
// Update from a config source; dispatch it to the right handler
// callback.
if !open {
klog.ErrorS(nil, "Update channel is closed, exiting the sync loop")
return false
}
switch u.Op {
case kubetypes.ADD:
klog.V(2).InfoS("SyncLoop ADD", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
// After restarting, kubelet will get all existing pods through
// ADD as if they are new pods. These pods will then go through the
// admission process and *may* be rejected. This can be resolved
// once we have checkpointing.
handler.HandlePodAdditions(u.Pods)
case kubetypes.UPDATE:
klog.V(2).InfoS("SyncLoop UPDATE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
handler.HandlePodUpdates(u.Pods)
case kubetypes.REMOVE:
klog.V(2).InfoS("SyncLoop REMOVE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
handler.HandlePodRemoves(u.Pods)
case kubetypes.RECONCILE:
klog.V(4).InfoS("SyncLoop RECONCILE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
handler.HandlePodReconcile(u.Pods)
case kubetypes.DELETE:
klog.V(2).InfoS("SyncLoop DELETE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
// DELETE is treated as a UPDATE because of graceful deletion.
handler.HandlePodUpdates(u.Pods)
case kubetypes.SET:
// TODO: Do we want to support this?
klog.ErrorS(nil, "Kubelet does not support snapshot update")
default:
klog.ErrorS(nil, "Invalid operation type received", "operation", u.Op)
}
// 省略部分内容
}
return true
}
对于不同的事件类型,有不同的 handler,此处不深究。
-
Add Pod:HandlePodAdditions
-
Update Pod、Delete Pod:HandlePodUpdates
-
Remove Pod:HandlePodRemoves
-
Reconcile Pod:HandlePodReconcile
对于kubetypes.ADD
事件,handler 是HandlePodAdditions
:
// HandlePodAdditions is the callback in SyncHandler for pods being added from
// a config source.
func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
start := kl.clock.Now()
sort.Sort(sliceutils.PodsByCreationTime(pods))
for _, pod := range pods {
existingPods := kl.podManager.GetPods()
// Always add the pod to the pod manager. Kubelet relies on the pod
// manager as the source of truth for the desired state. If a pod does
// not exist in the pod manager, it means that it has been deleted in
// the apiserver and no action (other than cleanup) is required.
kl.podManager.AddPod(pod)
// 静态 Pod,通过 handleMirrorPod 处理
if kubetypes.IsMirrorPod(pod) {
kl.handleMirrorPod(pod, start)
continue
}
// Only go through the admission process if the pod is not requested
// for termination by another part of the kubelet. If the pod is already
// using resources (previously admitted), the pod worker is going to be
// shutting it down. If the pod hasn't started yet, we know that when
// the pod worker is invoked it will also avoid setting up the pod, so
// we simply avoid doing any work.
if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
// We failed pods that we rejected, so activePods include all admitted
// pods that are alive.
activePods := kl.filterOutInactivePods(existingPods)
// Check if we can admit the pod; if not, reject it.
if ok, reason, message := kl.canAdmitPod(activePods, pod); !ok {
kl.rejectPod(pod, reason, message)
continue
}
}
mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
}
}
只有非 terminating 状态的 Pod 才会走到 admission 逻辑。对应方法是canAdmitPod
中,在该方法中,会通过不同的admitHandlers
处理Admit
逻辑,打印日志如下:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269248 3390574 topology_manager.go:210] "Topology Admit Handler" podUID=dd780355-f908-4bf2-9193-68dabbbca69e podNamespace="default" podName="dp-test-5986bbcc8f-gcrmm"
对应代码:
func (m *manager) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
klog.InfoS("Topology Admit Handler", "podUID", attrs.Pod.UID, "podNamespace", attrs.Pod.Namespace, "podName", attrs.Pod.Name)
pod := attrs.Pod
return m.scope.Admit(pod)
}
Admit
方法会先根据container.Resources.Limits
分析需要分配的资源(如:网卡等),日志如下:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269261 3390574 manager.go:854] "Looking for needed resources" needed=1 resourceName="xxx/eni-ip"
对应代码:
// allocateContainerResources attempts to allocate all of required device
// plugin resources for the input container, issues an Allocate rpc request
// for each new device resource requirement, processes their AllocateResponses,
// and updates the cached containerDevices on success.
func (m *ManagerImpl) allocateContainerResources(pod *v1.Pod, container *v1.Container, devicesToReuse map[string]sets.String) error {
podUID := string(pod.UID)
contName := container.Name
allocatedDevicesUpdated := false
needsUpdateCheckpoint := false
// Extended resources are not allowed to be overcommitted.
// Since device plugin advertises extended resources,
// therefore Requests must be equal to Limits and iterating
// over the Limits should be sufficient.
for k, v := range container.Resources.Limits {
resource := string(k)
needed := int(v.Value())
klog.V(3).InfoS("Looking for needed resources", "needed", needed, "resourceName", resource)
if !m.isDevicePluginResource(resource) {
continue
}
// Updates allocatedDevices to garbage collect any stranded resources
// before doing the device plugin allocation.
if !allocatedDevicesUpdated {
m.UpdateAllocatedDevices()
allocatedDevicesUpdated = true
}
var allocDevices sets.String
var err error
contDevicesScheduledKey := fmt.Sprintf("%s_container-%s", resource, contName)
contDevicesScheduled := pod.Annotations[contDevicesScheduledKey]
if contDevicesScheduled != "" {
allocDevices, err = m.devicesToAllocateFromScheduler(podUID, contName, resource, contDevicesScheduled, needed, devicesToReuse[resource])
} else {
allocDevices, err = m.devicesToAllocate(podUID, contName, resource, needed, devicesToReuse[resource])
}
if err != nil {
return err
}
if allocDevices == nil || len(allocDevices) <= 0 {
continue
}
// 省略部分内容
m.podDevices.insert(podUID, contName, resource, allocDevicesWithNUMA, resp.ContainerResponses[0])
}
if needsUpdateCheckpoint {
return m.writeCheckpoint()
}
return nil
}
然后,会走到devicesToAllocate
以分配需要的设备,下一条日志为:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269321 3390574 manager.go:621] "Need devices to allocate for pod" deviceNumber=1 resourceName="xxx/eni-ip" podUID="dd780355-f908-4bf2-9193-68dabbbca69e" containerName="nginx"
对应代码:
// Returns list of device Ids we need to allocate with Allocate rpc call.
// Returns empty list in case we don't need to issue the Allocate rpc call.
func (m *ManagerImpl) devicesToAllocate(podUID, contName, resource string, required int, reusableDevices sets.String) (sets.String, error) {
m.mutex.Lock()
defer m.mutex.Unlock()
needed := required
// Gets list of devices that have already been allocated.
// This can happen if a container restarts for example.
devices := m.podDevices.containerDevices(podUID, contName, resource)
if devices != nil {
klog.V(3).InfoS("Found pre-allocated devices for resource on pod", "resourceName", resource, "containerName", contName, "podUID", string(podUID), "devices", devices.List())
needed = needed - devices.Len()
// A pod's resource is not expected to change once admitted by the API server,
// so just fail loudly here. We can revisit this part if this no longer holds.
if needed != 0 {
return nil, fmt.Errorf("pod %q container %q changed request for resource %q from %d to %d", string(podUID), contName, resource, devices.Len(), required)
}
}
// We have 3 major flows to handle:
// 1. kubelet running, normal allocation (needed > 0, container being [re]created). Steady state and most common case by far and large.
// 2. kubelet restart. In this scenario every other component of the stack (device plugins, app container, runtime) is still running.
// 3. node reboot. In this scenario device plugins may not be running yet when we try to allocate devices.
// note: if we get this far the runtime is surely running. This is usually enforced at OS level by startup system services dependencies.
// First we take care of the exceptional flow (scenarios 2 and 3). In both flows, kubelet is reinitializing, and while kubelet is initializing, sources are NOT all ready.
// Is this a simple kubelet restart (scenario 2)? To distinguish, we use the informations we got for runtime. If we are asked to allocate devices for containers reported
// running, then it can only be a kubelet restart. On node reboot the runtime and the containers were also shut down. Then, if the container was running, it can only be
// because it already has access to all the required devices, so we got nothing to do and we can bail out.
if !m.sourcesReady.AllReady() && m.isContainerAlreadyRunning(podUID, contName) {
klog.V(3).InfoS("container detected running, nothing to do", "deviceNumber", needed, "resourceName", resource, "podUID", string(podUID), "containerName", contName)
return nil, nil
}
// We dealt with scenario 2. If we got this far it's either scenario 3 (node reboot) or scenario 1 (steady state, normal flow).
klog.V(3).InfoS("Need devices to allocate for pod", "deviceNumber", needed, "resourceName", resource, "podUID", string(podUID), "containerName", contName)
healthyDevices, hasRegistered := m.healthyDevices[resource]
// 省略部分内容
// Filters available Devices based on NUMA affinity.
aligned, unaligned, noAffinity := m.filterByAffinity(podUID, contName, resource, available)
// If we can allocate all remaining devices from the set of aligned ones, then
// give the plugin the chance to influence which ones to allocate from that set.
if needed < aligned.Len() {
// First allocate from the preferred devices list (if available).
preferred, err := m.callGetPreferredAllocationIfAvailable(podUID, contName, resource, aligned.Union(allocated), allocated, required)
if err != nil {
return nil, err
}
if allocateRemainingFrom(preferred.Intersection(aligned)) {
return allocated, nil
}
// Then fallback to allocate from the aligned set if no preferred list
// is returned (or not enough devices are returned in that list).
if allocateRemainingFrom(aligned) {
return allocated, nil
}
return nil, fmt.Errorf("unexpectedly allocated less resources than required. Requested: %d, Got: %d", required, required-needed)
}
// If we can't allocate all remaining devices from the set of aligned ones,
// then start by first allocating all of the aligned devices (to ensure
// that the alignment guaranteed by the TopologyManager is honored).
if allocateRemainingFrom(aligned) {
return allocated, nil
}
// Then give the plugin the chance to influence the decision on any
// remaining devices to allocate.
preferred, err := m.callGetPreferredAllocationIfAvailable(podUID, contName, resource, available.Union(allocated), allocated, required)
if err != nil {
return nil, err
}
if allocateRemainingFrom(preferred.Intersection(available)) {
return allocated, nil
}
// Finally, if the plugin did not return a preferred allocation (or didn't
// return a large enough one), then fall back to allocating the remaining
// devices from the 'unaligned' and 'noAffinity' sets.
if allocateRemainingFrom(unaligned) {
return allocated, nil
}
if allocateRemainingFrom(noAffinity) {
return allocated, nil
}
return nil, fmt.Errorf("unexpectedly allocated less resources than required. Requested: %d, Got: %d", required, required-needed)
}
如果没有配置GetPreferredAllocation
,会打打印以下日志:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269375 3390574 manager.go:1041] "Plugin options indicate to skip GetPreferredAllocation for resource" resourceName="xxx/eni-ip"
对应代码:
// callGetPreferredAllocationIfAvailable issues GetPreferredAllocation grpc
// call for device plugin resource with GetPreferredAllocationAvailable option set.
func (m *ManagerImpl) callGetPreferredAllocationIfAvailable(podUID, contName, resource string, available, mustInclude sets.String, size int) (sets.String, error) {
eI, ok := m.endpoints[resource]
if !ok {
return nil, fmt.Errorf("endpoint not found in cache for a registered resource: %s", resource)
}
if eI.opts == nil || !eI.opts.GetPreferredAllocationAvailable {
klog.V(4).InfoS("Plugin options indicate to skip GetPreferredAllocation for resource", "resourceName", resource)
return nil, nil
}
m.mutex.Unlock()
klog.V(4).InfoS("Issuing a GetPreferredAllocation call for container", "containerName", contName, "podUID", string(podUID))
resp, err := eI.e.getPreferredAllocation(available.UnsortedList(), mustInclude.UnsortedList(), size)
m.mutex.Lock()
if err != nil {
return nil, fmt.Errorf("device plugin GetPreferredAllocation rpc failed with err: %v", err)
}
if resp != nil && len(resp.ContainerResponses) > 0 {
return sets.NewString(resp.ContainerResponses[0].DeviceIDs...), nil
}
return sets.NewString(), nil
}
接下来,会走到分配设备的逻辑,打印日志:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269402 3390574 manager.go:909] "Making allocation request for device plugin" devices=[eni-ip-121] resourceName="xxx/eni-ip"
对应代码在allocateContainerResources
方法:
func (m *ManagerImpl) allocateContainerResources(pod *v1.Pod, container *v1.Container, devicesToReuse map[string]sets.String) error {
// 省略部分内容
devs := allocDevices.UnsortedList()
// TODO: refactor this part of code to just append a ContainerAllocationRequest
// in a passed in AllocateRequest pointer, and issues a single Allocate call per pod.
klog.V(3).InfoS("Making allocation request for device plugin", "devices", devs, "resourceName", resource)
resp, err := eI.e.allocate(devs)
metrics.DevicePluginAllocationDuration.WithLabelValues(resource).Observe(metrics.SinceInSeconds(startRPCTime))
if err != nil {
// In case of allocation failure, we want to restore m.allocatedDevices
// to the actual allocated state from m.podDevices.
m.mutex.Lock()
m.allocatedDevices = m.podDevices.devices()
m.mutex.Unlock()
return err
}
if len(resp.ContainerResponses) == 0 {
return fmt.Errorf("no containers return in allocation response %v", resp)
}
// 省略部分内容
}
⚠️ 注意:这里有个指标可以看到设备分配花费的时间device_plugin_alloc_duration_seconds
(不过遗憾的是并没有 Pod 和 container 信息,只有资源名称信息)
这里调了 RPC 分配需要的设备:
func (c *devicePluginClient) Allocate(ctx context.Context, in *AllocateRequest, opts ...grpc.CallOption) (*AllocateResponse, error) {
out := new(AllocateResponse)
err := c.cc.Invoke(ctx, "/v1beta1.DevicePlugin/Allocate", in, out, opts...)
if err != nil {
return nil, err
}
return out, nil
}
资源分配完成,Pod 通过dispatchWork
分配给对应的 worker 处理:
kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
此时,可以看到类似下面的日志:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273776 3390574 pod_workers.go:585] "Pod is being synced for the first time" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e
对应代码:
func (p *podWorkers) UpdatePod(options UpdatePodOptions) {
// 省略部分内容
// decide what to do with this pod - we are either setting it up, tearing it down, or ignoring it
now := time.Now()
status, ok := p.podSyncStatuses[uid]
if !ok {
klog.V(4).InfoS("Pod is being synced for the first time", "pod", klog.KObj(pod), "podUID", pod.UID)
status = &podSyncStatus{
syncedAt: now,
fullname: kubecontainer.GetPodFullName(pod),
}
// if this pod is being synced for the first time, we need to make sure it is an active pod
if !isRuntimePod && (pod.Status.Phase == v1.PodFailed || pod.Status.Phase == v1.PodSucceeded) {
// check to see if the pod is not running and the pod is terminal.
// If this succeeds then record in the podWorker that it is terminated.
if statusCache, err := p.podCache.Get(pod.UID); err == nil {
if isPodStatusCacheTerminal(statusCache) {
status = &podSyncStatus{
terminatedAt: now,
terminatingAt: now,
syncedAt: now,
startedTerminating: true,
finished: true,
fullname: kubecontainer.GetPodFullName(pod),
}
}
}
}
p.podSyncStatuses[uid] = status
}
// 省略部分内容
}
如果是新创建的 Pod,会创建一个 Go Routine(Pod Worker)来处理,主要逻辑为managePodLoop
,通过outCh
接收请求:
// Creating a new pod worker either means this is a new pod, or that the
// kubelet just restarted. In either case the kubelet is willing to believe
// the status of the pod for the first pod worker sync. See corresponding
// comment in syncPod.
go func() {
defer runtime.HandleCrash()
p.managePodLoop(outCh)
}()
当通过outCh
收到请求,会打印以下日志:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273812 3390574 pod_workers.go:902] "Processing pod event" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e updateType=0
代码位置:
func (p *podWorkers) managePodLoop(podUpdates <-chan podWork) {
// 省略部分内容
klog.V(4).InfoS("Processing pod event", "pod", klog.KObj(pod), "podUID", pod.UID, "updateType", update.WorkType)
var isTerminal bool
err := func() error {
// The worker is responsible for ensuring the sync method sees the appropriate
// status updates on resyncs (the result of the last sync), transitions to
// terminating (no wait), or on terminated (whatever the most recent state is).
// Only syncing and terminating can generate pod status changes, while terminated
// pods ensure the most recent status makes it to the api server.
var status *kubecontainer.PodStatus
var err error
switch {
case update.Options.RunningPod != nil:
// when we receive a running pod, we don't need status at all
default:
// wait until we see the next refresh from the PLEG via the cache (max 2s)
// TODO: this adds ~1s of latency on all transitions from sync to terminating
// to terminated, and on all termination retries (including evictions). We should
// improve latency by making the pleg continuous and by allowing pod status
// changes to be refreshed when key events happen (killPod, sync->terminating).
// Improving this latency also reduces the possibility that a terminated
// container's status is garbage collected before we have a chance to update the
// API server (thus losing the exit code).
status, err = p.podCache.GetNewerThan(pod.UID, lastSyncTime)
}
if err != nil {
// This is the legacy event thrown by manage pod loop all other events are now dispatched
// from syncPodFn
p.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedSync, "error determining status: %v", err)
return err
}
ctx := p.contextForWorker(pod.UID)
// Take the appropriate action (illegal phases are prevented by UpdatePod)
switch {
case update.WorkType == TerminatedPodWork:
err = p.syncTerminatedPodFn(ctx, pod, status)
case update.WorkType == TerminatingPodWork:
var gracePeriod *int64
if opt := update.Options.KillPodOptions; opt != nil {
gracePeriod = opt.PodTerminationGracePeriodSecondsOverride
}
podStatusFn := p.acknowledgeTerminating(pod)
err = p.syncTerminatingPodFn(ctx, pod, status, update.Options.RunningPod, gracePeriod, podStatusFn)
default:
isTerminal, err = p.syncPodFn(ctx, update.Options.UpdateType, pod, update.Options.MirrorPod, status)
}
lastSyncTime = time.Now()
return err
}()
// 省略部分内容
}
除了以下两种类型,都通过syncPodFn
处理:
-
TerminatedPodWork
-
TerminatingPodWork
在syncPodFn
中,会首先打印以下日志:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273831 3390574 kubelet.go:1620] "syncPod enter" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e
对应代码位于(syncPodFn
同步 Pod 状态的主要逻辑,具体可参考注释):
// syncPod is the transaction script for the sync of a single pod (setting up)
// a pod. This method is reentrant and expected to converge a pod towards the
// desired state of the spec. The reverse (teardown) is handled in
// syncTerminatingPod and syncTerminatedPod. If syncPod exits without error,
// then the pod runtime state is in sync with the desired configuration state
// (pod is running). If syncPod exits with a transient error, the next
// invocation of syncPod is expected to make progress towards reaching the
// runtime state. syncPod exits with isTerminal when the pod was detected to
// have reached a terminal lifecycle phase due to container exits (for
// RestartNever or RestartOnFailure) and the next method invoked will by
// syncTerminatingPod.
//
// Arguments:
//
// updateType - whether this is a create (first time) or an update, should
//
// only be used for metrics since this method must be reentrant
//
// pod - the pod that is being set up
// mirrorPod - the mirror pod known to the kubelet for this pod, if any
// podStatus - the most recent pod status observed for this pod which can
//
// be used to determine the set of actions that should be taken during
// this loop of syncPod
//
// The workflow is:
// - If the pod is being created, record pod worker start latency
// - Call generateAPIPodStatus to prepare an v1.PodStatus for the pod
// - If the pod is being seen as running for the first time, record pod
// start latency
// - Update the status of the pod in the status manager
// - Stop the pod's containers if it should not be running due to soft
// admission
// - Ensure any background tracking for a runnable pod is started
// - Create a mirror pod if the pod is a static pod, and does not
// already have a mirror pod
// - Create the data directories for the pod if they do not exist
// - Wait for volumes to attach/mount
// - Fetch the pull secrets for the pod
// - Call the container runtime's SyncPod callback
// - Update the traffic shaping for the pod's ingress and egress limits
//
// If any step of this workflow errors, the error is returned, and is repeated
// on the next syncPod call.
//
// This operation writes all events that are dispatched in order to provide
// the most accurate information possible about an error situation to aid debugging.
// Callers should not write an event if this operation returns an error.
func (kl *Kubelet) syncPod(_ context.Context, updateType kubetypes.SyncPodType, pod, mirrorPod *v1.Pod, podStatus *kubecontainer.PodStatus) (isTerminal bool, err error) {
// TODO(#113606): connect this with the incoming context parameter, which comes from the pod worker.
// Currently, using that context causes test failures.
ctx := context.TODO()
klog.V(4).InfoS("syncPod enter", "pod", klog.KObj(pod), "podUID", pod.UID)
defer func() {
klog.V(4).InfoS("syncPod exit", "pod", klog.KObj(pod), "podUID", pod.UID, "isTerminal", isTerminal)
}()
// Latency measurements for the main workflow are relative to the
// first time the pod was seen by kubelet.
var firstSeenTime time.Time
if firstSeenTimeStr, ok := pod.Annotations[kubetypes.ConfigFirstSeenAnnotationKey]; ok {
firstSeenTime = kubetypes.ConvertToTimestamp(firstSeenTimeStr).Get()
}
// Record pod worker start latency if being created
// TODO: make pod workers record their own latencies
if updateType == kubetypes.SyncPodCreate {
if !firstSeenTime.IsZero() {
// This is the first time we are syncing the pod. Record the latency
// since kubelet first saw the pod if firstSeenTime is set.
metrics.PodWorkerStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
} else {
klog.V(3).InfoS("First seen time not recorded for pod",
"podUID", pod.UID,
"pod", klog.KObj(pod))
}
}
// 省略部分内容
}
这里有个指标pod_worker_start_duration_seconds
记录了 kubelet 第一次看到这个 Pod 到第一次syncPod
经历的时间。
kubelet4
下一条日志,kubelet 生成 Pod 状态信息:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273841 3390574 kubelet_pods.go:1461] "Generating pod status" pod="default/dp-test-5986bbcc8f-gcrmm"
对应代码:
// generateAPIPodStatus creates the final API pod status for a pod, given the
// internal pod status. This method should only be called from within sync*Pod methods.
func (kl *Kubelet) generateAPIPodStatus(pod *v1.Pod, podStatus *kubecontainer.PodStatus) v1.PodStatus {
klog.V(3).InfoS("Generating pod status", "pod", klog.KObj(pod))
// 省略部分内容
}
对于新创建的 Pod,初始状态都是 Pending,对应日志:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273867 3390574 kubelet_pods.go:1426] "Pod waiting > 0, pending"
计算 Pod 的状态
Pod 的 Phase 是根据 Pod 中所有 container 的状态计算得出的:
// getPhase returns the phase of a pod given its container info.
func getPhase(spec *v1.PodSpec, info []v1.ContainerStatus) v1.PodPhase {
pendingInitialization := 0
failedInitialization := 0
for _, container := range spec.InitContainers {
containerStatus, ok := podutil.GetContainerStatus(info, container.Name)
if !ok {
pendingInitialization++
continue
}
switch {
case containerStatus.State.Running != nil:
pendingInitialization++
case containerStatus.State.Terminated != nil:
if containerStatus.State.Terminated.ExitCode != 0 {
failedInitialization++
}
case containerStatus.State.Waiting != nil:
if containerStatus.LastTerminationState.Terminated != nil {
if containerStatus.LastTerminationState.Terminated.ExitCode != 0 {
failedInitialization++
}
} else {
pendingInitialization++
}
default:
pendingInitialization++
}
}
unknown := 0
running := 0
waiting := 0
stopped := 0
succeeded := 0
for _, container := range spec.Containers {
containerStatus, ok := podutil.GetContainerStatus(info, container.Name)
if !ok {
unknown++
continue
}
switch {
case containerStatus.State.Running != nil:
running++
case containerStatus.State.Terminated != nil:
stopped++
if containerStatus.State.Terminated.ExitCode == 0 {
succeeded++
}
case containerStatus.State.Waiting != nil:
if containerStatus.LastTerminationState.Terminated != nil {
stopped++
} else {
waiting++
}
default:
unknown++
}
}
if failedInitialization > 0 && spec.RestartPolicy == v1.RestartPolicyNever {
return v1.PodFailed
}
switch {
case pendingInitialization > 0:
fallthrough
case waiting > 0:
klog.V(5).InfoS("Pod waiting > 0, pending")
// One or more containers has not been started
return v1.PodPending
case running > 0 && unknown == 0:
// All containers have been started, and at least
// one container is running
return v1.PodRunning
case running == 0 && stopped > 0 && unknown == 0:
// All containers are terminated
if spec.RestartPolicy == v1.RestartPolicyAlways {
// All containers are in the process of restarting
return v1.PodRunning
}
if stopped == succeeded {
// RestartPolicy is not Always, and all
// containers are terminated in success
return v1.PodSucceeded
}
if spec.RestartPolicy == v1.RestartPolicyNever {
// RestartPolicy is Never, and all containers are
// terminated with at least one in failure
return v1.PodFailed
}
// RestartPolicy is OnFailure, and at least one in failure
// and in the process of restarting
return v1.PodRunning
default:
klog.V(5).InfoS("Pod default case, pending")
return v1.PodPending
}
}
-
failedInitialization 大于 0 并且 RestartPolicy 是 Never:Pod 状态为 Failed
-
waiting 状态的 container 数量大于 0:Pod 状态为 Pending
-
running 状态的 container 大于 0 并且 unknown 状态的 container 等于 0:Pod 状态为 Running
-
running 状态的 container 等于 0 并且 stopped 状态的 container 大于 0,同时 unknown 状态的 container 等于 0
-
RestartPolicy 是 always:Pod 状态为 Running
-
如果 stopped 状态 container 数量和 succeeded 状态 container 数量相同:Pod 状态为 Succeeded
-
如果 RestartPolicy 是 Never:Pod 状态为 Failed
-
其它情况:Pod 状态为 Running
-
其它情况:Pod 状态为 Pending
这里有指标pod_start_duration_seconds
记录了 kubelet 第一次看到 Pod 到 Pod running 的时间:
Pod 状态更新完成,打印以下日志:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273874 3390574 kubelet_pods.go:1471] "Got phase for pod" pod="default/dp-test-5986bbcc8f-gcrmm" oldPhase=Pending phase=Pending
计算出 Pod 状态信息,接下来更新 Pod 状态信息,对应代码:
func (kl *Kubelet) syncPod(_ context.Context, updateType kubetypes.SyncPodType, pod, mirrorPod *v1.Pod, podStatus *kubecontainer.PodStatus) (isTerminal bool, err error) {
// 省略其他内容
kl.statusManager.SetPodStatus(pod, apiPodStatus)
// 省略其他内容
}
调用updateStatusInternal
更新 Pod 状态:
func (m *manager) SetPodStatus(pod *v1.Pod, status v1.PodStatus) {
m.podStatusesLock.Lock()
defer m.podStatusesLock.Unlock()
// Make sure we're caching a deep copy.
status = *status.DeepCopy()
// Force a status update if deletion timestamp is set. This is necessary
// because if the pod is in the non-running state, the pod worker still
// needs to be able to trigger an update and/or deletion.
m.updateStatusInternal(pod, status, pod.DeletionTimestamp != nil)
}
updateStatusInternal
方法除了更新本地缓存中的 Pod 状态,还会添加到发给 API server 的请求队列中。在updateStatusInternal
方法,打印如下日志:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273910 3390574 status_manager.go:552] "updateStatusInternal" version=1 pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e containers="(nginx state=waiting previous=<none>)"
对应代码:
func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {
// 省略部分内容
// Perform some more extensive logging of container termination state to assist in
// debugging production races (generally not needed).
if klogV := klog.V(5); klogV.Enabled() {
var containers []string
for _, s := range append(append([]v1.ContainerStatus(nil), status.InitContainerStatuses...), status.ContainerStatuses...) {
var current, previous string
switch {
case s.State.Running != nil:
current = "running"
case s.State.Waiting != nil:
current = "waiting"
case s.State.Terminated != nil:
current = fmt.Sprintf("terminated=%d", s.State.Terminated.ExitCode)
default:
current = "unknown"
}
switch {
case s.LastTerminationState.Running != nil:
previous = "running"
case s.LastTerminationState.Waiting != nil:
previous = "waiting"
case s.LastTerminationState.Terminated != nil:
previous = fmt.Sprintf("terminated=%d", s.LastTerminationState.Terminated.ExitCode)
default:
previous = "<none>"
}
containers = append(containers, fmt.Sprintf("(%s state=%s previous=%s)", s.Name, current, previous))
}
sort.Strings(containers)
klogV.InfoS("updateStatusInternal", "version", cachedStatus.version+1, "pod", klog.KObj(pod), "podUID", pod.UID, "containers", strings.Join(containers, " "))
}
// 省略部分内容
}
往 API server 更新 Pod 状态的请求被添加到 Channel,同时打印如下日志:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273990 3390574 status_manager.go:582] "Status Manager: adding pod with new status to podStatusChannel" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e statusVersion=1 status={Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.0.9 PodIP: PodIPs:[] StartTime:2024-06-15 15:18:59 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:nginx State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:nginx:latest ImageID: ContainerID: Started:0xc001ed178c}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
对应代码:
func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {
// 省略部分内容
m.podStatuses[pod.UID] = newStatus
select {
case m.podStatusChannel <- podStatusSyncRequest{pod.UID, newStatus}:
klog.V(5).InfoS("Status Manager: adding pod with new status to podStatusChannel",
"pod", klog.KObj(pod),
"podUID", pod.UID,
"statusVersion", newStatus.version,
"status", newStatus.status)
return true
default:
// Let the periodic syncBatch handle the update if the channel is full.
// We can't block, since we hold the mutex lock.
klog.V(4).InfoS("Skipping the status update for pod for now because the channel is full",
"pod", klog.KObj(pod),
"status", status)
return false
}
//省略部分内容
}
可以看到,请求里面只有UID
和新的状态两部分内容:
-
pod.UID
-
newStatus
向 API server 更新 Pod 状态
因此,下一条日志是 Channel 的 handler(Status Manager)收到这个请求:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.274047 3390574 status_manager.go:186] "Status Manager: syncing pod with status from podStatusChannel" podUID=dd780355-f908-4bf2-9193-68dabbbca69e statusVersion=1 status={Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.0.9 PodIP: PodIPs:[] StartTime:2024-06-15 15:18:59 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:nginx State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:nginx:latest ImageID: ContainerID: Started:0xc001ed178c}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
对应代码:
func (m *manager) Start() {
// Don't start the status manager if we don't have a client. This will happen
// on the master, where the kubelet is responsible for bootstrapping the pods
// of the master components.
if m.kubeClient == nil {
klog.InfoS("Kubernetes client is nil, not starting status manager")
return
}
klog.InfoS("Starting to sync pod status with apiserver")
//nolint:staticcheck // SA1015 Ticker can leak since this is only called once and doesn't handle termination.
syncTicker := time.NewTicker(syncPeriod).C
// syncPod and syncBatch share the same go routine to avoid sync races.
go wait.Forever(func() {
for {
select {
case syncRequest := <-m.podStatusChannel:
klog.V(5).InfoS("Status Manager: syncing pod with status from podStatusChannel",
"podUID", syncRequest.podUID,
"statusVersion", syncRequest.status.version,
"status", syncRequest.status.status)
m.syncPod(syncRequest.podUID, syncRequest.status)
case <-syncTicker:
klog.V(5).InfoS("Status Manager: syncing batch")
// remove any entries in the status channel since the batch will handle them
for i := len(m.podStatusChannel); i > 0; i-- {
<-m.podStatusChannel
}
m.syncBatch()
}
}
}, 0)
}
Status Manager 是一个 Go Routine,主要通过m.syncPod
向 API server 更新状态:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.285142 3390574 status_manager.go:719] "Patch status for pod" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e patch="{\"metadata\":{\"uid\":\"dd780355-f908-4bf2-9193-68dabbbca69e\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-06-15T07:18:59Z\",\"status\":\"True\",\"type\":\"Initialized\"},{\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-06-15T07:18:59Z\",\"message\":\"containers with unready status: [nginx]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-06-15T07:18:59Z\",\"message\":\"containers with unready status: [nginx]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"image\":\"nginx:latest\",\"imageID\":\"\",\"lastState\":{},\"name\":\"nginx\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"waiting\":{\"reason\":\"ContainerCreating\"}}}],\"hostIP\":\"192.168.0.9\",\"startTime\":\"2024-06-15T07:18:59Z\"}}"
对应代码:
// syncPod syncs the given status with the API server. The caller must not hold the lock.
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
// 省略部分内容
newPod, patchBytes, unchanged, err := statusutil.PatchPodStatus(context.TODO(), m.kubeClient, pod.Namespace, pod.Name, pod.UID, pod.Status, mergedStatus)
klog.V(3).InfoS("Patch status for pod", "pod", klog.KObj(pod), "podUID", uid, "patch", string(patchBytes))
// 省略部分内容
}
这里,Kubelet 使用了 Patch API:
updatedPod, err := c.CoreV1().Pods(namespace).Patch(ctx, name, types.StrategicMergePatchType, patchBytes, metav1.PatchOptions{}, "status")
if err != nil {
return nil, nil, false, fmt.Errorf("failed to patch status %q for pod %q/%q: %v", patchBytes, namespace, name, err)
}
更新完成后打印如下日志:
Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.285211 3390574 status_manager.go:728] "Status for pod updated successfully" pod="default/dp-test-5986bbcc8f-gcrmm" statusVersion=1 status={Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.0.9 PodIP: PodIPs:[] StartTime:2024-06-15 15:18:59 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:nginx State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:nginx:latest ImageID: ContainerID: Started:0xc001ed178c}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
对应代码:
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
// 省略部分内容
if unchanged {
klog.V(3).InfoS("Status for pod is up-to-date", "pod", klog.KObj(pod), "statusVersion", status.version)
} else {
klog.V(3).InfoS("Status for pod updated successfully", "pod", klog.KObj(pod), "statusVersion", status.version, "status", mergedStatus)
pod = newPod
// We pass a new object (result of API call which contains updated ResourceVersion)
m.podStartupLatencyHelper.RecordStatusUpdated(pod)
}
// 省略部分内容
}
创建 Cgroup
接下来,Kubelet 需要为 Pod 创建 cgroup 并更新资源配置:
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
// 省略部分内容
// Create Cgroups for the pod and apply resource parameters
// to them if cgroups-per-qos flag is enabled.
pcm := kl.containerManager.NewPodContainerManager()
// If pod has already been terminated then we need not create
// or update the pod's cgroup
// TODO: once context cancellation is added this check can be removed
if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
// When the kubelet is restarted with the cgroups-per-qos
// flag enabled, all the pod's running containers
// should be killed intermittently and brought back up
// under the qos cgroup hierarchy.
// Check if this is the pod's first sync
firstSync := true
for _, containerStatus := range apiPodStatus.ContainerStatuses {
if containerStatus.State.Running != nil {
firstSync = false
break
}
}
// Don't kill containers in pod if pod's cgroups already
// exists or the pod is running for the first time
podKilled := false
if !pcm.Exists(pod) && !firstSync {
p := kubecontainer.ConvertPodStatusToRunningPod(kl.getRuntime().Type(), podStatus)
if err := kl.killPod(ctx, pod, p, nil); err == nil {
podKilled = true
} else {
klog.ErrorS(err, "KillPod failed", "pod", klog.KObj(pod), "podStatus", podStatus)
}
}
// Create and Update pod's Cgroups
// Don't create cgroups for run once pod if it was killed above
// The current policy is not to restart the run once pods when
// the kubelet is restarted with the new flag as run once pods are
// expected to run only once and if the kubelet is restarted then
// they are not expected to run again.
// We don't create and apply updates to cgroup if its a run once pod and was killed above
if !(podKilled && pod.Spec.RestartPolicy == v1.RestartPolicyNever) {
if !pcm.Exists(pod) {
if err := kl.containerManager.UpdateQOSCgroups(); err != nil {
klog.V(2).InfoS("Failed to update QoS cgroups while syncing pod", "pod", klog.KObj(pod), "err", err)
}
if err := pcm.EnsureExists(pod); err != nil {
kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToCreatePodContainer, "unable to ensure pod container exists: %v", err)
return false, fmt.Errorf("failed to ensure that the pod: %v cgroups exist and are correctly applied: %v", pod.UID, err)
}
}
}
}
// 省略部分内容
}
UpdateQOSCgroups
更新资源配置,如果 Cgroup 不存在,pcm.EnsureExists
创建对应的 Cgroup:
// EnsureExists takes a pod as argument and makes sure that
// pod cgroup exists if qos cgroup hierarchy flag is enabled.
// If the pod level container doesn't already exist it is created.
func (m *podContainerManagerImpl) EnsureExists(pod *v1.Pod) error {
podContainerName, _ := m.GetPodContainerName(pod)
// check if container already exist
alreadyExists := m.Exists(pod)
if !alreadyExists {
enforceMemoryQoS := false
if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQoS) &&
libcontainercgroups.IsCgroup2UnifiedMode() {
enforceMemoryQoS = true
}
// Create the pod container
containerConfig := &CgroupConfig{
Name: podContainerName,
ResourceParameters: ResourceConfigForPod(pod, m.enforceCPULimits, m.cpuCFSQuotaPeriod, enforceMemoryQoS),
}
if m.podPidsLimit > 0 {
containerConfig.ResourceParameters.PidsLimit = &m.podPidsLimit
}
if enforceMemoryQoS {
klog.V(4).InfoS("MemoryQoS config for pod", "pod", klog.KObj(pod), "unified", containerConfig.ResourceParameters.Unified)
}
if pod.Annotations[annotationHostResourceView] == "false" {
hostResourceView := false
containerConfig.ResourceParameters.HostResourceView = &hostResourceView
}
if err := m.cgroupManager.Create(containerConfig); err != nil {
return fmt.Errorf("failed to create container for %v : %v", podContainerName, err)
}
}
return nil
}
cgroup_manager_duration_seconds
记录了创建 cgroup 所需的时间(其它类型操作通过 label 区分):
kubelet6
更新 Cgroup 使用了 runc 的 package:
// Update updates the cgroup with the specified Cgroup Configuration
func (m *cgroupManagerImpl) Update(cgroupConfig *CgroupConfig) error {
start := time.Now()
defer func() {
metrics.CgroupManagerDuration.WithLabelValues("update").Observe(metrics.SinceInSeconds(start))
}()
libcontainerCgroupConfig := m.libctCgroupConfig(cgroupConfig, true)
manager, err := manager.New(libcontainerCgroupConfig)
if err != nil {
return fmt.Errorf("failed to create cgroup manager: %v", err)
}
return manager.Set(libcontainerCgroupConfig.Resources)
}
创建 Pod 目录
接下来,Kubelet 为每个 Pod 创建目录:
// makePodDataDirs creates the dirs for the pod datas.
func (kl *Kubelet) makePodDataDirs(pod *v1.Pod) error {
uid := pod.UID
if err := os.MkdirAll(kl.getPodDir(uid), 0750); err != nil && !os.IsExist(err) {
return err
}
if err := os.MkdirAll(kl.getPodVolumesDir(uid), 0750); err != nil && !os.IsExist(err) {
return err
}
if err := os.MkdirAll(kl.getPodPluginsDir(uid), 0750); err != nil && !os.IsExist(err) {
return err
}
return nil
}
包括以下 3 种目录:
-
Pod 根目录
-
Volume 根目录
-
Plugins 根目录
挂载存储卷
接下来,kubelet 等待 Pod 所需的存储卷被挂载成功(对于 terminating 状态的 Pod 不会等待):
Jun 15 15:19:06 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:19:06.599897 3390574 volume_manager.go:408] "Waiting for volumes to attach and mount for pod" pod="default/dp-test-5986bbcc8f-gcrmm"
对应代码:
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
// 省略部分内容
// Volume manager will not mount volumes for terminating pods
// TODO: once context cancellation is added this check can be removed
if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
// Wait for volumes to attach/mount
if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedMountVolume, "Unable to attach or mount volumes: %v", err)
klog.ErrorS(err, "Unable to attach or mount volumes for pod; skipping pod", "pod", klog.KObj(pod))
return false, err
}
}
// 省略部分内容
}
Kubelet 通过desiredStateOfWorld
和actualStateOfWorld
来管理 Pod 的存储卷状态。在这里,WaitForAttachAndMount
只根据这两个变量的值,确保所有卷都被 Attach 并且 Mount 成功:
“
这里对于不同的存储卷类型,处理逻辑不同,第三方存储卷还需要 CSI 插件的参与,这个会在后面的文章单独进行分析。
func (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) error {
// 省略部分内容
err := wait.PollImmediate(
podAttachAndMountRetryInterval,
podAttachAndMountTimeout,
vm.verifyVolumesMountedFunc(uniquePodName, expectedVolumes))
if err != nil {
unmountedVolumes :=
vm.getUnmountedVolumes(uniquePodName, expectedVolumes)
// Also get unattached volumes for error message
unattachedVolumes :=
vm.getUnattachedVolumes(expectedVolumes)
if len(unmountedVolumes) == 0 {
return nil
}
return fmt.Errorf(
"unmounted volumes=%v, unattached volumes=%v: %s",
unmountedVolumes,
unattachedVolumes,
err)
}
// 省略部分内容
}
如果这个步骤失败,错误信息会展示在 Pod 的 event 里面,这个 event 我们经常会遇到:
Unable to attach or mount volumes xxx
这里unmounted volumes
和unattached volumes
的值就是通过desiredStateOfWorld
和actualStateOfWorld
得到的。
如果所有卷都已经挂载成功,会打印以下日志:
Jun 15 15:19:06 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:19:06.599946 3390574 volume_manager.go:439] "All volumes are attached and mounted for pod" pod="default/dp-test-5986bbcc8f-gcrmm"
获取 imagePullSecrets
接下来,kubelet 会从 API server 拿imagePullSecrets
:
func (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) error {
// 省略部分内容
// Fetch the pull secrets for the pod
pullSecrets := kl.getPullSecretsForPod(pod)
// 省略部分内容
}
imagePullSecrets
传给containerRuntime
的syncPod
方法,以启动容器:
func (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) error {
// 省略部分内容
// Ensure the pod is being probed
kl.probeManager.AddPod(pod)
// Call the container runtime's SyncPod callback
result := kl.containerRuntime.SyncPod(ctx, pod, podStatus, pullSecrets, kl.backOff)
kl.reasonCache.Update(pod.UID, result)
if err := result.Error(); err != nil {
// Do not return error if the only failures were pods in backoff
for _, r := range result.SyncResults {
if r.Error != kubecontainer.ErrCrashLoopBackOff && r.Error != images.ErrImagePullBackOff {
// Do not record an event here, as we keep all event logging for sync pod failures
// local to container runtime, so we get better errors.
return false, err
}
}
return false, nil
}
// 省略部分内容
}
创建 Sandbox 和 Container
至此,Kubelet 在创建 Pod 前的准备工作已经完成。SyncPod
方法会通过 CRI 接口调用运行时的 RPC,完成创建 Sandbox 和 Container 的过程:
“
具体逻辑会在后面的文章单独分析。
// SyncPod syncs the running pod into the desired pod by executing following steps:
//
// 1. Compute sandbox and container changes.
// 2. Kill pod sandbox if necessary.
// 3. Kill any containers that should not be running.
// 4. Create sandbox if necessary.
// 5. Create ephemeral containers.
// 6. Create init containers.
// 7. Create normal containers.