kubelet 创建 Pod 前发生了什么？

Kubelet Watch 到新增的 Pod，需要做的主要有以下几件事：

管理 Pod 状态，除了更新本地缓存，还要同步给 API server
计算节点的资源是否足够创建 Pod
创建 Cgroup 并更新资源配置
创建 Pod、存储卷、插件的根目录
挂载存储卷并等待所有存储卷挂载成功
获取 imagePullSecrets
调用 CRI 创建 PodSandbox
调用 CRI 创建 Container（ephemeral Container、Init Container、普通 Container）

Pod 的 event 是 k8s 组件向用户展示关键信息的渠道，对于某个 Pod，正常情况下会有以下事件：

Pod 被调度

当 Pod 被创建出来时，会首先提交到 API server（如果由 controller 管理的 Pod，也是如此）。此时，调度器发现未被调度的 Pod，会对其进行调度。当通过一系列算法找到合适的调度节点后，调度器会将 Pod 和节点进行绑定。对于默认调度器，调度完成的标志是往 Pod 的事件打印以下内容：

Successfully assigned default/dp-test-5986bbcc8f-4vlmd to 192.168.0.9

Scheduler 中对应代码如下：

func (sched *Scheduler) finishBinding(fwk framework.Framework, assumed *v1.Pod, targetNode string, status *framework.Status) {
    if finErr := sched.Cache.FinishBinding(assumed); finErr != nil {
        klog.ErrorS(finErr, "Scheduler cache FinishBinding failed")
    }
    if !status.IsSuccess() {
        klog.V(1).InfoS("Failed to bind pod", "pod", klog.KObj(assumed))
        return
    }

    fwk.EventRecorder().Eventf(assumed, nil, v1.EventTypeNormal, "Scheduled", "Binding", "Successfully assigned %v/%v to %v", assumed.Namespace, assumed.Name, targetNode)
}

Kubelet 发现需要被处理的 Pod

Kubelet 会持续通过 watch 的方式从 API server 获取调度到当前节点的 Pod 列表信息：

“

根据 Selector spec.nodeName 进行过滤。

// NewSourceApiserver creates a config source that watches and pulls from the apiserver.
func NewSourceApiserver(c clientset.Interface, nodeName types.NodeName, nodeHasSynced func() bool, updates chan<- interface{}) {
    lw := cache.NewListWatchFromClient(c.CoreV1().RESTClient(), "pods", metav1.NamespaceAll, fields.OneTermEqualSelector("spec.nodeName", string(nodeName)))

    // The Reflector responsible for watching pods at the apiserver should be run only after
    // the node sync with the apiserver has completed.
    klog.InfoS("Waiting for node sync before watching apiserver pods")
    go func() {
        for {
            if nodeHasSynced() {
                klog.V(4).InfoS("node sync completed")
                break
            }
            time.Sleep(WaitForAPIServerSyncPeriod)
            klog.V(4).InfoS("node sync has not completed yet")
        }
        klog.InfoS("Watching apiserver")
        newSourceApiserverFromLW(lw, updates)
    }()
}

当 Kubelet 发现有新的 Pod 被调度到当前节点，会打印以下日志：

“

需要调整日志等级为 5，下同。

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269157 3390574 config.go:398] "Receiving a new pod" pod="default/dp-test-5986bbcc8f-gcrmm"

对应代码如下：

// recordFirstSeenTime records the first seen time of this pod.
func recordFirstSeenTime(pod *v1.Pod) {
    klog.V(4).InfoS("Receiving a new pod", "pod", klog.KObj(pod))
    pod.Annotations[kubetypes.ConfigFirstSeenAnnotationKey] = kubetypes.NewTimestamp().GetString()
}

在这里，kubelet 会为这个 Pod 增加 key 为kubernetes.io/config.seen的 Annotation（这个 Annotation 只在 kubelet 使用，没有同步到 API Server），记录 kubelet 第一次看到这个 Pod 的时间。

Kubelet 对拿到的 Pod 进行分类，并通过s.updates Channel 发送给对应的 handler。通过这个 Channel，我们可以找到对应的处理函数：

// syncLoop is the main loop for processing changes. It watches for changes from
// three channels (file, apiserver, and http) and creates a union of them. For
// any new change seen, will run a sync against desired state and running state. If
// no changes are seen to the configuration, will synchronize the last known desired
// state every sync-frequency seconds. Never returns.
func (kl *Kubelet) syncLoop(ctx context.Context, updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
    klog.InfoS("Starting kubelet main sync loop")
    // The syncTicker wakes up kubelet to checks if there are any pod workers
    // that need to be sync'd. A one-second period is sufficient because the
    // sync interval is defaulted to 10s.
    syncTicker := time.NewTicker(time.Second)
    defer syncTicker.Stop()
    housekeepingTicker := time.NewTicker(housekeepingPeriod)
    defer housekeepingTicker.Stop()
    plegCh := kl.pleg.Watch()
    const (
        base   = 100 * time.Millisecond
        max    = 5 * time.Second
        factor = 2
    )
    duration := base
    // Responsible for checking limits in resolv.conf
    // The limits do not have anything to do with individual pods
    // Since this is called in syncLoop, we don't need to call it anywhere else
    if kl.dnsConfigurer != nil && kl.dnsConfigurer.ResolverConfig != "" {
        kl.dnsConfigurer.CheckLimitsForResolvConf()
    }

    for {
        if err := kl.runtimeState.runtimeErrors(); err != nil {
            klog.ErrorS(err, "Skipping pod synchronization")
            // exponential backoff
            time.Sleep(duration)
            duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
            continue
        }
        // reset backoff if we have a success
        duration = base

        kl.syncLoopMonitor.Store(kl.clock.Now())
        if !kl.syncLoopIteration(ctx, updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
            break
        }
        kl.syncLoopMonitor.Store(kl.clock.Now())
    }
}

syncLoop是 kubelet 的主要处理逻辑，方法syncLoopIteration从不同 Channel 读取事件并且分发给 worker。内容较多，可以通过注释了解大概逻辑：

// syncLoopIteration reads from various channels and dispatches pods to the
// given handler.
//
// Arguments:
// 1.  configCh:       a channel to read config events from
// 2.  handler:        the SyncHandler to dispatch pods to
// 3.  syncCh:         a channel to read periodic sync events from
// 4.  housekeepingCh: a channel to read housekeeping events from
// 5.  plegCh:         a channel to read PLEG updates from
//
// Events are also read from the kubelet liveness manager's update channel.
//
// The workflow is to read from one of the channels, handle that event, and
// update the timestamp in the sync loop monitor.
//
// Here is an appropriate place to note that despite the syntactical
// similarity to the switch statement, the case statements in a select are
// evaluated in a pseudorandom order if there are multiple channels ready to
// read from when the select is evaluated.  In other words, case statements
// are evaluated in random order, and you can not assume that the case
// statements evaluate in order if multiple channels have events.
//
// With that in mind, in truly no particular order, the different channels
// are handled as follows:
//
//   - configCh: dispatch the pods for the config change to the appropriate
//     handler callback for the event type
//   - plegCh: update the runtime cache; sync pod
//   - syncCh: sync all pods waiting for sync
//   - housekeepingCh: trigger cleanup of pods
//   - health manager: sync pods that have failed or in which one or more
//     containers have failed health checks

此时，Kubelet 打印以下日志，表示已经收到新增 Pod 的请求，并调用对应 handler 进行处理：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269224 3390574 kubelet.go:2206] "SyncLoop ADD" source="api" pods="[default/dp-test-5986bbcc8f-gcrmm]"

对应代码：

func (kl *Kubelet) syncLoopIteration(ctx context.Context, configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
    syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
    select {
    case u, open := <-configCh:
        // Update from a config source; dispatch it to the right handler
        // callback.
        if !open {
            klog.ErrorS(nil, "Update channel is closed, exiting the sync loop")
            return false
        }

        switch u.Op {
        case kubetypes.ADD:
            klog.V(2).InfoS("SyncLoop ADD", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            // After restarting, kubelet will get all existing pods through
            // ADD as if they are new pods. These pods will then go through the
            // admission process and *may* be rejected. This can be resolved
            // once we have checkpointing.
            handler.HandlePodAdditions(u.Pods)
        case kubetypes.UPDATE:
            klog.V(2).InfoS("SyncLoop UPDATE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            handler.HandlePodUpdates(u.Pods)
        case kubetypes.REMOVE:
            klog.V(2).InfoS("SyncLoop REMOVE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            handler.HandlePodRemoves(u.Pods)
        case kubetypes.RECONCILE:
            klog.V(4).InfoS("SyncLoop RECONCILE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            handler.HandlePodReconcile(u.Pods)
        case kubetypes.DELETE:
            klog.V(2).InfoS("SyncLoop DELETE", "source", u.Source, "pods", klog.KObjSlice(u.Pods))
            // DELETE is treated as a UPDATE because of graceful deletion.
            handler.HandlePodUpdates(u.Pods)
        case kubetypes.SET:
            // TODO: Do we want to support this?
            klog.ErrorS(nil, "Kubelet does not support snapshot update")
        default:
            klog.ErrorS(nil, "Invalid operation type received", "operation", u.Op)
        }

        // 省略部分内容
    }
    return true
}

对于不同的事件类型，有不同的 handler，此处不深究。

Add Pod：HandlePodAdditions
Update Pod、Delete Pod：HandlePodUpdates
Remove Pod：HandlePodRemoves
Reconcile Pod：HandlePodReconcile

对于kubetypes.ADD事件，handler 是HandlePodAdditions：

// HandlePodAdditions is the callback in SyncHandler for pods being added from
// a config source.
func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
    start := kl.clock.Now()
    sort.Sort(sliceutils.PodsByCreationTime(pods))
    for _, pod := range pods {
        existingPods := kl.podManager.GetPods()
        // Always add the pod to the pod manager. Kubelet relies on the pod
        // manager as the source of truth for the desired state. If a pod does
        // not exist in the pod manager, it means that it has been deleted in
        // the apiserver and no action (other than cleanup) is required.
        kl.podManager.AddPod(pod)

        // 静态 Pod，通过 handleMirrorPod 处理
        if kubetypes.IsMirrorPod(pod) {
            kl.handleMirrorPod(pod, start)
            continue
        }

        // Only go through the admission process if the pod is not requested
        // for termination by another part of the kubelet. If the pod is already
        // using resources (previously admitted), the pod worker is going to be
        // shutting it down. If the pod hasn't started yet, we know that when
        // the pod worker is invoked it will also avoid setting up the pod, so
        // we simply avoid doing any work.
        if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
            // We failed pods that we rejected, so activePods include all admitted
            // pods that are alive.
            activePods := kl.filterOutInactivePods(existingPods)

            // Check if we can admit the pod; if not, reject it.
            if ok, reason, message := kl.canAdmitPod(activePods, pod); !ok {
                kl.rejectPod(pod, reason, message)
                continue
            }
        }
        mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
        kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
    }
}

只有非 terminating 状态的 Pod 才会走到 admission 逻辑。对应方法是canAdmitPod中，在该方法中，会通过不同的admitHandlers处理Admit逻辑，打印日志如下：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269248 3390574 topology_manager.go:210] "Topology Admit Handler" podUID=dd780355-f908-4bf2-9193-68dabbbca69e podNamespace="default" podName="dp-test-5986bbcc8f-gcrmm"

对应代码：

func (m *manager) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
    klog.InfoS("Topology Admit Handler", "podUID", attrs.Pod.UID, "podNamespace", attrs.Pod.Namespace, "podName", attrs.Pod.Name)
    pod := attrs.Pod

    return m.scope.Admit(pod)
}

Admit方法会先根据container.Resources.Limits分析需要分配的资源（如：网卡等），日志如下：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269261 3390574 manager.go:854] "Looking for needed resources" needed=1 resourceName="xxx/eni-ip"

对应代码：

// allocateContainerResources attempts to allocate all of required device
// plugin resources for the input container, issues an Allocate rpc request
// for each new device resource requirement, processes their AllocateResponses,
// and updates the cached containerDevices on success.
func (m *ManagerImpl) allocateContainerResources(pod *v1.Pod, container *v1.Container, devicesToReuse map[string]sets.String) error {
    podUID := string(pod.UID)
    contName := container.Name
    allocatedDevicesUpdated := false
    needsUpdateCheckpoint := false
    // Extended resources are not allowed to be overcommitted.
    // Since device plugin advertises extended resources,
    // therefore Requests must be equal to Limits and iterating
    // over the Limits should be sufficient.
    for k, v := range container.Resources.Limits {
        resource := string(k)
        needed := int(v.Value())
        klog.V(3).InfoS("Looking for needed resources", "needed", needed, "resourceName", resource)
        if !m.isDevicePluginResource(resource) {
            continue
        }
        // Updates allocatedDevices to garbage collect any stranded resources
        // before doing the device plugin allocation.
        if !allocatedDevicesUpdated {
            m.UpdateAllocatedDevices()
            allocatedDevicesUpdated = true
        }

        var allocDevices sets.String
        var err error
        contDevicesScheduledKey := fmt.Sprintf("%s_container-%s", resource, contName)
        contDevicesScheduled := pod.Annotations[contDevicesScheduledKey]
        if contDevicesScheduled != "" {
            allocDevices, err = m.devicesToAllocateFromScheduler(podUID, contName, resource, contDevicesScheduled, needed, devicesToReuse[resource])
        } else {
            allocDevices, err = m.devicesToAllocate(podUID, contName, resource, needed, devicesToReuse[resource])
        }
        if err != nil {
            return err
        }
        if allocDevices == nil || len(allocDevices) <= 0 {
            continue
        }

        // 省略部分内容
        m.podDevices.insert(podUID, contName, resource, allocDevicesWithNUMA, resp.ContainerResponses[0])
    }

    if needsUpdateCheckpoint {
        return m.writeCheckpoint()
    }

    return nil
}

然后，会走到devicesToAllocate以分配需要的设备，下一条日志为：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269321 3390574 manager.go:621] "Need devices to allocate for pod" deviceNumber=1 resourceName="xxx/eni-ip" podUID="dd780355-f908-4bf2-9193-68dabbbca69e" containerName="nginx"

对应代码：

// Returns list of device Ids we need to allocate with Allocate rpc call.
// Returns empty list in case we don't need to issue the Allocate rpc call.
func (m *ManagerImpl) devicesToAllocate(podUID, contName, resource string, required int, reusableDevices sets.String) (sets.String, error) {
    m.mutex.Lock()
    defer m.mutex.Unlock()
    needed := required
    // Gets list of devices that have already been allocated.
    // This can happen if a container restarts for example.
    devices := m.podDevices.containerDevices(podUID, contName, resource)
    if devices != nil {
        klog.V(3).InfoS("Found pre-allocated devices for resource on pod", "resourceName", resource, "containerName", contName, "podUID", string(podUID), "devices", devices.List())
        needed = needed - devices.Len()
        // A pod's resource is not expected to change once admitted by the API server,
        // so just fail loudly here. We can revisit this part if this no longer holds.
        if needed != 0 {
            return nil, fmt.Errorf("pod %q container %q changed request for resource %q from %d to %d", string(podUID), contName, resource, devices.Len(), required)
        }
    }

    // We have 3 major flows to handle:
    // 1. kubelet running, normal allocation (needed > 0, container being  [re]created). Steady state and most common case by far and large.
    // 2. kubelet restart. In this scenario every other component of the stack (device plugins, app container, runtime) is still running.
    // 3. node reboot. In this scenario device plugins may not be running yet when we try to allocate devices.
    //    note: if we get this far the runtime is surely running. This is usually enforced at OS level by startup system services dependencies.

    // First we take care of the exceptional flow (scenarios 2 and 3). In both flows, kubelet is reinitializing, and while kubelet is initializing, sources are NOT all ready.
    // Is this a simple kubelet restart (scenario 2)? To distinguish, we use the informations we got for runtime. If we are asked to allocate devices for containers reported
    // running, then it can only be a kubelet restart. On node reboot the runtime and the containers were also shut down. Then, if the container was running, it can only be
    // because it already has access to all the required devices, so we got nothing to do and we can bail out.
    if !m.sourcesReady.AllReady() && m.isContainerAlreadyRunning(podUID, contName) {
        klog.V(3).InfoS("container detected running, nothing to do", "deviceNumber", needed, "resourceName", resource, "podUID", string(podUID), "containerName", contName)
        return nil, nil
    }

    // We dealt with scenario 2. If we got this far it's either scenario 3 (node reboot) or scenario 1 (steady state, normal flow).
    klog.V(3).InfoS("Need devices to allocate for pod", "deviceNumber", needed, "resourceName", resource, "podUID", string(podUID), "containerName", contName)
    healthyDevices, hasRegistered := m.healthyDevices[resource]
    // 省略部分内容

    // Filters available Devices based on NUMA affinity.
    aligned, unaligned, noAffinity := m.filterByAffinity(podUID, contName, resource, available)

    // If we can allocate all remaining devices from the set of aligned ones, then
    // give the plugin the chance to influence which ones to allocate from that set.
    if needed < aligned.Len() {
        // First allocate from the preferred devices list (if available).
        preferred, err := m.callGetPreferredAllocationIfAvailable(podUID, contName, resource, aligned.Union(allocated), allocated, required)
        if err != nil {
            return nil, err
        }
        if allocateRemainingFrom(preferred.Intersection(aligned)) {
            return allocated, nil
        }
        // Then fallback to allocate from the aligned set if no preferred list
        // is returned (or not enough devices are returned in that list).
        if allocateRemainingFrom(aligned) {
            return allocated, nil
        }

        return nil, fmt.Errorf("unexpectedly allocated less resources than required. Requested: %d, Got: %d", required, required-needed)
    }

    // If we can't allocate all remaining devices from the set of aligned ones,
    // then start by first allocating all of the  aligned devices (to ensure
    // that the alignment guaranteed by the TopologyManager is honored).
    if allocateRemainingFrom(aligned) {
        return allocated, nil
    }

    // Then give the plugin the chance to influence the decision on any
    // remaining devices to allocate.
    preferred, err := m.callGetPreferredAllocationIfAvailable(podUID, contName, resource, available.Union(allocated), allocated, required)
    if err != nil {
        return nil, err
    }
    if allocateRemainingFrom(preferred.Intersection(available)) {
        return allocated, nil
    }
    // Finally, if the plugin did not return a preferred allocation (or didn't
    // return a large enough one), then fall back to allocating the remaining
    // devices from the 'unaligned' and 'noAffinity' sets.
    if allocateRemainingFrom(unaligned) {
        return allocated, nil
    }
    if allocateRemainingFrom(noAffinity) {
        return allocated, nil
    }

    return nil, fmt.Errorf("unexpectedly allocated less resources than required. Requested: %d, Got: %d", required, required-needed)
}

如果没有配置GetPreferredAllocation，会打打印以下日志：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269375 3390574 manager.go:1041] "Plugin options indicate to skip GetPreferredAllocation for resource" resourceName="xxx/eni-ip"

对应代码：

// callGetPreferredAllocationIfAvailable issues GetPreferredAllocation grpc
// call for device plugin resource with GetPreferredAllocationAvailable option set.
func (m *ManagerImpl) callGetPreferredAllocationIfAvailable(podUID, contName, resource string, available, mustInclude sets.String, size int) (sets.String, error) {
    eI, ok := m.endpoints[resource]
    if !ok {
        return nil, fmt.Errorf("endpoint not found in cache for a registered resource: %s", resource)
    }

    if eI.opts == nil || !eI.opts.GetPreferredAllocationAvailable {
        klog.V(4).InfoS("Plugin options indicate to skip GetPreferredAllocation for resource", "resourceName", resource)
        return nil, nil
    }

    m.mutex.Unlock()
    klog.V(4).InfoS("Issuing a GetPreferredAllocation call for container", "containerName", contName, "podUID", string(podUID))
    resp, err := eI.e.getPreferredAllocation(available.UnsortedList(), mustInclude.UnsortedList(), size)
    m.mutex.Lock()
    if err != nil {
        return nil, fmt.Errorf("device plugin GetPreferredAllocation rpc failed with err: %v", err)
    }
    if resp != nil && len(resp.ContainerResponses) > 0 {
        return sets.NewString(resp.ContainerResponses[0].DeviceIDs...), nil
    }
    return sets.NewString(), nil
}

接下来，会走到分配设备的逻辑，打印日志：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.269402 3390574 manager.go:909] "Making allocation request for device plugin" devices=[eni-ip-121] resourceName="xxx/eni-ip"

对应代码在allocateContainerResources方法：

func (m *ManagerImpl) allocateContainerResources(pod *v1.Pod, container *v1.Container, devicesToReuse map[string]sets.String) error {
    // 省略部分内容
    devs := allocDevices.UnsortedList()
    // TODO: refactor this part of code to just append a ContainerAllocationRequest
    // in a passed in AllocateRequest pointer, and issues a single Allocate call per pod.
    klog.V(3).InfoS("Making allocation request for device plugin", "devices", devs, "resourceName", resource)
    resp, err := eI.e.allocate(devs)
    metrics.DevicePluginAllocationDuration.WithLabelValues(resource).Observe(metrics.SinceInSeconds(startRPCTime))
    if err != nil {
        // In case of allocation failure, we want to restore m.allocatedDevices
        // to the actual allocated state from m.podDevices.
        m.mutex.Lock()
        m.allocatedDevices = m.podDevices.devices()
        m.mutex.Unlock()
        return err
    }

    if len(resp.ContainerResponses) == 0 {
        return fmt.Errorf("no containers return in allocation response %v", resp)
    }
    // 省略部分内容
}

⚠️ 注意：这里有个指标可以看到设备分配花费的时间device_plugin_alloc_duration_seconds（不过遗憾的是并没有 Pod 和 container 信息，只有资源名称信息）

这里调了 RPC 分配需要的设备：

func (c *devicePluginClient) Allocate(ctx context.Context, in *AllocateRequest, opts ...grpc.CallOption) (*AllocateResponse, error) {
    out := new(AllocateResponse)
    err := c.cc.Invoke(ctx, "/v1beta1.DevicePlugin/Allocate", in, out, opts...)
    if err != nil {
        return nil, err
    }
    return out, nil
}

资源分配完成，Pod 通过dispatchWork分配给对应的 worker 处理：

kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)

此时，可以看到类似下面的日志：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273776 3390574 pod_workers.go:585] "Pod is being synced for the first time" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e

对应代码：

func (p *podWorkers) UpdatePod(options UpdatePodOptions) {
    // 省略部分内容
    // decide what to do with this pod - we are either setting it up, tearing it down, or ignoring it
    now := time.Now()
    status, ok := p.podSyncStatuses[uid]
    if !ok {
        klog.V(4).InfoS("Pod is being synced for the first time", "pod", klog.KObj(pod), "podUID", pod.UID)
        status = &podSyncStatus{
            syncedAt: now,
            fullname: kubecontainer.GetPodFullName(pod),
        }
        // if this pod is being synced for the first time, we need to make sure it is an active pod
        if !isRuntimePod && (pod.Status.Phase == v1.PodFailed || pod.Status.Phase == v1.PodSucceeded) {
            // check to see if the pod is not running and the pod is terminal.
            // If this succeeds then record in the podWorker that it is terminated.
            if statusCache, err := p.podCache.Get(pod.UID); err == nil {
                if isPodStatusCacheTerminal(statusCache) {
                    status = &podSyncStatus{
                        terminatedAt:       now,
                        terminatingAt:      now,
                        syncedAt:           now,
                        startedTerminating: true,
                        finished:           true,
                        fullname:           kubecontainer.GetPodFullName(pod),
                    }
                }
            }
        }
        p.podSyncStatuses[uid] = status
    }
    // 省略部分内容
}

如果是新创建的 Pod，会创建一个 Go Routine（Pod Worker）来处理，主要逻辑为managePodLoop，通过outCh接收请求：

// Creating a new pod worker either means this is a new pod, or that the
// kubelet just restarted. In either case the kubelet is willing to believe
// the status of the pod for the first pod worker sync. See corresponding
// comment in syncPod.
go func() {
    defer runtime.HandleCrash()
    p.managePodLoop(outCh)
}()

当通过outCh收到请求，会打印以下日志：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273812 3390574 pod_workers.go:902] "Processing pod event" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e updateType=0

代码位置：

func (p *podWorkers) managePodLoop(podUpdates <-chan podWork) {
    // 省略部分内容
    klog.V(4).InfoS("Processing pod event", "pod", klog.KObj(pod), "podUID", pod.UID, "updateType", update.WorkType)
    var isTerminal bool
    err := func() error {
        // The worker is responsible for ensuring the sync method sees the appropriate
        // status updates on resyncs (the result of the last sync), transitions to
        // terminating (no wait), or on terminated (whatever the most recent state is).
        // Only syncing and terminating can generate pod status changes, while terminated
        // pods ensure the most recent status makes it to the api server.
        var status *kubecontainer.PodStatus
        var err error
        switch {
        case update.Options.RunningPod != nil:
            // when we receive a running pod, we don't need status at all
        default:
            // wait until we see the next refresh from the PLEG via the cache (max 2s)
            // TODO: this adds ~1s of latency on all transitions from sync to terminating
            //  to terminated, and on all termination retries (including evictions). We should
            //  improve latency by making the pleg continuous and by allowing pod status
            //  changes to be refreshed when key events happen (killPod, sync->terminating).
            //  Improving this latency also reduces the possibility that a terminated
            //  container's status is garbage collected before we have a chance to update the
            //  API server (thus losing the exit code).
            status, err = p.podCache.GetNewerThan(pod.UID, lastSyncTime)
        }
        if err != nil {
            // This is the legacy event thrown by manage pod loop all other events are now dispatched
            // from syncPodFn
            p.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedSync, "error determining status: %v", err)
            return err
        }

        ctx := p.contextForWorker(pod.UID)

        // Take the appropriate action (illegal phases are prevented by UpdatePod)
        switch {
        case update.WorkType == TerminatedPodWork:
            err = p.syncTerminatedPodFn(ctx, pod, status)

        case update.WorkType == TerminatingPodWork:
            var gracePeriod *int64
            if opt := update.Options.KillPodOptions; opt != nil {
                gracePeriod = opt.PodTerminationGracePeriodSecondsOverride
            }
            podStatusFn := p.acknowledgeTerminating(pod)

            err = p.syncTerminatingPodFn(ctx, pod, status, update.Options.RunningPod, gracePeriod, podStatusFn)

        default:
            isTerminal, err = p.syncPodFn(ctx, update.Options.UpdateType, pod, update.Options.MirrorPod, status)
        }

        lastSyncTime = time.Now()
        return err
    }()
    // 省略部分内容
}

除了以下两种类型，都通过syncPodFn处理：

TerminatedPodWork
TerminatingPodWork

在syncPodFn中，会首先打印以下日志：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273831 3390574 kubelet.go:1620] "syncPod enter" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e

对应代码位于（syncPodFn 同步 Pod 状态的主要逻辑，具体可参考注释）：

// syncPod is the transaction script for the sync of a single pod (setting up)
// a pod. This method is reentrant and expected to converge a pod towards the
// desired state of the spec. The reverse (teardown) is handled in
// syncTerminatingPod and syncTerminatedPod. If syncPod exits without error,
// then the pod runtime state is in sync with the desired configuration state
// (pod is running). If syncPod exits with a transient error, the next
// invocation of syncPod is expected to make progress towards reaching the
// runtime state. syncPod exits with isTerminal when the pod was detected to
// have reached a terminal lifecycle phase due to container exits (for
// RestartNever or RestartOnFailure) and the next method invoked will by
// syncTerminatingPod.
//
// Arguments:
//
// updateType - whether this is a create (first time) or an update, should
//
//  only be used for metrics since this method must be reentrant
//
// pod - the pod that is being set up
// mirrorPod - the mirror pod known to the kubelet for this pod, if any
// podStatus - the most recent pod status observed for this pod which can
//
//  be used to determine the set of actions that should be taken during
//  this loop of syncPod
//
// The workflow is:
//   - If the pod is being created, record pod worker start latency
//   - Call generateAPIPodStatus to prepare an v1.PodStatus for the pod
//   - If the pod is being seen as running for the first time, record pod
//     start latency
//   - Update the status of the pod in the status manager
//   - Stop the pod's containers if it should not be running due to soft
//     admission
//   - Ensure any background tracking for a runnable pod is started
//   - Create a mirror pod if the pod is a static pod, and does not
//     already have a mirror pod
//   - Create the data directories for the pod if they do not exist
//   - Wait for volumes to attach/mount
//   - Fetch the pull secrets for the pod
//   - Call the container runtime's SyncPod callback
//   - Update the traffic shaping for the pod's ingress and egress limits
//
// If any step of this workflow errors, the error is returned, and is repeated
// on the next syncPod call.
//
// This operation writes all events that are dispatched in order to provide
// the most accurate information possible about an error situation to aid debugging.
// Callers should not write an event if this operation returns an error.
func (kl *Kubelet) syncPod(_ context.Context, updateType kubetypes.SyncPodType, pod, mirrorPod *v1.Pod, podStatus *kubecontainer.PodStatus) (isTerminal bool, err error) {
    // TODO(#113606): connect this with the incoming context parameter, which comes from the pod worker.
    // Currently, using that context causes test failures.
    ctx := context.TODO()
    klog.V(4).InfoS("syncPod enter", "pod", klog.KObj(pod), "podUID", pod.UID)
    defer func() {
        klog.V(4).InfoS("syncPod exit", "pod", klog.KObj(pod), "podUID", pod.UID, "isTerminal", isTerminal)
    }()

    // Latency measurements for the main workflow are relative to the
    // first time the pod was seen by kubelet.
    var firstSeenTime time.Time
    if firstSeenTimeStr, ok := pod.Annotations[kubetypes.ConfigFirstSeenAnnotationKey]; ok {
        firstSeenTime = kubetypes.ConvertToTimestamp(firstSeenTimeStr).Get()
    }

    // Record pod worker start latency if being created
    // TODO: make pod workers record their own latencies
    if updateType == kubetypes.SyncPodCreate {
        if !firstSeenTime.IsZero() {
            // This is the first time we are syncing the pod. Record the latency
            // since kubelet first saw the pod if firstSeenTime is set.
            metrics.PodWorkerStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
        } else {
            klog.V(3).InfoS("First seen time not recorded for pod",
                "podUID", pod.UID,
                "pod", klog.KObj(pod))
        }
    }

    // 省略部分内容
}

这里有个指标pod_worker_start_duration_seconds记录了 kubelet 第一次看到这个 Pod 到第一次syncPod经历的时间。

kubelet4

下一条日志，kubelet 生成 Pod 状态信息：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273841 3390574 kubelet_pods.go:1461] "Generating pod status" pod="default/dp-test-5986bbcc8f-gcrmm"

对应代码：

// generateAPIPodStatus creates the final API pod status for a pod, given the
// internal pod status. This method should only be called from within sync*Pod methods.
func (kl *Kubelet) generateAPIPodStatus(pod *v1.Pod, podStatus *kubecontainer.PodStatus) v1.PodStatus {
    klog.V(3).InfoS("Generating pod status", "pod", klog.KObj(pod))
    // 省略部分内容
}

对于新创建的 Pod，初始状态都是 Pending，对应日志：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273867 3390574 kubelet_pods.go:1426] "Pod waiting > 0, pending"

计算 Pod 的状态

Pod 的 Phase 是根据 Pod 中所有 container 的状态计算得出的：

// getPhase returns the phase of a pod given its container info.
func getPhase(spec *v1.PodSpec, info []v1.ContainerStatus) v1.PodPhase {
    pendingInitialization := 0
    failedInitialization := 0
    for _, container := range spec.InitContainers {
        containerStatus, ok := podutil.GetContainerStatus(info, container.Name)
        if !ok {
            pendingInitialization++
            continue
        }

        switch {
        case containerStatus.State.Running != nil:
            pendingInitialization++
        case containerStatus.State.Terminated != nil:
            if containerStatus.State.Terminated.ExitCode != 0 {
                failedInitialization++
            }
        case containerStatus.State.Waiting != nil:
            if containerStatus.LastTerminationState.Terminated != nil {
                if containerStatus.LastTerminationState.Terminated.ExitCode != 0 {
                    failedInitialization++
                }
            } else {
                pendingInitialization++
            }
        default:
            pendingInitialization++
        }
    }

    unknown := 0
    running := 0
    waiting := 0
    stopped := 0
    succeeded := 0
    for _, container := range spec.Containers {
        containerStatus, ok := podutil.GetContainerStatus(info, container.Name)
        if !ok {
            unknown++
            continue
        }

        switch {
        case containerStatus.State.Running != nil:
            running++
        case containerStatus.State.Terminated != nil:
            stopped++
            if containerStatus.State.Terminated.ExitCode == 0 {
                succeeded++
            }
        case containerStatus.State.Waiting != nil:
            if containerStatus.LastTerminationState.Terminated != nil {
                stopped++
            } else {
                waiting++
            }
        default:
            unknown++
        }
    }

    if failedInitialization > 0 && spec.RestartPolicy == v1.RestartPolicyNever {
        return v1.PodFailed
    }

    switch {
    case pendingInitialization > 0:
        fallthrough
    case waiting > 0:
        klog.V(5).InfoS("Pod waiting > 0, pending")
        // One or more containers has not been started
        return v1.PodPending
    case running > 0 && unknown == 0:
        // All containers have been started, and at least
        // one container is running
        return v1.PodRunning
    case running == 0 && stopped > 0 && unknown == 0:
        // All containers are terminated
        if spec.RestartPolicy == v1.RestartPolicyAlways {
            // All containers are in the process of restarting
            return v1.PodRunning
        }
        if stopped == succeeded {
            // RestartPolicy is not Always, and all
            // containers are terminated in success
            return v1.PodSucceeded
        }
        if spec.RestartPolicy == v1.RestartPolicyNever {
            // RestartPolicy is Never, and all containers are
            // terminated with at least one in failure
            return v1.PodFailed
        }
        // RestartPolicy is OnFailure, and at least one in failure
        // and in the process of restarting
        return v1.PodRunning
    default:
        klog.V(5).InfoS("Pod default case, pending")
        return v1.PodPending
    }
}

failedInitialization 大于 0 并且 RestartPolicy 是 Never：Pod 状态为 Failed
waiting 状态的 container 数量大于 0：Pod 状态为 Pending
running 状态的 container 大于 0 并且 unknown 状态的 container 等于 0：Pod 状态为 Running
running 状态的 container 等于 0 并且 stopped 状态的 container 大于 0，同时 unknown 状态的 container 等于 0
RestartPolicy 是 always：Pod 状态为 Running
如果 stopped 状态 container 数量和 succeeded 状态 container 数量相同：Pod 状态为 Succeeded
如果 RestartPolicy 是 Never：Pod 状态为 Failed
其它情况：Pod 状态为 Running
其它情况：Pod 状态为 Pending

这里有指标pod_start_duration_seconds记录了 kubelet 第一次看到 Pod 到 Pod running 的时间：

Pod 状态更新完成，打印以下日志：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273874 3390574 kubelet_pods.go:1471] "Got phase for pod" pod="default/dp-test-5986bbcc8f-gcrmm" oldPhase=Pending phase=Pending

计算出 Pod 状态信息，接下来更新 Pod 状态信息，对应代码：

func (kl *Kubelet) syncPod(_ context.Context, updateType kubetypes.SyncPodType, pod, mirrorPod *v1.Pod, podStatus *kubecontainer.PodStatus) (isTerminal bool, err error) {
    // 省略其他内容
    kl.statusManager.SetPodStatus(pod, apiPodStatus)
    // 省略其他内容
}

调用updateStatusInternal更新 Pod 状态：

func (m *manager) SetPodStatus(pod *v1.Pod, status v1.PodStatus) {
    m.podStatusesLock.Lock()
    defer m.podStatusesLock.Unlock()

    // Make sure we're caching a deep copy.
    status = *status.DeepCopy()

    // Force a status update if deletion timestamp is set. This is necessary
    // because if the pod is in the non-running state, the pod worker still
    // needs to be able to trigger an update and/or deletion.
    m.updateStatusInternal(pod, status, pod.DeletionTimestamp != nil)
}

updateStatusInternal方法除了更新本地缓存中的 Pod 状态，还会添加到发给 API server 的请求队列中。在updateStatusInternal方法，打印如下日志：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273910 3390574 status_manager.go:552] "updateStatusInternal" version=1 pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e containers="(nginx state=waiting previous=<none>)"

对应代码：

func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {
    // 省略部分内容
    // Perform some more extensive logging of container termination state to assist in
    // debugging production races (generally not needed).
    if klogV := klog.V(5); klogV.Enabled() {
        var containers []string
        for _, s := range append(append([]v1.ContainerStatus(nil), status.InitContainerStatuses...), status.ContainerStatuses...) {
            var current, previous string
            switch {
            case s.State.Running != nil:
                current = "running"
            case s.State.Waiting != nil:
                current = "waiting"
            case s.State.Terminated != nil:
                current = fmt.Sprintf("terminated=%d", s.State.Terminated.ExitCode)
            default:
                current = "unknown"
            }
            switch {
            case s.LastTerminationState.Running != nil:
                previous = "running"
            case s.LastTerminationState.Waiting != nil:
                previous = "waiting"
            case s.LastTerminationState.Terminated != nil:
                previous = fmt.Sprintf("terminated=%d", s.LastTerminationState.Terminated.ExitCode)
            default:
                previous = "<none>"
            }
            containers = append(containers, fmt.Sprintf("(%s state=%s previous=%s)", s.Name, current, previous))
        }
        sort.Strings(containers)
        klogV.InfoS("updateStatusInternal", "version", cachedStatus.version+1, "pod", klog.KObj(pod), "podUID", pod.UID, "containers", strings.Join(containers, " "))
    }
    // 省略部分内容
}

往 API server 更新 Pod 状态的请求被添加到 Channel，同时打印如下日志：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.273990 3390574 status_manager.go:582] "Status Manager: adding pod with new status to podStatusChannel" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e statusVersion=1 status={Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.0.9 PodIP: PodIPs:[] StartTime:2024-06-15 15:18:59 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:nginx State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:nginx:latest ImageID: ContainerID: Started:0xc001ed178c}] QOSClass:BestEffort EphemeralContainerStatuses:[]}

对应代码：

func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {
    // 省略部分内容
    m.podStatuses[pod.UID] = newStatus

    select {
    case m.podStatusChannel <- podStatusSyncRequest{pod.UID, newStatus}:
        klog.V(5).InfoS("Status Manager: adding pod with new status to podStatusChannel",
            "pod", klog.KObj(pod),
            "podUID", pod.UID,
            "statusVersion", newStatus.version,
            "status", newStatus.status)
        return true
    default:
        // Let the periodic syncBatch handle the update if the channel is full.
        // We can't block, since we hold the mutex lock.
        klog.V(4).InfoS("Skipping the status update for pod for now because the channel is full",
            "pod", klog.KObj(pod),
            "status", status)
        return false
    }
    //省略部分内容
}

可以看到，请求里面只有UID和新的状态两部分内容：

pod.UID
newStatus

向 API server 更新 Pod 状态

因此，下一条日志是 Channel 的 handler（Status Manager）收到这个请求：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.274047 3390574 status_manager.go:186] "Status Manager: syncing pod with status from podStatusChannel" podUID=dd780355-f908-4bf2-9193-68dabbbca69e statusVersion=1 status={Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.0.9 PodIP: PodIPs:[] StartTime:2024-06-15 15:18:59 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:nginx State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:nginx:latest ImageID: ContainerID: Started:0xc001ed178c}] QOSClass:BestEffort EphemeralContainerStatuses:[]}

对应代码：

func (m *manager) Start() {
    // Don't start the status manager if we don't have a client. This will happen
    // on the master, where the kubelet is responsible for bootstrapping the pods
    // of the master components.
    if m.kubeClient == nil {
        klog.InfoS("Kubernetes client is nil, not starting status manager")
        return
    }

    klog.InfoS("Starting to sync pod status with apiserver")

    //nolint:staticcheck // SA1015 Ticker can leak since this is only called once and doesn't handle termination.
    syncTicker := time.NewTicker(syncPeriod).C

    // syncPod and syncBatch share the same go routine to avoid sync races.
    go wait.Forever(func() {
        for {
            select {
            case syncRequest := <-m.podStatusChannel:
                klog.V(5).InfoS("Status Manager: syncing pod with status from podStatusChannel",
                    "podUID", syncRequest.podUID,
                    "statusVersion", syncRequest.status.version,
                    "status", syncRequest.status.status)
                m.syncPod(syncRequest.podUID, syncRequest.status)
            case <-syncTicker:
                klog.V(5).InfoS("Status Manager: syncing batch")
                // remove any entries in the status channel since the batch will handle them
                for i := len(m.podStatusChannel); i > 0; i-- {
                    <-m.podStatusChannel
                }
                m.syncBatch()
            }
        }
    }, 0)
}

Status Manager 是一个 Go Routine，主要通过m.syncPod向 API server 更新状态：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.285142 3390574 status_manager.go:719] "Patch status for pod" pod="default/dp-test-5986bbcc8f-gcrmm" podUID=dd780355-f908-4bf2-9193-68dabbbca69e patch="{\"metadata\":{\"uid\":\"dd780355-f908-4bf2-9193-68dabbbca69e\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-06-15T07:18:59Z\",\"status\":\"True\",\"type\":\"Initialized\"},{\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-06-15T07:18:59Z\",\"message\":\"containers with unready status: [nginx]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-06-15T07:18:59Z\",\"message\":\"containers with unready status: [nginx]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"image\":\"nginx:latest\",\"imageID\":\"\",\"lastState\":{},\"name\":\"nginx\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"waiting\":{\"reason\":\"ContainerCreating\"}}}],\"hostIP\":\"192.168.0.9\",\"startTime\":\"2024-06-15T07:18:59Z\"}}"

对应代码：

// syncPod syncs the given status with the API server. The caller must not hold the lock.
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    // 省略部分内容
    newPod, patchBytes, unchanged, err := statusutil.PatchPodStatus(context.TODO(), m.kubeClient, pod.Namespace, pod.Name, pod.UID, pod.Status, mergedStatus)
    klog.V(3).InfoS("Patch status for pod", "pod", klog.KObj(pod), "podUID", uid,     "patch", string(patchBytes))
    // 省略部分内容
}

这里，Kubelet 使用了 Patch API：

updatedPod, err := c.CoreV1().Pods(namespace).Patch(ctx, name, types.StrategicMergePatchType, patchBytes, metav1.PatchOptions{}, "status")
if err != nil {
    return nil, nil, false, fmt.Errorf("failed to patch status %q for pod %q/%q: %v", patchBytes, namespace, name, err)
}

更新完成后打印如下日志：

Jun 15 15:18:59 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:18:59.285211 3390574 status_manager.go:728] "Status for pod updated successfully" pod="default/dp-test-5986bbcc8f-gcrmm" statusVersion=1 status={Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason:ContainersNotReady Message:containers with unready status: [nginx]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2024-06-15 15:18:59 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:192.168.0.9 PodIP: PodIPs:[] StartTime:2024-06-15 15:18:59 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:nginx State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:nginx:latest ImageID: ContainerID: Started:0xc001ed178c}] QOSClass:BestEffort EphemeralContainerStatuses:[]}

对应代码：

func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    // 省略部分内容
    if unchanged {
        klog.V(3).InfoS("Status for pod is up-to-date", "pod", klog.KObj(pod), "statusVersion", status.version)
    } else {
        klog.V(3).InfoS("Status for pod updated successfully", "pod", klog.KObj(pod), "statusVersion", status.version, "status", mergedStatus)
        pod = newPod
        // We pass a new object (result of API call which contains updated ResourceVersion)
        m.podStartupLatencyHelper.RecordStatusUpdated(pod)
    }
    // 省略部分内容
}

创建 Cgroup

接下来，Kubelet 需要为 Pod 创建 cgroup 并更新资源配置：


func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    // 省略部分内容
    // Create Cgroups for the pod and apply resource parameters
    // to them if cgroups-per-qos flag is enabled.
    pcm := kl.containerManager.NewPodContainerManager()
    // If pod has already been terminated then we need not create
    // or update the pod's cgroup
    // TODO: once context cancellation is added this check can be removed
    if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
        // When the kubelet is restarted with the cgroups-per-qos
        // flag enabled, all the pod's running containers
        // should be killed intermittently and brought back up
        // under the qos cgroup hierarchy.
        // Check if this is the pod's first sync
        firstSync := true
        for _, containerStatus := range apiPodStatus.ContainerStatuses {
            if containerStatus.State.Running != nil {
                firstSync = false
                break
            }
        }
        // Don't kill containers in pod if pod's cgroups already
        // exists or the pod is running for the first time
        podKilled := false
        if !pcm.Exists(pod) && !firstSync {
            p := kubecontainer.ConvertPodStatusToRunningPod(kl.getRuntime().Type(), podStatus)
            if err := kl.killPod(ctx, pod, p, nil); err == nil {
                podKilled = true
            } else {
                klog.ErrorS(err, "KillPod failed", "pod", klog.KObj(pod), "podStatus", podStatus)
            }
        }
        // Create and Update pod's Cgroups
        // Don't create cgroups for run once pod if it was killed above
        // The current policy is not to restart the run once pods when
        // the kubelet is restarted with the new flag as run once pods are
        // expected to run only once and if the kubelet is restarted then
        // they are not expected to run again.
        // We don't create and apply updates to cgroup if its a run once pod and was killed above
        if !(podKilled && pod.Spec.RestartPolicy == v1.RestartPolicyNever) {
            if !pcm.Exists(pod) {
                if err := kl.containerManager.UpdateQOSCgroups(); err != nil {
                    klog.V(2).InfoS("Failed to update QoS cgroups while syncing pod", "pod", klog.KObj(pod), "err", err)
                }
                if err := pcm.EnsureExists(pod); err != nil {
                    kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToCreatePodContainer, "unable to ensure pod container exists: %v", err)
                    return false, fmt.Errorf("failed to ensure that the pod: %v cgroups exist and are correctly applied: %v", pod.UID, err)
                }
            }
        }
    }
    // 省略部分内容
}

UpdateQOSCgroups更新资源配置，如果 Cgroup 不存在，pcm.EnsureExists创建对应的 Cgroup：

// EnsureExists takes a pod as argument and makes sure that
// pod cgroup exists if qos cgroup hierarchy flag is enabled.
// If the pod level container doesn't already exist it is created.
func (m *podContainerManagerImpl) EnsureExists(pod *v1.Pod) error {
    podContainerName, _ := m.GetPodContainerName(pod)
    // check if container already exist
    alreadyExists := m.Exists(pod)
    if !alreadyExists {
        enforceMemoryQoS := false
        if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQoS) &&
            libcontainercgroups.IsCgroup2UnifiedMode() {
            enforceMemoryQoS = true
        }
        // Create the pod container
        containerConfig := &CgroupConfig{
            Name:               podContainerName,
            ResourceParameters: ResourceConfigForPod(pod, m.enforceCPULimits, m.cpuCFSQuotaPeriod, enforceMemoryQoS),
        }
        if m.podPidsLimit > 0 {
            containerConfig.ResourceParameters.PidsLimit = &m.podPidsLimit
        }
        if enforceMemoryQoS {
            klog.V(4).InfoS("MemoryQoS config for pod", "pod", klog.KObj(pod), "unified", containerConfig.ResourceParameters.Unified)
        }
        if pod.Annotations[annotationHostResourceView] == "false" {
            hostResourceView := false
            containerConfig.ResourceParameters.HostResourceView = &hostResourceView
        }
        if err := m.cgroupManager.Create(containerConfig); err != nil {
            return fmt.Errorf("failed to create container for %v : %v", podContainerName, err)
        }
    }
    return nil
}

cgroup_manager_duration_seconds记录了创建 cgroup 所需的时间（其它类型操作通过 label 区分）：

kubelet6

更新 Cgroup 使用了 runc 的 package：

// Update updates the cgroup with the specified Cgroup Configuration
func (m *cgroupManagerImpl) Update(cgroupConfig *CgroupConfig) error {
    start := time.Now()
    defer func() {
        metrics.CgroupManagerDuration.WithLabelValues("update").Observe(metrics.SinceInSeconds(start))
    }()

    libcontainerCgroupConfig := m.libctCgroupConfig(cgroupConfig, true)
    manager, err := manager.New(libcontainerCgroupConfig)
    if err != nil {
        return fmt.Errorf("failed to create cgroup manager: %v", err)
    }
    return manager.Set(libcontainerCgroupConfig.Resources)
}

创建 Pod 目录

接下来，Kubelet 为每个 Pod 创建目录：

// makePodDataDirs creates the dirs for the pod datas.
func (kl *Kubelet) makePodDataDirs(pod *v1.Pod) error {
    uid := pod.UID
    if err := os.MkdirAll(kl.getPodDir(uid), 0750); err != nil && !os.IsExist(err) {
        return err
    }
    if err := os.MkdirAll(kl.getPodVolumesDir(uid), 0750); err != nil && !os.IsExist(err) {
        return err
    }
    if err := os.MkdirAll(kl.getPodPluginsDir(uid), 0750); err != nil && !os.IsExist(err) {
        return err
    }
    return nil
}

包括以下 3 种目录：

挂载存储卷

接下来，kubelet 等待 Pod 所需的存储卷被挂载成功（对于 terminating 状态的 Pod 不会等待）：

Jun 15 15:19:06 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:19:06.599897 3390574 volume_manager.go:408] "Waiting for volumes to attach and mount for pod" pod="default/dp-test-5986bbcc8f-gcrmm"

对应代码：


func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
    // 省略部分内容
    // Volume manager will not mount volumes for terminating pods
    // TODO: once context cancellation is added this check can be removed
    if !kl.podWorkers.IsPodTerminationRequested(pod.UID) {
        // Wait for volumes to attach/mount
        if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
            kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedMountVolume, "Unable to attach or mount volumes: %v", err)
            klog.ErrorS(err, "Unable to attach or mount volumes for pod; skipping pod", "pod", klog.KObj(pod))
            return false, err
        }
    }
    // 省略部分内容
}

Kubelet 通过desiredStateOfWorld和actualStateOfWorld来管理 Pod 的存储卷状态。在这里，WaitForAttachAndMount只根据这两个变量的值，确保所有卷都被 Attach 并且 Mount 成功：

“

这里对于不同的存储卷类型，处理逻辑不同，第三方存储卷还需要 CSI 插件的参与，这个会在后面的文章单独进行分析。

func (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) error {
    // 省略部分内容
    err := wait.PollImmediate(
        podAttachAndMountRetryInterval,
        podAttachAndMountTimeout,
        vm.verifyVolumesMountedFunc(uniquePodName, expectedVolumes))

    if err != nil {
        unmountedVolumes :=
            vm.getUnmountedVolumes(uniquePodName, expectedVolumes)
        // Also get unattached volumes for error message
        unattachedVolumes :=
            vm.getUnattachedVolumes(expectedVolumes)

        if len(unmountedVolumes) == 0 {
            return nil
        }

        return fmt.Errorf(
            "unmounted volumes=%v, unattached volumes=%v: %s",
            unmountedVolumes,
            unattachedVolumes,
            err)
    }
    // 省略部分内容
}

如果这个步骤失败，错误信息会展示在 Pod 的 event 里面，这个 event 我们经常会遇到：

Unable to attach or mount volumes xxx

这里unmounted volumes和unattached volumes的值就是通过desiredStateOfWorld和actualStateOfWorld得到的。

如果所有卷都已经挂载成功，会打印以下日志：

Jun 15 15:19:06 iv-yd6ldax3wgr9cxxjx9iv kubelet[3390574]: I0615 15:19:06.599946 3390574 volume_manager.go:439] "All volumes are attached and mounted for pod" pod="default/dp-test-5986bbcc8f-gcrmm"

获取 imagePullSecrets

接下来，kubelet 会从 API server 拿imagePullSecrets：

func (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) error {
    // 省略部分内容
    // Fetch the pull secrets for the pod
    pullSecrets := kl.getPullSecretsForPod(pod)
    // 省略部分内容
}

imagePullSecrets传给containerRuntime的syncPod方法，以启动容器：

func (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) error {
    // 省略部分内容
    // Ensure the pod is being probed
    kl.probeManager.AddPod(pod)

    // Call the container runtime's SyncPod callback
    result := kl.containerRuntime.SyncPod(ctx, pod, podStatus, pullSecrets, kl.backOff)
    kl.reasonCache.Update(pod.UID, result)
    if err := result.Error(); err != nil {
        // Do not return error if the only failures were pods in backoff
        for _, r := range result.SyncResults {
            if r.Error != kubecontainer.ErrCrashLoopBackOff && r.Error != images.ErrImagePullBackOff {
                // Do not record an event here, as we keep all event logging for sync pod failures
                // local to container runtime, so we get better errors.
                return false, err
            }
        }

        return false, nil
    }
    // 省略部分内容
}

创建 Sandbox 和 Container

至此，Kubelet 在创建 Pod 前的准备工作已经完成。SyncPod方法会通过 CRI 接口调用运行时的 RPC，完成创建 Sandbox 和 Container 的过程：

“

具体逻辑会在后面的文章单独分析。

// SyncPod syncs the running pod into the desired pod by executing following steps:
//
//  1. Compute sandbox and container changes.
//  2. Kill pod sandbox if necessary.
//  3. Kill any containers that should not be running.
//  4. Create sandbox if necessary.
//  5. Create ephemeral containers.
//  6. Create init containers.
//  7. Create normal containers.

posted @ 2024-06-17 00:06 abin在路上阅读(101) 评论(0) 收藏举报

刷新页面返回顶部

公众号：abin在路上

既可以朝九晚五，也可以浪迹夭涯。公众号：abin在路上

kubelet 创建 Pod 前发生了什么？

Pod 被调度

Kubelet 发现需要被处理的 Pod

计算 Pod 的状态

向 API server 更新 Pod 状态

创建 Cgroup

创建 Pod 目录

挂载存储卷

获取 imagePullSecrets

创建 Sandbox 和 Container

公告