目录

kubernetes源码-kubelet 原理和源码分析(五)

kubelet 驱逐管理器,驱逐 pod 回收资源,源码为 kubernetes 的 release-1.26 分支 .

写在前面

  • kubernetes 日常维护中,或多或少会碰到 pod 被驱逐的情况, pod 为什么会被驱逐呢?如果我们给节点设置驱逐阈值,那么当节点资源使用到了一定的阈值的时候, kubelet 就会帮我们把低优先级的 pod 从本节点驱逐出去,从而释放出资源供其他高优先级的 pod 使用。

  • 驱逐分 2 种,软驱逐和硬驱逐,从字面来看,软驱逐就是设置一个时间范围,超过这个时间范围才被视为是需要触发驱逐的操作的,相反,硬驱逐就是立刻驱逐。

  • kubelet 的驱逐和通过 api-server 接口调用的驱逐还有一定的区别,通过 api-server 调用接口发生的驱逐,实际就等于执行了正常的 delete 操作,具体可以看官方网站的 API 发起的驱逐

那么接下来,我们来看看 kubelet 的驱逐源码,看看它的逻辑是什么样的。

函数入口

在 pkg/kubelet/kubelet.go 文件下,我们可以在构建 kubelet 的构造函数中看到 evictionManager 的生成,然后在初始化依赖的时候,将 evictionManager 启动。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
    // NewMainKubelet 构造函数
    // setup eviction manager
    evictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig,
        killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.podManager.GetMirrorPodByPod, klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock, kubeCfg.LocalStorageCapacityIsolation)

    // initializeRuntimeDependentModules 构造函数
    klet.evictionManager = evictionManager
    klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)

    // 启动驱逐管理器
    // eviction manager must start after cadvisor because it needs to know if the container runtime has a dedicated imagefs
    kl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)

启动函数

  • 启动函数比较重要的方法就是 synchronize() 。

  • 我们可以看到它有 2 个地方触发驱逐,一个是驱逐管理器监控,每间隔 10s 执行一次,一个是 notifier.Start() ,他们的底层都是去调用 synchronize() 。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Start starts the control loop to observe and response to low compute resources.
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) {
    thresholdHandler := func(message string) {
        klog.InfoS(message)
        m.synchronize(diskInfoProvider, podFunc)
    }
    if m.config.KernelMemcgNotification {
        for _, threshold := range m.config.Thresholds {
            if threshold.Signal == evictionapi.SignalMemoryAvailable || threshold.Signal == evictionapi.SignalAllocatableMemoryAvailable {
                notifier, err := NewMemoryThresholdNotifier(threshold, m.config.PodCgroupRoot, &CgroupNotifierFactory{}, thresholdHandler)
                if err != nil {
                    klog.InfoS("Eviction manager: failed to create memory threshold notifier", "err", err)
                } else {
                    go notifier.Start()
                    m.thresholdNotifiers = append(m.thresholdNotifiers, notifier)
                }
            }
        }
    }
    // start the eviction manager monitoring
    go func() {
        for {
            if evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil {
                klog.InfoS("Eviction manager: pods evicted, waiting for pod to be cleaned up", "pods", klog.KObjSlice(evictedPods))
                m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)
            } else {
                time.Sleep(monitoringInterval)
            }
        }
    }()
}

synchronize()

我们先看看 synchronize() 的逻辑,然后再去看看 notifier.Start() 是实现了什么。

  1. 根据分区构建排名函数:

    • 检查容器运行时存放容器镜像的所在分区。

    • 检查 kubelet root-dir 目录所在的分区。

    • 如果运行时的镜像存储分区和 kubelet 目录所在的分区不是同一分区,则 nodefs 根据 pod 的日志和本地存储的使用量进行排名,imagefs 根据容器的可写层使用量进行排名。反之,他们共享共同的排名函数。

  2. 构建节点资源回收函数,套路跟上面一样,判断是否同个分区,然后进行构建。

  3. 构建函数会根据上面的分区检查 kubelet 和 运行时的存储目录是否在同一个分区,并决定是否清理相应的数据。

  4. 获取所有处于活跃状态的 pod ,过滤掉处于退出状态和正在退出状态的 pod 。

  5. 获取来自 kubelet 对节点的统计信息汇总摘要,如果 updateStats 为 true,则将获取过程中也会更新一些统计信息。

  6. 获取观察结果并获取一个函数来导出与这些观察结果相关的 Pod 使用统计数据,这个函数接受一个 pod 参数,返回该 pod 的资源使用情况。

  7. notifier.UpdateThreshold() 根据提供的指标更新内存 cgroup 阈值。 使用最新的指标调用 UpdateThreshold 使 ThresholdNotifier 更准确地触发驱逐。

  8. thresholdsMet() 计算并返回一个切片,记录哪些资源的阈值和最小驱逐回收的值的和大于他们的可用容量。

    • 比如 evictionHard: nodefs.available: “1Gi” 最小驱逐回收 evictionMinimumReclaim: nodefs.available: “500Mi” ,则 kubelet 会回收资源,直到信号达到 1GiB 的条件, 然后继续回收至少 500MiB 直到信号达到 1.5GiB。返回的结果记录的是哪些资源还没回收资源至驱逐阈值的水位线。
  9. 记录最后一次观察到驱逐阈值的时间。

  10. 设置跟驱逐阈值相关联的节点状态,如:DiskPressure ,MemoryPressure ,PIDPressure ,并记录这些状态的最后一次观察到的时间。

  11. 在某些情况下,节点在软驱逐条件上下振荡,而没有保持定义的宽限期。 这会导致报告的节点条件在 true 和 false 之间不断切换,从而导致错误的驱逐决策。为了防止振荡,你可以使用 eviction-pressure-transition-period 标志, 该标志控制 kubelet 在将节点条件转换为不同状态之前必须等待的时间。 过渡期的默认值为 5m。

  12. localStorageEviction() 检查每个 Pod 的 EmptyDir 卷使用情况,并确定其是否超出指定限制并需要被驱逐。 它还会检查 pod 中的每个容器,如果容器可写层使用量超过限制,pod 也会被驱逐。

  13. 对资源类型进行排序,内存优先于其他所有资源类型,优先需要被回收。

  14. reclaimNodeLevelResources() 在最终驱逐用户 Pod 之前检查是否有节点级别的资源可以回收以减轻节点压力,这里分别是调用 containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages 回收没有在使用的容器和镜像,如果在这一步能回收足够的资源,则无需驱逐 pod ,否则代码逻辑则需要继续往下走。

  15. 接下来,根据需要回收的资源类型,获取资源排名函数,rank(activePods, statsFunc) 。传入 activePods, statsFunc ,对 pod 进行优先级排序。

    • 这里拿内存来看看它怎么排序的, orderedBy(exceedMemoryRequests(stats), priority, memory(stats)).Sort(pods) ,我们可以看到它先检查 pod 内存有没有超过 requests ,再根据优先级,最后是内存的使用情况。
  16. for 循环对 activePods 执行遍历,选择最优 pod kill ,每个驱逐周期只驱逐一个 pod 。

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
    ctx := context.Background()
    // if we have nothing to do, just return
    thresholds := m.config.Thresholds
    if len(thresholds) == 0 && !m.localStorageCapacityIsolation {
        return nil
    }

    klog.V(3).InfoS("Eviction manager: synchronize housekeeping")
    // build the ranking functions (if not yet known)
    // TODO: have a function in cadvisor that lets us know if global housekeeping has completed
    if m.dedicatedImageFs == nil {
        hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs(ctx)
        if ok != nil {
            return nil
        }
        m.dedicatedImageFs = &hasImageFs
        m.signalToRankFunc = buildSignalToRankFunc(hasImageFs)
        m.signalToNodeReclaimFuncs = buildSignalToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)
    }

    activePods := podFunc()
    updateStats := true
    summary, err := m.summaryProvider.Get(ctx, updateStats)
    if err != nil {
        klog.ErrorS(err, "Eviction manager: failed to get summary stats")
        return nil
    }

    if m.clock.Since(m.thresholdsLastUpdated) > notifierRefreshInterval {
        m.thresholdsLastUpdated = m.clock.Now()
        for _, notifier := range m.thresholdNotifiers {
            if err := notifier.UpdateThreshold(summary); err != nil {
                klog.InfoS("Eviction manager: failed to update notifier", "notifier", notifier.Description(), "err", err)
            }
        }
    }

    // make observations and get a function to derive pod usage stats relative to those observations.
    observations, statsFunc := makeSignalObservations(summary)
    debugLogObservations("observations", observations)

    // determine the set of thresholds met independent of grace period
    thresholds = thresholdsMet(thresholds, observations, false)
    debugLogThresholdsWithObservation("thresholds - ignoring grace period", thresholds, observations)

    // determine the set of thresholds previously met that have not yet satisfied the associated min-reclaim
    if len(m.thresholdsMet) > 0 {
        thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)
        thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)
    }
    debugLogThresholdsWithObservation("thresholds - reclaim not satisfied", thresholds, observations)

    // track when a threshold was first observed
    now := m.clock.Now()
    thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)

    // the set of node conditions that are triggered by currently observed thresholds
    nodeConditions := nodeConditions(thresholds)
    if len(nodeConditions) > 0 {
        klog.V(3).InfoS("Eviction manager: node conditions - observed", "nodeCondition", nodeConditions)
    }

    // track when a node condition was last observed
    nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)

    // node conditions report true if it has been observed within the transition period window
    nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)
    if len(nodeConditions) > 0 {
        klog.V(3).InfoS("Eviction manager: node conditions - transition period not met", "nodeCondition", nodeConditions)
    }

    // determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met)
    thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)
    debugLogThresholdsWithObservation("thresholds - grace periods satisfied", thresholds, observations)

    // update internal state
    m.Lock()
    m.nodeConditions = nodeConditions
    m.thresholdsFirstObservedAt = thresholdsFirstObservedAt
    m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt
    m.thresholdsMet = thresholds

    // determine the set of thresholds whose stats have been updated since the last sync
    thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)
    debugLogThresholdsWithObservation("thresholds - updated stats", thresholds, observations)

    m.lastObservations = observations
    m.Unlock()

    // evict pods if there is a resource usage violation from local volume temporary storage
    // If eviction happens in localStorageEviction function, skip the rest of eviction action
    if m.localStorageCapacityIsolation {
        if evictedPods := m.localStorageEviction(activePods, statsFunc); len(evictedPods) > 0 {
            return evictedPods
        }
    }

    if len(thresholds) == 0 {
        klog.V(3).InfoS("Eviction manager: no resources are starved")
        return nil
    }

    // rank the thresholds by eviction priority
    sort.Sort(byEvictionPriority(thresholds))
    thresholdToReclaim, resourceToReclaim, foundAny := getReclaimableThreshold(thresholds)
    if !foundAny {
        return nil
    }
    klog.InfoS("Eviction manager: attempting to reclaim", "resourceName", resourceToReclaim)

    // record an event about the resources we are now attempting to reclaim via eviction
    m.recorder.Eventf(m.nodeRef, v1.EventTypeWarning, "EvictionThresholdMet", "Attempting to reclaim %s", resourceToReclaim)

    // check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods.
    if m.reclaimNodeLevelResources(ctx, thresholdToReclaim.Signal, resourceToReclaim) {
        klog.InfoS("Eviction manager: able to reduce resource pressure without evicting pods.", "resourceName", resourceToReclaim)
        return nil
    }

    klog.InfoS("Eviction manager: must evict pod(s) to reclaim", "resourceName", resourceToReclaim)

    // rank the pods for eviction
    rank, ok := m.signalToRankFunc[thresholdToReclaim.Signal]
    if !ok {
        klog.ErrorS(nil, "Eviction manager: no ranking function for signal", "threshold", thresholdToReclaim.Signal)
        return nil
    }

    // the only candidates viable for eviction are those pods that had anything running.
    if len(activePods) == 0 {
        klog.ErrorS(nil, "Eviction manager: eviction thresholds have been met, but no pods are active to evict")
        return nil
    }

    // rank the running pods for eviction for the specified resource
    rank(activePods, statsFunc)

    klog.InfoS("Eviction manager: pods ranked for eviction", "pods", klog.KObjSlice(activePods))

    //record age of metrics for met thresholds that we are using for evictions.
    for _, t := range thresholds {
        timeObserved := observations[t.Signal].time
        if !timeObserved.IsZero() {
            metrics.EvictionStatsAge.WithLabelValues(string(t.Signal)).Observe(metrics.SinceInSeconds(timeObserved.Time))
        }
    }

    // we kill at most a single pod during each eviction interval
    for i := range activePods {
        pod := activePods[i]
        gracePeriodOverride := int64(0)
        if !isHardEvictionThreshold(thresholdToReclaim) {
            gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
        }
        message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc, thresholds, observations)
        var condition *v1.PodCondition
        if utilfeature.DefaultFeatureGate.Enabled(features.PodDisruptionConditions) {
            condition = &v1.PodCondition{
                Type:    v1.DisruptionTarget,
                Status:  v1.ConditionTrue,
                Reason:  v1.PodReasonTerminationByKubelet,
                Message: message,
            }
        }
        if m.evictPod(pod, gracePeriodOverride, message, annotations, condition) {
            metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc()
            return []*v1.Pod{pod}
        }
    }
    klog.InfoS("Eviction manager: unable to evict any pods from the node")
    return nil
}

evictPod()

对 pod 进行驱逐,kubelet 调用的 evictPod() 方法。我们看看它的逻辑。

他的实现实际是使用 killPodFunc() 函数,我们看看 killPodFunc() 。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
func (m *managerImpl) evictPod(pod *v1.Pod, gracePeriodOverride int64, evictMsg string, annotations map[string]string, condition *v1.PodCondition) bool {
    // If the pod is marked as critical and static, and support for critical pod annotations is enabled,
    // do not evict such pods. Static pods are not re-admitted after evictions.
    // https://github.com/kubernetes/kubernetes/issues/40573 has more details.
    if kubelettypes.IsCriticalPod(pod) {
        klog.ErrorS(nil, "Eviction manager: cannot evict a critical pod", "pod", klog.KObj(pod))
        return false
    }
    // record that we are evicting the pod
    m.recorder.AnnotatedEventf(pod, annotations, v1.EventTypeWarning, Reason, evictMsg)
    // this is a blocking call and should only return when the pod and its containers are killed.
    klog.V(3).InfoS("Evicting pod", "pod", klog.KObj(pod), "podUID", pod.UID, "message", evictMsg)
    err := m.killPodFunc(pod, true, &gracePeriodOverride, func(status *v1.PodStatus) {
        status.Phase = v1.PodFailed
        status.Reason = Reason
        status.Message = evictMsg
        if condition != nil {
            podutil.UpdatePodCondition(status, condition)
        }
    })
    if err != nil {
        klog.ErrorS(err, "Eviction manager: pod failed to evict", "pod", klog.KObj(pod))
    } else {
        klog.InfoS("Eviction manager: pod is evicted successfully", "pod", klog.KObj(pod))
    }
    return true
}

killPodFunc()

该函数在构建驱逐管理器的时候,由 kubelet 传给驱逐管理器的。

函数位于 pkg/kubelet/pod_workers.go 文件下。

具体逻辑实际就是调用 podWorkers.UpdatePod() 函数,对 pod 下发 kill 的更新指令,触发 syncTerminatingPod() 方法对 pod 进行 kill ,statusManager 设置 pod 状态,然后进入下一个阶段的 pod 处理,释放 pod 资源。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// killPodNow returns a KillPodFunc that can be used to kill a pod.
// It is intended to be injected into other modules that need to kill a pod.
func killPodNow(podWorkers PodWorkers, recorder record.EventRecorder) eviction.KillPodFunc {
    return func(pod *v1.Pod, isEvicted bool, gracePeriodOverride *int64, statusFn func(*v1.PodStatus)) error {
        // determine the grace period to use when killing the pod
        gracePeriod := int64(0)
        if gracePeriodOverride != nil {
            gracePeriod = *gracePeriodOverride
        } else if pod.Spec.TerminationGracePeriodSeconds != nil {
            gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
        }

        // we timeout and return an error if we don't get a callback within a reasonable time.
        // the default timeout is relative to the grace period (we settle on 10s to wait for kubelet->runtime traffic to complete in sigkill)
        timeout := int64(gracePeriod + (gracePeriod / 2))
        minTimeout := int64(10)
        if timeout < minTimeout {
            timeout = minTimeout
        }
        timeoutDuration := time.Duration(timeout) * time.Second

        // open a channel we block against until we get a result
        ch := make(chan struct{}, 1)
        podWorkers.UpdatePod(UpdatePodOptions{
            Pod:        pod,
            UpdateType: kubetypes.SyncPodKill,
            KillPodOptions: &KillPodOptions{
                CompletedCh:                              ch,
                Evict:                                    isEvicted,
                PodStatusFunc:                            statusFn,
                PodTerminationGracePeriodSecondsOverride: gracePeriodOverride,
            },
        })

        // wait for either a response, or a timeout
        select {
        case <-ch:
            return nil
        case <-time.After(timeoutDuration):
            recorder.Eventf(pod, v1.EventTypeWarning, events.ExceededGracePeriod, "Container runtime did not kill the pod within specified grace period.")
            return fmt.Errorf("timeout waiting to kill pod")
        }
    }
}

notifier.Start()

  1. notifier.Start() 这里比较不太好理解,理解起来比较绕。

  2. 在前面的 UpdateThreshold() 逻辑里面,最后有一段 m.notifier.Start(m.events) ,实际就是启动监听内核事件的代码逻辑,为下面的消费内核事件做生产者,通过通道通信。

  3. 然后 notifier.Start() 去消费内核事件,检测到内核事件时,执行 synchronize() 。

  4. 每 10s 检查一次是否有事件更新,通过 epoll 来监听上述的 eventfd,当监听到内核发送的事件时,说明使用的内存已超过阈值。

pkg/kubelet/eviction/memory_threshold_notifier.go 文件是 notifier.Start() 的源码位置。

1
2
3
4
5
6
7
// 这里 start 是去消费内核事件
func (m *memoryThresholdNotifier) Start() {
    klog.InfoS("Eviction manager: created memoryThresholdNotifier", "notifier", m.Description())
    for range m.events {
        m.handler(fmt.Sprintf("eviction manager: %s crossed", m.Description()))
    }
}

pkg/kubelet/eviction/threshold_notifier_linux.go 文件是 m.notifier.Start(m.events) 的源码位置。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
func (n *linuxCgroupNotifier) Start(eventCh chan<- struct{}) {
    err := unix.EpollCtl(n.epfd, unix.EPOLL_CTL_ADD, n.eventfd, &unix.EpollEvent{
        Fd:     int32(n.eventfd),
        Events: unix.EPOLLIN,
    })
    if err != nil {
        klog.InfoS("Eviction manager: error adding epoll eventfd", "err", err)
        return
    }
    buf := make([]byte, eventSize)
    for {
        select {
        case <-n.stop:
            return
        default:
        }
        event, err := wait(n.epfd, n.eventfd, notifierRefreshInterval)
        if err != nil {
            klog.InfoS("Eviction manager: error while waiting for memcg events", "err", err)
            return
        } else if !event {
            // Timeout on wait.  This is expected if the threshold was not crossed
            continue
        }
        // Consume the event from the eventfd
        _, err = unix.Read(n.eventfd, buf)
        if err != nil {
            klog.InfoS("Eviction manager: error reading memcg events", "err", err)
            return
        }
        eventCh <- struct{}{}
    }
}

waitForPodsCleanup()

执行完 synchronize() 会看到它需要等 pod 清理完成,我们看看 waitForPodsCleanup() 到底清理了什么。

  1. 他调用 podCleanedUpFunc() 对 pod 进行检查。

  2. 然后通过 kubelet 的 statusManager 检查能否获取到 pod 的状态,如果获取不到,则调用 PodResourcesAreReclaimed() 方法进一步做检查。

  3. 这里对 pod 做以下检查:

    • 是否有容器正在运行。

    • 检查存储卷是否还存在。

    • 检查 sandbox 容器是否已清理。

  4. 如果 podCleanedUpFunc() 返回 false 说明没有完成清理,退出 for 循环等待下一次检查,默认周期是 1s ,直到检查完成或 30s 超时,退出 waitForPodsCleanup() 。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
func (m *managerImpl) waitForPodsCleanup(podCleanedUpFunc PodCleanedUpFunc, pods []*v1.Pod) {
    timeout := m.clock.NewTimer(podCleanupTimeout)
    defer timeout.Stop()
    ticker := m.clock.NewTicker(podCleanupPollFreq)
    defer ticker.Stop()
    for {
        select {
        case <-timeout.C():
            klog.InfoS("Eviction manager: timed out waiting for pods to be cleaned up", "pods", klog.KObjSlice(pods))
            return
        case <-ticker.C():
            for i, pod := range pods {
                if !podCleanedUpFunc(pod) {
                    break
                }
                if i == len(pods)-1 {
                    klog.InfoS("Eviction manager: pods successfully cleaned up", "pods", klog.KObjSlice(pods))
                    return
                }
            }
        }
    }
}


func (kl *Kubelet) PodResourcesAreReclaimed(pod *v1.Pod, status v1.PodStatus) bool {
    if kl.podWorkers.CouldHaveRunningContainers(pod.UID) {
        // We shouldn't delete pods that still have running containers
        klog.V(3).InfoS("Pod is terminated, but some containers are still running", "pod", klog.KObj(pod))
        return false
    }
    if count := countRunningContainerStatus(status); count > 0 {
        // We shouldn't delete pods until the reported pod status contains no more running containers (the previous
        // check ensures no more status can be generated, this check verifies we have seen enough of the status)
        klog.V(3).InfoS("Pod is terminated, but some container status has not yet been reported", "pod", klog.KObj(pod), "running", count)
        return false
    }
    if kl.podVolumesExist(pod.UID) && !kl.keepTerminatedPodVolumes {
        // We shouldn't delete pods whose volumes have not been cleaned up if we are not keeping terminated pod volumes
        klog.V(3).InfoS("Pod is terminated, but some volumes have not been cleaned up", "pod", klog.KObj(pod))
        return false
    }
    if kl.kubeletConfiguration.CgroupsPerQOS {
        pcm := kl.containerManager.NewPodContainerManager()
        if pcm.Exists(pod) {
            klog.V(3).InfoS("Pod is terminated, but pod cgroup sandbox has not been cleaned up", "pod", klog.KObj(pod))
            return false
        }
    }

    // Note: we leave pod containers to be reclaimed in the background since dockershim requires the
    // container for retrieving logs and we want to make sure logs are available until the pod is
    // physically deleted.

    klog.V(3).InfoS("Pod is terminated and all resources are reclaimed", "pod", klog.KObj(pod))
    return true
}

// podResourcesAreReclaimed simply calls PodResourcesAreReclaimed with the most up-to-date status.
func (kl *Kubelet) podResourcesAreReclaimed(pod *v1.Pod) bool {
    status, ok := kl.statusManager.GetPodStatus(pod.UID)
    if !ok {
        status = pod.Status
    }
    return kl.PodResourcesAreReclaimed(pod, status)
}

总结

  1. kubelet 驱逐这一块的源码,阅读起来相当困难,第一是因为它穿插调用很多地方的函数,然后参数又多,而且参数大部分是自定义结构体类型,还有伴随着方法,理解起来比较绕比较费脑,也不好写笔记。

  2. 还需要配合 podworker 的代码一起看,因为它 kill 的过程就是通过 podworker 执行的。

  3. 里面还有一段逻辑判断 pod IsStaticPod 是否静态 pod, IsMirrorPod 是否镜像 pod , IsCriticalPodBasedOnPriority 是否 CriticalPod,如果不想 pod 被驱逐,可以通过把 pod 设置为 CriticalPod ,避免被驱逐。