负载均衡（Load Balance）- active load balance

前面一篇负载均衡文章中分析了periodic load balance，主要跟踪了其代码框架和流程。其中在load_balance函数中，会进行多次task 迁移尝试，如果多次尝试后仍然失败，那么就会判断是否需要进行更加激进的balance。

而激进的balance其中就包含了active load balance。前文简单介绍过active load balance的工作原理是：从负载重的cpu，向负载轻的cpu，推送（push）task。

下面我们就看下具体代码，代码基于kernel-5.4。水平有限，不免有错误之处，烦请指正。

1、触发条件

其实在前文中也解析到过，一条调用路径是load_balance函数中调用need_active_balance来判断和触发：

（1-1）判断是否需要active balance（return 1表示需要进行active load balance）

static int need_active_balance(struct lb_env *env)
{
    struct sched_domain *sd = env->sd;

    if (voluntary_active_balance(env))        //（1-1-1）判断是否满足主动触发active balance的条件
        return 1;

    if ((env->idle != CPU_NOT_IDLE) &&                                //src cpu处于idle状态
        (capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&    //src cpu_capacity < dst cpu_capacity
        ((capacity_orig_of(env->src_cpu) <                            //且src orig_cpu_capacity < dst orig_cpu_capacity
                capacity_orig_of(env->dst_cpu))) &&                    
                env->src_rq->cfs.h_nr_running == 1 &&                //且src cpu只有一个cfs task
                cpu_overutilized(env->src_cpu) &&                    //且src cpu处于overutil状态
                !cpu_overutilized(env->dst_cpu)) {                    //且dst cpu不处于overutil状态
        return 1;
    }

    if (env->src_grp_type == group_overloaded && env->src_rq->misfit_task_load)    //src cpu所在group处于overload，且src cpu有misfit task load
        return 1;

    return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);    //当前sd balance fail的失败计数 > 当前sd->cache_nice_tries+2
}

（1-1-1）判断是否满足主动触发active balance的条件

static inline bool
voluntary_active_balance(struct lb_env *env)
{
    struct sched_domain *sd = env->sd;

    if (asym_active_balance(env))    //平台不支持SMT level，也就不会有flag：SD_ASYM_PACKING。所以这里永远为false
        return 1;

    /*
     * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.            //dst cpu处于idle，src cpu只有1个cfs task
     * It's worth migrating the task if the src_cpu's capacity is reduced    //如果src cpu的capacity由于其他sched class或者irq
     * because of other sched_class or IRQs if more capacity stays            //并且dst cpu有更大的capacity，那么是值得迁移task的
     * available on dst_cpu.
     */
    if ((env->idle != CPU_NOT_IDLE) &&                //src cpu处于idle
        (env->src_rq->cfs.h_nr_running == 1)) {        //src cpu只有一个cfs task
        if ((check_cpu_capacity(env->src_rq, sd)) &&    //满足src cpu_capacity * sd->imbalance_pct < src cpu_capacity_orig*100
            (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))    //满足src cpu_capacity * sd->imbalance_pct < dst cpu_capacity*100
            return 1;
    }

    if (env->idle != CPU_NOT_IDLE &&                //如果src cpu处于idle
            env->src_grp_type == group_misfit_task)    //src cpu所在group type == group_misfit_task
        return 1;

    return 0;
}

当满足触发条件后，这里再仔细看一遍load_balance函数：

在more_balance中，会判断busiest rq是否在做active balance，如果是的话，就不进行迁移，并goto到no_move
在no_move中，如果几轮尝试后，仍然没有迁移task，就可能需要判断是否需要active balance
need_active_balance会判断是否需要进行active balance和是否满足条件
在经过一些过滤条件后，开启stop class工作队列，进行active balance

/*
 * Check this_cpu to ensure it is balanced within domain. Attempt to move
 * tasks if there is an imbalance.
 */
static int load_balance(int this_cpu, struct rq *this_rq,
            struct sched_domain *sd, enum cpu_idle_type idle,
            int *continue_balancing)
{
    int ld_moved = 0, cur_ld_moved, active_balance = 0;
。。。
more_balance:
。。。
        /*
         * The world might have changed. Validate assumptions.            //代码运行到这里时，有时状态已经发生了变化，所以需要再check：
         * And also, if the busiest cpu is undergoing active_balance,     //如果busiest cpu在做active balance，但是它的rq上只有 <= 2个task
         * it doesn't need help if it has less than 2 tasks on it.        //那么它不需要进行迁移
         */

        if (busiest->nr_running <= 1 ||                                    //busiest rq只有1个task或没有task
            (busiest->active_balance && busiest->nr_running <= 2)) {       //或者   （busiest rq处于active balance状态 并且 rq有<=2个task）
            rq_unlock_irqrestore(busiest, &rf);
            env.flags &= ~LBF_ALL_PINNED;                //清flag：LBF_ALL_PINNED
            goto no_move;                                //退出，不进行迁移 no_move
        }
。。。

no_move:
    if (!ld_moved) {                //经过几轮的努力尝试，最终迁移的进程数ld_moved还是0，说明balance失败
。。。
        if (need_active_balance(&env)) {                //（1-1）判断是否需要active balance
。。。/*
             * ->active_balance synchronizes accesses to        //active_balance标记是与active_balance_work同步的
             * ->active_balance_work.  Once set, it's cleared    //标记只在active load balance完成之后会清除
             * only after active load balance is finished.
             */
            if (!busiest->active_balance &&                //busiest没有处于active balance状态
                !cpu_isolated(cpu_of(busiest))) {    //并且busiest rq的cpu没有isolate
                busiest->active_balance = 1;            //那么就标记busiest rq的状态为active balance
                busiest->push_cpu = this_cpu;            //push cpu为当前cpu（this_cpu）
                active_balance = 1;                        //同时标记active_balance =1，表示active_balance work开始
                mark_reserved(this_cpu);                //标记this cpu为reserved
            }
            raw_spin_unlock_irqrestore(&busiest->lock, flags);

            if (active_balance) {
                stop_one_cpu_nowait(cpu_of(busiest),                        //（1-2）开启active_load_balance work的工作队列（这个进程调度类是stop class）
                    active_load_balance_cpu_stop, busiest,                    //（1-3）工作函数为：active_load_balance_cpu_stop
                    &busiest->active_balance_work);
                *continue_balancing = 0;　　　　　　　　　　　　　　//开启active balance后，就停止继续进行balance，将标记continue balancing清0
            }

            /* We've kicked active balancing, force task migration. */    //已经触发active balance，并强制执行task迁移
            sd->nr_balance_failed = sd->cache_nice_tries +                //把balance失败计数改为cache_nice_tries + 10 -1
                    NEED_ACTIVE_BALANCE_THRESHOLD - 1;
        }
    } else
        sd->nr_balance_failed = 0;        //load balance成功发生迁移的话，清空失败计数

    if (likely(!active_balance)) {            //没有触发active balance 或者active balance完成了
        /* We were unbalanced, so reset the balancing interval */
        sd->balance_interval = sd->min_interval;        //重置balance间隔时间为min_interval
    } else {
        /*
         * If we've begun active balancing, start to back off. This        //如果我们正在进行active balance，那么就要将间隔搞大点
         * case may not be covered by the all_pinned logic if there
         * is only 1 task on the busy runqueue (because we don't call
         * detach_tasks).
         */
        if (sd->balance_interval < sd->max_interval)    //balance间隔 < max_interval
            sd->balance_interval *= 2;                    //则将balance间隔放大成2倍
    }
。。。
out:
    trace_sched_load_balance(this_cpu, idle, *continue_balancing,
                 group ? group->cpumask[0] : 0,
                 busiest ? busiest->nr_running : 0,
                 env.imbalance, env.flags, ld_moved,
                 sd->balance_interval, active_balance,
                 sd_overutilized(sd), env.prefer_spread);
    return ld_moved;
}

上面是第一条从load_balance中触发调用active balance的路径，另一条是定时调度中触发，路径如下：

scheduler_tick

-> check_for_migration

　　-> stop_one_cpu_nowait

（1-2）

/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 */
void scheduler_tick(void)
{
。。。if (curr->sched_class == &fair_sched_class)
        check_for_migration(rq, curr);

#ifdef CONFIG_SMP
    rq_lock(rq, &rf);
    if (idle_cpu(cpu) && is_reserved(cpu) && !rq->active_balance)
        clear_reserved(cpu);
    rq_unlock(rq, &rf);
#endif
。。。
}

（1-2-1）针对misfit task的情况，进行特别的active balance需求进行判断

void check_for_migration(struct rq *rq, struct task_struct *p)
{
    int active_balance;
    int new_cpu = -1;
    int prev_cpu = task_cpu(p);
    int ret;

    if (rq->misfit_task_load) {                    //如果rq中有misfit task
        if (rq->curr->state != TASK_RUNNING ||    //如果rq中curr不处于running状态
            rq->curr->nr_cpus_allowed == 1)        //或者，curr进程受cpuset限制，只能在1个cpu上运行
            return;

        if (walt_rotation_enabled) {            //如果walt rotation使能了。使能条件：在没有开启sched_boost和sysctl（WALT rotation）情况下，big task数量超过cpu总核数
            raw_spin_lock(&migration_lock);
            walt_check_for_rotation(rq);        //会尝试寻找一个dst cpu，将当前cpu->curr与dst cpu->curr进行迁移交换。这部分等后续有时间了，再补充这部分代码分析
            raw_spin_unlock(&migration_lock);
            return;
        }

        raw_spin_lock(&migration_lock);
        rcu_read_lock();
        new_cpu = find_energy_efficient_cpu(p, prev_cpu, 0, 1);        //通过EAS尝试重新选择一个新的cpu狼放置curr进程
        rcu_read_unlock();
        if ((new_cpu >= 0) && (new_cpu != prev_cpu) &&                        //如果有找到新的cpu，并且新cpu不是prev cpu
            (capacity_orig_of(new_cpu) > capacity_orig_of(prev_cpu))) {        //并且new_cpu的cpu capacity_orig > prev_cpu的
            active_balance = kick_active_balance(rq, p, new_cpu);            //（1-2-1-1）判断是否满足条件，触发active balance
            if (active_balance) {
                mark_reserved(new_cpu);                            //把new cpu标记为CPU_RESERVED
                raw_spin_unlock(&migration_lock);
                ret = stop_one_cpu_nowait(prev_cpu,                //（1-3）开启active_load_balance work的工作队列（这个进程调度类是stop class）
                    active_load_balance_cpu_stop, rq,            //（2-1）工作函数为：active_load_balance_cpu_stop
                    &rq->active_balance_work);
                if (!ret)
                    clear_reserved(new_cpu);    //如果启动active balance work失败，则清掉new_cpu CPU_RESERVED标记
                else
                    wake_up_if_idle(new_cpu);    //如果成功启动的active balance work，如果new cpu上的curr是idle task，则需要通过ipi唤醒new cpu
                return;
            }
        }
        raw_spin_unlock(&migration_lock);
    }
}

（1-2-1-1）判断是否满足条件，触发active balance

int
kick_active_balance(struct rq *rq, struct task_struct *p, int new_cpu)
{
    unsigned long flags;
    int rc = 0;

    /* Invoke active balance to force migrate currently running task */
    raw_spin_lock_irqsave(&rq->lock, flags);
    if (!rq->active_balance) {        //如果rq并未处于active balance状态
        rq->active_balance = 1;        //置位flag
        rq->push_cpu = new_cpu;        //设定push cpu为new cpu
        get_task_struct(p);
        rq->wrq.push_task = p;        //push的task为p(实际就是curr)
        rc = 1;
    }
    raw_spin_unlock_irqrestore(&rq->lock, flags);

    return rc;
}

（1-3）开启active_load_balance work的工作队列

/**
 * stop_one_cpu_nowait - stop a cpu but don't wait for completion
 * @cpu: cpu to stop
 * @fn: function to execute
 * @arg: argument to @fn
 * @work_buf: pointer to cpu_stop_work structure
 *
 * Similar to stop_one_cpu() but doesn't wait for completion.  The
 * caller is responsible for ensuring @work_buf is currently unused
 * and will remain untouched until stopper starts executing @fn.
 *
 * CONTEXT:
 * Don't care.
 *
 * RETURNS:
 * true if cpu_stop_work was queued successfully and @fn will be called,
 * false otherwise.
 */
bool stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
            struct cpu_stop_work *work_buf)
{
    *work_buf = (struct cpu_stop_work){ .fn = fn, .arg = arg, };　　//传入work的fn函数（active_load_balance_cpu_stop）和arg参数（busiest）
    return cpu_stop_queue_work(cpu, work_buf);   //（1-3-1）一切就绪就启动work
}

（1-3-1）简单判断stopper内核线程是否初始化完成，ready to run？一切就绪就启动work

/* queue @work to @stopper.  if offline, @work is completed immediately */
static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work)
{
    struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
    DEFINE_WAKE_Q(wakeq);
    unsigned long flags;
    bool enabled;

    preempt_disable();
    raw_spin_lock_irqsave(&stopper->lock, flags);
    enabled = stopper->enabled;　　　　　　　　　　　　　　//stopper初始化完成，unpark后，就会将enabled flag置true
    if (enabled)
        __cpu_stop_queue_work(stopper, work, &wakeq);　//（2-1）启动work，work中的fn实际就是active_load_balance_cpu_stop函数，所以最终是执行active_load_balance_cpu_stop函数，传入arg参数（busiest）
    else if (work->done)
        cpu_stop_signal_done(work->done);
    raw_spin_unlock_irqrestore(&stopper->lock, flags);

    wake_up_q(&wakeq);
    preempt_enable();

    return enabled;
}

这里先简单看下cpu stop work的中初始化和注册流程：

static struct smp_hotplug_thread cpu_stop_threads = {
    .store            = &cpu_stopper.thread,
    .thread_should_run    = cpu_stop_should_run,
    .thread_fn        = cpu_stopper_thread,
    .thread_comm        = "migration/%u",
    .create            = cpu_stop_create,
    .park            = cpu_stop_park,
    .selfparking        = true,
};

static int __init cpu_stop_init(void)
{
    unsigned int cpu;

    for_each_possible_cpu(cpu) {　　　　　　　　　　　　　　　　　　　　　　//为每个cpu初始化per_cpu结构体cpu_stopper
        struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);

        raw_spin_lock_init(&stopper->lock);
        INIT_LIST_HEAD(&stopper->works);
    }

    BUG_ON(smpboot_register_percpu_thread(&cpu_stop_threads));　　　　//为每个online的cpu创建cpu stop线程并启动该线程
    stop_machine_unpark(raw_smp_processor_id());　　　　　　　　　　　　//将对应cpu的stop进程unpark
    stop_machine_initialized = true;
    return 0;
}
early_initcall(cpu_stop_init);

为每个online的cpu创建cpu stop线程并启动该线程

/**
 * smpboot_register_percpu_thread - Register a per_cpu thread related
 *                         to hotplug
 * @plug_thread:    Hotplug thread descriptor
 *
 * Creates and starts the threads on all online cpus.
 */
int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
{
    unsigned int cpu;
    int ret = 0;

    get_online_cpus();
    mutex_lock(&smpboot_threads_lock);
    for_each_online_cpu(cpu) {　　　　　　　　　　　　　　　　　　//遍历每个online的cpu
        ret = __smpboot_create_thread(plug_thread, cpu);　　 //为每个online cpu创建线程
        if (ret) {
            smpboot_destroy_threads(plug_thread);　　　　//如果创建出错，则需要销毁
            goto out;
        }
        smpboot_unpark_thread(plug_thread, cpu);　　　　//因为selfparking = true，所以这里不会unpark内核线程
    }
    list_add(&plug_thread->list, &hotplug_threads);　　//将新创建的内核线程，加到hotplug_threads链表中，里面放了所有migration/%u内核线程
out:
    mutex_unlock(&smpboot_threads_lock);
    put_online_cpus();
    return ret;
}

创建内核线程：migration/%u ，%u会替换为数字0，1，2...，对应绑定的cpu号

static int
__smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
{
    struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);　　　　　　//获取cpu_stopper.thread
    struct smpboot_thread_data *td;

    if (tsk)
        return 0;

    td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));　　//申请smpboot_thread_data结构体空间（只是这里考虑了NUMA架构，我们是UMA架构）
    if (!td)
        return -ENOMEM;
    td->cpu = cpu;　　　　　　//赋值对应cpu
    td->ht = ht;　　　　　　　//关联对应的smp_hotplug_thread

    tsk = kthread_create_on_cpu(smpboot_thread_fn, td, cpu,　　　　　　//在给定的cpu上，创建内核线程，名字为"migration/%u",%u会替换为数字0，1，2...，对应绑定的cpu号
                    ht->thread_comm);
    if (IS_ERR(tsk)) {
        kfree(td);
        return PTR_ERR(tsk);
    }
    /*
     * Park the thread so that it could start right on the CPU
     * when it is available.
     */
    kthread_park(tsk);　　　　　　　　　　　　　　//内核线程先挂起，不要调用执行对应的fn
    get_task_struct(tsk);　　　　　　　　　　　　//task struct的引用计数+1
    *per_cpu_ptr(ht->store, cpu) = tsk;　　　　//将内核线程绑定到per_cpu的cpu_stopper.thread
    if (ht->create) {　　　　　　　　　　　　　　　　　　//
        /*
         * Make sure that the task has actually scheduled out
         * into park position, before calling the create
         * callback. At least the migration thread callback
         * requires that the task is off the runqueue.
         */
        if (!wait_task_inactive(tsk, TASK_PARKED))　　　　//等待task到挂起状态
            WARN_ON(1);
        else
            ht->create(cpu);　　　　//调用cpu_stop_create，设置task优先级、stop class和FIFO policy等
    }
    return 0;
}

调用cpu_stop_create，将内核线程优先级设为99（100-1），调度class设置为stop级别，policy为FIFO。

如有老的stop进程，则将其改为rt class进程。（这说明一个cpu上只允许有一个stop class进程，有新的stop设置，老的stop需要降级为rt class）

static void cpu_stop_create(unsigned int cpu)
{
    sched_set_stop_task(cpu, per_cpu(cpu_stopper.thread, cpu));　　//获取当前cpu对应的cpu_stopper.thread，即上述的内核线程migration/%u
}

void sched_set_stop_task(int cpu, struct task_struct *stop)
{
    struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };　　//配置优先级为 100-1 = 99
    struct task_struct *old_stop = cpu_rq(cpu)->stop;　　　　　　//获取老的stop进程

    if (stop) {
        /*
         * Make it appear like a SCHED_FIFO task, its something
         * userspace knows about and won't get confused about.
         *
         * Also, it will make PI more or less work without too
         * much confusion -- but then, stop work should not
         * rely on PI working anyway.
         */
        sched_setscheduler_nocheck(stop, SCHED_FIFO, &param);　　//将内核线程设置为FIFO，并设置优先级

        stop->sched_class = &stop_sched_class;　　　　　　　　　　//设置调度class为stop class
    }

    cpu_rq(cpu)->stop = stop;　　　　　　//更新新的stop进程

    if (old_stop) {
        /*
         * Reset it back to a normal scheduling class so that
         * it can die in pieces.
         */
        old_stop->sched_class = &rt_sched_class;　　　　　　//将老的stop进程设置为rt class
    }
}

将cpu id对应的cpu_stopper进程unpark，取消挂起。

void stop_machine_unpark(int cpu)
{
    struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);

    stopper->enabled = true;　　　　　　　//enabled flag置为true
    kthread_unpark(stopper->thread);　　//取消挂起
}

如果是多核cpu，这里应该只会挂起一个cpu的内核线程。其余cpu会在cpu启动的时候执行的，具体调用路径如下，最终在cpuhp_online_idle函数中取消挂起。

调用路径：

secondary_startup

-> __secondary_switched

　　-> secondary_start_kernel

　　　　-> cpu_startup_entry

　　　　　　-> cpuhp_online_idle中，调用stop_machine_unpark函数取消挂起

2、active balance操作

（2-1）启动work，work中的fn实际就是active_load_balance_cpu_stop函数，所以最终是执行active_load_balance_cpu_stop函数；传入的参数为busiest（或者是有misfit task的rq）

/*
 * active_load_balance_cpu_stop is run by the CPU stopper. It pushes
 * running tasks off the busiest CPU onto idle CPUs. It requires at
 * least 1 task to be running on each physical CPU where possible, and
 * avoids physical / logical imbalances.
 */
int active_load_balance_cpu_stop(void *data)
{
    struct rq *busiest_rq = data;
    int busiest_cpu = cpu_of(busiest_rq);
    int target_cpu = busiest_rq->push_cpu;            //获取busiest rq对应push task的目标cpu；一般是执行load balance时候的当前cpu（赋值为this_cpu）
    struct rq *target_rq = cpu_rq(target_cpu);        //获取push cpu对应的cpu rq
    struct sched_domain *sd = NULL;
    struct task_struct *p = NULL;
    struct rq_flags rf;
#ifdef CONFIG_SCHED_WALT
    struct task_struct *push_task;
    int push_task_detached = 0;
    struct lb_env env = {
        .sd                     = sd,                //初始化sd为NULL
        .dst_cpu                = target_cpu,        //target_cpu为push cpu
        .dst_rq                 = target_rq,
        .src_cpu                = busiest_rq->cpu,    //src_cpu为busiest cpu
        .src_rq                 = busiest_rq,
        .idle                   = CPU_IDLE,
        .flags                  = 0,
        .loop                   = 0,
    };
#endif

    rq_lock_irq(busiest_rq, &rf);
    /*
     * Between queueing the stop-work and running it is a hole in which        //在做queue work和运行stop-work的中间，有可能cpu变inactive
     * CPUs can become inactive. We should not move tasks from or to        //我们不能在inactive的cpu上迁移过来或者迁移走task
     * inactive CPUs.
     */
    if (!cpu_active(busiest_cpu) || !cpu_active(target_cpu))        //如果src或者target cpu有一个处于inactive状态，则不进行迁移了
        goto out_unlock;

    /* Make sure the requested CPU hasn't gone down in the meantime: */        //确保提出需求的busiest cpu与当前cpu一致，以及在做active_balance的标记为1
    if (unlikely(busiest_cpu != smp_processor_id() ||
             !busiest_rq->active_balance))
        goto out_unlock;

    /* Is there any task to move? */
    if (busiest_rq->nr_running <= 1)        //busiest上如果没有多的task用来迁移了？数量<=1
        goto out_unlock;

    /*
     * This condition is "impossible", if it occurs
     * we need to fix it. Originally reported by
     * Bjorn Helgaas on a 128-CPU setup.
     */
    BUG_ON(busiest_rq == target_rq);        //target和src是同一个的异常情况

#ifdef CONFIG_SCHED_WALT
    push_task = busiest_rq->wrq.push_task;        //使用WALT rq中标记的push task
    target_cpu = busiest_rq->push_cpu;            //又赋值了一次target_cpu，同样地，一般是执行load balance时候的当前cpu（赋值为this_cpu）
    if (push_task) {                                  //如果WALT rq有标记push task
        if (task_on_rq_queued(push_task) &&                //再判断push task的on_rq ==1
            push_task->state == TASK_RUNNING &&            //并且push task处于running状态
            task_cpu(push_task) == busiest_cpu &&        //并且push task处在busiest cpu上
                    cpu_online(target_cpu)) {            //并且target_cpu处于online状态
            update_rq_clock(busiest_rq);            //更新busiest rq clock信息
            detach_task(push_task, &env);            //将push task从busiest rq上剥离，代码流程可参考前一篇文章：负载均衡（Load Balance）- periodic load balance
            push_task_detached = 1;                    //标记push task现在已经剥离
        }
        goto out_unlock;
    }
#endif
                                                            //走到这里，说明WALT rq里面没有标记push task
    /* Search for an sd spanning us and the target CPU. */        //搜索同时包含busiest 和 target cpu的调度域
    rcu_read_lock();
    for_each_domain(target_cpu, sd) {
        if ((sd->flags & SD_LOAD_BALANCE) &&
            cpumask_test_cpu(busiest_cpu, sched_domain_span(sd)))
                break;
    }

    if (likely(sd)) {                            //如果上面找到了满足条件的sd
        struct lb_env env = {
            .sd        = sd,
            .dst_cpu    = target_cpu,
            .dst_rq        = target_rq,
            .src_cpu    = busiest_rq->cpu,
            .src_rq        = busiest_rq,
            .idle        = CPU_IDLE,
            /*
             * can_migrate_task() doesn't need to compute new_dst_cpu        //can_migrate_task函数不需要为active balance计算新的dst cpu
             * for active balancing. Since we have CPU_IDLE, but no            //因为我们有CPU_IDLE状态，但没有dst_grpmask
             * @dst_grpmask we need to make that test go away with lying    //我们需要用LBF_DST_PINNED，来跳过相应的检测
             * about DST_PINNED.
             */
            .flags        = LBF_DST_PINNED,
        };

        schedstat_inc(sd->alb_count);            //active load balance调度计数+1
        update_rq_clock(busiest_rq);            //更新busiest rq clock

        p = detach_one_task(&env);                //（2-1-1）从busiest rq寻找满足条件的task p，并进行剥离
        if (p) {
            schedstat_inc(sd->alb_pushed);            //active load balance中push task计数+1
            /* Active balancing done, reset the failure counter. */
            sd->nr_balance_failed = 0;                //清零load balance失败计数
        } else {
            schedstat_inc(sd->alb_failed);        //如果没有detach成功，则active load balance失败计数+1
        }
    }
    rcu_read_unlock();
out_unlock:
    busiest_rq->active_balance = 0;                //清零active balance状态flag，说明detach task是active balance的主要动作；而attach到dst rq不需要做互斥
#ifdef CONFIG_SCHED_WALT
    push_task = busiest_rq->wrq.push_task;    //取出WALT rq标记的push task
#endif
    target_cpu = busiest_rq->push_cpu;        //取出push cpu为target cpu
    clear_reserved(target_cpu);                //清除WALT rq的CPU_RESERVED标记位

#ifdef CONFIG_SCHED_WALT
    if (push_task)
        busiest_rq->wrq.push_task = NULL;    //清空WALT rq的push task
#endif

    rq_unlock(busiest_rq, &rf);

#ifdef CONFIG_SCHED_WALT
    if (push_task) {                                //如果上面执行的是WALT rq的push task
        if (push_task_detached)                        //如果上面进行过剥离
            attach_one_task(target_rq, push_task);    //（2-1-2）则将push task迁移到target cpu的rq上
        put_task_struct(push_task);
    }
#endif

    if (p)                                //如果上面执行的是原生的push task
        attach_one_task(target_rq, p);    //则同样将push task迁移到target cpu的rq上

    local_irq_enable();

    return 0;
}

（2-1-1）从busiest rq寻找满足条件的task p，并进行剥离。

can_migrate_task函数代码流程可参考前一篇文章：负载均衡（Load Balance）- periodic load balance。但是因为上面函数中env中设置了LBF_DST_PINNED，所以其中流程如果因为cpuset限制了，不能将task p迁移到dst cpu，就会直接return。因此会比periodic load balance会减少因为cpuset限制后的重新挑选new_dst_cpu的流程（这里解释了上面为什么要设置 env.flags = LBF_DST_PINNED）。

detach_task函数流程则也可参考前一篇文章，这里不再赘述。

/*
 * detach_one_task() -- tries to dequeue exactly one task from env->src_rq, as
 * part of active balancing operations within "domain".
 *
 * Returns a task if successful and NULL otherwise.
 */
static struct task_struct *detach_one_task(struct lb_env *env)
{
    struct task_struct *p;

    lockdep_assert_held(&env->src_rq->lock);

    list_for_each_entry_reverse(p,                            //从src rq的cfs_tasks链表中从尾部开始遍历寻找task p
            &env->src_rq->cfs_tasks, se.group_node) {
        if (!can_migrate_task(p, env))                        //判断task p是否满足迁移条件，代码流程可参考前一篇文章：负载均衡（Load Balance）- periodic load balance
            continue;

        detach_task(p, env);        //如果成功找到满足条件的task p，则进行剥离

        /*
         * Right now, this is only the second place where            //现在这里只是第二个lb_gained[env->idle]更新的地方
         * lb_gained[env->idle] is updated (other is detach_tasks)    //（另一处是detach_tasks）
         * so we can safely collect stats here rather than            //所以我们可以安全地在这里收集状态信息，而不用到detach_tasks函数中收集
         * inside detach_tasks().
         */
        schedstat_inc(env->sd->lb_gained[env->idle]);
        return p;
    }
    return NULL;
}

（2-1-2）将push task迁移到target cpu的rq上。

attach_one_task函数与attach_task差别仅仅是迁移前更新了rq clock信息，迁移后更新rq的overutil状态。

update_overutilized_status函数、attach_task函数流程都可参考前一篇文章，这里不再赘述。

/*
 * attach_one_task() -- attaches the task returned from detach_one_task() to
 * its new rq.
 */
static void attach_one_task(struct rq *rq, struct task_struct *p)
{
    struct rq_flags rf;

    rq_lock(rq, &rf);
    update_rq_clock(rq);　　　　　　//更新rq clock信息
    attach_task(rq, p);　　　　　　　　//将task p迁移到rq
    update_overutilized_status(rq);　　//迁移完成后，更新rq的最新overutil状态
    rq_unlock(rq, &rf);
}

3、总结

active balace一共只有2个路径会调用到：一个是load balance多次尝试失败之后触发（A路径）；另一个是misfit task的rq在定时调度时进行检测和触发（B路径）。active balance其本身是依托于cpu rq上的stop class的workqueue来实现（对应内核线程名：migration/%u），在触发后进行push task挑选：

A路径中会从busiest rq中挑选符合条件的cfs task进行迁移。

B路径中会直接选择有misfit task的rq对应curr进程进行迁移。

然后对push task进行task剥离和重新attach，然后就完成了workqueue的工作。

特别地，通过WALT rotation机制，更可以在misfit task的情况下，直接将两个rq的curr进程进行对调迁移（不再通过active balance），加快了负载均衡的效率，也对mifit task的均衡更加保守（个人猜想是因为直接迁移misfit task可能让接收misfit task的目标cpu承受不了，所以进行curr对调。不过这块最好是通过数据统计下看直接push misfit task好，还是walt rotation贴合场景）。

posted @ 2023-01-03 17:15 Sugars_DJ 阅读(751) 评论(0) 收藏举报

刷新页面返回顶部