linux内核负载均衡（二）sched_balance_rq详解

这里有一篇文章讲得很好：http://www.wowotech.net/process_management/load_balance_function.html

我们以6.13-rc2作为基础来分析。

上一篇讲到了触发负载均衡的三种方式：newilde balance，nohz idle balance和周期性balance。sched_balance_rq是核心代码。

sched_balance_rq是在一个sched domain中做负载均衡的，主要思想是将当前的cpu作为目的cpu，从某个调度域中选择比较繁忙的调度组，从该调度组中迁移进程到目的cpu，从而实现负载的均衡。sched domain(调度域）和sched group（调度组）是非常重要的概念，在分析代码之前，我们先sched domain和sched group的概念。

调度域的产生是基于一个事实，系统中cpu是一个层级结构，有smt，mc，numa，socket等层次结构。如果在负载均衡的时候不考虑这些因素会影响进程/系统性能。比如对于有亲和性需求的情况尽可能选择之前的cpu或共享llc的cpu。对于没有亲和性需求的情况可以选择那些idle数量比较多的cpu集合，这样可以减少访存带宽的竞争，提升性能。

调度域是per cpu的，每个cpu都有自己的调度域结构无需跟其他cpu共享。而调度组是在多个cpu间共享的。

调度域是有层级的，最低一层一般是smt域或mc域，它没有子调度域，向上可以找到父调度域，最顶层的调度域包含了系统中的所有cpu。

每个调度域都有多个调度组，有时候我们会发现调度域的child似乎跟它的调度组有相同的cpu集合，事实常常如此，但是不要将两者混为一谈。

下面是sched_domain结构体

域	含义
struct sched_domain __rcu *parent	父调度域，调度域是一个层级结构
struct sched_domain __rcu *child	子调度域
struct sched_group *groups	调度组，一个调度域由多个调度组组成，组成一个环形链表，groups是链表头
unsigned long min_interval	均衡间的最小间隔时间
unsigned long max_interval	负载均衡间的最大时间间隔
unsigned int busy_factor	cpu busy程度，按照cpu busy程度调节均衡间隔时间，busy_factor * balance_interval代表间隔时间
unsigned int imbalance_pct	负载不均衡门限
unsigned int cache_nice_tries	和nr_balance_failed配合使用，当nr_balance_failed大于cache_nice_tries会进行更激进的balance
unsigned int imb_numa_nr	对于numa系统允许一定程度的imbalance，这个值的计算见build_sched_domains
int nohz_idle
int flags	调度标志
int level	调度域在层级中的级别，base为0，逐层向上加1
unsigned long last_balance	该调度域上次进行负载均衡的时间
unsigned int balance_interval	基础负载均衡时间间隔，与busy_factor配合使用
unsigned int nr_balance_failed	该调度域负载均衡失败的次数
u64 max_newidle_lb_cost	在该domain上进行new idle balance的最大时间间隔(真的吗？），会衰减，防止过大
unsigned long last_decay_max_lb_cost	记录更新上述cost时的jiffies，也用来控制上面cost的衰减
char *name	调度域的名字
struct sched_domain_shared *shared	sched_domain是per cpu的，有些变量需要共享，包括调度域中busy cpu个数，是否有idle cpu等
unsigned int span_weight	调度包含cpu个数
unsigned long span[]	调度域包含的cpu mask

sched group结构体

成员	含义
struct sched_group *next	本调度域包含的调度组组成的循环链表，next是包含当前cpu的调度组节点
atomic_t ref	调度组是共享的，这个是引用次数
unsigned int group_weight	本调度组包含的cpu数
unsigned int cores
struct sched_group_capacity *sgc	调度组的计算能力
int asym_prefer_cpu	大小核会用？本调度组最高优先级的cpu
int flags	调度组的标志
unsigned long cpumask[]	本调度组所包含cpu的cpumask

linux内核中有一大段注释来讲解调度域和调度组是怎么建立起来的。见topology.c - kernel/sched/topology.c - Linux source code v6.10 - Bootlin Elixir Cross Referencer。可以方便理解build_balance_mask函数。

sched_balance_rq是进行负载均衡的核心函数。

这是它的基本函数流程：

static int sched_balance_rq(int this_cpu, struct rq *this_rq,
            struct sched_domain *sd, enum cpu_idle_type idle,
            int *continue_balancing)

先看函数参数。this cpu是指一个idle的cpu，要从其他繁忙的cpu上拉取一些任务过来。this_rq，将任务拉取的这个rq上，sd是在这个调度域上做找繁忙的cpu，idle是当前cpu的idle状态，continue_balancing作为返回值的一部分告诉caller要不要继续做均衡。

static int sched_balance_rq(int this_cpu, struct rq *this_rq,
            struct sched_domain *sd, enum cpu_idle_type idle,
            int *continue_balancing)
{
    int ld_moved, cur_ld_moved, active_balance = 0;
    struct sched_domain *sd_parent = sd->parent;
    struct sched_group *group;
    struct rq *busiest;
    struct rq_flags rf;
    struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
    struct lb_env env = {
        .sd        = sd,
        .dst_cpu    = this_cpu,
        .dst_rq        = this_rq,
        .dst_grpmask    = group_balance_mask(sd->groups),
        .idle        = idle,
        .loop_break    = SCHED_NR_MIGRATE_BREAK,
        .cpus        = cpus,
        .fbq_type    = all,
        .tasks        = LIST_HEAD_INIT(env.tasks),
    };

    cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);

    schedstat_inc(sd->lb_count[idle]);

redo:
    if (!should_we_balance(&env)) {
        *continue_balancing = 0;
        goto out_balanced;
    }

    group = sched_balance_find_src_group(&env);
    if (!group) {
        schedstat_inc(sd->lb_nobusyg[idle]);
        goto out_balanced;
    }

    busiest = sched_balance_find_src_rq(&env, group);
    if (!busiest) {
        schedstat_inc(sd->lb_nobusyq[idle]);
        goto out_balanced;
    }

    WARN_ON_ONCE(busiest == env.dst_rq);

    schedstat_add(sd->lb_imbalance[idle], env.imbalance);

    env.src_cpu = busiest->cpu;
    env.src_rq = busiest;

sched_balance_rq使用lb_env作为负载均衡的上下文在各个流程中传递信息。

should_we_balance负责判断当前cpu是否可以做负载均衡。

sched_balance_find_src_group也就是以前的find_busiest_group，负责在本调度域中找最忙的调度组作为迁移进程的源调度组；

sched_balance_find_src_rq就是之前的find_busiest_rq，负责在上面选处理的最忙调度组中找出最忙的rq；

现在已经找到了最忙的rq，可以开始迁移了。

    ld_moved = 0;
    /* Clear this flag as soon as we find a pullable task */
    env.flags |= LBF_ALL_PINNED;
    if (busiest->nr_running > 1) {
        /*
         * Attempt to move tasks. If sched_balance_find_src_group has found
         * an imbalance but busiest->nr_running <= 1, the group is
         * still unbalanced. ld_moved simply stays zero, so it is
         * correctly treated as an imbalance.
         */
        env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);

more_balance:
        rq_lock_irqsave(busiest, &rf);
        update_rq_clock(busiest);

        /*
         * cur_ld_moved - load moved in current iteration
         * ld_moved     - cumulative load moved across iterations
         */
        cur_ld_moved = detach_tasks(&env);

        /*
         * We've detached some tasks from busiest_rq. Every
         * task is masked "TASK_ON_RQ_MIGRATING", so we can safely
         * unlock busiest->lock, and we are able to be sure
         * that nobody can manipulate the tasks in parallel.
         * See task_rq_lock() family for the details.
         */

        rq_unlock(busiest, &rf);

        if (cur_ld_moved) {
            attach_tasks(&env);
            ld_moved += cur_ld_moved;
        }

使用ld_moved记录总的迁移量，，迁移可能进行多轮，cur_ld_moved记录每次的迁移量。迁移要锁住rq，因此时间不能太久，设置loop_max限制最大的迭代次数。

detach_tasks负责在选中的rq上选择env->imbalance指定数量的task取下来放到env的task list中。

attach_tasks负责从env中取出要迁移的task，enqueue到当前group中，增加ld_moved的值。

        if (env.flags & LBF_NEED_BREAK) {
            env.flags &= ~LBF_NEED_BREAK;
            goto more_balance;
        }

        /*
         * Revisit (affine) tasks on src_cpu that couldn't be moved to
         * us and move them to an alternate dst_cpu in our sched_group
         * where they can run. The upper limit on how many times we
         * iterate on same src_cpu is dependent on number of CPUs in our
         * sched_group.
         *
         * This changes load balance semantics a bit on who can move
         * load to a given_cpu. In addition to the given_cpu itself
         * (or a ilb_cpu acting on its behalf where given_cpu is
         * nohz-idle), we now have balance_cpu in a position to move
         * load to given_cpu. In rare situations, this may cause
         * conflicts (balance_cpu and given_cpu/ilb_cpu deciding
         * _independently_ and at _same_ time to move some load to
         * given_cpu) causing excess load to be moved to given_cpu.
         * This however should not happen so much in practice and
         * moreover subsequent load balance cycles should correct the
         * excess load moved.
         */
        if ((env.flags & LBF_DST_PINNED) && env.imbalance > 0) {

            /* Prevent to re-select dst_cpu via env's CPUs */
            __cpumask_clear_cpu(env.dst_cpu, env.cpus);

            env.dst_rq     = cpu_rq(env.new_dst_cpu);
            env.dst_cpu     = env.new_dst_cpu;
            env.flags    &= ~LBF_DST_PINNED;
            env.loop     = 0;
            env.loop_break     = SCHED_NR_MIGRATE_BREAK;

            /*
             * Go back to "more_balance" rather than "redo" since we
             * need to continue with same src_cpu.
             */
            goto more_balance;
        }

        /*
         * We failed to reach balance because of affinity.
         */
        if (sd_parent) {
            int *group_imbalance = &sd_parent->groups->sgc->imbalance;

            if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
                *group_imbalance = 1;
        }

        /* All tasks on this runqueue were pinned by CPU affinity */
        if (unlikely(env.flags & LBF_ALL_PINNED)) {
            __cpumask_clear_cpu(cpu_of(busiest), cpus);
            /*
             * Attempting to continue load balancing at the current
             * sched_domain level only makes sense if there are
             * active CPUs remaining as possible busiest CPUs to
             * pull load from which are not contained within the
             * destination group that is receiving any migrated
             * load.
             */
            if (!cpumask_subset(cpus, env.dst_grpmask)) {
                env.loop = 0;
                env.loop_break = SCHED_NR_MIGRATE_BREAK;
                goto redo;
            }
            goto out_all_pinned;
        }
    }

LBF_NEED_BREAK是用来做中场休息的，其实还没有迁移完，所以这时候只是释放rq锁，然后再回去重新来迁移进程。

LBF_DST_PINNED是指要被迁移的task因为affinity的原有没法迁移，将env.dst_cpu换成env.new_dst_cpu再去试试。

LBF_SOME_PINNED是指因为affinity的原有无法迁移进程，让父调度域去解决。

LBF_ALL_PINNED说明所有task都pin住了，没法迁移，清除在排除掉busiest group的cpu尝试重新选择busiest group从头再来。

    if (!ld_moved) {
        schedstat_inc(sd->lb_failed[idle]);
        /*
         * Increment the failure counter only on periodic balance.
         * We do not want newidle balance, which can be very
         * frequent, pollute the failure counter causing
         * excessive cache_hot migrations and active balances.
         *
         * Similarly for migration_misfit which is not related to
         * load/util migration, don't pollute nr_balance_failed.
         */
        if (idle != CPU_NEWLY_IDLE &&
            env.migration_type != migrate_misfit)
            sd->nr_balance_failed++;

        if (need_active_balance(&env)) {
            unsigned long flags;

            raw_spin_rq_lock_irqsave(busiest, flags);

            /*
             * Don't kick the active_load_balance_cpu_stop,
             * if the curr task on busiest CPU can't be
             * moved to this_cpu:
             */
            if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
                raw_spin_rq_unlock_irqrestore(busiest, flags);
                goto out_one_pinned;
            }

            /* Record that we found at least one task that could run on this_cpu */
            env.flags &= ~LBF_ALL_PINNED;

            /*
             * ->active_balance synchronizes accesses to
             * ->active_balance_work.  Once set, it's cleared
             * only after active load balance is finished.
             */
            if (!busiest->active_balance) {
                busiest->active_balance = 1;
                busiest->push_cpu = this_cpu;
                active_balance = 1;
            }

            preempt_disable();
            raw_spin_rq_unlock_irqrestore(busiest, flags);
            if (active_balance) {
                stop_one_cpu_nowait(cpu_of(busiest),
                    active_load_balance_cpu_stop, busiest,
                    &busiest->active_balance_work);
            }
            preempt_enable();
        }
    } else {
        sd->nr_balance_failed = 0;
    }

如果ld_moved为0说明迁移进程失败了。增加sd->nr_balance_failed计数。

使用need_active_balance判断是否需要对正在运行的task进行迁移。如果需要，会调用stop_one_cpu_nowait去唤醒migration线程去做迁移。

如果ld_moved不为0，说明至少成功迁移1个进程，可以收工了。

下面对关键函数进行分析：

先介绍一个数据结构lb_env，函数利用它来进行控制信息传递。

    struct lb_env env = {
        .sd        = sd,
        .dst_cpu    = this_cpu,      //当前cpu，也就是要将remote task拉到这里来
        .dst_rq        = this_rq,    //当前rq
        .dst_grpmask    = group_balance_mask(sd->groups),   //可以用来load balance的cpu
        .idle        = idle,                                //cpu idle类型，0是not idle，
        .loop_break    = SCHED_NR_MIGRATE_BREAK,            // detach task的时候不能太久，循环一定次数就休息以下
        .cpus        = cpus,
        .fbq_type    = all,
        .tasks        = LIST_HEAD_INIT(env.tasks),          // 存放要迁移的task
    };

should_we_balance，判断是否要进行均衡。

static int should_we_balance(struct lb_env *env)
{
    struct cpumask *swb_cpus = this_cpu_cpumask_var_ptr(should_we_balance_tmpmask);
    struct sched_group *sg = env->sd->groups;
    int cpu, idle_smt = -1;

    //判断一下当前cpu是不是可以用，或许已经被hotunplug了。
    if (!cpumask_test_cpu(env->dst_cpu, env->cpus))
        return 0;

    // 对于将要进入idle状态的cpu，只要rq没有进程，也不会有将要被唤醒的进程就return 1
    if (env->idle == CPU_NEWLY_IDLE) {
        if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
            return 0;
        return 1;
    }

    cpumask_copy(swb_cpus, group_balance_mask(sg));
    /* Try to find first idle CPU */
    // 从调度组中找到第一个idle的cpu
    for_each_cpu_and(cpu, swb_cpus, env->cpus) {
        if (!idle_cpu(cpu))
            continue;

        // 先不要为busy core的idle smt做均衡，但是记着这个idle smt
        if (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) {
            if (idle_smt == -1)
                idle_smt = cpu;
            /*
             * If the core is not idle, and first SMT sibling which is
             * idle has been found, then its not needed to check other
             * SMT siblings for idleness:
             */
#ifdef CONFIG_SCHED_SMT
            cpumask_andnot(swb_cpus, swb_cpus, cpu_smt_mask(cpu));
#endif
            continue;
        }

        // 走到这里肯定是找到了一个idle cpu，看看它是不是就是当前的cpu，如果是才返回1。
        return cpu == env->dst_cpu;
    }

    /* Are we the first idle CPU with busy siblings? */
    // 走到这里说明没找到一个idle core，但是可能存在一个idle smt，看看是不是本cpu
    if (idle_smt != -1)
        return idle_smt == env->dst_cpu;

    /* Are we the first CPU of this group ? */
    // 走到这里说明没有idle cpu，看看当前cpu是不是调度组中可被用作均衡的第一个cpu
    return group_balance_cpu(sg) == env->dst_cpu;
}

大致就是在当前的调度组里找idle的cpu，如果没有就用当前调度组里可用做均衡的第一个cpu，但是两者必须都满足等于当前cpu才可以，因为我们就是要在当前的cpu上做均衡，其实就是在验证当前cpu的条件。

sched_balance_find_src_group会找到当前调度域里面最繁忙的调度组。下面是一张决策表：

busiest\local type	has_spare	fully_busy	misfit	asym	imbalanced	overloaded
has_spare	nr_idle	balanced	N/A	N/A	balanced	balanced
fully_busy	nr_idle	nr_ilde	N/A	N/A	balanced	balanced
misfit_task	force	N/A	N/A	N/A	N/A	N/A
asym_packing	force	force	N/A	N/A	force	force
imbalanced	force	force	N/A	N/A	force	force
overloaded	force	force	N/A	N/A	force	avg_load

这个表在该函数的注释里。下面是具体含义：

busiest是指要对比的调度组，local是本地调度组，表中的第一行/列代表调度组的类型；

N/A：不适用，该情况已经过滤掉；

balanced: 这两个group已经平衡了；

force：需要迁移负载，计算imbalance；

avg_load：根据imbalance的值确定；

nr_idle: dst_cpu不太忙，这俩组的idle cpu相差较大；

可以看出，如果当前的调度组比较闲而另一个组比较忙，那一定是要做迁移的，其他情况可以对照这个表来看。

static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
{
    struct sg_lb_stats *local, *busiest;
    struct sd_lb_stats sds;

    init_sd_lb_stats(&sds);

    /*
     * Compute the various statistics relevant for load balancing at
     * this level.
     */
    update_sd_lb_stats(env, &sds);    //找到本调度域最忙的调度组

    /* There is no busy sibling group to pull tasks from */
    if (!sds.busiest)
        goto out_balanced;

    busiest = &sds.busiest_stat;
//下面开始做判断

主要的逻辑就是先调用update_sd_lb_stats找到本调度域中最繁忙的调度组，接下来就根据上面的决策表来判断。update_sd_lb_stats是一个非常关键的函数，后面详细分析。

    /* Misfit tasks should be dealt with regardless of the avg load */
...

    if (busiest->group_type == group_imbalanced)
        goto force_balance;

    local = &sds.local_stat;

    if (local->group_type > busiest->group_type)
        goto out_balanced;

    if (local->group_type == group_overloaded) {
...
    }
...

force_balance:
    /* Looks like there is an imbalance. Compute it */
    calculate_imbalance(env, &sds);
    return env->imbalance ? sds.busiest : NULL;

out_balanced:
    env->imbalance = 0;
    return NULL;
}

忽略掉的判断大致是比较local group跟busiest group之间的负载差异，再两者之间尽量做均衡，因为只能从busiest拉取任务到local，如果local比busiest还要忙，即使不均衡也当成均衡来对待。最终如果需要均衡就会跳到force_balance，执行calculate_imbalance计算imbalance的值，也就是需要迁移多少负载。

这里涉及到group type的枚举值来表示调度组现在的忙闲状态。

enum group_type	含义
group_has_spare = 0	组内有剩余算力
group_fully_busy	组内cpu没有剩余算力
group_misfit_task	组内有task运行在算力较小的cpu上，需要迁移。适用于非对称系统
group_smt_balance	均衡smt group，可以将运行在一个core内所有smt都在忙的超线程上的task迁移到另一个idle core上
group_asym_packing	异构系统适用，local cpu有算力更大的cpu可用
group_imbalanced	task被affinity设置限制，阻止了之前的负载均衡
group_overloaded	组内cpu过载

这些都是在update_sd_lb_stats算出来的，且可以通过比较大小来确定busy程度，越大越忙。在update_sd_pick_busiest中会用到这一点来选出最忙的组。

sched_balance_find_src_rq会在上面选好的调度组里找最繁忙的rq。

detach_tasks

static int detach_tasks(struct lb_env *env)
{
    struct list_head *tasks = &env->src_rq->cfs_tasks;
while (!list_empty(tasks)) {
         env->loop++;
        /* We've more or less seen every task there is, call it quits */
        if (env->loop > env->loop_max)
            break;

        /* take a breather every nr_migrate tasks */
        if (env->loop > env->loop_break) {
            env->loop_break += SCHED_NR_MIGRATE_BREAK;
            env->flags |= LBF_NEED_BREAK;
            break;
        }

        p = list_last_entry(tasks, struct task_struct, se.group_node);

        if (!can_migrate_task(p, env))
            goto next;

        switch (env->migration_type) {
        case migrate_load:
            load = max_t(unsigned long, task_h_load(p), 1);

            if (sched_feat(LB_MIN) &&
                load < 16 && !env->sd->nr_balance_failed)
                goto next;

            /*
             * Make sure that we don't migrate too much load.
             * Nevertheless, let relax the constraint if
             * scheduler fails to find a good waiting task to
             * migrate.
             */
            if (shr_bound(load, env->sd->nr_balance_failed) > env->imbalance)
                goto next;

            env->imbalance -= load;
            break;

        case migrate_util:
            util = task_util_est(p);

            if (shr_bound(util, env->sd->nr_balance_failed) > env->imbalance)
                goto next;

            env->imbalance -= util;
            break;

        case migrate_task:
            env->imbalance--;
            break;

        case migrate_misfit:
            /* This is not a misfit task */
            if (task_fits_cpu(p, env->src_cpu))
                goto next;

            env->imbalance = 0;
            break;
        }

        detach_task(p, env);
        list_add(&p->se.group_node, &env->tasks);

        detached++;

#ifdef CONFIG_PREEMPTION
        /*
         * NEWIDLE balancing is a source of latency, so preemptible
         * kernels will stop after the first task is detached to minimize
         * the critical section.
         */
        if (env->idle == CPU_NEWLY_IDLE)
            break;
#endif

        /*
         * We only want to steal up to the prescribed amount of
         * load/util/tasks.
         */
        if (env->imbalance <= 0)
            break;

        continue;
next:
        list_move(&p->se.group_node, tasks);
    }

return detached;
}

从源rq里拿到cfs_tasks，遍历这个task链表，从最后一个开始拿task，这样可以尽量避免task还处于cache hot的状态。取出一个task后，使用can_migrate_task判断该task是否可以迁移。如果可以，根据migrate_type来做不同的计算方式，在每一轮的循环中都做判断是继续迁移。但是migrate_type是task还是load，util，最终都是通过迁移task来实现的。如果不能迁移task就将这个task放到cfs_tasks 链表的头部。

detach_task负责将进程供rq上取下来。取下的task挂到env->tasks上。对于newly idle的情形迁移一个进程也就可以了。如果当前imblance值已经较小到0以下就可以完工了。

detach_task

static void detach_task(struct task_struct *p, struct lb_env *env)
{
    lockdep_assert_rq_held(env->src_rq);

    deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
    set_task_cpu(p, env->dst_cpu);
}

void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
{
    SCHED_WARN_ON(flags & DEQUEUE_SLEEP);

    WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
    ASSERT_EXCLUSIVE_WRITER(p->on_rq);

    /*
     * Code explicitly relies on TASK_ON_RQ_MIGRATING begin set *before*
     * dequeue_task() and cleared *after* enqueue_task().
     */

    dequeue_task(rq, p, flags);
}

detach_task调用deactivate_task将task从rq从取下来，设置task的cpu为env->dst_cpu，但是此时task还没有加入到任何cpu。

deactivate_task将task的on_rq成员设置为TASK_ON_RQ_MIGRATING，调用dequeue_task将task从rq中取下来。

attach_tasks

static void attach_tasks(struct lb_env *env)
{
    struct list_head *tasks = &env->tasks;
    struct task_struct *p;
    struct rq_flags rf;

    rq_lock(env->dst_rq, &rf);
    update_rq_clock(env->dst_rq);

    while (!list_empty(tasks)) {
        p = list_first_entry(tasks, struct task_struct, se.group_node);
        list_del_init(&p->se.group_node);

        attach_task(env->dst_rq, p);
    }

    rq_unlock(env->dst_rq, &rf);
}

attach_tasks从env的task list中一个个取出task，调用attach_task将其加入env->dst_rq中。

static void attach_task(struct rq *rq, struct task_struct *p)
{
    lockdep_assert_rq_held(rq);

    WARN_ON_ONCE(task_rq(p) != rq);
    activate_task(rq, p, ENQUEUE_NOCLOCK);
    wakeup_preempt(rq, p, 0);
}

void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
    if (task_on_rq_migrating(p))
        flags |= ENQUEUE_MIGRATED;
    if (flags & ENQUEUE_MIGRATED)
        sched_mm_cid_migrate_to(rq, p);

    enqueue_task(rq, p, flags);

    WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED);
    ASSERT_EXCLUSIVE_WRITER(p->on_rq);
}

attach_task最终调用enqueue_task将task加入到dst rq中。设置task on_rq成员为TASK_ON_RQ_QUEUED。

need_active_balance

static int need_active_balance(struct lb_env *env)
{
    struct sched_domain *sd = env->sd;

    if (asym_active_balance(env))
        return 1;

    if (imbalanced_active_balance(env))
        return 1;

    /*
     * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
     * It's worth migrating the task if the src_cpu's capacity is reduced
     * because of other sched_class or IRQs if more capacity stays
     * available on dst_cpu.
     */
    if (env->idle &&
        (env->src_rq->cfs.h_nr_running == 1)) {
        if ((check_cpu_capacity(env->src_rq, sd)) &&
            (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))
            return 1;
    }

    if (env->migration_type == migrate_misfit)
        return 1;

    return 0;
}

判断在异构和imbalanced情形下是否需要进行active balance。

如果dst cpu是idle的，src rq只有一个task，如果src cpu的算力很低也会去做active balance。

如果迁移类型是migrate_misfit返回1。其他情况返回0。

imbalanced_active_balance

static inline bool
imbalanced_active_balance(struct lb_env *env)
{
    struct sched_domain *sd = env->sd;

    /*
     * The imbalanced case includes the case of pinned tasks preventing a fair
     * distribution of the load on the system but also the even distribution of the
     * threads on a system with spare capacity
     */
    if ((env->migration_type == migrate_task) &&
        (sd->nr_balance_failed > sd->cache_nice_tries+2))
        return 1;

    return 0;
}

如果迁移类型为migrate_task，且迁移失败的次数超过sd->cache_nice_tries+2就认为需要进行active balance。

can_migrate_task

static
int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
    int tsk_cache_hot;

    lockdep_assert_rq_held(env->src_rq);

    /*
     * We do not migrate tasks that are:
     * 1) throttled_lb_pair, or
     * 2) cannot be migrated to this CPU due to cpus_ptr, or
     * 3) running (obviously), or
     * 4) are cache-hot on their current CPU.
     */
    if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
        return 0;

    /* Disregard percpu kthreads; they are where they need to be. */
    if (kthread_is_per_cpu(p))
        return 0;

    if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
        int cpu;

        schedstat_inc(p->stats.nr_failed_migrations_affine);

        env->flags |= LBF_SOME_PINNED;

        /*
         * Remember if this task can be migrated to any other CPU in
         * our sched_group. We may want to revisit it if we couldn't
         * meet load balance goals by pulling other tasks on src_cpu.
         *
         * Avoid computing new_dst_cpu
         * - for NEWLY_IDLE
         * - if we have already computed one in current iteration
         * - if it's an active balance
         */
        if (env->idle == CPU_NEWLY_IDLE ||
            env->flags & (LBF_DST_PINNED | LBF_ACTIVE_LB))
            return 0;

        /* Prevent to re-select dst_cpu via env's CPUs: */
        for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
            if (cpumask_test_cpu(cpu, p->cpus_ptr)) {
                env->flags |= LBF_DST_PINNED;
                env->new_dst_cpu = cpu;
                break;
            }
        }

        return 0;
    }

    /* Record that we found at least one task that could run on dst_cpu */
    env->flags &= ~LBF_ALL_PINNED;

    if (task_on_cpu(env->src_rq, p)) {
        schedstat_inc(p->stats.nr_failed_migrations_running);
        return 0;
    }

    /*
     * Aggressive migration if:
     * 1) active balance
     * 2) destination numa is preferred
     * 3) task is cache cold, or
     * 4) too many balance attempts have failed.
     */
    if (env->flags & LBF_ACTIVE_LB)
        return 1;

    tsk_cache_hot = migrate_degrades_locality(p, env);
    if (tsk_cache_hot == -1)
        tsk_cache_hot = task_hot(p, env);

    if (tsk_cache_hot <= 0 ||
        env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
        if (tsk_cache_hot == 1) {
            schedstat_inc(env->sd->lb_hot_gained[env->idle]);
            schedstat_inc(p->stats.nr_forced_migrations);
        }
        return 1;
    }

    schedstat_inc(p->stats.nr_failed_migrations_hot);
    return 0;
}

注释似乎已经比较清楚了。对于以下情形不做迁移：

1. throttled进程组；

2. per cpu线程；

3. dst cpu没有包含在task的cpus_ptr内；

4. task正在运行；

对下面的情形会进行比较激进的迁移：

1. env flag有LBF_ACTIVE_LB标签；

2. 没有numa对局部性的影响；

3. cache是冷的；

4. 失败了很多次；

我们来看看这个numa局部性对进程迁移的影响。

#ifdef CONFIG_NUMA_BALANCING
/*
 * Returns 1, if task migration degrades locality
 * Returns 0, if task migration improves locality i.e migration preferred.
 * Returns -1, if task migration is not affected by locality.
 */
static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
{
    struct numa_group *numa_group = rcu_dereference(p->numa_group);
    unsigned long src_weight, dst_weight;
    int src_nid, dst_nid, dist;

    if (!static_branch_likely(&sched_numa_balancing))
        return -1;

    if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
        return -1;

    src_nid = cpu_to_node(env->src_cpu);
    dst_nid = cpu_to_node(env->dst_cpu);

    if (src_nid == dst_nid)
        return -1;

    /* Migrating away from the preferred node is always bad. */
    if (src_nid == p->numa_preferred_nid) {
        if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
            return 1;
        else
            return -1;
    }

    /* Encourage migration to the preferred node. */
    if (dst_nid == p->numa_preferred_nid)
        return 0;

    /* Leaving a core idle is often worse than degrading locality. */
    if (env->idle == CPU_IDLE)
        return -1;

    dist = node_distance(src_nid, dst_nid);
    if (numa_group) {
        src_weight = group_weight(p, src_nid, dist);
        dst_weight = group_weight(p, dst_nid, dist);
    } else {
        src_weight = task_weight(p, src_nid, dist);
        dst_weight = task_weight(p, dst_nid, dist);
    }

    return dst_weight < src_weight;
}

可以看出，这个函数是在开启NUMA_BALANCE才会有意义。开头的注释很明确：返回1说明迁移对numa局部性有影响，返回0说明对numa局部性有益，返回-1说明对numa 局部性无关。

首先是检查sched_numa_balancing有没有打开，如果没打开就认为对numa局部性无影响，其实这只能说明numa balance没生效的情况，在事实上可能会对numa亲和性造成影响。

如果task struct没有numa_faults或者当前的调度域不跨numa也认为对numa局部性无影响，其实这一点可以说明对numa亲和性无影响。

如果dst cpu和src cpu在一个node上，也认为是numa局部性无关的。

如果src node是task的numa_preferred_nid，且src rq上可运行的进程大于nr_preferred_running，说明迁移会对numa局部性造成影响。

如果dst node是task的preferred node，那么认为迁移有利；

还考虑了numa group的影响。

总的来说can_migrate_task考虑了numa的影响，但是只是一个软限制，当迁移失败的次数较多时依然会迁移对numa亲和性不利的task。

即使在can_migrate_task做一些改动，让其对某些task不做迁移，也会因为增加nr_balance_failed次数，导致引发active_balance，从而唤醒migrate线程去进行更激进的进程迁移。可见，内核对load balance的执着，顽强，在一切可能的情形下去做load balance。但是在migration线程中也会调用can_migration_task去判断是否可以迁移线程，这样can_migration_task应该是可以阻止线程迁移的一个点。

本篇还留了一个小尾巴，对于active balance那一块还没有深入的讲解，在下一篇会分析有关migration 线程的代码。

sched_balance_rq函数非常复杂，下面用一个函数流程图作为结束。

posted on 2024-12-18 20:07 半山随笔阅读(275) 评论(0) 收藏举报

刷新页面返回顶部

linux内核负载均衡（二）sched_balance_rq详解

导航

公告