负载均衡(Load Balance)- active load balance
前面一篇负载均衡文章中分析了periodic load balance,主要跟踪了其代码框架和流程。其中在load_balance函数中,会进行多次task 迁移尝试,如果多次尝试后仍然失败,那么就会判断是否需要进行更加激进的balance。
而激进的balance其中就包含了active load balance。前文简单介绍过active load balance的工作原理是:从负载重的cpu,向负载轻的cpu,推送(push)task。
下面我们就看下具体代码,代码基于kernel-5.4。水平有限,不免有错误之处,烦请指正。
1、触发条件
其实在前文中也解析到过,一条调用路径是load_balance函数中调用need_active_balance来判断和触发:
(1-1)判断是否需要active balance(return 1表示需要进行active load balance)
static int need_active_balance(struct lb_env *env) { struct sched_domain *sd = env->sd; if (voluntary_active_balance(env)) //(1-1-1)判断是否满足主动触发active balance的条件 return 1; if ((env->idle != CPU_NOT_IDLE) && //src cpu处于idle状态 (capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) && //src cpu_capacity < dst cpu_capacity ((capacity_orig_of(env->src_cpu) < //且src orig_cpu_capacity < dst orig_cpu_capacity capacity_orig_of(env->dst_cpu))) && env->src_rq->cfs.h_nr_running == 1 && //且src cpu只有一个cfs task cpu_overutilized(env->src_cpu) && //且src cpu处于overutil状态 !cpu_overutilized(env->dst_cpu)) { //且dst cpu不处于overutil状态 return 1; } if (env->src_grp_type == group_overloaded && env->src_rq->misfit_task_load) //src cpu所在group处于overload,且src cpu有misfit task load return 1; return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2); //当前sd balance fail的失败计数 > 当前sd->cache_nice_tries+2 }
(1-1-1)判断是否满足主动触发active balance的条件
static inline bool voluntary_active_balance(struct lb_env *env) { struct sched_domain *sd = env->sd; if (asym_active_balance(env)) //平台不支持SMT level,也就不会有flag:SD_ASYM_PACKING。所以这里永远为false return 1; /* * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task. //dst cpu处于idle,src cpu只有1个cfs task * It's worth migrating the task if the src_cpu's capacity is reduced //如果src cpu的capacity由于其他sched class或者irq * because of other sched_class or IRQs if more capacity stays //并且dst cpu有更大的capacity,那么是值得迁移task的 * available on dst_cpu. */ if ((env->idle != CPU_NOT_IDLE) && //src cpu处于idle (env->src_rq->cfs.h_nr_running == 1)) { //src cpu只有一个cfs task if ((check_cpu_capacity(env->src_rq, sd)) && //满足src cpu_capacity * sd->imbalance_pct < src cpu_capacity_orig*100 (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100)) //满足src cpu_capacity * sd->imbalance_pct < dst cpu_capacity*100 return 1; } if (env->idle != CPU_NOT_IDLE && //如果src cpu处于idle env->src_grp_type == group_misfit_task) //src cpu所在group type == group_misfit_task return 1; return 0; }
当满足触发条件后,这里再仔细看一遍load_balance函数:
- 在more_balance中,会判断busiest rq是否在做active balance,如果是的话,就不进行迁移,并goto到no_move
- 在no_move中,如果几轮尝试后,仍然没有迁移task,就可能需要判断是否需要active balance
- need_active_balance会判断是否需要进行active balance和是否满足条件
- 在经过一些过滤条件后,开启stop class工作队列,进行active balance
/* * Check this_cpu to ensure it is balanced within domain. Attempt to move * tasks if there is an imbalance. */ static int load_balance(int this_cpu, struct rq *this_rq, struct sched_domain *sd, enum cpu_idle_type idle, int *continue_balancing) { int ld_moved = 0, cur_ld_moved, active_balance = 0; 。。。 more_balance: 。。。 /* * The world might have changed. Validate assumptions. //代码运行到这里时,有时状态已经发生了变化,所以需要再check: * And also, if the busiest cpu is undergoing active_balance, //如果busiest cpu在做active balance,但是它的rq上只有 <= 2个task * it doesn't need help if it has less than 2 tasks on it. //那么它不需要进行迁移 */ if (busiest->nr_running <= 1 || //busiest rq只有1个task或没有task (busiest->active_balance && busiest->nr_running <= 2)) { //或者 (busiest rq处于active balance状态 并且 rq有<=2个task) rq_unlock_irqrestore(busiest, &rf); env.flags &= ~LBF_ALL_PINNED; //清flag:LBF_ALL_PINNED goto no_move; //退出,不进行迁移 no_move } 。。。 no_move: if (!ld_moved) { //经过几轮的努力尝试,最终迁移的进程数ld_moved还是0,说明balance失败 。。。 if (need_active_balance(&env)) { //(1-1)判断是否需要active balance 。。。/* * ->active_balance synchronizes accesses to //active_balance标记是与active_balance_work同步的 * ->active_balance_work. Once set, it's cleared //标记只在active load balance完成之后会清除 * only after active load balance is finished. */ if (!busiest->active_balance && //busiest没有处于active balance状态 !cpu_isolated(cpu_of(busiest))) { //并且busiest rq的cpu没有isolate busiest->active_balance = 1; //那么就标记busiest rq的状态为active balance busiest->push_cpu = this_cpu; //push cpu为当前cpu(this_cpu) active_balance = 1; //同时标记active_balance =1,表示active_balance work开始 mark_reserved(this_cpu); //标记this cpu为reserved } raw_spin_unlock_irqrestore(&busiest->lock, flags); if (active_balance) { stop_one_cpu_nowait(cpu_of(busiest), //(1-2)开启active_load_balance work的工作队列(这个进程调度类是stop class) active_load_balance_cpu_stop, busiest, //(1-3)工作函数为:active_load_balance_cpu_stop &busiest->active_balance_work); *continue_balancing = 0; //开启active balance后,就停止继续进行balance,将标记continue balancing清0 } /* We've kicked active balancing, force task migration. */ //已经触发active balance,并强制执行task迁移 sd->nr_balance_failed = sd->cache_nice_tries + //把balance失败计数改为cache_nice_tries + 10 -1 NEED_ACTIVE_BALANCE_THRESHOLD - 1; } } else sd->nr_balance_failed = 0; //load balance成功发生迁移的话,清空失败计数 if (likely(!active_balance)) { //没有触发active balance 或者active balance完成了 /* We were unbalanced, so reset the balancing interval */ sd->balance_interval = sd->min_interval; //重置balance间隔时间为min_interval } else { /* * If we've begun active balancing, start to back off. This //如果我们正在进行active balance,那么就要将间隔搞大点 * case may not be covered by the all_pinned logic if there * is only 1 task on the busy runqueue (because we don't call * detach_tasks). */ if (sd->balance_interval < sd->max_interval) //balance间隔 < max_interval sd->balance_interval *= 2; //则将balance间隔放大成2倍 } 。。。 out: trace_sched_load_balance(this_cpu, idle, *continue_balancing, group ? group->cpumask[0] : 0, busiest ? busiest->nr_running : 0, env.imbalance, env.flags, ld_moved, sd->balance_interval, active_balance, sd_overutilized(sd), env.prefer_spread); return ld_moved; }
上面是第一条从load_balance中触发调用active balance的路径,另一条是定时调度中触发,路径如下:
scheduler_tick
-> check_for_migration
-> stop_one_cpu_nowait
(1-2)
/* * This function gets called by the timer code, with HZ frequency. * We call it with interrupts disabled. */ void scheduler_tick(void) { 。。。if (curr->sched_class == &fair_sched_class) check_for_migration(rq, curr); #ifdef CONFIG_SMP rq_lock(rq, &rf); if (idle_cpu(cpu) && is_reserved(cpu) && !rq->active_balance) clear_reserved(cpu); rq_unlock(rq, &rf); #endif 。。。 }
(1-2-1)针对misfit task的情况,进行特别的active balance需求进行判断
void check_for_migration(struct rq *rq, struct task_struct *p) { int active_balance; int new_cpu = -1; int prev_cpu = task_cpu(p); int ret; if (rq->misfit_task_load) { //如果rq中有misfit task if (rq->curr->state != TASK_RUNNING || //如果rq中curr不处于running状态 rq->curr->nr_cpus_allowed == 1) //或者,curr进程受cpuset限制,只能在1个cpu上运行 return; if (walt_rotation_enabled) { //如果walt rotation使能了。使能条件:在没有开启sched_boost和sysctl(WALT rotation)情况下,big task数量超过cpu总核数 raw_spin_lock(&migration_lock); walt_check_for_rotation(rq); //会尝试寻找一个dst cpu,将当前cpu->curr与dst cpu->curr进行迁移交换。这部分等后续有时间了,再补充这部分代码分析 raw_spin_unlock(&migration_lock); return; } raw_spin_lock(&migration_lock); rcu_read_lock(); new_cpu = find_energy_efficient_cpu(p, prev_cpu, 0, 1); //通过EAS尝试重新选择一个新的cpu狼放置curr进程 rcu_read_unlock(); if ((new_cpu >= 0) && (new_cpu != prev_cpu) && //如果有找到新的cpu,并且新cpu不是prev cpu (capacity_orig_of(new_cpu) > capacity_orig_of(prev_cpu))) { //并且new_cpu的cpu capacity_orig > prev_cpu的 active_balance = kick_active_balance(rq, p, new_cpu); //(1-2-1-1)判断是否满足条件,触发active balance if (active_balance) { mark_reserved(new_cpu); //把new cpu标记为CPU_RESERVED raw_spin_unlock(&migration_lock); ret = stop_one_cpu_nowait(prev_cpu, //(1-3)开启active_load_balance work的工作队列(这个进程调度类是stop class) active_load_balance_cpu_stop, rq, //(2-1)工作函数为:active_load_balance_cpu_stop &rq->active_balance_work); if (!ret) clear_reserved(new_cpu); //如果启动active balance work失败,则清掉new_cpu CPU_RESERVED标记 else wake_up_if_idle(new_cpu); //如果成功启动的active balance work,如果new cpu上的curr是idle task,则需要通过ipi唤醒new cpu return; } } raw_spin_unlock(&migration_lock); } }
(1-2-1-1)判断是否满足条件,触发active balance
int kick_active_balance(struct rq *rq, struct task_struct *p, int new_cpu) { unsigned long flags; int rc = 0; /* Invoke active balance to force migrate currently running task */ raw_spin_lock_irqsave(&rq->lock, flags); if (!rq->active_balance) { //如果rq并未处于active balance状态 rq->active_balance = 1; //置位flag rq->push_cpu = new_cpu; //设定push cpu为new cpu get_task_struct(p); rq->wrq.push_task = p; //push的task为p(实际就是curr) rc = 1; } raw_spin_unlock_irqrestore(&rq->lock, flags); return rc; }
(1-3)开启active_load_balance work的工作队列
/** * stop_one_cpu_nowait - stop a cpu but don't wait for completion * @cpu: cpu to stop * @fn: function to execute * @arg: argument to @fn * @work_buf: pointer to cpu_stop_work structure * * Similar to stop_one_cpu() but doesn't wait for completion. The * caller is responsible for ensuring @work_buf is currently unused * and will remain untouched until stopper starts executing @fn. * * CONTEXT: * Don't care. * * RETURNS: * true if cpu_stop_work was queued successfully and @fn will be called, * false otherwise. */ bool stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg, struct cpu_stop_work *work_buf) { *work_buf = (struct cpu_stop_work){ .fn = fn, .arg = arg, }; //传入work的fn函数(active_load_balance_cpu_stop)和arg参数(busiest) return cpu_stop_queue_work(cpu, work_buf); //(1-3-1)一切就绪就启动work }
(1-3-1)简单判断stopper内核线程是否初始化完成,ready to run?一切就绪就启动work
/* queue @work to @stopper. if offline, @work is completed immediately */ static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work) { struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu); DEFINE_WAKE_Q(wakeq); unsigned long flags; bool enabled; preempt_disable(); raw_spin_lock_irqsave(&stopper->lock, flags); enabled = stopper->enabled; //stopper初始化完成,unpark后,就会将enabled flag置true if (enabled) __cpu_stop_queue_work(stopper, work, &wakeq); //(2-1)启动work,work中的fn实际就是active_load_balance_cpu_stop函数,所以最终是执行active_load_balance_cpu_stop函数,传入arg参数(busiest) else if (work->done) cpu_stop_signal_done(work->done); raw_spin_unlock_irqrestore(&stopper->lock, flags); wake_up_q(&wakeq); preempt_enable(); return enabled; }
这里先简单看下cpu stop work的中初始化和注册流程:
static struct smp_hotplug_thread cpu_stop_threads = { .store = &cpu_stopper.thread, .thread_should_run = cpu_stop_should_run, .thread_fn = cpu_stopper_thread, .thread_comm = "migration/%u", .create = cpu_stop_create, .park = cpu_stop_park, .selfparking = true, }; static int __init cpu_stop_init(void) { unsigned int cpu; for_each_possible_cpu(cpu) { //为每个cpu初始化per_cpu结构体cpu_stopper struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu); raw_spin_lock_init(&stopper->lock); INIT_LIST_HEAD(&stopper->works); } BUG_ON(smpboot_register_percpu_thread(&cpu_stop_threads)); //为每个online的cpu创建cpu stop线程并启动该线程 stop_machine_unpark(raw_smp_processor_id()); //将对应cpu的stop进程unpark stop_machine_initialized = true; return 0; } early_initcall(cpu_stop_init);
为每个online的cpu创建cpu stop线程并启动该线程
/** * smpboot_register_percpu_thread - Register a per_cpu thread related * to hotplug * @plug_thread: Hotplug thread descriptor * * Creates and starts the threads on all online cpus. */ int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread) { unsigned int cpu; int ret = 0; get_online_cpus(); mutex_lock(&smpboot_threads_lock); for_each_online_cpu(cpu) { //遍历每个online的cpu ret = __smpboot_create_thread(plug_thread, cpu); //为每个online cpu创建线程 if (ret) { smpboot_destroy_threads(plug_thread); //如果创建出错,则需要销毁 goto out; } smpboot_unpark_thread(plug_thread, cpu); //因为selfparking = true,所以这里不会unpark内核线程 } list_add(&plug_thread->list, &hotplug_threads); //将新创建的内核线程,加到hotplug_threads链表中,里面放了所有migration/%u内核线程 out: mutex_unlock(&smpboot_threads_lock); put_online_cpus(); return ret; }
创建内核线程:migration/%u ,%u会替换为数字0,1,2...,对应绑定的cpu号
static int __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu) { struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu); //获取cpu_stopper.thread struct smpboot_thread_data *td; if (tsk) return 0; td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu)); //申请smpboot_thread_data结构体空间(只是这里考虑了NUMA架构,我们是UMA架构) if (!td) return -ENOMEM; td->cpu = cpu; //赋值对应cpu td->ht = ht; //关联对应的smp_hotplug_thread tsk = kthread_create_on_cpu(smpboot_thread_fn, td, cpu, //在给定的cpu上,创建内核线程,名字为"migration/%u",%u会替换为数字0,1,2...,对应绑定的cpu号 ht->thread_comm); if (IS_ERR(tsk)) { kfree(td); return PTR_ERR(tsk); } /* * Park the thread so that it could start right on the CPU * when it is available. */ kthread_park(tsk); //内核线程先挂起,不要调用执行对应的fn get_task_struct(tsk); //task struct的引用计数+1 *per_cpu_ptr(ht->store, cpu) = tsk; //将内核线程绑定到per_cpu的cpu_stopper.thread if (ht->create) { // /* * Make sure that the task has actually scheduled out * into park position, before calling the create * callback. At least the migration thread callback * requires that the task is off the runqueue. */ if (!wait_task_inactive(tsk, TASK_PARKED)) //等待task到挂起状态 WARN_ON(1); else ht->create(cpu); //调用cpu_stop_create,设置task优先级、stop class和FIFO policy等 } return 0; }
调用cpu_stop_create,将内核线程优先级设为99(100-1),调度class设置为stop级别,policy为FIFO。
如有老的stop进程,则将其改为rt class进程。(这说明一个cpu上只允许有一个stop class进程,有新的stop设置,老的stop需要降级为rt class)
static void cpu_stop_create(unsigned int cpu) { sched_set_stop_task(cpu, per_cpu(cpu_stopper.thread, cpu)); //获取当前cpu对应的cpu_stopper.thread,即上述的内核线程migration/%u } void sched_set_stop_task(int cpu, struct task_struct *stop) { struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 }; //配置优先级为 100-1 = 99 struct task_struct *old_stop = cpu_rq(cpu)->stop; //获取老的stop进程 if (stop) { /* * Make it appear like a SCHED_FIFO task, its something * userspace knows about and won't get confused about. * * Also, it will make PI more or less work without too * much confusion -- but then, stop work should not * rely on PI working anyway. */ sched_setscheduler_nocheck(stop, SCHED_FIFO, ¶m); //将内核线程设置为FIFO,并设置优先级 stop->sched_class = &stop_sched_class; //设置调度class为stop class } cpu_rq(cpu)->stop = stop; //更新新的stop进程 if (old_stop) { /* * Reset it back to a normal scheduling class so that * it can die in pieces. */ old_stop->sched_class = &rt_sched_class; //将老的stop进程设置为rt class } }
将cpu id对应的cpu_stopper进程unpark,取消挂起。
void stop_machine_unpark(int cpu) { struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu); stopper->enabled = true; //enabled flag置为true kthread_unpark(stopper->thread); //取消挂起 }
如果是多核cpu,这里应该只会挂起一个cpu的内核线程。其余cpu会在cpu启动的时候执行的,具体调用路径如下,最终在cpuhp_online_idle函数中取消挂起。
调用路径:
secondary_startup
-> __secondary_switched
-> secondary_start_kernel
-> cpu_startup_entry
-> cpuhp_online_idle中,调用stop_machine_unpark函数取消挂起
2、active balance操作
(2-1)启动work,work中的fn实际就是active_load_balance_cpu_stop函数,所以最终是执行active_load_balance_cpu_stop函数;传入的参数为busiest(或者是有misfit task的rq)
/* * active_load_balance_cpu_stop is run by the CPU stopper. It pushes * running tasks off the busiest CPU onto idle CPUs. It requires at * least 1 task to be running on each physical CPU where possible, and * avoids physical / logical imbalances. */ int active_load_balance_cpu_stop(void *data) { struct rq *busiest_rq = data; int busiest_cpu = cpu_of(busiest_rq); int target_cpu = busiest_rq->push_cpu; //获取busiest rq对应push task的目标cpu;一般是执行load balance时候的当前cpu(赋值为this_cpu) struct rq *target_rq = cpu_rq(target_cpu); //获取push cpu对应的cpu rq struct sched_domain *sd = NULL; struct task_struct *p = NULL; struct rq_flags rf; #ifdef CONFIG_SCHED_WALT struct task_struct *push_task; int push_task_detached = 0; struct lb_env env = { .sd = sd, //初始化sd为NULL .dst_cpu = target_cpu, //target_cpu为push cpu .dst_rq = target_rq, .src_cpu = busiest_rq->cpu, //src_cpu为busiest cpu .src_rq = busiest_rq, .idle = CPU_IDLE, .flags = 0, .loop = 0, }; #endif rq_lock_irq(busiest_rq, &rf); /* * Between queueing the stop-work and running it is a hole in which //在做queue work和运行stop-work的中间,有可能cpu变inactive * CPUs can become inactive. We should not move tasks from or to //我们不能在inactive的cpu上迁移过来或者迁移走task * inactive CPUs. */ if (!cpu_active(busiest_cpu) || !cpu_active(target_cpu)) //如果src或者target cpu有一个处于inactive状态,则不进行迁移了 goto out_unlock; /* Make sure the requested CPU hasn't gone down in the meantime: */ //确保提出需求的busiest cpu与当前cpu一致,以及在做active_balance的标记为1 if (unlikely(busiest_cpu != smp_processor_id() || !busiest_rq->active_balance)) goto out_unlock; /* Is there any task to move? */ if (busiest_rq->nr_running <= 1) //busiest上如果没有多的task用来迁移了?数量<=1 goto out_unlock; /* * This condition is "impossible", if it occurs * we need to fix it. Originally reported by * Bjorn Helgaas on a 128-CPU setup. */ BUG_ON(busiest_rq == target_rq); //target和src是同一个的异常情况 #ifdef CONFIG_SCHED_WALT push_task = busiest_rq->wrq.push_task; //使用WALT rq中标记的push task target_cpu = busiest_rq->push_cpu; //又赋值了一次target_cpu,同样地,一般是执行load balance时候的当前cpu(赋值为this_cpu) if (push_task) { //如果WALT rq有标记push task if (task_on_rq_queued(push_task) && //再判断push task的on_rq ==1 push_task->state == TASK_RUNNING && //并且push task处于running状态 task_cpu(push_task) == busiest_cpu && //并且push task处在busiest cpu上 cpu_online(target_cpu)) { //并且target_cpu处于online状态 update_rq_clock(busiest_rq); //更新busiest rq clock信息 detach_task(push_task, &env); //将push task从busiest rq上剥离,代码流程可参考前一篇文章:负载均衡(Load Balance)- periodic load balance push_task_detached = 1; //标记push task现在已经剥离 } goto out_unlock; } #endif //走到这里,说明WALT rq里面没有标记push task /* Search for an sd spanning us and the target CPU. */ //搜索同时包含busiest 和 target cpu的调度域 rcu_read_lock(); for_each_domain(target_cpu, sd) { if ((sd->flags & SD_LOAD_BALANCE) && cpumask_test_cpu(busiest_cpu, sched_domain_span(sd))) break; } if (likely(sd)) { //如果上面找到了满足条件的sd struct lb_env env = { .sd = sd, .dst_cpu = target_cpu, .dst_rq = target_rq, .src_cpu = busiest_rq->cpu, .src_rq = busiest_rq, .idle = CPU_IDLE, /* * can_migrate_task() doesn't need to compute new_dst_cpu //can_migrate_task函数不需要为active balance计算新的dst cpu * for active balancing. Since we have CPU_IDLE, but no //因为我们有CPU_IDLE状态,但没有dst_grpmask * @dst_grpmask we need to make that test go away with lying //我们需要用LBF_DST_PINNED,来跳过相应的检测 * about DST_PINNED. */ .flags = LBF_DST_PINNED, }; schedstat_inc(sd->alb_count); //active load balance调度计数+1 update_rq_clock(busiest_rq); //更新busiest rq clock p = detach_one_task(&env); //(2-1-1)从busiest rq寻找满足条件的task p,并进行剥离 if (p) { schedstat_inc(sd->alb_pushed); //active load balance中push task计数+1 /* Active balancing done, reset the failure counter. */ sd->nr_balance_failed = 0; //清零load balance失败计数 } else { schedstat_inc(sd->alb_failed); //如果没有detach成功,则active load balance失败计数+1 } } rcu_read_unlock(); out_unlock: busiest_rq->active_balance = 0; //清零active balance状态flag,说明detach task是active balance的主要动作;而attach到dst rq不需要做互斥 #ifdef CONFIG_SCHED_WALT push_task = busiest_rq->wrq.push_task; //取出WALT rq标记的push task #endif target_cpu = busiest_rq->push_cpu; //取出push cpu为target cpu clear_reserved(target_cpu); //清除WALT rq的CPU_RESERVED标记位 #ifdef CONFIG_SCHED_WALT if (push_task) busiest_rq->wrq.push_task = NULL; //清空WALT rq的push task #endif rq_unlock(busiest_rq, &rf); #ifdef CONFIG_SCHED_WALT if (push_task) { //如果上面执行的是WALT rq的push task if (push_task_detached) //如果上面进行过剥离 attach_one_task(target_rq, push_task); //(2-1-2)则将push task迁移到target cpu的rq上 put_task_struct(push_task); } #endif if (p) //如果上面执行的是原生的push task attach_one_task(target_rq, p); //则同样将push task迁移到target cpu的rq上 local_irq_enable(); return 0; }
(2-1-1)从busiest rq寻找满足条件的task p,并进行剥离。
can_migrate_task函数代码流程可参考前一篇文章:负载均衡(Load Balance)- periodic load balance。但是因为上面函数中env中设置了LBF_DST_PINNED,所以其中流程如果因为cpuset限制了,不能将task p迁移到dst cpu,就会直接return。因此会比periodic load balance会减少因为cpuset限制后的重新挑选new_dst_cpu的流程(这里解释了上面为什么要设置 env.flags = LBF_DST_PINNED)。
detach_task函数流程则也可参考前一篇文章,这里不再赘述。
/* * detach_one_task() -- tries to dequeue exactly one task from env->src_rq, as * part of active balancing operations within "domain". * * Returns a task if successful and NULL otherwise. */ static struct task_struct *detach_one_task(struct lb_env *env) { struct task_struct *p; lockdep_assert_held(&env->src_rq->lock); list_for_each_entry_reverse(p, //从src rq的cfs_tasks链表中从尾部开始遍历寻找task p &env->src_rq->cfs_tasks, se.group_node) { if (!can_migrate_task(p, env)) //判断task p是否满足迁移条件,代码流程可参考前一篇文章:负载均衡(Load Balance)- periodic load balance continue; detach_task(p, env); //如果成功找到满足条件的task p,则进行剥离 /* * Right now, this is only the second place where //现在这里只是第二个lb_gained[env->idle]更新的地方 * lb_gained[env->idle] is updated (other is detach_tasks) //(另一处是detach_tasks) * so we can safely collect stats here rather than //所以我们可以安全地在这里收集状态信息,而不用到detach_tasks函数中收集 * inside detach_tasks(). */ schedstat_inc(env->sd->lb_gained[env->idle]); return p; } return NULL; }
(2-1-2)将push task迁移到target cpu的rq上。
attach_one_task函数与attach_task差别仅仅是迁移前更新了rq clock信息,迁移后更新rq的overutil状态。
update_overutilized_status函数、attach_task函数流程都可参考前一篇文章,这里不再赘述。
/* * attach_one_task() -- attaches the task returned from detach_one_task() to * its new rq. */ static void attach_one_task(struct rq *rq, struct task_struct *p) { struct rq_flags rf; rq_lock(rq, &rf); update_rq_clock(rq); //更新rq clock信息 attach_task(rq, p); //将task p迁移到rq update_overutilized_status(rq); //迁移完成后,更新rq的最新overutil状态 rq_unlock(rq, &rf); }
3、总结
active balace一共只有2个路径会调用到:一个是load balance多次尝试失败之后触发(A路径);另一个是misfit task的rq在定时调度时进行检测和触发(B路径)。active balance其本身是依托于cpu rq上的stop class的workqueue来实现(对应内核线程名:migration/%u),在触发后进行push task挑选:
A路径中会从busiest rq中挑选符合条件的cfs task进行迁移。
B路径中会直接选择有misfit task的rq对应curr进程进行迁移。
然后对push task进行task剥离和重新attach,然后就完成了workqueue的工作。
特别地,通过WALT rotation机制,更可以在misfit task的情况下,直接将两个rq的curr进程进行对调迁移(不再通过active balance),加快了负载均衡的效率,也对mifit task的均衡更加保守(个人猜想是因为直接迁移misfit task可能让接收misfit task的目标cpu承受不了,所以进行curr对调。不过这块最好是通过数据统计下看直接push misfit task好,还是walt rotation贴合场景)。