Starting from fork(...)

作为计算机程序的基本单位，一切五花八门，新奇古怪的程序都源于一个fork。亚当夏娃之后，人类繁衍生息便出现了社会，fork繁衍生息之后便出现了windows，或者Linux，又或者你手中的iPhone5，双卡双待，大屏加超长待机，还有标配的炫酷铃声——《爱情买卖》。

fork不是一个C函数，而是一个系统调用。c通常是用户层的语言，比如简单的加减法，若要解决复杂的问题，比如申请一段内存，开多进程，这显然不是c 能办到的，或者你也不知如何实现这样一个函数。不同的操作系统有自己的标准，亦有自己定义的API，fork一个进程更不会是一套相同的代码。这种C自己办不到的事情，只能量力而行，通知系统（内核）帮自己处理下咯，内核处理好，将结果返回给c，这便是合作的道理。

创建一个进程

#include <unistd.h>
pid_t fork(void);

系统调用的过程

--> 应用程序函数，也就是上面的pid fork(void)

--> libc里的封装例程 , 向内核发送系统调用号

--> 系统调用处理函数，接收到系统调用号，通过sys_call_table找到相应服务例程地址

/* 0 */     CALL(sys_restart_syscall)
                CALL(sys_exit)
                CALL(sys_fork_wrapper)    //-->
                CALL(sys_read)
                CALL(sys_write)

/*-------------------------------------------------------------*/


sys_fork_wrapper:
        add r0, sp, #S_OFF
        b   sys_fork//调用sys_fork

--> 系统调用的服务例程，也就是系统调用的真正干活的函数，在这里就是sys_fork()。

asmlinkage int sys_fork(struct pt_regs *regs)
{
#ifdef CONFIG_MMU
    return do_fork(SIGCHLD, regs->ARM_sp, regs, 0, NULL, NULL);
#else
    /* can not support in nommu mode */
    return(-EINVAL);
#endif
}

“内核态”与“用户态"

系统调用的过程中出现了两个概念“内核态”和“用户态”。同一个CPU，同一块内存，是从哪里看出分出了两态？
这涉及到处理器的硬件常识，具体到arm处理器，处理器本身就有多种模式：

六种特权模式

- abort模式
- interrupt request模式
- fast interrupt request模式
- supervisor模式
- system模式
- undefined模式

一种非特权模式

- user模式

模式的解释

- 当访问内存（存储器）失败，进入abort模式；
- 处理器响应中断，进入interrupt request模式或者 fast interrupt request模式；
- 处理器复位，supervisor模式，内核便经常运行在这种模式；
- 通常情况下，也就是非内核态，一般运行在user模式，或者system模式；
- 如果遇到错误指令，或者不认识的指令，则进入undefined模式。

模式的寄存器

有这么多模式，当然就该有表示模式的寄存器。
arm处理器有37个寄存器。不同的模式下，一些寄存器工作，一些寄存器隐藏。即不同模式各自有属于自己的寄存器们。当然了，有些寄存器是公共的。

有必要隆重介绍下cpsr寄存器，中文名：程序状态寄存器。寄存器有32位，低五位便表示不同的模式。比如：10011 表示supervisor模式。

关于划分不同模式的意义，就拿user模式与supervisor模式举例。
当系统处于user模式，也就是非内核态时，我们可以访问自己的内存空间，但绝不被允许访问内核代码。但我们将指针指向3G～4G的空间，会怎样。

处理器接收到该取值信号，然后查看当前模式，哦？处理器该模式下没有访问该地址空间的能力。这样一来，内核代码保护从硬件的角度采取禁止措施，也就保护了内核空间的安全，多么无敌的黑客，即使强如凤姐，从用户空间想要破快内核这块碉堡也是徒劳。只能待碉堡自己内部崩溃了。
若用户进程要进入内核态，也就是由user模式转化为supervisor模式。首次，进入特权模式下的system模式，该模式于user模式共用寄存器，唯一的区别是处于system模式下用户态的进程可以有权改变cspr寄存器，也就是改变cspr寄存器的低五位为：10011，进入supervisor模式，然后进程便有权访问3G～4G的内存空间。

先提这么些，有所了解以便能继续策下去。通过系统调用这么个过程，现在我们终于处于内核态了，也就是：arm处理器的cspr寄存器的低五位为10011，开始执行do_fork函数。

“处理器级别”进入内核态，可以“系统调用”

创造子进程

正如注释所言，do_fork 便是展现进程创建细节的函数

View Code

/*
 *  Ok, this is the main fork-routine.
 *
 * It copies the process, and if successful kick-starts
 * it and waits for it to finish using the VM if required.
 */

参数分析

参数解析如下：

clone_flags:
低八位，用于子进程结束时发送到父进程的信号代码。

View Code

#define CSIGNAL　　　　　　0x000000ff  /* signal mask to be sent at exit */
#define CLONE_VM　　　　　 0x00000100  /* set if VM shared between processes */
#define CLONE_FS     　　 0x00000200  /* set if fs info shared between processes */
#define CLONE_FILES  　　 0x00000400  /* set if open files shared between processes */
#define CLONE_SIGHAND    0x00000800  /* set if signal handlers and blocked signals shared */
#define CLONE_PTRACE     0x00002000  /* set if we want to let tracing continue on the child too */
#define CLONE_VFORK  　   0x00004000  /* set if the parent wants the child to wake it up on mm_release */
#define CLONE_PARENT     0x00008000  /* set if we want to have the same parent as the cloner */
#define CLONE_THREAD     0x00010000  /* Same thread group? */
#define CLONE_NEWNS  　 　0x00020000  /* New namespace group? */
#define CLONE_SYSVSEM    0x00040000  /* share system V SEM_UNDO semantics */
#define CLONE_SETTLS     0x00080000  /* create a new TLS for the child */
#define CLONE_PARENT_SETTID  　　0x00100000  /* set the TID in the parent */
#define CLONE_CHILD_CLEARTID    0x00200000  /* clear the TID in the child */
#define CLONE_DETACHED      0x00400000  /* Unused, ignored */
#define CLONE_UNTRACED      0x00800000  /* set if the tracing process can't force CLONE_PTRACE on this clone */
#define CLONE_CHILD_SETTID  0x01000000  /* set the TID in the child */
#define CLONE_NEWUTS        0x04000000  /* New utsname group? */
#define CLONE_NEWIPC        0x08000000  /* New ipcs */
#define CLONE_NEWUSER       0x10000000  /* New user namespace */
#define CLONE_NEWPID        0x20000000  /* New pid namespace */
#define CLONE_NEWNET        0x40000000  /* New network namespace */
#define CLONE_IO            0x80000000 /* Clone io context */

stack_start:
用户态堆栈指针赋给子进程。

stack_size:
未使用。

parent_tidptr:
父进程的用户态变量地址。

child_tidptr:
子进程的用户态变量地址。

理解do_fork其实不难，生成子进程，然后插入进程调度队列，等待调度，分配时间片，最后运行。

long do_fork(unsigned long clone_flags,
             unsigned long stack_start,
             struct pt_regs *regs,
             unsigned long stack_size,
             int __user *parent_tidptr,  //
             int __user *child_tidptr)  //
{
    struct task_struct *p;
    int trace = 0; 
    long nr;

    if (clone_flags & CLONE_NEWUSER) {
        if (clone_flags & CLONE_THREAD)
            return -EINVAL;

        if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) ||
                !capable(CAP_SETGID))
            return -EPERM;
    }

    if (likely(user_mode(regs)))
        trace = tracehook_prepare_clone(clone_flags);

    p = copy_process(clone_flags, stack_start, regs, stack_size,
                     child_tidptr, NULL, trace);


    if (!IS_ERR(p))
    {
        ... ...

        wake_up_new_task(p);　　//-->

        ... ...
    }

    ... ...

    return nr;
}

主要是两个过程：

1. 待子进程有血有肉后，
2. 将其地址交给wake_up_new_task，准备将其唤醒。

wake_up_new_task：

void wake_up_new_task(struct task_struct *p)
{
    ... ...

    rq = __task_rq_lock(p);
    activate_task(rq, p, 0);　　//-->
    p->on_rq = 1;

    ... ...
}

activate_task：

static void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
    if (task_contributes_to_load(p))
        rq->nr_uninterruptible--;

    enqueue_task(rq, p, flags); //加入队列
    inc_nr_running(rq);
}

管理子进程

Linux的世界里没有“计划生育”，直接导致了无数的子进程们，这当然要管理，怎么管理嘞，排队嘛。

rq:

View Code

/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq rq;

以上便是fork的大致过程，现在我们来稍微深入一下。

创建一个进程，要晓得进程这东西到底是个啥构造。先有骨后有肉，撑起进程的骨骼，剩下的便是在相应的部位填充器官而已。

先介绍copy_process函数的几个重要部分，

（1）设置进程的重要结构：进程描述符和 thread_info

    p = dup_task_struct(current);
    if (!p)
        goto fork_out;

View Code

static struct task_struct *dup_task_struct(struct task_struct *orig)
{

    ... ...

    int node = tsk_fork_get_node(orig);
    int err;

    prepare_to_copy(orig);

    tsk = alloc_task_struct_node(node);  //struct task_struct
    if (!tsk)
        return NULL;

    ti = alloc_thread_info_node(tsk, node); //struct thread_info
    if (!ti) {
        free_task_struct(tsk);
        return NULL;
    }

    err = arch_dup_task_struct(tsk, orig);
    if (err)
        goto out;

    tsk->stack = ti;

    ... ...

    setup_thread_stack(tsk, orig);

    ... ...

}

（2）初始化调度相关。

    sched_fork(p);

View Code

void sched_fork(struct task_struct *p)
{
    unsigned long flags;
    int cpu = get_cpu();

    __sched_fork(p);    //初始化该进程的调度单元结构体sched_entity。

    /*
     * We mark the process as running here. This guarantees that
     * nobody will actually run it, and a signal or other external
     * event cannot wake it up and insert it on the runqueue either.
     */
    p->state = TASK_RUNNING;

    /*
     * Revert to default priority/policy on fork if requested.
     */
    if (unlikely(p->sched_reset_on_fork)) {
        if (p->policy == SCHED_FIFO || p->policy == SCHED_RR) {
            p->policy = SCHED_NORMAL;
            p->normal_prio = p->static_prio;
        }

        if (PRIO_TO_NICE(p->static_prio) < 0) {
            p->static_prio = NICE_TO_PRIO(0);
            p->normal_prio = p->static_prio;
            set_load_weight(p);
        }

        /*
         * We don't need the reset flag anymore after the fork. It has
         * fulfilled its duty:
         */
        p->sched_reset_on_fork = 0;
    }

    /*
     * Make sure we do not leak PI boosting priority to the child.
     */
    p->prio = current->normal_prio;

    if (!rt_prio(p->prio))
        p->sched_class = &fair_sched_class;    //设置调度模式：绝对公平调度算法

    if (p->sched_class->task_fork)
        p->sched_class->task_fork(p);        

    /*
     * The child is not yet in the pid-hash so no cgroup attach races,
     * and the cgroup is pinned to this child due to cgroup_fork()
     * is ran before sched_fork().
     *
     * Silence PROVE_RCU.
     */
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    set_task_cpu(p, cpu);
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);

#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
    if (likely(sched_info_on()))
        memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP)
    p->on_cpu = 0;
#endif
#ifdef CONFIG_PREEMPT
    /* Want to start with kernel preemption disabled. */
    task_thread_info(p)->preempt_count = 1;
#endif
#ifdef CONFIG_SMP
    plist_node_init(&p->pushable_tasks, MAX_PRIO);
#endif

    put_cpu();
}

（3）设置子进程的寄存器初始值，包括内核堆栈位置。

    retval = copy_thread(clone_flags, stack_start, stack_size, p, regs);
    if (retval)
        goto bad_fork_cleanup_io;

View Code

int copy_thread(unsigned long clone_flags, unsigned long stack_start,
                unsigned long stk_sz,      struct task_struct *p,     struct pt_regs *regs)
{
    struct thread_info *thread = task_thread_info(p);
    struct pt_regs *childregs = task_pt_regs(p);

/*
 struct pt_regs {
     unsigned long uregs[18];
 };
 
 #define ARM_cpsr    uregs[16]
 #define ARM_pc      uregs[15]
 #define ARM_lr      uregs[14]
 #define ARM_sp      uregs[13]
 #define ARM_ip      uregs[12]
 #define ARM_fp      uregs[11]
 #define ARM_r10     uregs[10]
 #define ARM_r9      uregs[9]
 #define ARM_r8      uregs[8]
 #define ARM_r7      uregs[7]
 #define ARM_r6      uregs[6]
 #define ARM_r5      uregs[5]
 #define ARM_r4      uregs[4]
 #define ARM_r3      uregs[3]
 #define ARM_r2      uregs[2]
 #define ARM_r1      uregs[1]
 #define ARM_r0      uregs[0]
 #define ARM_ORIG_r0 uregs[17]
*/

    *childregs = *regs;
    childregs->ARM_r0 = 0;
    childregs->ARM_sp = stack_start;


/*
 struct cpu_context_save {       
     __u32   r4;     
     __u32   r5;
     __u32   r6;     
     __u32   r7;     
     __u32   r8;
     __u32   r9;
     __u32   sl;
     __u32   fp;
     __u32   sp;
     __u32   pc;
     __u32   extra[2];       // Xscale 'acc' register, etc 
 };
*/

    memset(&thread->cpu_context, 0, sizeof(struct cpu_context_save));
    thread->cpu_context.sp = (unsigned long)childregs;
    thread->cpu_context.pc = (unsigned long)ret_from_fork;

    clear_ptrace_hw_breakpoint(p);

    if (clone_flags & CLONE_SETTLS)
        thread->tp_value = regs->ARM_r3;

    thread_notify(THREAD_NOTIFY_COPY, thread);

    return 0;
}

（4）将pid插入到hlist。

    attach_pid(p, PIDTYPE_PID, pid);
    nr_threads++;

View Code

void attach_pid(struct task_struct *task, enum pid_type type,
                struct pid *pid)
{
    struct pid_link *link;
    
    link = &task->pids[type];
    link->pid = pid;
    hlist_add_head_rcu(&link->node, &pid->tasks[type]);
}




struct pid_link
{   
    struct hlist_node node;
    struct pid *pid;
};




struct pid
{
    atomic_t count;
    unsigned int level;

    struct hlist_head tasks[PIDTYPE_MAX];
    struct rcu_head rcu;
    struct upid numbers[1];
};

以上便是创建一个进程（用户态）的大致过程，也是“写时复制”的特点，子进程初始化时大部分继承父进程资源，以便使创建过程轻量化。

起初学习操作系统，接触的仅仅是“进程“、“线程”两个简单明了、不痛不痒的词。谁知在实际的操作系统当中却又冒出了“内核线程”、“轻量级进程”、“用户线程”、“LWP“。

“人的第一印象很重要”这大家都晓得，其实“概念的第一印象也很重要“。不是进程就是大的，线程就是小的。这么一大一小就把全世界的X程给归类了。有些东西需要再细抠一下，才能明白其产生的原因。

一、线程类型

内核线程

首先，关于“内核进程”的问题，引用csdn论坛的回答：

没有“内核进程”。“内核线程”本身就是一种特殊的进程，它只在内核空间中运行，因此没有与之相关联的“虚拟地址空间”，也就永远不会被切换到用户空间中执行。但跟一般的进程一样，它们也是可调度的、可抢占的。这一点跟中断处理程序不一样。
Linux一般用内核线程来执行一些特殊的操作。比如负责page cache回写的pdflush内核线程。
另外，在Linux内核中，可调度的东西都对应一个thread_info以及一个task_struct，同一个进程中的线程，跟进程的区别仅仅是它们共享了一些资源，比如地址空间（mm_struct成员指向同一位置）。所以，如果非要觉得内核线程应该被称为“内核进程”，那也没啥不可以，只是这样说的话，就成了文字游戏了。毕竟官方的叫法就是“内核线程”。

用户“轻量级线程 LWP"

轻量级线程(LWP)是一种由内核支持的用户线程。它是基于内核线程的高级抽象，因此只有先支持内核线程，才能有LWP。

每一个进程有一个或多个LWPs，每个LWP由一个内核线程支持。这种模型实际上就是恐龙书上所提到的一对一线程模型。在这种实现的操作系统中，LWP就是用户线程。
由于每个LWP都与一个特定的内核线程关联，因此每个LWP都是一个独立的线程调度单元。即使有一个LWP在系统调用中阻塞，也不会影响整个进程的执行。
轻量级进程具有局限性。首先，大多数LWP的操作，如建立、析构以及同步，都需要进行系统调用。系统调用的代价相对较高：需要在user mode和kernel mode中切换。其次，每个LWP都需要有一个内核线程支持，因此LWP要消耗内核资源（内核线程的栈空间）。因此一个系统不能支持大量的LWP。LWP虽然本质上属于用户线程，但LWP线程库是建立在内核之上的，LWP的许多操作都要进行系统调用，因此效率不高。

用户线程

我们常用的”线程“实则完全建立在用户空间，用一套库去实现。用户线程在用户空间中实现，内核并没有直接对用户线程进行调度。内核并不知道用户线程的存在。
其缺点是一个用户线程如果阻塞在系统调用中，则整个进程都将会阻塞。

绑定模式：用户线程+LWP

介于“轻量级线程”可与内核交互的特点和 “用户线程”效率高的特点，两者结合便衍生出“加强版的用户线程——用户线程+LWP“模式。
用户线程库还是完全建立在用户空间中，因此用户线程的操作还是很廉价，因此可以建立任意多需要的用户线程。操作系统提供了LWP作为用户线程和内核线程之间的桥梁。
LWP还是和前面提到的一样，具有内核线程支持，是内核的调度单元，并且用户线程的系统调用要通过LWP，因此进程中某个用户线程的阻塞不会影响整个进程的执行。用户线程库将建立的用户线程关联到LWP上，LWP与用户线程的数量不一定一致。当内核调度到某个LWP上时，此时与该LWP关联的用户线程就被执行。

二、创建一个简单的 “内核线程”

/*
 * Create a kernel thread.
 */
pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
{
    struct pt_regs regs;

    memset(&regs, 0, sizeof(regs));

    regs.ARM_r4 = (unsigned long)arg;
    regs.ARM_r5 = (unsigned long)fn;
    regs.ARM_r6 = (unsigned long)kernel_thread_exit;
    regs.ARM_r7 = SVC_MODE | PSR_ENDSTATE | PSR_ISETSTATE;
    regs.ARM_pc = (unsigned long)kernel_thread_helper;
    regs.ARM_cpsr = regs.ARM_r7 | PSR_I_BIT;

    return do_fork(flags|CLONE_VM|CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
}

有了之前对arm处理器寄存器的了解后，对以上的宏就不会陌生。

寄存器中我们提到了cspr寄存器，寄存器的32位并未全有意义，低地址即表示处理器模式，还有出现的异常标志。高地址表示汇编运算中的需要的标志，比如条件判断是否相等，加减运算是否为零等。倘若你稍加学习arm汇编，便对以下的各种宏再熟悉不过。

View Code

/*
 * PSR bits
 */
#define USR26_MODE  0x00000000
#define FIQ26_MODE  0x00000001
#define IRQ26_MODE  0x00000002
#define SVC26_MODE  0x00000003
#define USR_MODE    0x00000010
#define FIQ_MODE    0x00000011
#define IRQ_MODE    0x00000012
#define SVC_MODE    0x00000013
#define ABT_MODE    0x00000017
#define UND_MODE    0x0000001b
#define SYSTEM_MODE 0x0000001f
#define MODE32_BIT  0x00000010
#define MODE_MASK   0x0000001f
#define PSR_T_BIT   0x00000020
#define PSR_F_BIT   0x00000040
#define PSR_I_BIT   0x00000080
#define PSR_A_BIT   0x00000100
#define PSR_E_BIT   0x00000200
#define PSR_J_BIT   0x01000000
#define PSR_Q_BIT   0x08000000
#define PSR_V_BIT   0x10000000
#define PSR_C_BIT   0x20000000
#define PSR_Z_BIT   0x40000000
#define PSR_N_BIT   0x80000000

“懂硬件的程序员才是好程序员”。

祖宗进程：INIT_TASK(), kernel_init()

最熟悉的内核线程莫过于进程0，进程1。人总有“认祖归宗”的天性，那这么些个进程的老祖宗到底是谁，当然就是进程0。

“宇宙形成之初，一切归于虚无”，在你开机的刹那，没有进程，更没有什么例程为你服务。一切都需自力更生，数据结构只能自己静态分配。

一、进程0

/*
 * Initial task structure.
 *
 * All other task structs will be allocated on slabs in fork.c
 */
struct task_struct init_task = INIT_TASK(init_task);


/*
 *  INIT_TASK is used to set up the first task table, touch at
 * your own risk!. Base=0, limit=0x1fffff (=2MB)
 */
#define INIT_TASK(tsk)  \
{                                   \
    .state      = 0,                        \
    .stack      = &init_thread_info,                \
    .usage      = ATOMIC_INIT(2),               \
    .flags      = PF_KTHREAD,                   \
    .prio       = MAX_PRIO-20,                  \
    .static_prio    = MAX_PRIO-20,                  \
    .normal_prio    = MAX_PRIO-20,                  \
    .policy     = SCHED_NORMAL,                 \
    .cpus_allowed   = CPU_MASK_ALL,                 \
    .mm     = NULL,                     \
    .active_mm  = &init_mm,                 \
    .se     = {                     \
        .group_node     = LIST_HEAD_INIT(tsk.se.group_node),    \
    },                              \
    .rt     = {                     \
        .run_list   = LIST_HEAD_INIT(tsk.rt.run_list),  \
        .time_slice = HZ,                   \
        .nr_cpus_allowed = NR_CPUS,             \
    },                              \
    .tasks      = LIST_HEAD_INIT(tsk.tasks),            \

    ... ...

}

进程0执行start_kernel函数初始化内核需要的所有数据结构，激活中断，然后创建进程1(init进程)。

asmlinkage void __init start_kernel(void)
{
    ... ...

    /* Do the rest non-__init'ed, we're now alive */
    rest_init();
}

最后进入假死状态，若某时刻突然没了孩子运行，便诈尸收拾局面。

static noinline void __init_refok rest_init(void)
{
    int pid;

    rcu_scheduler_starting();
    /*
     * We need to spawn init first so that it obtains pid 1, however
     * the init task will end up wanting to create kthreads, which, if
     * we schedule it before we create kthreadd, will OOPS.
     */
    kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);  //create init task...

    numa_default_policy();
    pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); //
    rcu_read_lock();
    kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
    rcu_read_unlock();
    complete(&kthreadd_done); //解锁kthreadd_done

    /*
     * The boot idle thread must execute schedule()
     * at least once to get things moving:
     */
    init_idle_bootup_task(current);
    preempt_enable_no_resched();
    schedule();
    preempt_disable();

    /* Call into cpu_idle with preempt disabled */
    cpu_idle();  //进程0进入假死状态，当没有其他进程处于TASK_RUNNING状态时，调度程序才选择进程0
}

二、进程1

static int __init kernel_init(void * unused)
{
    /*
     * Wait until kthreadd is all set-up.
     */
    wait_for_completion(&kthreadd_done); //等待kthreadd_done解锁
    /*
     * init can allocate pages on any node
     */
    set_mems_allowed(node_states[N_HIGH_MEMORY]);
    /*
     * init can run on any cpu.
     */
    set_cpus_allowed_ptr(current, cpu_all_mask);

    cad_pid = task_pid(current);

    smp_prepare_cpus(setup_max_cpus);

    do_pre_smp_initcalls();
    lockup_detector_init();

    smp_init();
    sched_init_smp();

    do_basic_setup();

    /* Open the /dev/console on the rootfs, this should never fail */
    if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
        printk(KERN_WARNING "Warning: unable to open an initial console.\n");

    (void) sys_dup(0);
    (void) sys_dup(0);
    /*
     * check if there is an early userspace init.  If yes, let it do all
     * the work
     */

    if (!ramdisk_execute_command)
        ramdisk_execute_command = "/init"; //一般为空，然后赋值"/init"

    if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
        ramdisk_execute_command = NULL;
        prepare_namespace();
    }

    /*
     * Ok, we have completed the initial bootup, and
     * we're essentially up and running. Get rid of the
     * initmem segments and start the user-mode stuff..
     */

    init_post();　　//-->
   return 0;
}

static noinline int init_post(void)
{
    /* need to finish all async __init code before freeing the memory */
    async_synchronize_full();
    free_initmem();
    mark_rodata_ro();
    system_state = SYSTEM_RUNNING;
    numa_default_policy();


    current->signal->flags |= SIGNAL_UNKILLABLE;

    if (ramdisk_execute_command) {
        run_init_process(ramdisk_execute_command); 

/*
 * 装入可执行程序init_filename, init内核线程变为一个普通进程
 * 
 * static void run_init_process(const char *init_filename)
 * { 
 *     argv_init[0] = init_filename;
 *     kernel_execve(init_filename, argv_init, envp_init);
 * }
 * 
 */
        printk(KERN_WARNING "Failed to execute %s\n",
                ramdisk_execute_command);
    }

    /*
     * We try each of these until one succeeds.
     *
     * The Bourne shell can be used instead of init if we are
     * trying to recover a really broken machine.
     */
    if (execute_command) {
        run_init_process(execute_command);
        printk(KERN_WARNING "Failed to execute %s.  Attempting "
                    "defaults...\n", execute_command);
    }
    run_init_process("/sbin/init");
    run_init_process("/etc/init");
    run_init_process("/bin/init");
    run_init_process("/bin/sh");

    panic("No init found.  Try passing init= option to kernel. "
          "See Linux Documentation/init.txt for guidance.");
}

Oyeah，算是终于创出了个进程；Linux调度器如何调度孩子们呢？

“生孩子容易，养孩子难，何况又逢如今高房价、高物价、高血压的年代。”

fork了一堆子进程，如何管理，又轮到谁执行。 This's a big problem!

说到选择，就不得不提运筹学，没有谁是重要到可以忽视整个团队，只有合理的分配组合才能发挥最大的效用。

Linux内核乃抢占式内核众所周知，“大家每人占一会儿，VIP要躲占一会儿”，问题来了，这“一会儿”该是多长？谁又该是VIP?

“一会儿”若是太长，后面的人等的急，便会反应迟钝。
“一会儿”若是太短，还没做什么，就会被换下去。

要让大伙都能受到照顾，不会产生怨言，这便是“调度器”的使命。

LInux 调度器

Linux调度器的算法思想可参见：

http://hi.baidu.com/kebey2004/blog/item/3f96250803662a3de8248841.html

目前内核使用的调度算法是 CFS，模糊了传统的时间片和优先级的概念。

基于调度器模块管理器，可以加入其它调度算法，不同的进程可选择不同的调度算法。

调度的首要问题便是：何时调度。似乎有个定了时的闹钟，闹铃一响，考虑是否切换进程。

时间滴滴嗒嗒，又是谁掌控着内核的生物钟。

时钟中断是一种I/O中断，每中断一次，一次滴答。时钟滴答来源于pclk的分频。

一、单处理器的时间中断

-- arch/arm/plat-samsung/time.c --

/*
 * IRQ handler for the timer
 */
static irqreturn_t
s3c2410_timer_interrupt(int irq, void *dev_id)
{
    timer_tick();
    return IRQ_HANDLED;
}


static struct irqaction s3c2410_timer_irq = { 
    .name       = "S3C2410 Timer Tick",
    .flags      = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
    .handler    = s3c2410_timer_interrupt,
};

-- arch/arm/kernel/time.c --

/*
 * Kernel system timer support.
 */
void timer_tick(void)
{   
    profile_tick(CPU_PROFILING);
    do_leds();
    xtime_update(1); //初始化墙上时钟 --> b
#ifndef CONFIG_SMP
    update_process_times(user_mode(get_irq_regs())); //更新一些内核统计数 --> d
#endif
}

void xtime_update(unsigned long ticks)
{
    write_seqlock(&xtime_lock);
    do_timer(ticks); //-->bb
    write_sequnlock(&xtime_lock);
}

bb:

View Code

void do_timer(unsigned long ticks)
{
    jiffies_64 += ticks;
    update_wall_time();
    calc_global_load(ticks); 
}

/*更新一些内核统计数*/

void update_process_times(int user_tick)
{
    struct task_struct *p = current;
    int cpu = smp_processor_id();

    /* Note: this timer irq context must be accounted for as well. */
    account_process_tick(p, user_tick); //检查当前进程运行了多长时间 -->e


    run_local_timers();   //激活本地TIMER_SOFTIRQ任务队列-->f

    rcu_check_callbacks(cpu, user_tick);
    printk_tick();
#ifdef CONFIG_IRQ_WORK
    if (in_irq())
        irq_work_run();
#endif
    scheduler_tick(); //-->g
    run_posix_cpu_timers(p);
}

View Code

/*
 * Account a single tick of cpu time.
 * @p: the process that the cpu time gets accounted to
 * @user_tick: indicates if the tick is a user or a system tick
 */

注释写的很清楚，结合实参便明白：当时钟中断发生时，会根据当前进程运行在用户态还是系统态而执行不同的函数。

这个 user_mode(get_irq_regs() 就是查看当前是个什么态。对于arm处理器就是查看cpsr寄存器。


void account_process_tick(struct task_struct *p, int user_tick)
{
    cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
    struct rq *rq = this_rq();

    if (sched_clock_irqtime) {
        irqtime_account_process_tick(p, user_tick, rq);
        return;
    }

    if (user_tick)
        account_user_time(p, cputime_one_jiffy, one_jiffy_scaled); //-->ee
    else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
        account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
                            one_jiffy_scaled);
    else
        account_idle_time(cputime_one_jiffy);
}

ee:

View Code

/*
 * Account user cpu time to a process.
 * @p: the process that the cpu time gets accounted to
 * @cputime: the cpu time spent in user space since the last update
 * @cputime_scaled: cputime scaled by cpu frequency
 */


#define cputime_one_jiffy       jiffies_to_cputime(1)

#define jiffies_to_cputime(__hz)    (__hz)


/*主要是进程相关的时间更新，以及相应的优先级变化*/
void account_user_time(struct task_struct *p, cputime_t cputime,
                       cputime_t cputime_scaled)
{
    struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
    cputime64_t tmp;
    
    /* Add user time to process. */
    p->utime = cputime_add(p->utime, cputime);
    p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
    account_group_user_time(p, cputime);

    /* Add user time to cpustat. */
    tmp = cputime_to_cputime64(cputime);
    if (TASK_NICE(p) > 0)
        cpustat->nice = cputime64_add(cpustat->nice, tmp); //nice值增加，优先级降低
    else
        cpustat->user = cputime64_add(cpustat->user, tmp);

    cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
    /* Account for user time used */
    acct_update_integrals(p);
}

二、硬中断

这里涉及一个新东西，软中断。

/*
 * Called by the local, per-CPU timer interrupt on SMP.
 */
void run_local_timers(void)
{
    hrtimer_run_queues();
    raise_softirq(TIMER_SOFTIRQ); //激活软中断，软中断下标：TIMER_SOFTIRQ()
}

linux3.0就是用了这么些个软中断。

View Code

enum
{
    HI_SOFTIRQ=0, 　　  //处理高优先级的tasklet
    TIMER_SOFTIRQ, 　　 //和时钟中断相关的tasklet
    NET_TX_SOFTIRQ, 　　//把数据包发送到网卡
    NET_RX_SOFTIRQ, 　　 //从网卡上接收数据包
    BLOCK_SOFTIRQ,
    BLOCK_IOPOLL_SOFTIRQ,
    TASKLET_SOFTIRQ, 　 //处理常规的tasklet
    SCHED_SOFTIRQ,
    HRTIMER_SOFTIRQ,
    RCU_SOFTIRQ,    /* Preferable RCU should always be the last softirq */

    NR_SOFTIRQS
};

路人甲问了：“中断有软也该有硬，那软硬之间到底有啥区别哩？”

曾经的路人甲就是我，忽软忽硬搞得晕头转向。硬中断好理解，好比你和CPU之间牵着根拉紧的绳子，你一按key，绳子（电平）拉低，放出一个

正弦波，顺绳子发散开去，cpu忽觉手一沉，正弦波到了。cpu知道你发了信号，不自觉的走了神，中断了cpu的思维。

硬中断当然要有个实实在在的硬家伙“中断硬件控制器”，

arm模式下的寄存器组织（还有另外的thumb模式）。小三角表示各个模式自己私有的寄存器，可见R8~R14为快中断模式下的私有私有寄存器，有7个，为最多，这使得快速中断模式执行很大部分程序时，不需要保存太多的寄存器的值（将寄存器入栈写入内存），节省处理时间，使之能快速的反应。

stmdb   sp!, {r0-r12, lr}       @保存寄存器
... ... 
ldmia   sp!, {r0-r12, pc} ^     @恢复寄存器

处理过程：

1，汇集各类外设发出的中断信号，然后告诉CPU。
2，CPU保存寄存器（当前环境），调用Interrupt Service Routine。
3，识别中断
4，清除中断
5，恢复现场

如果许多中断同时发生该怎么办？快速中断为何反应更快？
随着处理器的升级，中断的种类也会不多的增多，若一个中断位对应一种中断，那么中断寄存器这区区32位就早就溢出不够用了么。
结论，有些中断位能表示多个中断。而这多个中断势必有共同点。
比如串口，发送数据完毕，INT_TXD0；接收到数据，INT_RXD0；所以，只要其中有一个发生，则SRCPND寄存器中的INT_UART0位被置1。
也就是说，处理中断使用了深度为三的树型结构。根结点即为cpu处理当前的中断。
每种中断都有一个优先级，在硬件方面，中断在进入cpu之前先要经过仲裁器的审批，先后被处理，也就是先后进入根结点。
选择GPIO作为中断引脚，通常捡一个空闲的GPIO，将EINT0，EINT1后的数字只是作为简单的编号，实则不然。不仅是一个编号，同时也反应着它作为中断线在仲裁器中的优先级。显然，EINT0是个优先级很高的GPIO。

最后，中断中如果有快中断，它的处理必须是及时的，凭什么能如此迅速，因为人家有快速通道，不用仲裁，直达根结点。

三、软中断

软中断，系统调用便属于此。虽说都在同一片内存上运行，同吃一口饭，但内核守护着自己1G的空间，封闭的像个碉堡。用户程序在城墙外叫内核开门，忙碌的内核被你的吵闹声中断。

raise_softirq(TIMER_SOFTIRQ);
/*开始激活软中断队列*/

void raise_softirq(unsigned int nr)
{   
    unsigned long flags;

    local_irq_save(flags); //保存寄存器状态，并禁止本地中断
    raise_softirq_irqoff(nr); //-->i
    local_irq_restore(flags);
}

View Code

ii:

View Code

iii:

每个cpu都有一个32位的位掩码，描述挂起的软中断。

View Code

typedef struct {

    unsigned int __softirq_pending;

#ifdef CONFIG_LOCAL_TIMERS
    unsigned int local_timer_irqs;
#endif

#ifdef CONFIG_SMP
    unsigned int ipi_irqs[NR_IPI];
#endif

} ____cacheline_aligned irq_cpustat_t;



#define or_softirq_pending(x)  (local_softirq_pending() |= (x)) //设置软中断位掩码

#define local_softirq_pending() \
    __IRQ_STAT(smp_processor_id(), __softirq_pending)

#define __IRQ_STAT(cpu, member) (irq_stat[cpu].member)

软中断通过raise_softirq设置，之后在适当的时候被执行。
也就是说，这里挂起了一个时间相关的软中断，在某些特殊的时间点会周期性的检查软中断位掩码，突然发现：“哦？竟然有人挂起！” 于是内核调用do_softirq来处理。
这里冒出个问题：“某些特殊的时间点” 指的是哪些点？

asmlinkage void do_softirq(void)
{
    __u32 pending;
    unsigned long flags;

    if (in_interrupt()) //-->in
        return;
    
    local_irq_save(flags);

    pending = local_softirq_pending();

    if (pending)
        __do_softirq(); //有挂起，则执行-->pend

    local_irq_restore(flags);
}

in：

/*检查当前是否处于中断当中*/

#define in_interrupt()      (irq_count())

#define irq_count() ( preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK | NMI_MASK) )
/*
 * PREEMPT_MASK: 0x000000ff
 * SOFTIRQ_MASK: 0x0000ff00
 * HARDIRQ_MASK: 0x03ff0000
 *     NMI_MASK: 0x04000000
 */

#define preempt_count() (current_thread_info()->preempt_count)

static inline struct thread_info *current_thread_info(void)
{
    register unsigned long sp asm ("sp");
    return (struct thread_info *)(sp & ~(THREAD_SIZE - 1)); //8k-1: 0001 1111 1111 1111
}

pend:

数组元素对应软中断各自的处理函数。

static struct softirq_action softirq_vec[NR_SOFTIRQS]

struct softirq_action
{   
    void    (*action)(struct softirq_action *);
};

asmlinkage void __do_softirq(void)
{
    struct softirq_action *h;
    __u32 pending;
    int max_restart = MAX_SOFTIRQ_RESTART;
    int cpu;

    pending = local_softirq_pending(); //获得掩码
    account_system_vtime(current);

    __local_bh_disable((unsigned long)__builtin_return_address(0),
                       SOFTIRQ_OFFSET);
    lockdep_softirq_enter();

    cpu = smp_processor_id();

restart:
    set_softirq_pending(0); //清除软中断位图
    local_irq_enable();  //然后激活本中断

    h = softirq_vec;

    do {
        if (pending & 1) { //根据掩码按优先级顺序依次执行软中断处理函数

            unsigned int vec_nr = h - softirq_vec;
            int prev_count = preempt_count();

            kstat_incr_softirqs_this_cpu(vec_nr);

            trace_softirq_entry(vec_nr);
            h->action(h); //执行-->action
            trace_softirq_exit(vec_nr);

            if (unlikely(prev_count != preempt_count())) {
                printk(KERN_ERR "huh, entered softirq %u %s %p"
                       "with preempt_count %08x,"
                       " exited with %08x?\n", vec_nr,
                       softirq_to_name[vec_nr], h->action,
                       prev_count, preempt_count());
                preempt_count() = prev_count;
            }

            rcu_bh_qs(cpu);
        }
        h++;
        pending >>= 1;
    } while (pending);

    local_irq_disable();

    pending = local_softirq_pending(); //再次获得掩码
    if (pending && --max_restart) //正在执行一个软中断函数时可能出现新挂起的软中断
        goto restart;

    if (pending)   //若还有，则启动ksoftirqd内核线程处理
        wakeup_softirqd();  //-->softirqd

    lockdep_softirq_exit();

    account_system_vtime(current);
    __local_bh_enable(SOFTIRQ_OFFSET);
}

softirqd:

每个cpu都有自己的softirqd内核线程。

它的出现解决了一个纠结的问题：在软中断执行的过程中，刚好在此时又有软中断被挂起，怎么办。好比公车司机见最后一位乘客上车，关门。刚脚踩油门，后视镜竟瞧见有人狂奔而来，开不开门？

内核里，如果已经执行的软中断又被激活，do_softirq()则唤醒内核线程，并终止自己。剩下的事，交给有较低优先级的内核线程处理。这样，用户程序才会有机会运行。

可见，如此设计的原因是防止用户进程“饥饿”，毕竟内核优先级高于用户，无数的软中断若段时间突袭，用户岂不会卡死。

static int run_ksoftirqd(void * __bind_cpu)
{
    set_current_state(TASK_INTERRUPTIBLE);

    while (!kthread_should_stop()) {
        preempt_disable();
        if (!local_softirq_pending()) {
            preempt_enable_no_resched();
            schedule();
            preempt_disable();
        }   

        __set_current_state(TASK_RUNNING);

        while (local_softirq_pending()) {
            /* Preempt disable stops cpu going offline.
               If already offline, we'll be on wrong CPU:
               don't process */
            if (cpu_is_offline((long)__bind_cpu))
                goto wait_to_die;

            local_irq_disable();

            if (local_softirq_pending())
                __do_softirq();  //内核线程被唤醒，在必要是调用。

            local_irq_enable();
            preempt_enable_no_resched();
            cond_resched();  //进程切换
            preempt_disable();
            rcu_note_context_switch((long)__bind_cpu);
        }   
        preempt_enable();
        set_current_state(TASK_INTERRUPTIBLE); //没有挂起的软中断，则休眠
    }   
    __set_current_state(TASK_RUNNING);
    return 0;

wait_to_die:
    preempt_enable();
    /* Wait for kthread_stop */
    set_current_state(TASK_INTERRUPTIBLE);
    while (!kthread_should_stop()) {
        schedule();
        set_current_state(TASK_INTERRUPTIBLE);
    }
    __set_current_state(TASK_RUNNING);
    return 0;
}

action:

数组元素对应软中断各自的处理函数。

static struct softirq_action softirq_vec[NR_SOFTIRQS]

struct softirq_action
{   
    void    (*action)(struct softirq_action *);
};

与TIMER_SOFTIRQ相关的软中断处理函数到底是个甚样子。

-- kernel/softirq.c --

void open_softirq(int nr, void (*action)(struct softirq_action *)) 
{
    softirq_vec[nr].action = action;
}

再往下看，原来早在init_timers函数中便初始化。

-- kernel/timer.c --

void __init init_timers(void)
{
    int err = timer_cpu_notify(&timers_nb, (unsigned long)CPU_UP_PREPARE,
                (void *)(long)smp_processor_id());

    init_timer_stats();

    BUG_ON(err != NOTIFY_OK);
    register_cpu_notifier(&timers_nb);
    open_softirq(TIMER_SOFTIRQ, run_timer_softirq);　　//中断处理函数赋值
}

TIMER_SOFTIRQ软中断处理函数：

static void run_timer_softirq(struct softirq_action *h)
{
    struct tvec_base *base = __this_cpu_read(tvec_bases);

    hrtimer_run_pending();

    if (time_after_eq(jiffies, base->timer_jiffies))
        __run_timers(base); //-->run
}

说到此，有必要简单的提下定时器

定时器什么功能大家都晓得，唯一注意一下的是，定时器不一定准。

内核是个大忙人，比定时器重要的任务还有许多，所以出现定时到了却迟迟没有动静的情况也不要大惊小怪。

一、定时器就是一个闹钟

struct timer_list {

    struct list_head entry;
    unsigned long expires; 
    struct tvec_base *base;
    
    void (*function)(unsigned long);
    unsigned long data;
        
    int slack;

#ifdef CONFIG_TIMER_STATS
    int start_pid;
    void *start_site;
    char start_comm[16];
#endif

#ifdef CONFIG_LOCKDEP
    struct lockdep_map lockdep_map;
#endif
};

闹钟们都挂在链子上：

struct tvec_base {
    spinlock_t lock;
    struct timer_list *running_timer;
    unsigned long timer_jiffies;
    unsigned long next_timer;
    struct tvec_root tv1;
    struct tvec tv2;
    struct tvec tv3;
    struct tvec tv4;
    struct tvec tv5;
} ____cacheline_aligned;

看来有五条链子（tv1~tv5）, 分别挂着剩余时间相同的闹钟。
有人问了，时间总在不停的前进，闹钟也该不停的换链表吧。
那是当然，这交给了cascade函数。

run:

static inline void __run_timers(struct tvec_base *base)
{
    struct timer_list *timer;

    spin_lock_irq(&base->lock);

    while (time_after_eq(jiffies, base->timer_jiffies)) {

        struct list_head work_list;
        struct list_head *head = &work_list;
        int index = base->timer_jiffies & TVR_MASK;

        /*
         * Cascade timers:
         */
        if (!index &&
            (!cascade(base, &base->tv2, INDEX(0))) &&
                (!cascade(base, &base->tv3, INDEX(1))) &&
                    !cascade(base, &base->tv4, INDEX(2)))
            cascade(base, &base->tv5, INDEX(3)); //过滤动态定时器

        ++base->timer_jiffies;
        list_replace_init(base->tv1.vec + index, &work_list);

        while (!list_empty(head)) {

            void (*fn)(unsigned long);
            unsigned long data;

            timer = list_first_entry(head, struct timer_list,entry);
            fn = timer->function;
            data = timer->data;

            timer_stats_account_timer(timer);

            base->running_timer = timer;
            detach_timer(timer, 1);

            spin_unlock_irq(&base->lock);
            call_timer_fn(timer, fn, data); //执行定时器函数
            spin_lock_irq(&base->lock);
        }
    }
    base->running_timer = NULL;
    spin_unlock_irq(&base->lock);
}

过去总有这么个疑惑，cpu不停的更新jiffies，咋会有足够的时间做其他的么。仔细一想，才知杞人忧天。cpu在一个jiffies内至少还有一万次以上的震荡，上千条指令的空余。区区改个时间能用到多少指令。

当然了，也会有任务多到一个jiffies搞不定的时候，怎么办，就把任务推后到下一个jiffies的空余时间里处理，这也就是所谓的中断下半部的思想，而这里的软中断，还有tasklet就亦如此。

二、调度算法模块

之前说到哪了……哦，还剩下update_process_times的最后一个函数scheduler_tick。

void scheduler_tick(void)
{   
    int cpu = smp_processor_id();
    struct rq *rq = cpu_rq(cpu);
    struct task_struct *curr = rq->curr;
        
    sched_clock_tick();
    
    raw_spin_lock(&rq->lock);

    update_rq_clock(rq);
    update_cpu_load_active(rq);
    curr->sched_class->task_tick(rq, curr, 0);　　//-->

    raw_spin_unlock(&rq->lock);

    perf_event_task_tick();

#ifdef CONFIG_SMP
    rq->idle_at_tick = idle_cpu(cpu);
    trigger_load_balance(rq, cpu);
#endif
}

这里出现的sched_class结构体便是调度算法模块化的体现，内核通过该结构体管理调度问题。

关于绝对公平调度：

View Code

-->

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &curr->se;

    for_each_sched_entity(se) {  //宏：有子到父逐渐向上遍历sched_entity
        cfs_rq = cfs_rq_of(se);
        entity_tick(cfs_rq, se, queued); //-->
    }
}

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);

    /*
     * Update share accounting for long-running entities.
     */
    update_entity_shares_tick(cfs_rq);

#ifdef CONFIG_SCHED_HRTICK
    /*
     * queued ticks are scheduled to match the slice, so don't bother
     * validating it and just reschedule.
     */
    if (queued) {
        resched_task(rq_of(cfs_rq)->curr); //
        return;
    }
    /*
     * don't let the period tick interfere with the hrtick preemption
     */
    if (!sched_feat(DOUBLE_TICK) &&
            hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
        return;
#endif

    if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))
        check_preempt_tick(cfs_rq, curr);
}

sched_entity

struct sched_entity {
    struct load_weight  load;       /* for load-balancing */
    struct rb_node      run_node;
    struct list_head    group_node;
    unsigned int        on_rq;

    u64         exec_start;
    u64         sum_exec_runtime;
    u64         vruntime;
    u64         prev_sum_exec_runtime;

    u64         nr_migrations;

#ifdef CONFIG_SCHEDSTATS
    struct sched_statistics statistics;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    struct sched_entity *parent;
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq       *cfs_rq;
    /* rq "owned" by this entity/group: */
    struct cfs_rq       *my_q;
#endif
};

进程优先级之红黑树

CFS引入了红黑树结构，树的结点即代表一个进程，优先级越高，位置越靠左。新加入进程，进程红黑树需要调整，待调整完毕，剩下进程调度的精髓函数：schedule

Goto: Linux内核CFS进程调度策略

一、pick up one

从运行队列的链表中找到一个进程，并将CPU分配给它。

/*
 * schedule() is the main scheduler function.
 */
asmlinkage void __sched schedule(void)
{
    struct task_struct *prev, *next; //关键在于为next赋值，也就是找到要调度的下一个进程
    unsigned long *switch_count;
    struct rq *rq;
    int cpu;

need_resched:
    preempt_disable();
    cpu = smp_processor_id(); //获得current的cpu ID
    rq = cpu_rq(cpu);  //找到属于该cpu的队列
    rcu_note_context_switch(cpu);
    prev = rq->curr;

    schedule_debug(prev);

    if (sched_feat(HRTICK))
        hrtick_clear(rq);

    raw_spin_lock_irq(&rq->lock); //在寻找可运行进程之前，必须关掉本地中断

    switch_count = &prev->nivcsw;

    if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

        if (unlikely(signal_pending_state(prev->state, prev))) { //若为非阻塞挂起信号，则给予prev一次运行的机会

            prev->state = TASK_RUNNING;

        } else {

            deactivate_task(rq, prev, DEQUEUE_SLEEP); //休眠当前进程
            prev->on_rq = 0;

            /*
             * If a worker went to sleep, notify and ask workqueue
             * whether it wants to wake up a task to maintain
             * concurrency.
             */
            if (prev->flags & PF_WQ_WORKER) {
                struct task_struct *to_wakeup;

                to_wakeup = wq_worker_sleeping(prev, cpu);
                if (to_wakeup)
                    try_to_wake_up_local(to_wakeup);
            }

            /*
             * If we are going to sleep and we have plugged IO
             * queued, make sure to submit it to avoid deadlocks.
             */
            if (blk_needs_flush_plug(prev)) {
                raw_spin_unlock(&rq->lock);
                blk_schedule_flush_plug(prev);
                raw_spin_lock(&rq->lock);
            }
        }
        switch_count = &prev->nvcsw;
    }

    pre_schedule(rq, prev);

    if (unlikely(!rq->nr_running)) //若当前cpu队列没有了可运行的进程，咋办？
        idle_balance(cpu, rq);  //向另一个cpu上借几个好了，这涉及到CPU间的负载平衡问题

    put_prev_task(rq, prev);  //安置好prev进程
    next = pick_next_task(rq);  //寻找一个新进程
    clear_tsk_need_resched(prev);
    rq->skip_clock_update = 0;

    if (likely(prev != next)) {  //准备进程切换
        rq->nr_switches++;
        rq->curr = next;
        ++*switch_count;

        context_switch(rq, prev, next); /* unlocks the rq */
        /*
         * The context switch have flipped the stack from under us
         * and restored the local variables which were saved when
         * this task called schedule() in the past. prev == current
         * is still correct, but it can be moved to another cpu/rq.
         */
        cpu = smp_processor_id();
        rq = cpu_rq(cpu);
    } else
        raw_spin_unlock_irq(&rq->lock);

    post_schedule(rq);

    preempt_enable_no_resched();

    if (need_resched()) //查看是否一些其他的进程设置了当前进程的TIF_NEED_RESCHED标志
        goto need_resched;
}

二、寻找一个新进程

static inline struct task_struct *
pick_next_task(struct rq *rq)
{
    const struct sched_class *class;
    struct task_struct *p;

    /*
     * Optimization: we know that if all tasks are in
     * the fair class we can call that function directly:
     */
    if (likely(rq->nr_running == rq->cfs.nr_running)) {
        p = fair_sched_class.pick_next_task(rq);　　　　　　// 若是CFS调度，next task就是红黑树的最左端的进程，很方便
        if (likely(p))
            return p;
    }

    for_each_class(class) {
        p = class->pick_next_task(rq);
        if (p)
            return p;
    }

    BUG(); /* the idle class will always have a runnable task */
}

三、放飞子进程

得到新进程的标识后，开始对内存等重要指标设置。一切设置完毕，子进程长大了，翅膀硬了，就让它自己飞了。

/*
 * context_switch - switch to the new MM and the new
 * thread's register state.
 */
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
               struct task_struct *next)
{
    struct mm_struct *mm, *oldmm;

    prepare_task_switch(rq, prev, next);

    mm = next->mm;
    oldmm = prev->active_mm;
    /*
     * For paravirt, this is coupled with an exit in switch_to to
     * combine the page table reload and the switch backend into
     * one hypercall.
     */
    arch_start_context_switch(prev);

    if (!mm) { //若为内核线程，则使用prev的地址空间
        next->active_mm = oldmm;
        atomic_inc(&oldmm->mm_count);
        enter_lazy_tlb(oldmm, next);
    } else
        switch_mm(oldmm, mm, next);

    if (!prev->mm) {
        prev->active_mm = NULL;
        rq->prev_mm = oldmm;
    }

    /*
     * Since the runqueue lock will be released by the next
     * task (which is an invalid locking op but in the case
     * of the scheduler it's an obvious special-case), so we
     * do an early lockdep release here:
     */
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
    spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif

    /* Here we just switch the register state and the stack. */
    switch_to(prev, next, prev);

    barrier(); //保证任何汇编语言指令都不能通过
    /*
     * this_rq must be evaluated again because prev may have moved
     * CPUs since it called schedule(), thus the 'rq' on its stack
     * frame will be invalid.
     */
    finish_task_switch(this_rq(), prev);
}

轮到了这个进程，放飞它

一、开始起飞

#define switch_to(prev,next,last)                   \
do {                                    \
    last = __switch_to(prev,task_thread_info(prev), task_thread_info(next));    \
} while (0)

二、汇编代码细节

-- arch/arm/kernel/entry-armv.S --


/*
 * Register switch for ARMv3 and ARMv4 processors
 * r0 = previous task_struct, r1 = previous thread_info, r2 = next thread_info
 * previous and next are guaranteed not to be the same.
 */
ENTRY(__switch_to)
 UNWIND(.fnstart    )    
 UNWIND(.cantunwind )
    add ip, r1, #TI_CPU_SAVE
    ldr r3, [r2, #TI_TP_VALUE]
 ARM(   stmia   ip!, {r4 - sl, fp, sp, lr} )    @ Store most regs on stack
 THUMB( stmia   ip!, {r4 - sl, fp}     )    @ Store most regs on stack
 THUMB( str sp, [ip], #4           )    
 THUMB( str lr, [ip], #4           )    
#ifdef CONFIG_CPU_USE_DOMAINS
    ldr r6, [r2, #TI_CPU_DOMAIN]
#endif
    set_tls r3, r4, r5
#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP)
    ldr r7, [r2, #TI_TASK]
    ldr r8, =__stack_chk_guard
    ldr r7, [r7, #TSK_STACK_CANARY]
#endif
#ifdef CONFIG_CPU_USE_DOMAINS
    mcr p15, 0, r6, c3, c0, 0       @ Set domain register
#endif
    mov r5, r0
    add r4, r2, #TI_CPU_SAVE
    ldr r0, =thread_notify_head
    mov r1, #THREAD_NOTIFY_SWITCH
    bl  atomic_notifier_call_chain
#if defined(CONFIG_CC_STACKPROTECTOR) && !defined(CONFIG_SMP)
    str r7, [r8] 
#endif

 THUMB( mov ip, r4             )
    mov r0, r5
 ARM(   ldmia   r4, {r4 - sl, fp, sp, pc}  )    @ Load all regs saved previously
 THUMB( ldmia   ip!, {r4 - sl, fp}     )    @ Load all regs saved previously
 THUMB( ldr sp, [ip], #4           )
 THUMB( ldr pc, [ip]           )
 UNWIND(.fnend      )
ENDPROC(__switch_to)

到此为止，进程切换完毕。

posted @ 2011-10-13 11:06 郝壹贰叁阅读(2923) 评论(1) 编辑收藏举报

刷新页面返回顶部

机器学习水很深

We all have two lives. The second one starts when we realize that we only have one. --- Tom Hiddleston

Starting from fork(...)

创建一个进程

系统调用的过程

“内核态”与“用户态"

六种特权模式

一种非特权模式

模式的解释

模式的寄存器

创造子进程

参数分析

管理子进程

一、线程类型

内核线程

用户“轻量级线程 LWP"

用户线程

绑定模式：用户线程+LWP

二、创建一个简单的 “内核线程”

一、进程0

二、进程1

LInux 调度器

一、单处理器的时间中断

二、硬中断

三、软中断

一、定时器就是一个闹钟

二、调度算法模块

一、pick up one

二、寻找一个新进程

三、放飞子进程

一、开始起飞

二、汇编代码细节

公告