Linux内核机制—mutex

一、Mutex锁简介

1. mutex是具有严格语义的简单、直接的互斥锁：

(1) 一次只能有一个任务持锁
(2) 只有锁的持有者才能释放锁
(3) 不允许多次释放锁
(4) 不允许递归持锁
(5) 必须通过 API 初始化锁
(6) 不能通过 memset 或拷贝来初始化锁
(7) 任务不应该在持锁的情况下退出
(8) 不能释放锁所在的内存区域
(9) 已经持有的锁不能重复初始化
(10) 此锁不能在硬中断或软中断上下文使用，例如 tasklets 和计时器

当启用 DEBUG_MUTEXES 时，这些语义将完全强制执行。此外，除了强制执行上述规则外，互斥锁调试代码还实现了许多使锁调试更容易和更快的附加功能：

(1) 只要它们在调试输出中打印就使用互斥体的符号名称
(2) 获取点跟踪，函数名称的符号查找
(3) 打印输出系统中所有锁的列表，
(4) 所有者跟踪
(5) 检测自递归锁并打印出所有相关信息
(6) 检测多任务循环死锁并打印出所有受影响的锁和任务（并且只有那些任务）

注：上述来自 include/linux/mutex.h 中 struct mutex 结构的注释信息。

2. 一个简单的mutex工作原理图：

传统的mutex只需要一个状态标记和一个等待队列就OK了，等待队列中是一个个阻塞的线程，thread owner当前持有mutex，当它离开临界区释放锁的时候，会唤醒等待队列中第一个线程(top waiter)，这时候top waiter会去竞争持锁，如果成功，那么从等待队列中摘下，成为owner。如果失败，继续保持阻塞状态，等待owner释放锁的时候唤醒它。在owner task持锁过程中，如果有新的任务来竞争mutex，那么就会进入阻塞状态并插入等待队列的尾部。

相对于传统的mutex，linux内核进行了一些乐观自旋的优化，也就是说当线程持锁失败的时候，可以选择在mutex状态标记上自旋，等待owner释放锁，也可以选择进入阻塞状态并挂入等待队列。具体如何选择是在自旋等待的时间开销和进程上下文切换的开销之间进行平衡。此外为了防止多个线程自旋带来的性能问题，mutex的乐观自旋机制还引入了MCS锁，后面章节我们会详细描述。

二、相关结构

1. 互斥量对象通过 struct mutex 成员来描述

struct mutex {
    /*
     * 1. 标记mutex对象被哪一个task(struct task_struct*)持有，如果为NULL表示还没有
     * 被任何一个任务持有。
     * 2. 由于task_struct至少是8字节对齐的，其最低3bit可以用来做标志位，分别为：
     * (1) MUTEX_FLAG_WAITERS：表示 wait_list 成员链表非空，即有任务阻塞在此mutex锁上，
     * owner在unlock的时候必须要执行唤醒动作。
     * (2) MUTEX_FLAG_HANDOFF：为了防止mutex等待队列中的任务饿死，在唤醒top waiter时会
     * 设置这个标志(由于乐观自旋任务的不断插入，唤醒的top waiter任务也不一定能获取到锁)，
     * 设置这个标志后，owner在解锁时会将锁直接转交给top waiter，而不是让唤醒的top waiter
     * 去竞争锁。
     * (3) MUTEX_FLAG_PICKUP：此标志表示mutex已经完事具备(即完成了转交)，只等待top waiter
     * 来持锁。
     */
    atomic_long_t    owner;
    //用于保护 wait_list 成员链表的自旋锁
    spinlock_t        wait_lock;
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER //默认使能
    /*
     * 在配置了 MUTEX_SPIN_ON_OWNER 的时候，mutex支持乐观自旋机制，osq成员就是乐观自旋时需
     * 要持有的MCS锁队列，其只有一个tail成员，如果等于0说明是个空队列，没有乐观自旋任务。否
     * 则tail指向队列的尾部。注意tail不是指针是个cpu number，optimistic_spin_node对象是per-cpu
     * 的，有cpu number就能定位到 optimistic_spin_node 对象。
     */
    struct optimistic_spin_queue osq; /* Spinner MCS lock */
#endif
    /*
     * mutex是个睡眠锁，当任务无法获取到锁又不具备乐观自旋条件时会挂入到这个等待队列，等待owner
     * 释放锁。
     */
    struct list_head    wait_list;
#ifdef CONFIG_DEBUG_MUTEXES //默认关闭
    //debug相关成员
    void            *magic;
#endif
#ifdef CONFIG_DEBUG_LOCK_ALLOC //默认关闭
    //debug相关成员
    struct lockdep_map    dep_map;
#endif
    ANDROID_OEM_DATA_ARRAY(1, 2);
};

注：只要是64bit机器且结构体中有大于等于8字节的成员，结构体间的对齐就是8字节对齐的，最低3bit可以用来做标志位。

大部分的成员都非常好理解，除了osq这个成员，其工作原理示意图如下：

注：图上串联的是 struct optimistic_spin_node 结构。

osq(Optimistic spin queue) 就是乐观自旋队列的意思，也就是形成一组处于自旋状态的任务队列。和等待队列不一样，这个队列中的任务都是当前正在执行的任务。Osq并没有直接将这些任务的 task struct 形成队列结构，而是把 per-CPU 的 mcs lock 对象串联形成队列。Mcs lock 中有 cpu number，通过这些 cpu number 可以定位到指定 cpu 上的current thread，也就定位到了自旋的任务。

虽然都是自旋，但是自旋方式并不一样。Osq 队列中的头部节点是持有 osq 锁的，只有该任务处于对 mutex 的 owner 进行乐观自旋的状态（我们称之 mutex 乐观自旋）。Osq 队列中的其他节点都是自旋在自己的 mcs lock 上（我们称之 mcs 乐观自旋）。当头部的 mcs lock 释放掉后（结束 mutex 乐观自旋，持有了 mutex 锁），它会将 mcs lock 传递给下一个节点，从而让 spinner 队列上的任务一个个的按顺序进入 mutex 的乐观自旋，从而避免了 cache-line bouncing 带来的性能开销。

2. 等待任务对象

由于是 mutex 是 sleep lock，需要把等待的任务挂入队列。在内核中，struct mutex_waiter 用来抽象等待mutex的任务，其成员描述如下：

/*
 * This is the control structure for tasks blocked on mutex,
 * which resides on the blocked task's kernel stack:
 */
struct mutex_waiter {
    //通过它挂入mutex的wait_list链表上
    struct list_head    list;
    //等待该mutex的任务
    struct task_struct    *task;
    //说明wait/wound mutex上下文，本文不会深入讲解
    struct ww_acquire_ctx    *ww_ctx;
#ifdef CONFIG_DEBUG_MUTEXES //默认关闭
    //和debug相关成员
    void            *magic;
#endif
};

3. MCS锁对象

在linux内核中，对睡眠锁（例如mutex、rwsem）进行了乐观自旋的优化，这涉及到 MCS lock。struct optimistic_spin_node 用来抽象乐观自旋的 MCS lock，其成员描述如下：

/*
 * An MCS like lock especially tailored for optimistic spinning for sleeping
 * lock implementations (mutex, rwsem, etc).
 */
struct optimistic_spin_node {
    /*
     * 通过这两个成员将 mcs 锁对象串联成一个双向链表，睡眠锁对象中会有成员指向这
     * 个链表。对于 mutex 结构就是 osq 成员，其指向mcs lock链表尾部的成员(最近试
     * 图持锁的那个，见上图)。而mcs lock链表头部的那个成员就是有资格乐观自旋在mutex
     * 上的那个。
     */
    struct optimistic_spin_node *next, *prev;
    /* mcs lock的状态，1表示持锁，0表示没有持锁 */
    int locked; /* 1 if lock acquired */
    /* mcs lock是per-cpu的，此表示mcs lock的cpu id */
    int cpu; /* encoded CPU # + 1 value */
};

4. owner成员的标志和注释

/*
 * @owner: contains: 'struct task_struct *' to the current lock owner, NULL means not owned. 
 * Since task_struct pointers are aligned at at least L1_CACHE_BYTES, we have low bits to 
 * store extra state.
 *
 * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
 * Bit1 indicates unlock needs to hand the lock to the top-waiter
 * Bit2 indicates handoff has been done and we're waiting for pickup.
 */
#define MUTEX_FLAG_WAITERS    0x01
#define MUTEX_FLAG_HANDOFF    0x02
#define MUTEX_FLAG_PICKUP    0x04

#define MUTEX_FLAGS        0x07

三、外部接口

1. Mutex模块对外部的接口API

(1) 初始化接口

mutex_init //初始化一个已经定义的mutex对象
__mutex_init //和mutex_init类似，但允许更灵活的设置debug信息
DEFINE_MUTEX //定义并初始化一个mutex对象
__MUTEX_INITIALIZER //当mutex嵌入在其它对象中的时候，该API可以初始化数据结构中内嵌的mutex对象

(2) 获取mutex锁接口

mutex_lock //获取mutex锁，若是不成功进入D状态
mutex_lock_interruptible //获取mutex锁，若是不成功进入S状态
mutex_lock_killable //获取mutex锁，若是不成功进入Killable状态
mutex_lock_io //类似mutex_lock，增加标记iowait状态，未成功获取锁时进入io wait D状态
mutex_trylock //尝试获取mutex锁，若不成功，不阻塞，返回0
mutex_trylock_recursive //类似mutex_lock，但增加了当前线程重复持锁的检查
atomic_dec_and_mutex_lock //结合原子操作的持锁接口，当减1导致原子变量变成0时，并且成功持锁后返回true，否则返回false

(3) 释放mutex锁接口

mutex_unlock //释放mutex锁，离开临界区

(4) 释放mutex锁状态接口

mutex_is_locked //判断mutex的当前状态，若是被其它线程持有返回true，否则返回false

四、尝试获取锁

和 mutex_lock 不一样，mutex_trylock 只是尝试获取锁，如果成功，那么自然是好的，直接返回true，如果失败，也不会阻塞，只是返回false就可以了。代码主逻辑在 __mutex_trylock_or_owner()

/*
 * Trylock variant that retuns the owning task on failure.
 */
static inline struct task_struct *__mutex_trylock_or_owner(struct mutex *lock)
{
    unsigned long owner, curr = (unsigned long)current;

    /* 先保存修改前owner的状态 */
    owner = atomic_long_read(&lock->owner);
    /*
     * 对于mutex的owner成员，它是一个原子变量，我们采用了大量的原子操作来访问或者更新它。然而判断
     * 持锁需要一连串的操作，我们并没有采用同步机制（例如自旋锁）来保护这一段的对owner成员操作，因
     * 此，我们这些操作放到一个for循环中，在操作的结尾处会判断是否有其他线程插入修改了owner成员，
     * 如果中间有其他线程插入，那么就需要重新来过。
     */
    for (;;) { /* must loop, can race against a flag */
        unsigned long old, flags = __owner_flags(owner);
        unsigned long task = owner & ~MUTEX_FLAGS;

        /* task非空说明锁当前还是有线程持有的 */
        if (task) {
            /*
             * 如果task非空，并且也不等于current thread，那么说明mutex锁被其他线程持有，还没有释放
             * 锁（也有可能在释放锁的时候，把锁直接转交给了其他线程），因此直接break跳出循环，持锁
             * 失败。
             */
            if (likely(task != curr))
                break;

            /*
             * 如果持锁的task等于当前任务，而且设置了 MUTEX_FLAG_PICKUP 的标记，那么说明持锁线程已
             * 经把该mutex锁转交给了本线程，等待本线程来拾取。如果没有 MUTEX_FLAG_PICKUP 标记，那么
             * 也是直接break跳出循环，递归持锁失败。
            */
            if (likely(!(flags & MUTEX_FLAG_PICKUP)))
                break;

            flags &= ~MUTEX_FLAG_PICKUP;
        } else {
#ifdef CONFIG_DEBUG_MUTEXES
            DEBUG_LOCKS_WARN_ON(flags & MUTEX_FLAG_PICKUP);
#endif
        }

        /*
         * We set the HANDOFF bit, we must make sure it doesn't live
         * past the point where we acquire it. This would be possible
         * if we (accidentally) set the bit on an unlocked mutex.
         */
        /*
         * 有两种情况会走到这里:
         * (1) 一种情况是task为空，说明该mutex锁处于unlocked状态。
         * (2) 一种情况是task非空，但是等于当前线程，并且mutex发生了handoff，该锁被转交给当前试图
         * 持锁的线程。无论哪种情况，都可以去执行持锁操作了。
         */
        flags &= ~MUTEX_FLAG_HANDOFF;

        /*
         * 原子比较赋值操作，若 lock->owner 的值和 owner 的值相等，那么就对 lock->owner 赋值为
         * curr|flags，否则不赋值。返回值为 lock->owner 之前的值。
         */
        old = atomic_long_cmpxchg_acquire(&lock->owner, owner, curr | flags);
        /*
         * 如果成功获取了锁（没有其他线程插入修改owner这个原子变量），返回NULL(外层函数再非一下，
         * mutex_trylock()就返回true了)。如果owner发生了变化，说明中间有其他线程插入，那么重新来过。
         */
        if (old == owner)
            return NULL;

        owner = old;
    }

    /* 持锁失败，返回持锁线程 */
    return __owner_task(owner);
}

五、获取mutex锁

1. mutex_lock()

void __sched mutex_lock(struct mutex *lock)
{
    might_sleep();

    //快速路径
    if (!__mutex_trylock_fast(lock))
        //慢速路径
        __mutex_lock_slowpath(lock);
}

这里的 might_sleep 说明调用 mutex_lock 函数有可能会因为未能获取到mutex锁而进入阻塞状态。在原子上下文中（中断上下文、软中断上下文、持有自旋锁、禁止抢占等），不能调用可以引起阻塞的函数，因此在 might_sleep 函数中嵌入了这个检查，当原子上下文中调用 mutex_lock 函数的时候，内核会打印出内核栈的信息，从而定位这个异常。需要在设置 CONFIG_DEBUG_ATOMIC_SLEEP 选项的情况下才生效的，如果没有设置这个选项，might_sleep 函数退化为 might_resched 函数。
在配置了抢占式内核（CONFIG_PREEMPT）或者非抢占式内核（CONFIG_PREEMPT_NONE）的情况下，might_resched 是空函数。在配置了主动抢占式内核（CONFIG_PREEMPT_VOLUNTARY）的情况下，might_resched 会调用 _cond_resched 函数来主动触发一次抢占。主动抢占式内核通过在 might_sleep 函数中增加了潜在的调度点实现了比非抢占式内核更好的延迟特性，同时确保抢占带来的进程切换开销低于抢占式内核。

2. 持锁快速路径

static __always_inline bool __mutex_trylock_fast(struct mutex *lock)
{
    unsigned long curr = (unsigned long)current;
    unsigned long zero = 0UL;

    /*
     * 如果lock->owner的值等于0（即不仅task struct地址等于0，所有的flag也要等于0），
     * 那么将当前线程的task struct的指针赋值给lock->owner，表示该mutex锁已经被当前
     * 线程持有。
     * 如果lock->owner的值不等于0，表示该mutex锁已经被其他线程持有或者锁正在传递给
     * top waiter线程，当前线程需要阻塞等待。
     * 需要特别说明的是上面描述的操作（比较和赋值）都是原子操作，不会有任何指令插入
     * 其中。
     */
    if (atomic_long_try_cmpxchg_acquire(&lock->owner, &zero, curr))
        return true;

    return false;
}

/*
 * 该函数会对比 *v 和 *old 指针中的数值，如果相等执行赋值 *v=new 同时返回true。
 * 如果不相等，不执行赋值操作，直接返回false。
 */
bool atomic_long_try_cmpxchg_acquire(atomic_long_t *v, long *old, long new)

若试图获取的 mutex 还处于空闲状态，通过 __mutex_trylock_fast 来可以尝试获取到锁。也就说说若 mutex 锁没有被其它线程持锁，获取锁是非常轻量的，只是一个比较赋值。

3. 持锁慢速路径

static noinline void __sched __mutex_lock_slowpath(struct mutex *lock)
{
    __mutex_lock(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_);
}

static int __sched __mutex_lock(struct mutex *lock, long state, unsigned int subclass, struct lockdep_map *nest_lock, unsigned long ip)
{
    return __mutex_lock_common(lock, state, subclass, nest_lock, ip, NULL, false);
}

/*
 * Lock a mutex (possibly interruptible), slowpath:
 * mutex_lock 传参：(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_, NULL, false)
 * 成功获取到锁返回0
 */
static __always_inline int __sched __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
            struct lockdep_map *nest_lock, unsigned long ip, struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
{
    struct mutex_waiter waiter;
    struct ww_mutex *ww;
    int ret;

    //传参是不使用它的
    if (!use_ww_ctx)
        ww_ctx = NULL; //赋值为NULL，之后都不用看它了

    might_sleep();

#ifdef CONFIG_DEBUG_MUTEXES
    DEBUG_LOCKS_WARN_ON(lock->magic != lock);
#endif

    ww = container_of(lock, struct ww_mutex, base);
    if (ww_ctx) {
        if (unlikely(ww_ctx == READ_ONCE(ww->ctx)))
            return -EALREADY;

        /*
         * Reset the wounded flag after a kill. No other process can
         * race and wound us here since they can't have a valid owner
         * pointer if we don't have any locks held.
         */
        if (ww_ctx->acquired == 0)
            ww_ctx->wounded = 0;
    }

    /* 注意，这里关了抢占 */
    preempt_disable();

    mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);

    /*
     * __mutex_trylock 用来再次尝试获取锁(上节已讲)，mutex_optimistic_spin 则是mutex乐观自旋
     *（Optimistic spinning）部分的代码。
     * 这两个操作只要有其一能成功获取mutex锁，那么就直接返回了。由于没有进入阻塞状态，因此这
     * 个路径也叫做中速路径。
     */
    if (__mutex_trylock(lock) || mutex_optimistic_spin(lock, ww_ctx, NULL)) {
        /* got the lock, yay! */
        lock_acquired(&lock->dep_map, ip);
        if (ww_ctx)
            ww_mutex_set_context_fastpath(ww, ww_ctx);
        preempt_enable();
        return 0;
    }

    /* 持有   lock->wait_lock 锁 */
    spin_lock(&lock->wait_lock);
    /*
     * After waiting to acquire the wait_lock, try again.
     */
    if (__mutex_trylock(lock)) {
        if (ww_ctx)
            __ww_mutex_check_waiters(lock, ww_ctx);

        goto skip_wait;
    }

    debug_mutex_lock_common(lock, &waiter);

    lock_contended(&lock->dep_map, ip);

    if (!use_ww_ctx) {
        /*
         * add waiting tasks to the end of the waitqueue (FIFO): 
         *
         * 将新阻塞的线程挂入到 waiter->list 链表尾部，若是第一个被阻塞的线程还会给lock->owner
         * 设置上 MUTEX_FLAG_WAITERS 标志。
         */
        __mutex_add_waiter(lock, &waiter, &lock->wait_list);


#ifdef CONFIG_DEBUG_MUTEXES //默认关闭
        waiter.ww_ctx = MUTEX_POISON_WW_CTX;
#endif
    } else {
        /*
         * Add in stamp order, waking up waiters that must kill themselves.
         */
        ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx);
        if (ret)
            goto err_early_kill;

        waiter.ww_ctx = ww_ctx;
    }

    waiter.task = current;

    trace_android_vh_mutex_wait_start(lock);

    /* 将当前线程设置为参数  TASK_UNINTERRUPTIBLE 状态，也就是D状态 */
    set_current_state(state);
    for (;;) {
        bool first;

        /*
         * Once we hold wait_lock, we're serialized against
         * mutex_unlock() handing the lock off to us, do a trylock
         * before testing the error conditions to make sure we pick up
         * the handoff.
         */
        if (__mutex_trylock(lock))
            goto acquired;

        /*
         * Check for signals and kill conditions while holding
         * wait_lock. This ensures the lock cancellation is ordered
         * against mutex_unlock() and wake-ups do not go missing.
         */
        /* 对于传参 state = TASK_UNINTERRUPTIBLE, 这个判断直接返回0 */
        if (signal_pending_state(state, current)) {
            ret = -EINTR;
            goto err;
        }

        if (ww_ctx) {
            ret = __ww_mutex_check_kill(lock, &waiter, ww_ctx);
            if (ret)
                goto err;
        }

        /* 释放   lock->wait_lock 锁 */
        spin_unlock(&lock->wait_lock);

        /*
         * 进入阻塞状态，触发一次调度，让出CPU。由于目前执行上下文处于关闭抢占状态，因此这里的调度使
         * 用了关闭抢占版本的schedule函数。
         * 这个函数退出前重新将抢占设置为关闭状态。
         */
        schedule_preempt_disabled();

        /*
         * 该任务被唤醒之后，如果是等待队列中的第一个任务，即top waiter，那么需要给该 mutex 的 owner
         * 设置   MUTEX_FLAG_HANDOF 标志，这样即便本次唤醒后无法获取到mutex（有些在该mutex上乐观自旋的
         * 任务可能会抢先获得锁），那么下一次owner释放锁的时候，看到这个handoff标记也会进行锁的交接，
         * 不再是大家抢来抢去。
         * 通过这个机制，我们可以防止spinner队列中的任务抢占CPU资源，饿死waiter队列中的任务。
         *
         * 该任务被释放锁的动作给唤醒了，但是还是在链表中，说明抢锁没有抢过别人！若它变成top-waiter了
         * 那么设置标志下次将锁给 lock->wait_list 中睡眠的线程，而不是乐观自旋的线程了。
         */
        first = __mutex_waiter_is_first(lock, &waiter);
        if (first)
            __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);

        /* 这里在被释放锁的行为唤醒后，再次尝试获取锁 */
        set_current_state(state);
        /*
         * Here we order against unlock; we must either see it change
         * state back to RUNNING and fall through the next schedule(),
         * or we must see its unlock and acquire.
         */
        /*
         * 如果尝试获取到mutex，那么就退出循环。如果是等待队列中的top-waiter，那么就进入乐观自旋过程，
         * 这样会有更大的机会成功获取mutex锁。
         * 否则继续进入阻塞状态继续等待。
         */
        if (__mutex_trylock(lock) || (first && mutex_optimistic_spin(lock, ww_ctx, &waiter)))
            break;

        spin_lock(&lock->wait_lock);
    }

    spin_lock(&lock->wait_lock);
acquired:
    /* 获取到锁后将任务设置为 TASK_RUNNING 状态 */
    __set_current_state(TASK_RUNNING);
    trace_android_vh_mutex_wait_finish(lock);

    if (ww_ctx) {
        /*
         * Wound-Wait; we stole the lock (!first_waiter), check the
         * waiters as anyone might want to wound us.
         */
        if (!ww_ctx->is_wait_die && !__mutex_waiter_is_first(lock, &waiter))
            __ww_mutex_check_waiters(lock, ww_ctx);
    }

    /* 将当前线程从 lock->wait_list 链表中删除 */
    __mutex_remove_waiter(lock, &waiter);

    debug_mutex_free_waiter(&waiter);

skip_wait:
    /* got the lock - cleanup and rejoice! */
    lock_acquired(&lock->dep_map, ip);

    if (ww_ctx)
        ww_mutex_lock_acquired(ww, ww_ctx);

    spin_unlock(&lock->wait_lock);

    /* 这里才开的抢占 */
    preempt_enable();

    return 0;

err:
    __set_current_state(TASK_RUNNING);
    trace_android_vh_mutex_wait_finish(lock);
    __mutex_remove_waiter(lock, &waiter);
err_early_kill:
    spin_unlock(&lock->wait_lock);
    debug_mutex_free_waiter(&waiter);
    mutex_release(&lock->dep_map, ip);
    preempt_enable();
    return ret;
}

所谓慢速路径其实就是阻塞当前线程，这里将current task挂入mutex的等待队列的尾部。这样的操作让所有等待mutex的任务按照时间的先后顺序排列起来，当mutex被释放的时候，会首先唤醒队首的任务，即最先等待的任务最先被唤醒。此外，在向空队列插入第一个任务的时候，会给mutex flag设置上 MUTEX_FLAG_WAITERS 标记，表示已经有任务在等待这个mutex锁了。

3.1 __mutex_add_waiter()

/*
 * Add @waiter to a given location in the lock wait_list and set the FLAG_WAITERS flag if it's the first waiter.
 */
static void __mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter, struct list_head *list)
{
    bool already_on_list = false;
    debug_mutex_add_waiter(lock, waiter, current);

    trace_android_vh_alter_mutex_list_add(lock, waiter, list, &already_on_list);
    if (!already_on_list)
        /* 新阻塞的任务插入到链表尾部 */
        list_add_tail(&waiter->list, list);
    /*
     * list_first_entry(&lock->wait_list, struct mutex_waiter, list) == waiter
     * lock->wait_list 链表上只有有等待的任务，就会将 MUTEX_FLAG_WAITERS 标志设置上
     */
    if (__mutex_waiter_is_first(lock, waiter))
        __mutex_set_flag(lock, MUTEX_FLAG_WAITERS);
}

3.2 mutex_optimistic_spin()

乐观自旋的思路是，因为mutex锁可能是被其他CPU上正在执行中的线程持有，如果临界区比较短，那么有可能该mutex锁很快就被释放。这时候，与其进行一次上下文切换，还不如自旋等待，毕竟上下文切换的开销也是不小的。乐观自旋机制底层使用的是MCS锁。

函数解析见下节。

六、乐观自旋

1. mutex乐观自旋的代码位于 mutex_optimistic_spin 函数中，进入乐观自旋函数的线程可能有下面几个结果：
(1) 成功获取 osq 锁，进入 mutex 乐观自旋状态，当 owner 释放 mutex 锁后，该线程结束乐观自旋，成功持有了 mutex，返回true。
(2) 未能获取 osq 锁，在自己的 MCS 锁上乐观自旋。一旦成功持锁，同步骤1。
(3) 在 MCS 锁或者 mcs 锁乐观自旋的时候，由于各种原因（例如owner进入阻塞状态）而无法继续乐观自旋，那么 mutex_optimistic_spin 函数返回false，告知调用者乐观自旋失败，进入等待队列。

/*
 * 乐观旋转。
 *
 * 当我们发锁持有者当前正在（不同的）CPU 上运行并且我们不需要重新调度时，我们会尝试spin获取。 理由是如果锁的所有者正在运行，
 * 它很可能很快就会释放锁。
 *
 * 互斥体微调器（mutex spinners）使用 MCS 锁排队，因此只有一个spinner可以竞争互斥体。 但是，如果互斥体spinning不会发生，那么
 * 通过lock/unlock开销是没有意义的。
 *
 * 成功获取锁返回true，否则返回false，表示我们需要跳转到slowpath并休眠。
 *
 * 如果spinner是等待队列中的一个waiter那么将waiter标志设置为 true。如果存在，waiter-spinner 将直接和同时与 OSQ 头部的 spinner
 * 一起在锁上旋转，直到所有者更改为它自己。
 */
/* __mutex_lock_common() 中调用2次，一次传参 (lock, NULL, NULL)，一次传参 (lock, NULL, &waiter) */
static __always_inline bool mutex_optimistic_spin(struct mutex *lock, struct ww_acquire_ctx *ww_ctx, struct mutex_waiter *waiter)
{
    /*
     * 当waiter为空时，因为是正常路径的持锁请求，所以在乐观自旋之前需要持有osq锁，只有获得了osq锁，当前线程才能
     * 进入mutex乐观自旋的过程。否则只能是在自己的MCS锁上自旋等待。
     */
    if (!waiter) {
        /*
         * The purpose of the mutex_can_spin_on_owner() function is to eliminate the overhead of osq_lock() and
         * osq_unlock() in case spinning isn't possible. As a waiter-spinner is not going to take OSQ lock anyway, 
         * there is no need to call mutex_can_spin_on_owner().
         */
        /*
         * 是否乐观自旋等待mutex可以从两个视角思考：
         * (1) 如果本cpu已经设置了need resched标记，那说明有其它任务想要抢占当前试图持锁的任务。那么current task
         * 就没必要乐观自旋了，赶紧的去sleep为其他任务让路吧。
         * (2) 需要从owner的行为来判断。如果owner正在其他cpu运行，那么可以考虑进入乐观自旋过程。
         */
        if (!mutex_can_spin_on_owner(lock))
            goto fail;

        /*
         * In order to avoid a stampede of mutex spinners trying to acquire the mutex all at once, the spinners need
         * to take a MCS (queued) lock first before spinning on the owner field.
         * 翻译：
         * 为了避免 mutex spinners 试图一次获取所有互斥体的踩踏事件，spinners在owner字段上旋转之前需要先持有MCS(排队)
         * 锁。
         */
        /*
         * 在基于共享内存的多核计算系统中，mutex的实现是通过一个共享变量（owner成员）和一个队列来完成复杂的控制的。
         * 如果有多个cpu上的线程同时乐观自旋在这个共享变量上，那么就会出现缓存踩踏现象。为了解决这个问题，需要控制不
         * 能让太多的线程进入mutex乐观自旋状态(轮询owner成员)，只有那些获取了osq锁的线程才能进入。未能持osq锁的线程
         * 会进入mcs锁的乐观自旋过程，等待osq锁的owner(当前在mutex乐观自旋)释放osq锁。
         * 关于osq锁的细节我们在其他文章中描述。
         */
        if (!osq_lock(&lock->osq))
            goto fail;
    }

    /*
     * 完成了持osq锁之后(或者是被唤醒的top-waiter线程，传参waiter为非空，它会掠过osq持锁过程)，我们就可以进入mutex乐
     * 观自旋了，代码如下：
     */
    for (;;) {
        struct task_struct *owner;

        /* Try to acquire the mutex... */
        /*
         * 首先还是调用 __mutex_trylock_or_owner 试图获取 mutex 锁，如果返回的 owner 非空(需要注意的是：这里的 owner
         * 变量不包括 mutex flag 部分），那么说明 mutex 锁还在 owner task 手中。如果 owner 是空指针，说明原来持有锁的
         * owner 已经释放锁并被当前线程持锁成功，因此退出乐观自旋的循环。
         * 需要注意的是在退出 mutex 乐观自旋后会释放osq锁，        从而会让 spinner 队列中的下一个 mcs 锁自旋的任务进入 mutex
         * 乐观自旋状态。
         */
        owner = __mutex_trylock_or_owner(lock);
        if (!owner)
            break;

        /* There's an owner, wait for it to either release the lock or go to sleep. */
        /*
         * 如果 __mutex_trylock_or_owner 返回了非空 owner，说明当前线程获取锁失败，那么可以进入 mutex 乐观自旋了。所谓
         * 自旋不是自旋在 spinlock 上，而是不断的循环检测锁的 owner task 是否发生变化以及 owner task 的运行状态。如果
         * owner 阻塞了或者当前 cpu 有 resched 的需求(可能唤醒更高级任务)，那么就停止自旋，返回false，走入fail_unlock
         * 流程。
         * 如果 mutex 锁的 owner task 发生变化(例如变成NULL)则 mutex_spin_on_owner 函数返回true，则说明可以跳转到for循
         * 环处再次尝试获取锁并进行乐观自旋。
        */
        if (!mutex_spin_on_owner(lock, owner, ww_ctx, waiter))
            goto fail_unlock;

        /*
         * The cpu_relax() call is a compiler barrier which forces
         * everything in this loop to be re-loaded. We don't need
         * memory barriers as we'll eventually observe the right
         * values at the cost of a few extra spins.
         */
        cpu_relax(); //主要是内存屏障
    }

    if (!waiter)
        osq_unlock(&lock->osq);

    /* 在lock的owner上自旋成功获取锁后，返回true */
    return true;


fail_unlock:
    if (!waiter)
        osq_unlock(&lock->osq);

fail:
    /*
     * If we fell out of the spin path because of need_resched(), reschedule now, before we try-lock the mutex. 
     * This avoids getting scheduled out right after we obtained the mutex.
     * 翻译：
     * 如果我们因为 need_resched() 而退出了自旋，请立即重新调度，然后再尝试锁定互斥锁。这避免了在我们获得互斥体
     * 后立即被调度出去。
     */
    if (need_resched()) {
        /*
         * We _should_ have TASK_RUNNING here, but just in case we do not, make it so, otherwise we might get
         * stuck.
         *此路径下当前线程可能没有被放到任何wq上，若设置为 TASK_UNINTERRUPTIBLE 再调度走，就永远寿终正寝了。
         */
        __set_current_state(TASK_RUNNING);
        schedule_preempt_disabled();
    }

    /* 没有成功获取到锁返回fale. 走到fail_unlock还有一种lock->ower线程阻塞了没有处理*/
    return false;
}

__mutex_lock_common() 中调用 mutex_optimistic_spin() 函数的位置有两处，一处是 waiter 等于NULL，这是发生在 mutex_lock 的早期，这时候试图持锁的线程还没有挂入等待队列，因此 waiter 等于NULL。另一处是持锁未果，挂入等待队列，然后被唤醒之后的乐观自旋。这时候试图持锁的线程已经挂入等待队列，因此 waiter 非空。在这种场景下，刚唤醒的 top waiter 线程会给与优待，因此不需要持有 osq 锁就可以长驱直入，进入乐观自旋。

2. mutex_can_spin_on_owner()

/* 若当前持有这把锁的任务被设置抢占标志位返回0，当持锁任务正在运行或锁没人持有返回1 */
static inline int mutex_can_spin_on_owner(struct mutex *lock)
{
    struct task_struct *owner;
    int retval = 1;

    if (need_resched())
        return 0;

    rcu_read_lock();
    //清除 lock->owner 的最低3bit标志位的
    owner = __mutex_owner(lock);

    /*
     * As lock holder preemption issue, we both skip spinning if task is not
     * on cpu or its cpu is preempted。
     * 翻译：作为锁持有者抢占问题，如果任务不在 cpu 上或其 cpu 被抢占，我们都跳过旋转。
     *
     * 后者直接返回false, 因此相当于只判断了lock->owner是否正在cpu上运行。
     */
    if (owner)
        retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
    rcu_read_unlock();

    /*
     * If lock->owner is not set, the mutex has been released. Return true
     * such that we'll trylock in the spin path, which is a faster option
     * than the blocking slow path.
     */
    return retval;
}

3. mutex_spin_on_owner()

/*
 * 调用传参：(lock, owner, NULL, NULL) (lock, owner, NULL, waiter) owner为mutex当前的owner
 * 返回值：若mutex的owner不等于参数owner就返回true。否则在mutex->owner上自旋，直到mutex的owner不等于参数owner(也就
 * 是mutex的owner变化了)后返回true。自旋退出条件为参数参数owner不在cpu上运行，或当前线程要被切走，退出自旋并返回false.
 */
static noinline bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner, struct ww_acquire_ctx *ww_ctx, struct mutex_waiter *waiter)
{
    bool ret = true;

    rcu_read_lock();
    while (__mutex_owner(lock) == owner) {
        /*
         * Ensure we emit the owner->on_cpu, dereference _after_
         * checking lock->owner still matches owner. If that fails,
         * owner might point to freed memory. If it still matches,
         * the rcu_read_lock() ensures the memory stays valid.
         */
        barrier();

        /* Use vcpu_is_preempted to detect lock holder preemption issue.*/
        /* 若mutex的owner不在cpu上运行，或当前spinning的线程将要被调度走，退出spinning */
        if (!owner->on_cpu || need_resched() || vcpu_is_preempted(task_cpu(owner))) {
            ret = false;
            break;
        }

        if (ww_ctx && !ww_mutex_spin_on_owner(lock, ww_ctx, waiter)) {
            ret = false;
            break;
        }

        cpu_relax();
    }
    rcu_read_unlock();

    return ret;
}

若是锁的owner休眠了，mutex_spin_on_owner()返回flase，__mutex_lock_common() 中就会触发 reschedule() 就会进入D状态休眠。

七、释放mutex锁

1. mutex_unlock()

void __sched mutex_unlock(struct mutex *lock)
{
#ifndef CONFIG_DEBUG_LOCK_ALLOC //默认不使能
    /*
     * 如果一个线程获取了某个mutex锁之后，没有任何其他的线程试图进入临界区，那么这时候mutex
     * 的owner成员就是该线程的task struct地址，并且所有的mutex flag都是clear的。在这种情况下，
     * 将mutex的owner成员清零即可，不需要额外的操作，我们称之解锁快速路径:
     */
    if (__mutex_unlock_fast(lock))
        return;
#endif
    /*
     * 当然，如果有其他线程在竞争该mutex锁，那么情况会更复杂一些，这时候我们进入慢速路径，
     * 慢速路径的逻辑分成两段：一段是释放mutex锁，另外一段是唤醒top waiter线程。
     */
    __mutex_unlock_slowpath(lock, _RET_IP_);
}

static __always_inline bool __mutex_unlock_fast(struct mutex *lock)
{
    unsigned long curr = (unsigned long)current;

    //lock->owner最后3bit是标志，标志也都为0，并且是current持锁的，直接赋值为0然后返回true
    if (atomic_long_cmpxchg_release(&lock->owner, curr, 0UL) == curr)
        return true;

    return false;
}

/*
 * Release the lock, slowpath:
 */
static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip)
{
    struct task_struct *next = NULL;
    DEFINE_WAKE_Q(wake_q);
    unsigned long owner;

    mutex_release(&lock->dep_map, ip);

    /*
     * Release the lock before (potentially) taking the spinlock such that
     * other contenders can get on with things ASAP.
     *
     * Except when HANDOFF, in that case we must not clear the owner field,
     * but instead set it to the top waiter.
     */
    owner = atomic_long_read(&lock->owner);
    for (;;) {
        unsigned long old;

#ifdef CONFIG_DEBUG_MUTEXES //默认不使能
        DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current);
        DEBUG_LOCKS_WARN_ON(owner & MUTEX_FLAG_PICKUP);
#endif

        /*
         * 如果 mutex flag 中设定了 handoff 标记，那么说明 owner 在释放锁的时候要主动的把锁的 owner 传递
         * 给 top-waiter，不能让后来插入的乐观自旋的线程饿死 top-waiter。因此这时候我们还不能释放锁，需要在
         * __mutex_handoff 函数中释放锁给 top-waiter。
         */
        if (owner & MUTEX_FLAG_HANDOFF)
            break;

        /*
         * 若没有设置handoff标志继续往下走。将 owner 的 task struct 地址部分清掉，这也就是意味着 owner task
         * 放弃了持锁。这时候，如果有乐观自旋的任务在轮询 mutex owner，那么它会立刻感知到锁被释放，因此可以
         * 立刻获取 mutex 锁。在这样的情况下，即便后面唤醒了top-waiter，但为时已晚。
         */
        old = atomic_long_cmpxchg_release(&lock->owner, owner, __owner_flags(owner));
        if (old == owner) {
            /*
             * 释放锁成功，进入。如果等待队列中有任务阻塞在这个 mutex 中，那么退出循环，执行慢速路径中的第二
             * 段唤醒逻辑，否则直接返回，无需唤醒其他线程。
             */
            if (owner & MUTEX_FLAG_WAITERS)
                break;

            return;
        }

        /*
         * 在操作 owner 的过程中，如果有其他线程对 owner 进行的修改(没有同步机制保证多线程对 owner 的并发操作)，
         * 那么重新设定 owner，再次进行检测。
         */
        owner = old;
    }
    
    //第二段唤醒top waiter的代码如下：
    
    spin_lock(&lock->wait_lock);
    debug_mutex_unlock(lock);
    /*
     * 代码执行至此，需要唤醒 top-waiter，或者处理将锁转交 top-waiter 的逻辑，无论哪种情况，都需要从等待队列中
     * 找到 top waiter。找到后将其加入 wake queue。
     */
    if (!list_empty(&lock->wait_list)) {
        /* get the first entry from the wait-list: */
        struct mutex_waiter *waiter = list_first_entry(&lock->wait_list, struct mutex_waiter, list);
        next = waiter->task;
        debug_mutex_wake_waiter(lock, waiter);
        //尾插法放入等待队列中
        wake_q_add(&wake_q, next);
    }

    //需要handoff的场景，lock的owner在这里面才释放锁，然后直接给到top-waiter
    /*
     * 如果有任务(一般是top waiter，参考其唤醒后的代码逻辑)请求 handoff mutex，那么调用此函数可以直接将 owner
     * 设置为 top waiter 任务，然后该任务在醒来之后直接 pickup 即可。这相当与给了 top waiter 一些特权，防止由
     * 于不断的插入乐观自旋的任务而导致无法获取CPU资源。
     */
    if (owner & MUTEX_FLAG_HANDOFF)
        __mutex_handoff(lock, next);

    trace_android_vh_mutex_unlock_slowpath(lock);
    spin_unlock(&lock->wait_lock);

    /* 唤醒 top waiter 任务，因为wake_q只有一个任务，所以只唤醒一个 */
    wake_up_q(&wake_q);
}

2. __mutex_handoff()

/*
 * Give up ownership to a specific task, when @task = NULL, this is equivalent
 * to a regular unlock. Sets PICKUP on a handoff, clears HANDOF, preserves
 * WAITERS. Provides RELEASE semantics like a regular unlock, the
 * __mutex_trylock() provides a matching ACQUIRE semantics for the handoff.
 */
/*
 * 作用：将lock锁移交给task线程
*/
static void __mutex_handoff(struct mutex *lock, struct task_struct *task)
{
    unsigned long owner = atomic_long_read(&lock->owner);

    for (;;) {
        unsigned long old, new;

#ifdef CONFIG_DEBUG_MUTEXES
        DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current);
        DEBUG_LOCKS_WARN_ON(owner & MUTEX_FLAG_PICKUP);
#endif

        new = (owner & MUTEX_FLAG_WAITERS); //owner只保留 MUTEX_FLAG_WAITERS 这一bit
        new |= (unsigned long)task;
        if (task)
            new |= MUTEX_FLAG_PICKUP;

        /* 若锁交接给task成功，直接退出 */
        old = atomic_long_cmpxchg_release(&lock->owner, owner, new);
        if (old == owner)
            break;

        /* 执行的中途 lock->owner 变动了，更新owner继续尝试。old 就是变动后的 lock->owner */
        owner = old;
    }
}

八、总结

1. 在未能获取mutex锁的情况下，我们需要调用 __mutex_lock_slowpath() 函数进入慢速路径。若未能持锁当前线程会进入D状态。主要的代码逻辑在 __mutex_lock_common() 函数中。

九、补充

1. static __always_inline bool atomic_long_try_cmpxchg_acquire(atomic_long_t *v, long *old, long new) 函数有三个参数，从左到右分别是value指针，old指针和new。该函数会对比*value和*old指针中的数值，如果相等执行赋值*value=new同时返回true。如果不相等，不执行赋值操作，直接返回false。而 static __always_inline long atomic_long_cmpxchg_acquire(atomic_long_t *v, long old, long new) 这类函数则是会原子的比较 *v 和 old 的值，若相等则执行 *v=new，否则什么也不执行。返回值为*v的旧值。

2. osq_lock 出现在 mutex 中的 mutex_optimistic_spin() 中。在semphore信号量中也有使用。

3. long atomic_xchg(atomic_t *atom, long newval) 含义: 将 newval 值赋值给 atom->counter, 返回 atom->counter 的旧值，并且还带有加载获取、存储释放的特性。

4. mutex的handoff标志

在 Linux 内核的 mutex 锁机制中，执行 handoff（交接）操作的情况如下：

(1) 等待队列第一个进程被唤醒但未能获取锁时
当 mutex 的等待链表（wait_list）中第一个进程被唤醒后，如果此时锁已被其他自旋等待的进程抢占（例如当前持有锁的进程释放后，被正在自旋的进程快速获取），导致该唤醒进程无法成功获得锁，此时会触发 handoff 机制。

(2) 判断锁持有者状态后
被唤醒的进程会检查当前锁的持有者（owner）是否正在运行：

如果持有者未在运行（例如处于睡眠或被抢占状态），该进程会设置 MUTEX_FLAG_HANDOFF 标志（通过修改 owner 的 bit1）。
(3) 下次释放锁时强制交接
一旦 HANDOFF 标志被设置，下次锁持有者释放锁时，必须将锁直接交给该等待进程（即使有其他自旋等待的进程），从而绕过常规的竞争流程，保证公平性。

本质原因：Handoff 主要用于解决因自旋等待优化导致的公平性问题，避免长时间等待的进程（尤其是已进入睡眠的进程）被后续自旋等待的进程“插队”而饿死。这是一种权衡机制，在牺牲部分性能（可能延迟自旋等待者）的情况下，确保基本公平性。

(4) 取消这个标志时机：(1) mutex锁处于unlocked状态了; (2) mutex发生了handoff，该锁被转交给当前试图持锁的线程了; 无论哪种情况，都可以去执行持锁操作了。

参考：

https://mp.weixin.qq.com/s/ftb6fYP26ZBNS-_CNkFyTg

posted on 2022-05-09 23:23 Hello-World3 阅读(2996) 评论(1) 收藏举报

刷新页面返回顶部

hellokitty2