linux源码解读(十六):红黑树在内核的应用——虚拟内存管理

  1、linux内核中利用红黑树增删改查快速、稳定的特性来管理的还有另一个非常重要的功能:虚拟内存管理!前面介绍了buddy和slab算法是用来管理物理页面的。由于早期物理页面远比虚拟页面小很多,而且只需要分配和回收合并,所以也没用树形结构来组织,简单粗暴地用链表来管理!但是虚拟内存不一样了:以32位的系统为例,虚拟内存有4GB,能划分的虚拟内存块有很多,划分后需要快速增删查虚拟内存块(需要频繁地读取代码、读写数据、加载动态链接库等),此时用红黑树就很合适了!老规矩,先上结构体:

  task_struct中嵌套了一个mm_struct结构体指针,这个结构体大有乾坤:

struct task_struct {
.......
    struct mm_struct *mm;
.......
}

  继续深入:又发现了红黑树的根节点!另外两个vm_area_struct结构体又是干嘛的了?

struct mm_struct {
          struct vm_area_struct * mmap;       /* list of VMAs */
          struct rb_root mm_rb;/*又是红黑树的根节点*/
          struct vm_area_struct * mmap_cache;      /* last find_vma result */
.......
}

  继续深入:

  • 发现rb_node结构体了么?这明显是红黑树的节点呀!和上面的rb_root不是刚好组成红黑树么?
  • 进程的虚拟内存空间会被分成不同的若干区域,每个区域都有其相关的属性和用途;一个合法的地址总是落在某个区域当中的,这些区域也不会重叠。在linux内核中,这样的区域被称之为虚拟内存区域(virtual memory areas),简称 VMA。一个vma就是一块连续的线性地址空间的抽象,它拥有自身的权限(可读,可写,可执行等等)
struct vm_area_struct {
    struct mm_struct * vm_mm;    /* 所属的内存描述符 */
    unsigned long vm_start;    /* vma的起始地址 */
    unsigned long vm_end;        /* vma的结束地址 */
 
    /* 该vma的在一个进程的vma链表中的前驱vma和后驱vma指针,链表中的vma都是按地址来排序的*/
    struct vm_area_struct *vm_next, *vm_prev;
 
    pgprot_t vm_page_prot;        /* vma的访问权限 */
    unsigned long vm_flags;    /* 标识集 */
 
    struct rb_node vm_rb;      /* 红黑树中对应的节点 */
        ...............
}

   为了直观展示这些结构体之间的关系,我画了一张图供大家参考,要点说明如下:

  • vm_area_struct有vm_start和vm_end两个字段,分别指向虚拟内存区域的开始和结束地址;
  • vm_area_struct的vm_rb又组成了红黑树:便于根据特定的条件快速查找目标区域
  • vm_area_struct实例之间也用链表连接:主要用来遍历节点(不需要像树形结构一样前序、中序、后序等方式遍历,速度快一些;树形结构遍历需要递归或借助栈/队列等结构,空间复杂度是O(N)

  

   2、(1)结构体定义好了,下一步就是操作这些结构体了。既然用到了红黑树管理VMA,第一件事肯定是建树、建链表啦(所有的操作都在mm\mmap.c文件里),最直接的api是__vma_link函数,如下:

/*将新创建的vm_area_struct挂在mm_struct中管理的红黑树上*/
static void
__vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
    struct vm_area_struct *prev, struct rb_node **rb_link,
    struct rb_node *rb_parent)
{
    __vma_link_list(mm, vma, prev, rb_parent);
    __vma_link_rb(mm, vma, rb_link, rb_parent);
}

  调用了两个函数,从名称看就知道一个是建立链表,另一个是建立红黑树;先看第一个_vma_link_list函数,如下:就是把vma实例加入到mm的mmap字段,然后让vma的next指针指向下一个vma实例,完成链表的建立!

void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
        struct vm_area_struct *prev, struct rb_node *rb_parent)
{
    struct vm_area_struct *next;

    vma->vm_prev = prev;
    if (prev) {
        next = prev->vm_next;
        prev->vm_next = vma;
    } else {
        mm->mmap = vma;
        if (rb_parent)
            next = rb_entry(rb_parent,
                    struct vm_area_struct, vm_rb);
        else
            next = NULL;
    }
    vma->vm_next = next;
    if (next)
        next->vm_prev = vma;
}

  另一个是__vma_link_rb函数在红黑树中插入节点,如下:

void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
        struct rb_node **rb_link, struct rb_node *rb_parent)
{
    /* Update tracking information for the gap following the new vma. */
    if (vma->vm_next)
        vma_gap_update(vma->vm_next);
    else
        mm->highest_vm_end = vma->vm_end;

    /*
     * vma->vm_prev wasn't known when we followed the rbtree to find the
     * correct insertion point for that vma. As a result, we could not
     * update the vma vm_rb parents rb_subtree_gap values on the way down.
     * So, we first insert the vma with a zero rb_subtree_gap value
     * (to be consistent with what we did on the way down), and then
     * immediately update the gap to the correct value. Finally we
     * rebalance the rbtree after all augmented values have been set.
     */
    rb_link_node(&vma->vm_rb, rb_parent, rb_link);
    vma->rb_subtree_gap = 0;
    vma_gap_update(vma);
    /*传入的vma插入红黑树*/
    vma_rb_insert(vma, &mm->mm_rb);
}

  (2)上述代码执行完毕,红黑树也就建好了!接下来就是查询了;因为红黑树是用来管理vma的,建树的时候所用的key值就是vma的线性地址了,所以红黑树节点左孩vma的地址都比当前节点小,右孩vma的地址都比当前节点大;代码如下:

/* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
    struct rb_node *rb_node;
    struct vm_area_struct *vma;

    /* Check the cache first. */
    vma = vmacache_find(mm, addr);
    if (likely(vma))
        return vma;
    /*红黑树的根节点*/
    rb_node = mm->mm_rb.rb_node;

    while (rb_node) {
        struct vm_area_struct *tmp;

        tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);

        if (tmp->vm_end > addr) {
            vma = tmp;
            /*要查找的addr刚好在这个vma实例的vm_start和vm_end之间,说明找到了*/
            if (tmp->vm_start <= addr)
                break;
            /*addr比vm_start都大,继续从左子树查找*/
            rb_node = rb_node->rb_left;
        } else /*addr比vm_end小,继续从右子树查找*/
            rb_node = rb_node->rb_right;
    }

    if (vma)
        vmacache_update(addr, vma);
    return vma;
}

  (3)上面是根据用户指定的线性地址查找第一个符合的vma实例,用户实际使用时,还需要查找空闲未使用的虚拟内存块,用来存储重要的数据,这个又该怎么实现了?linux的实现方法为:arch_get_unmapped_area函数,如下:最核心的代码加了中文注释;思路就是根据addr查找vma;如果vma为空,并且大小、边界等条件也都满足,这块内存就可以拿来用了

unsigned long
arch_get_unmapped_area(struct file *filp, unsigned long addr,
        unsigned long len, unsigned long pgoff, unsigned long flags)
{
    struct mm_struct *mm = current->mm;
    struct vm_area_struct *vma;
    struct vm_unmapped_area_info info;

    if (len > TASK_SIZE - mmap_min_addr)
        return -ENOMEM;

    if (flags & MAP_FIXED)
        return addr;

    if (addr) {
        addr = PAGE_ALIGN(addr);//将addr调整成页大小的倍数
        /*通过addr查找对应的vma是否为空;如果是空,说明该区域还未被使用
          ,如果其他条件也满足,就直接使用这块地址了*/
        vma = find_vma(mm, addr);
        if (TASK_SIZE - len >= addr && addr >= mmap_min_addr &&
            (!vma || addr + len <= vma->vm_start))
            return addr;
    }

    info.flags = 0;
    info.length = len;
    info.low_limit = mm->mmap_base;
    info.high_limit = TASK_SIZE;
    info.align_mask = 0;
    return vm_unmapped_area(&info);
}

  (4)内存用完后就要释放,为了避免碎片,肯定是要和现有的空闲内存合并的;由于空闲内存没有用红黑树组织,所以此步骤也不涉及红黑树的操作,具体的思路为:先检查释放区域之前的prve区域终止地址是否与释放区域起始地址重合,或释放区域的结束地址是否与其之后的next区域起始地址重合;接着再检查将要合并的区域是否有相同的标志。如果合并区域均映射了磁盘文件,则还要检查其映射文件是否相同,以及文件内的偏移量是否连续。思路并不复杂,代码如下:

/*
 * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out
 * whether that can be merged with its predecessor or its successor.
 * Or both (it neatly fills a hole).
 *
 * In most cases - when called for mmap, brk or mremap - [addr,end) is
 * certain not to be mapped by the time vma_merge is called; but when
 * called for mprotect, it is certain to be already mapped (either at
 * an offset within prev, or at the start of next), and the flags of
 * this area are about to be changed to vm_flags - and the no-change
 * case has already been eliminated.
 *
 * The following mprotect cases have to be considered, where AAAA is
 * the area passed down from mprotect_fixup, never extending beyond one
 * vma, PPPPPP is the prev vma specified, and NNNNNN the next vma after:
 *
 *     AAAA             AAAA                AAAA          AAAA
 *    PPPPPPNNNNNN    PPPPPPNNNNNN    PPPPPPNNNNNN    PPPPNNNNXXXX
 *    cannot merge    might become    might become    might become
 *                    PPNNNNNNNNNN    PPPPPPPPPPNN    PPPPPPPPPPPP 6 or
 *    mmap, brk or    case 4 below    case 5 below    PPPPPPPPXXXX 7 or
 *    mremap move:                                    PPPPXXXXXXXX 8
 *        AAAA
 *    PPPP    NNNN    PPPPPPPPPPPP    PPPPPPPPNNNN    PPPPNNNNNNNN
 *    might become    case 1 below    case 2 below    case 3 below
 *
 * It is important for case 8 that the the vma NNNN overlapping the
 * region AAAA is never going to extended over XXXX. Instead XXXX must
 * be extended in region AAAA and NNNN must be removed. This way in
 * all cases where vma_merge succeeds, the moment vma_adjust drops the
 * rmap_locks, the properties of the merged vma will be already
 * correct for the whole merged range. Some of those properties like
 * vm_page_prot/vm_flags may be accessed by rmap_walks and they must
 * be correct for the whole merged range immediately after the
 * rmap_locks are released. Otherwise if XXXX would be removed and
 * NNNN would be extended over the XXXX range, remove_migration_ptes
 * or other rmap walkers (if working on addresses beyond the "end"
 * parameter) may establish ptes with the wrong permissions of NNNN
 * instead of the right permissions of XXXX.
 */
struct vm_area_struct *vma_merge(struct mm_struct *mm,
            struct vm_area_struct *prev, unsigned long addr,
            unsigned long end, unsigned long vm_flags,
            struct anon_vma *anon_vma, struct file *file,
            pgoff_t pgoff, struct mempolicy *policy,
            struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
{
    pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
    struct vm_area_struct *area, *next;
    int err;

    /*
     * We later require that vma->vm_flags == vm_flags,
     * so this tests vma->vm_flags & VM_SPECIAL, too.
     */
    if (vm_flags & VM_SPECIAL)
        return NULL;

    if (prev)
        next = prev->vm_next;
    else
        next = mm->mmap;
    area = next;
    if (area && area->vm_end == end)        /* cases 6, 7, 8 */
        next = next->vm_next;

    /* verify some invariant that must be enforced by the caller */
    VM_WARN_ON(prev && addr <= prev->vm_start);
    VM_WARN_ON(area && end > area->vm_end);
    VM_WARN_ON(addr >= end);

    /*
     * Can it merge with the predecessor?
     */
    if (prev && prev->vm_end == addr &&
            mpol_equal(vma_policy(prev), policy) &&
            can_vma_merge_after(prev, vm_flags,
                        anon_vma, file, pgoff,
                        vm_userfaultfd_ctx)) {
        /*
         * OK, it can.  Can we now merge in the successor as well?
         */
        if (next && end == next->vm_start &&
                mpol_equal(policy, vma_policy(next)) &&
                can_vma_merge_before(next, vm_flags,
                             anon_vma, file,
                             pgoff+pglen,
                             vm_userfaultfd_ctx) &&
                is_mergeable_anon_vma(prev->anon_vma,
                              next->anon_vma, NULL)) {
                            /* cases 1, 6 */
            err = __vma_adjust(prev, prev->vm_start,
                     next->vm_end, prev->vm_pgoff, NULL,
                     prev);
        } else                    /* cases 2, 5, 7 */
            err = __vma_adjust(prev, prev->vm_start,
                     end, prev->vm_pgoff, NULL, prev);
        if (err)
            return NULL;
        khugepaged_enter_vma_merge(prev, vm_flags);
        return prev;
    }

    /*
     * Can this new request be merged in front of next?
     */
    if (next && end == next->vm_start &&
            mpol_equal(policy, vma_policy(next)) &&
            can_vma_merge_before(next, vm_flags,
                         anon_vma, file, pgoff+pglen,
                         vm_userfaultfd_ctx)) {
        if (prev && addr < prev->vm_end)    /* case 4 */
            err = __vma_adjust(prev, prev->vm_start,
                     addr, prev->vm_pgoff, NULL, next);
        else {                    /* cases 3, 8 */
            err = __vma_adjust(area, addr, next->vm_end,
                     next->vm_pgoff - pglen, NULL, next);
            /*
             * In case 3 area is already equal to next and
             * this is a noop, but in case 8 "area" has
             * been removed and next was expanded over it.
             */
            area = next;
        }
        if (err)
            return NULL;
        khugepaged_enter_vma_merge(area, vm_flags);
        return area;
    }

    return NULL;
}

  (5)既然释放内存,相应的vma肯定也要从红黑树删除的,这个功能在detach_vmas_to_be_unmapped中实现的:

/*
 * Create a list of vma's touched by the unmap, removing them from the mm's
 * vma list as we go..
   Remove the vma's, and unmap the actual pages

 */
static void
detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
    struct vm_area_struct *prev, unsigned long end)
{
    struct vm_area_struct **insertion_point;
    struct vm_area_struct *tail_vma = NULL;

    insertion_point = (prev ? &prev->vm_next : &mm->mmap);
    vma->vm_prev = NULL;
    do {
        vma_rb_erase(vma, &mm->mm_rb);//从红黑树删除vma
        mm->map_count--;
        tail_vma = vma;
        vma = vma->vm_next;
    } while (vma && vma->vm_start < end);
    *insertion_point = vma;
    if (vma) {
        vma->vm_prev = prev;
        vma_gap_update(vma);
    } else
        mm->highest_vm_end = prev ? prev->vm_end : 0;
    tail_vma->vm_next = NULL;

    /* Kill the cache */
    vmacache_invalidate(mm);
}

 

总结:

1、这里有虚拟内存和物理内存、进程内存和操作系统内存的映射图示,方便各位理解

 

 2、用rb_node关键词搜索,我一共找到了3千多个:可见红黑树在linux内核使用的范围之广!

 

 3、AVL树和红黑树很类似,但是AVL树的删除和插入需要多次旋转操作以及不断向根节点回溯,所以在大量删除和插入操作的情况下, AVL树的效率较低。红黑树是一种查找效率仅次于AVL树的不完全平衡二叉树,它摒弃了AVL要求的强平衡约束,能够以O(logn)的时间复杂度进行插入、删除操作;而且其插入和删除最多需要两次或者三次旋转即可保持树的平衡。虽然二者的算法复杂度相同,但在最坏情况下,红黑树提供了更加快速删除和插入一个节点的算法,显然红黑树的高效操作更适合Linux内核中大量VMA的添加、删除和查找!

 

参考:

1、https://stephenzhou.blog.csdn.net/article/details/89501437?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1.pc_relevant_default&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1.pc_relevant_default&utm_relevant_index=2  虚拟内存VMA浅析

2、http://edsionte.com/techblog/archives/3564 虚拟内存操作

3、http://edsionte.com/techblog/archives/3586 虚拟内存操作

posted @ 2022-01-19 20:19  第七子007  阅读(1909)  评论(0编辑  收藏  举报