linux源码解读(十六):红黑树在内核的应用——虚拟内存管理
1、linux内核中利用红黑树增删改查快速、稳定的特性来管理的还有另一个非常重要的功能:虚拟内存管理!前面介绍了buddy和slab算法是用来管理物理页面的。由于早期物理页面远比虚拟页面小很多,而且只需要分配和回收合并,所以也没用树形结构来组织,简单粗暴地用链表来管理!但是虚拟内存不一样了:以32位的系统为例,虚拟内存有4GB,能划分的虚拟内存块有很多,划分后需要快速增删查虚拟内存块(需要频繁地读取代码、读写数据、加载动态链接库等),此时用红黑树就很合适了!老规矩,先上结构体:
task_struct中嵌套了一个mm_struct结构体指针,这个结构体大有乾坤:
struct task_struct { ....... struct mm_struct *mm; ....... }
继续深入:又发现了红黑树的根节点!另外两个vm_area_struct结构体又是干嘛的了?
struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ struct rb_root mm_rb;/*又是红黑树的根节点*/ struct vm_area_struct * mmap_cache; /* last find_vma result */ ....... }
继续深入:
- 发现rb_node结构体了么?这明显是红黑树的节点呀!和上面的rb_root不是刚好组成红黑树么?
-
进程的虚拟内存空间会被分成不同的若干区域,每个区域都有其相关的属性和用途;一个合法的地址总是落在某个区域当中的,这些区域也不会重叠。在linux内核中,这样的区域被称之为虚拟内存区域(virtual memory areas),简称 VMA。一个vma就是一块连续的线性地址空间的抽象,它拥有自身的权限(可读,可写,可执行等等) ;
struct vm_area_struct { struct mm_struct * vm_mm; /* 所属的内存描述符 */ unsigned long vm_start; /* vma的起始地址 */ unsigned long vm_end; /* vma的结束地址 */ /* 该vma的在一个进程的vma链表中的前驱vma和后驱vma指针,链表中的vma都是按地址来排序的*/ struct vm_area_struct *vm_next, *vm_prev; pgprot_t vm_page_prot; /* vma的访问权限 */ unsigned long vm_flags; /* 标识集 */ struct rb_node vm_rb; /* 红黑树中对应的节点 */ ............... }
为了直观展示这些结构体之间的关系,我画了一张图供大家参考,要点说明如下:
- vm_area_struct有vm_start和vm_end两个字段,分别指向虚拟内存区域的开始和结束地址;
- vm_area_struct的vm_rb又组成了红黑树:便于根据特定的条件快速查找目标区域;
- vm_area_struct实例之间也用链表连接:主要用来遍历节点(不需要像树形结构一样前序、中序、后序等方式遍历,速度快一些;树形结构遍历需要递归或借助栈/队列等结构,空间复杂度是O(N))
2、(1)结构体定义好了,下一步就是操作这些结构体了。既然用到了红黑树管理VMA,第一件事肯定是建树、建链表啦(所有的操作都在mm\mmap.c文件里),最直接的api是__vma_link函数,如下:
/*将新创建的vm_area_struct挂在mm_struct中管理的红黑树上*/ static void __vma_link(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, struct rb_node **rb_link, struct rb_node *rb_parent) { __vma_link_list(mm, vma, prev, rb_parent); __vma_link_rb(mm, vma, rb_link, rb_parent); }
调用了两个函数,从名称看就知道一个是建立链表,另一个是建立红黑树;先看第一个_vma_link_list函数,如下:就是把vma实例加入到mm的mmap字段,然后让vma的next指针指向下一个vma实例,完成链表的建立!
void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, struct rb_node *rb_parent) { struct vm_area_struct *next; vma->vm_prev = prev; if (prev) { next = prev->vm_next; prev->vm_next = vma; } else { mm->mmap = vma; if (rb_parent) next = rb_entry(rb_parent, struct vm_area_struct, vm_rb); else next = NULL; } vma->vm_next = next; if (next) next->vm_prev = vma; }
另一个是__vma_link_rb函数在红黑树中插入节点,如下:
void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma, struct rb_node **rb_link, struct rb_node *rb_parent) { /* Update tracking information for the gap following the new vma. */ if (vma->vm_next) vma_gap_update(vma->vm_next); else mm->highest_vm_end = vma->vm_end; /* * vma->vm_prev wasn't known when we followed the rbtree to find the * correct insertion point for that vma. As a result, we could not * update the vma vm_rb parents rb_subtree_gap values on the way down. * So, we first insert the vma with a zero rb_subtree_gap value * (to be consistent with what we did on the way down), and then * immediately update the gap to the correct value. Finally we * rebalance the rbtree after all augmented values have been set. */ rb_link_node(&vma->vm_rb, rb_parent, rb_link); vma->rb_subtree_gap = 0; vma_gap_update(vma); /*传入的vma插入红黑树*/ vma_rb_insert(vma, &mm->mm_rb); }
(2)上述代码执行完毕,红黑树也就建好了!接下来就是查询了;因为红黑树是用来管理vma的,建树的时候所用的key值就是vma的线性地址了,所以红黑树节点左孩vma的地址都比当前节点小,右孩vma的地址都比当前节点大;代码如下:
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) { struct rb_node *rb_node; struct vm_area_struct *vma; /* Check the cache first. */ vma = vmacache_find(mm, addr); if (likely(vma)) return vma; /*红黑树的根节点*/ rb_node = mm->mm_rb.rb_node; while (rb_node) { struct vm_area_struct *tmp; tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb); if (tmp->vm_end > addr) { vma = tmp; /*要查找的addr刚好在这个vma实例的vm_start和vm_end之间,说明找到了*/ if (tmp->vm_start <= addr) break; /*addr比vm_start都大,继续从左子树查找*/ rb_node = rb_node->rb_left; } else /*addr比vm_end小,继续从右子树查找*/ rb_node = rb_node->rb_right; } if (vma) vmacache_update(addr, vma); return vma; }
(3)上面是根据用户指定的线性地址查找第一个符合的vma实例,用户实际使用时,还需要查找空闲未使用的虚拟内存块,用来存储重要的数据,这个又该怎么实现了?linux的实现方法为:arch_get_unmapped_area函数,如下:最核心的代码加了中文注释;思路就是根据addr查找vma;如果vma为空,并且大小、边界等条件也都满足,这块内存就可以拿来用了!
unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; struct vm_unmapped_area_info info; if (len > TASK_SIZE - mmap_min_addr) return -ENOMEM; if (flags & MAP_FIXED) return addr; if (addr) { addr = PAGE_ALIGN(addr);//将addr调整成页大小的倍数 /*通过addr查找对应的vma是否为空;如果是空,说明该区域还未被使用 ,如果其他条件也满足,就直接使用这块地址了*/ vma = find_vma(mm, addr); if (TASK_SIZE - len >= addr && addr >= mmap_min_addr && (!vma || addr + len <= vma->vm_start)) return addr; } info.flags = 0; info.length = len; info.low_limit = mm->mmap_base; info.high_limit = TASK_SIZE; info.align_mask = 0; return vm_unmapped_area(&info); }
(4)内存用完后就要释放,为了避免碎片,肯定是要和现有的空闲内存合并的;由于空闲内存没有用红黑树组织,所以此步骤也不涉及红黑树的操作,具体的思路为:先检查释放区域之前的prve区域终止地址是否与释放区域起始地址重合,或释放区域的结束地址是否与其之后的next区域起始地址重合;接着再检查将要合并的区域是否有相同的标志。如果合并区域均映射了磁盘文件,则还要检查其映射文件是否相同,以及文件内的偏移量是否连续。思路并不复杂,代码如下:
/* * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out * whether that can be merged with its predecessor or its successor. * Or both (it neatly fills a hole). * * In most cases - when called for mmap, brk or mremap - [addr,end) is * certain not to be mapped by the time vma_merge is called; but when * called for mprotect, it is certain to be already mapped (either at * an offset within prev, or at the start of next), and the flags of * this area are about to be changed to vm_flags - and the no-change * case has already been eliminated. * * The following mprotect cases have to be considered, where AAAA is * the area passed down from mprotect_fixup, never extending beyond one * vma, PPPPPP is the prev vma specified, and NNNNNN the next vma after: * * AAAA AAAA AAAA AAAA * PPPPPPNNNNNN PPPPPPNNNNNN PPPPPPNNNNNN PPPPNNNNXXXX * cannot merge might become might become might become * PPNNNNNNNNNN PPPPPPPPPPNN PPPPPPPPPPPP 6 or * mmap, brk or case 4 below case 5 below PPPPPPPPXXXX 7 or * mremap move: PPPPXXXXXXXX 8 * AAAA * PPPP NNNN PPPPPPPPPPPP PPPPPPPPNNNN PPPPNNNNNNNN * might become case 1 below case 2 below case 3 below * * It is important for case 8 that the the vma NNNN overlapping the * region AAAA is never going to extended over XXXX. Instead XXXX must * be extended in region AAAA and NNNN must be removed. This way in * all cases where vma_merge succeeds, the moment vma_adjust drops the * rmap_locks, the properties of the merged vma will be already * correct for the whole merged range. Some of those properties like * vm_page_prot/vm_flags may be accessed by rmap_walks and they must * be correct for the whole merged range immediately after the * rmap_locks are released. Otherwise if XXXX would be removed and * NNNN would be extended over the XXXX range, remove_migration_ptes * or other rmap walkers (if working on addresses beyond the "end" * parameter) may establish ptes with the wrong permissions of NNNN * instead of the right permissions of XXXX. */ struct vm_area_struct *vma_merge(struct mm_struct *mm, struct vm_area_struct *prev, unsigned long addr, unsigned long end, unsigned long vm_flags, struct anon_vma *anon_vma, struct file *file, pgoff_t pgoff, struct mempolicy *policy, struct vm_userfaultfd_ctx vm_userfaultfd_ctx) { pgoff_t pglen = (end - addr) >> PAGE_SHIFT; struct vm_area_struct *area, *next; int err; /* * We later require that vma->vm_flags == vm_flags, * so this tests vma->vm_flags & VM_SPECIAL, too. */ if (vm_flags & VM_SPECIAL) return NULL; if (prev) next = prev->vm_next; else next = mm->mmap; area = next; if (area && area->vm_end == end) /* cases 6, 7, 8 */ next = next->vm_next; /* verify some invariant that must be enforced by the caller */ VM_WARN_ON(prev && addr <= prev->vm_start); VM_WARN_ON(area && end > area->vm_end); VM_WARN_ON(addr >= end); /* * Can it merge with the predecessor? */ if (prev && prev->vm_end == addr && mpol_equal(vma_policy(prev), policy) && can_vma_merge_after(prev, vm_flags, anon_vma, file, pgoff, vm_userfaultfd_ctx)) { /* * OK, it can. Can we now merge in the successor as well? */ if (next && end == next->vm_start && mpol_equal(policy, vma_policy(next)) && can_vma_merge_before(next, vm_flags, anon_vma, file, pgoff+pglen, vm_userfaultfd_ctx) && is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) { /* cases 1, 6 */ err = __vma_adjust(prev, prev->vm_start, next->vm_end, prev->vm_pgoff, NULL, prev); } else /* cases 2, 5, 7 */ err = __vma_adjust(prev, prev->vm_start, end, prev->vm_pgoff, NULL, prev); if (err) return NULL; khugepaged_enter_vma_merge(prev, vm_flags); return prev; } /* * Can this new request be merged in front of next? */ if (next && end == next->vm_start && mpol_equal(policy, vma_policy(next)) && can_vma_merge_before(next, vm_flags, anon_vma, file, pgoff+pglen, vm_userfaultfd_ctx)) { if (prev && addr < prev->vm_end) /* case 4 */ err = __vma_adjust(prev, prev->vm_start, addr, prev->vm_pgoff, NULL, next); else { /* cases 3, 8 */ err = __vma_adjust(area, addr, next->vm_end, next->vm_pgoff - pglen, NULL, next); /* * In case 3 area is already equal to next and * this is a noop, but in case 8 "area" has * been removed and next was expanded over it. */ area = next; } if (err) return NULL; khugepaged_enter_vma_merge(area, vm_flags); return area; } return NULL; }
(5)既然释放内存,相应的vma肯定也要从红黑树删除的,这个功能在detach_vmas_to_be_unmapped中实现的:
/* * Create a list of vma's touched by the unmap, removing them from the mm's * vma list as we go.. Remove the vma's, and unmap the actual pages */ static void detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, unsigned long end) { struct vm_area_struct **insertion_point; struct vm_area_struct *tail_vma = NULL; insertion_point = (prev ? &prev->vm_next : &mm->mmap); vma->vm_prev = NULL; do { vma_rb_erase(vma, &mm->mm_rb);//从红黑树删除vma mm->map_count--; tail_vma = vma; vma = vma->vm_next; } while (vma && vma->vm_start < end); *insertion_point = vma; if (vma) { vma->vm_prev = prev; vma_gap_update(vma); } else mm->highest_vm_end = prev ? prev->vm_end : 0; tail_vma->vm_next = NULL; /* Kill the cache */ vmacache_invalidate(mm); }
总结:
1、这里有虚拟内存和物理内存、进程内存和操作系统内存的映射图示,方便各位理解
2、用rb_node关键词搜索,我一共找到了3千多个:可见红黑树在linux内核使用的范围之广!
3、AVL树和红黑树很类似,但是AVL树的删除和插入需要多次旋转操作以及不断向根节点回溯,所以在大量删除和插入操作的情况下, AVL树的效率较低。红黑树是一种查找效率仅次于AVL树的不完全平衡二叉树,它摒弃了AVL要求的强平衡约束,能够以O(logn)的时间复杂度进行插入、删除操作;而且其插入和删除最多需要两次或者三次旋转即可保持树的平衡。虽然二者的算法复杂度相同,但在最坏情况下,红黑树提供了更加快速删除和插入一个节点的算法,显然红黑树的高效操作更适合Linux内核中大量VMA的添加、删除和查找!
参考:
1、https://stephenzhou.blog.csdn.net/article/details/89501437?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1.pc_relevant_default&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1.pc_relevant_default&utm_relevant_index=2 虚拟内存VMA浅析
2、http://edsionte.com/techblog/archives/3564 虚拟内存操作
3、http://edsionte.com/techblog/archives/3586 虚拟内存操作