[os] Kernel是如何管理你的内存 How The Kernel Manages Your Memory

转自:Melody_lu123 CSDN 博客 ,很赞的技术文章.

原文作者:Gustavo Duarte

转自:http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory

How The Kernel Manages Your Memory

内核如何管理你的内存

After examining the virtual address layout of a process, we turn to the kernel and its mechanisms for managing user memory. Here is gonzo again:

在上一篇关于一个进程的虚拟地址布局之后,是时候来看看内核和它管理用户内存的机制了。下面还是以上一篇提到到的gonzo为例:

Linux kernel mm_struct

Linux processes are implemented in the kernel as instances of task_struct, the process descriptor. The mm field in task_struct points to the memory descriptormm_struct, which is an executive summary of a program’s memory. It stores the start and end of memory segments as shown above, the number of physical memory pages used by the process (rss stands for Resident Set Size), the amount of virtual address space used, and other tidbits. Within the memory descriptor we also find the two work horses for managing program memory: the set of virtual memory areas and thepage tables. Gonzo’s memory areas are shown below:

Linux的进程在kernel中是由task_struct来实现的。其中有一个mm域指向这个进程所使用的内存的描述符,mm_struct。它包含了如上所示的各个段的开始和结束的地址(使用的是unsigned long类型来表示),被这个进程所使用的物理内存页的个数,各种被使用到的虚拟地址空间(total_vm, shared_vm,...)和其它一些内容。从这个结构体中我们能看到两块主要的用来管理程序内存的成员:虚拟内存空间(vm_area_struct)和页表(pgd,这是一个平台相关的结构)。下图是Gonzo的虚拟内存的布局。

Kernel memory descriptor and memory areas

Each virtual memory area (VMA) is a contiguous range of virtual addresses; these areas never overlap. An instance of vm_area_struct fully describes a memory area, including its start and end addresses, flags to determine access rights and behaviors, and the vm_file field to specify which file is being mapped by the area, if any. A VMA that does not map a file is anonymous. Each memory segment above (e.g., heap, stack) corresponds to a single VMA, with the exception of the memory mapping segment. This is not a requirement, though it is usual in x86 machines. VMAs do not care which segment they are in.

每个虚拟内存区域都是一个连续的虚拟地址空间;这些区域是不会重叠的。一个vm_area_struct表示一个这样的内存区域,包括该区域的起始和结束地址,flags描述了它的访问权限和行为,vm_file表示哪一个文件被映射到这块区域。一个VMA没有映射一个文件被称为是匿名的。上面的每个内存段(堆,栈...)都对应着一个VMA,唯一的例外是memory mapping segment,它会有几个VMA来描述。这虽然不是必须的,但是在x86下一般都是这样。任何一个VMA是不关心它是属于哪个段的。

A program’s VMAs are stored in its memory descriptor both as a linked list in the mmap field, ordered by starting virtual address, and as a red-black tree rooted at the mm_rb field. The red-black tree allows the kernel to search quickly for the memory area covering a given virtual address. When you read file /proc/pid_of_process/maps, the kernel is simply going through the linked list of VMAs for the process and printing each one.

一个程序的所有VMA都会保存在它自己的内存描述符中(mm_struct),会同时被一个mmap指向的链表所表示和用每一个vma的其实虚拟地址作为key的一个红黑树中。红黑树提供了在给定一个虚拟地址而能快速的找到它所在的VMA的效率。而链表的表示则被用来依次列出多有该进程所使用VMA的空间时被使用,比如当你从/proc/pid_of_process/maps读取数据的时候,内核就使用链表的结构来得到所有的VMA信息。

In Windows, the EPROCESS block is roughly a mix of task_struct and mm_struct. The Windows analog to a VMA is the Virtual Address Descriptor, or VAD; they are stored in an AVL tree. You know what the funniest thing about Windows and Linux is? It’s the little differences.

在windows中,使用EPROCESS结构来描述linux中的task_struct和mm_struct。windows中与VMA相似的概念是Virtual Address Descriptor,或则叫做VAD;它们被AVL树所管理。你知道关于windows和linux最搞笑的事情吗?那就是它们在虚拟内存方面只有很小的区别。

The 4GB virtual address space is divided into pages. x86 processors in 32-bit mode support page sizes of 4KB, 2MB, and 4MB. Both Linux and Windows map the user portion of the virtual address space using 4KB pages. Bytes 0-4095 fall in page 0, bytes 4096-8191 fall in page 1, and so on. The size of a VMA must be a multiple of page size. Here’s 3GB of user space in 4KB pages:

所有4GB的虚拟地址空间被划分为许多页。x86处理器在32位模式下支持4kb,2MB和4MB的页大小。linux和windows都默认采用4kb的页来映射用户的虚拟地址空间。0-4095属于0页,4096-8191属于1页,等等。所有的VMA的大小都必须是页大小的整数倍。下面是3GB的用户空间在页大小为4kb时的描述:

4KB Pages Virtual User Space

The processor consults page tables to translate a virtual address into a physical memory address. Each process has its own set of page tables; whenever a process switch occurs, page tables for user space are switched as well. Linux stores a pointer to a process’ page tables in the pgd field of the memory descriptor. To each virtual page there corresponds one page table entry (PTE) in the page tables, which in regular x86 paging is a simple 4-byte record shown below:

处理器通过页表来把一个虚拟地址转化为实际的物理内存地址。每个进程有属于它自己的一组页表;无论何时发生了进程切换,相应的会发生用户空间的页表切换。mm_struct中有一个pdg域,就指向该进程所使用的页表集。对每一个虚拟页在页表中都有对应的一个页表项(PTE),通常x86下的一页十一个4字节的如下记录:

x86 Page Table Entry (PTE) for 4KB page

Linux has functions to read and set each flag in a PTE. Bit P tells the processor whether the virtual page is present in physical memory. If clear (equal to 0), accessing the page triggers a page fault. Keep in mind that when this bit is zero, the kernel can do whatever it pleases with the remaining fields. The R/W flag stands for read/write; if clear, the page is read-only. Flag U/S stands for user/supervisor; if clear, then the page can only be accessed by the kernel. These flags are used to implement the read-only memory and protected kernel space we saw before.

Linux提供一些列用来读写PTE的函数。P位告诉处理器该虚拟页目前是否在物理内存中。0表示不在,这是访问该页的动作会触发一个缺页异常。请记住,当P位为0时,内核是不检测其它所有的位的,内核可以用这些位另做它用。R/W位表示读/写;0表示只读。U/S位代表用户和系统;0表示该页只能被kernel访问。

Bits D and A are for dirty and accessed. A dirty page has had a write, while an accessed page has had a write or read. Both flags are sticky: the processor only sets them, they must be cleared by the kernel. Finally, the PTE stores the starting physical address that corresponds to this page, aligned to 4KB. This naive-looking field is the source of some pain, for it limits addressable physical memory to 4 GB. The other PTE fields are for another day, as is Physical Address Extension.

D和A位分别对应表示脏数据和已被访问过。一个脏页表示被写过,而被访问包括被读和被写。这两个标志都有些特别:都只有处理器来设置它们,但是由内核来复位它们。最后,PTE中储存的是对应的page的开始物理地址,并且会与4KB对其。这些控制标志常会带来一些不便,它们限制了物理内存只能达到4GB.其它的一些符号位被用来支持物理地址扩展(PAE)。

A virtual page is the unit of memory protection because all of its bytes share the U/S and R/W flags. However, the same physical memory could be mapped by different pages, possibly with different protection flags. Notice that execute permissions are nowhere to be seen in the PTE. This is why classic x86 paging allows code on the stack to be executed, making it easier to exploit stack buffer overflows (it’s still possible to exploit non-executable stacks using return-to-libc and other techniques). This lack of a PTE no-execute flag illustrates a broader fact: permission flags in a VMA may or may not translate cleanly into hardware protection. The kernel does what it can, but ultimately the architecture limits what is possible.

一个虚拟页是内存保护机制的一个单元,因为它们都会共享U/S和R/W标志。尽管如此,相同的物理地址可能会被映射到不同的页,有可能这些页会有不同的保护标志。特别的,执行权限的标志不能够被PTE所支持(但是新的64位cpu及其对应的linux内核,还有开启了PAE支持的内核会支持no-exec的特性。详细请大家自己去看http://en.wikipedia.org/wiki/NX_bit。解释的相当清楚了)。这就是为什么典型的x86页机制允许code能够被在栈中执行,从而容易被利用进行栈溢出的漏洞攻击。(当然,即使有了不能执行的栈,也依然会有return-to-libc和其它一些的攻击方法)。这是因为PTE没有no-exec的标志:一个vma中的权限标志可能也可能不会清楚的从虚拟传递到硬件保护层上。内核做了它所能做的,但是体系结构限制了它。

Virtual memory doesn’t store anything, it simply maps a program’s address space onto the underlying physical memory, which is accessed by the processor as a large block called the physical address space. While memory operations on the bus are somewhat involved, we can ignore that here and assume that physical addresses range from zero to the top of available memory in one-byte increments. This physical address space is broken down by the kernel into page frames. The processor doesn’t know or care about frames, yet they are crucial to the kernel because the page frame is the unit of physical memory management. Both Linux and Windows use 4KB page frames in 32-bit mode; here is an example of a machine with 2GB of RAM:

虚拟内存不储存任何东西,它只是简单的映射一个程序的地址到底层的物理内存,该物理内存会被处理器访问,并被叫做物理内存地址空间。如果内存操作发生在总线上,我们可以忽略虚拟内存,而可以假设物理地址是从0到最大可用的内存并且是1个字节1个字节递增的。这物理地址空间现在却会被内核分为一系列的页帧。处理器不关心这些帧,它们对于内核很重要,因为内核利用页帧来作为管理物理内存的最小单位。Linux和windows在32位模式下都使用4kb的页帧;下面是一个有2G物理内存的机器的例子:

Physical Address Space

In Linux each page frame is tracked by a descriptor and several flags. Together these descriptors track the entire physical memory in the computer; the precise state of each page frame is always known. Physical memory is managed with the buddy memory allocation technique, hence a page frame is free if it’s available for allocation via the buddy system. An allocated page frame might be anonymous, holding program data, or it might be in the page cache, holding data stored in a file or block device. There are other exotic page frame uses, but leave them alone for now. Windows has an analogous Page Frame Number (PFN) database to track physical memory.

linux中每个页帧是通过一个描述符(即page struct)和一些标志来追踪。而通过把这些描述符结合起来来追踪计算机内的所有物理内存。每个页帧的具体状态是完全可知的。物理内存是通过伙伴内存分配系统来管理的,因此如果一个页帧对于伙伴系统来说是可用的那么这个页帧就是空闲的。一个被分配出来的页帧可能是匿名的,可以用来保存程序数据或者可能是一个页cache,或者保存一个文件中的数据或是一个块设备。还有一些其它的特别的页帧的使用,但是我们先不考虑它们。Windows有一个类似的Page Frame Number(PFN)的数据库来帮助跟踪物理内存。

Let’s put together virtual memory areas, page table entries and page frames to understand how this all works. Below is an example of a user heap:

让我把所有这些与虚拟内存区域结合起来帮助理解内存管理是如何工作的。下面十一个用户的堆:

Physical Address Space

Blue rectangles represent pages in the VMA range, while arrows represent page table entries mapping pages onto page frames. Some virtual pages lack arrows; this means their corresponding PTEs have the Present flag clear. This could be because the pages have never been touched or because their contents have been swapped out. In either case access to these pages will lead to page faults, even though they are within the VMA. It may seem strange for the VMA and the page tables to disagree, yet this often happens.

蓝色的长方形表示页在VMA中的范围,而箭头表示PTE所映射到的页帧。有一些虚拟页没有箭头;这意味着它们所对应的PTE的P标志位是0。这可能是因为该页从来没有被访问过或者是因为它们的内容已经被换出了。无论哪种情况,试图访问该页的操作都会引发页错误,即使它们确实是在VMA中。这看起来对于VMA和页表来说很奇怪,但是这经常发生。

A VMA is like a contract between your program and the kernel. You ask for something to be done (memory allocated, a file mapped, etc.), the kernel says “sure”, and it creates or updates the appropriate VMA. But it does not actually honor the request right away, it waits until a page fault happens to do real work. The kernel is a lazy, deceitful sack of scum; this is the fundamental principle of virtual memory. It applies in most situations, some familiar and some surprising, but the rule is that VMAs record what has been agreed upon, while PTEs reflect what has actually been done by the lazy kernel. These two data structures together manage a program’s memory; both play a role in resolving page faults, freeing memory, swapping memory out, and so on. Let’s take the simple case of memory allocation:

一个VMA就像是你的程序和内核之间的一份契约。你请求一些事情要被完成(内存分配,文件映射等等),kernel会告诉你OK,并且帮你建立或者更新适当的VMA。但是它不会真正的马上响应你的要求,它会等到一个页错误发生才会去做相应的事情。kernel是懒惰的,有点虚伪的:)这确实虚拟内存之所以存在的意义。它应用在一些常见的场景,一些熟悉,一些可能看起来奇怪的场景中,但是事实是VMA记录了什么是被kernel所认知同意的,而PTEs则反映了什么是被懒惰的内核真正做的。这两个关键数据结构合起来管理了一个程序的内存;同时来配合实现了一个决定是否发生页错误/释放内存/交换内存等等的逻辑。让我们再看一个内存分配的简单例子:

Example of demand paging and memory allocation

When the program asks for more memory via the brk() system call, the kernel simply updates the heap VMA and calls it good. No page frames are actually allocated at this point and the new pages are not present in physical memory. Once the program tries to access the pages, the processor page faults and do_page_fault() is called. It searches for the VMA covering the faulted virtual address using find_vma(). If found, the permissions on the VMA are also checked against the attempted access (read or write). If there’s no suitable VMA, no contract covers the attempted memory access and the process is punished by Segmentation Fault.

当一个程序通过brk系统调用要求更多的内存时,内核只是简单的更新了它的堆所在的VMA。没有页帧被真正的分配出来,并且新的页是没有在物理内存上建立出来。一旦程序尝试访问这些页,则处理器出发一个页错误,即执行do_page_fault()函数。它通过find_vma()来找到导致页错误的VMA中的虚拟地址。如果找到了,且各种VMA中的权限检查没有通过,则表示没有合适的VMA,即没有建立好适合的契约,那么这次的内存访问就是非法的,会导致内核发出Segmentation Fault。

When a VMA is found the kernel must handle the fault by looking at the PTE contents and the type of VMA. In our case, the PTE shows the page is not present. In fact, our PTE is completely blank (all zeros), which in Linux means the virtual page has never been mapped. Since this is an anonymous VMA, we have a purely RAM affair that must be handled by do_anonymous_page(), which allocates a page frame and makes a PTE to map the faulted virtual page onto the freshly allocated frame.

当VMA被找到,并且各种权限检查通过,则PTE会表示该页不在现有页帧当中。在我们的例子中,我们的PTE是完全空的,这意味着linux kernel认为虚拟页还没有被映射。因为这是一个匿名的VMA,我们就必须使用do_anonymous_page()处理这种情况,它会把之前PTE的错误的虚拟页映射到新分配出来的页帧上。

Things could have been different. The PTE for a swapped out page, for example, has 0 in the Present flag but is not blank. Instead, it stores the swap location holding the page contents, which must be read from disk and loaded into a page frame by do_swap_page() in what is called a major fault.

当然,事情也有可能有所区别。比如PTE中是一个换出的页,此时Present标示是0但是该PTE却不是空的。相反,它存储的交换区的地址里面保存的是页内容,它必须由do_swap_page()从硬盘里读取到页帧中。这叫做major fault

This concludes the first half of our tour through the kernel’s user memory management. In the next post, we’ll throw files into the mix to build a complete picture of memory fundamentals, including consequences for performance.

记下来作者会再写一片关于文件操作的,从而构成一个完整的内存模型,包括一些效率的讨论。

posted @ 2012-04-13 16:25  zsounder  阅读(1232)  评论(0编辑  收藏  举报