[os] 内存中的程序剖析 Anatomy of a Program in Memory
转自:Melody_lu123 CSDN 博客 ,很赞的技术文章.
这是之前几篇我所翻译文章的作者的另外一系列关于内存管理相关的文章的第一篇。翻译并自我巩固学习。希望对其它同学也能有所帮助。
转载自:Gustavo Duarte的http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory
Anatomy of a Program in Memory
剖析内存中的程序
Memory management is the heart of operating systems; it is crucial for both programming and system administration. In the next few posts I’ll cover memory with an eye towards practical aspects, but without shying away from internals. While the concepts are generic, examples are mostly from Linux and Windows on 32-bit x86. This first post describes how programs are laid out in memory.
内存管理系统是操作系统的心脏;尤其对程序的运行和系统的管理至关重要。作者会在接下来的几篇文章中从实际的角度来概述下内存管理相关的内容。因为概念是相同的,所以例子大部分来自32位的x86的linux和windows系统。这篇文章主要描述一个程序在内存中的布局。
Each process in a multi-tasking OS runs in its own memory sandbox. This sandbox is the virtual address space, which in 32-bit mode is always a 4GB block of memory addresses. These virtual addresses are mapped to physical memory by page tables, which are maintained by the operating system kernel and consulted by the processor. Each process has its own set of page tables, but there is a catch. Once virtual addresses are enabled, they apply to all software running in the machine, including the kernel itself. Thus a portion of the virtual address space must be reserved to the kernel:
在多任务的操作系统中,每个进程都运行在自己的内存沙箱内。这个沙箱就是虚拟内存地址空间,在32位模式下总是一个4GB的内存空间。这些虚拟地址通过页表对应到实际的物理内存,它们被操作系统的内核所维护并被处理器使用。每个进程都有它们自己的页表,但是会有一些限制。一旦虚拟地址被使用,它们会被所有运行在机器上的软件所使用,包括内核自身。因此有一块虚拟地址空间必须是留给内核专用(如下图所示,通常linux的默认配置是1:3, 但是,这是可配的,通常一些server,如大的关系数据库或者route可能会采取别的配比模式):
This does not mean the kernel uses that much physical memory, only that it has that portion of address space available to map whatever physical memory it wishes. Kernel space is flagged in the page tables as exclusive to privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, kernel space is constantly present and maps the same physical memory in all processes. Kernel code and data are always addressable, ready to handle interrupts or system calls at any time. By contrast, the mapping for the user-mode portion of the address space changes whenever a process switch happens:
这并不意味着内核使用如此多的物理内存,而只是意味着内核使用该段地址来映射它所使用的物理内存。内核空间在页表中会被设定特权级标示(ring 2/1/0), 因此当一个用户模式的程序试图访问该页就会触发一个页错误。在linux中,内核空间在所有的进程中都总是映射到同样的物理地址。内核的代码和数据总是可被寻址的,并且始终为处理中断或者系统调用而做好准备。相反,用户模式的进程地址空间映射总是随着进程的切换而变化:
Blue regions represent virtual addresses that are mapped to physical memory, whereas white regions are unmapped. In the example above, Firefox has used far more of its virtual address space due to its legendary memory hunger. The distinct bands in the address space correspond to memory segments like the heap, stack, and so on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with Intel-style segments. Anyway, here is the standard segment layout in a Linux process:
蓝色的区域表示的是已经映射到物理内存的虚拟地址,而白色区域表示没有映射的。在上面的例子中,firefox占据了更多的虚拟地址空间。这些地址空间对应于heap, stack等等的内存段。 请注意,这里的段只是表示一段内存地址,而与intel手册中所说的段寄存器之类的段没有任何关联。下面是标准的linux进程的段空间分布:(请注意,这里的图对应的是从内核2.6.7就引入的虚拟地址空间布局,它的mmap区域是自顶向下扩展的。经典布局与此相反。具体的原因和优缺点,请大家参考《Professional Linux Kernel Architecture》中4.3.2 process address space layout一节)
When computing was happy and safe and cuddly, the starting virtual addresses for the segments shown above were exactly the same for nearly every process in a machine. This made it easy to exploit security vulnerabilities remotely. An exploit often needs to reference absolute memory locations: an address on the stack, the address for a library function, etc. Remote attackers must choose this location blindly, counting on the fact that address spaces are all the same. When they are, people get pwned. Thus address space randomization has become popular. Linux randomizes the stack, memory mapping segment, and heap by adding offsets to their starting addresses. Unfortunately the 32-bit address space is pretty tight, leaving little room for randomization and hampering its effectiveness.
当计算机一切正常的时候,所有的进程开始时的虚拟地址空间布局都与上图很相似。这导致容易被破解引入安全漏洞。一个漏洞通常需要引用一个绝对的物理地址:一个栈上的地址,或者一个库函数的地址等等。远程攻击者必须盲目的选择这些地址,依赖于所有的内存空间都相似的这么一个事实。因此,操作系统对地址空间引入了随机化的机制。Linux会对栈,memory mapping segment和heap的开始地址加上一个随机化的偏移。不幸的是,32位的地址是相当紧张的,只给这种随机化留下了很少的空间,所以会影响该机制的效果。
The topmost segment in the process address space is the stack, which stores local variables and function parameters in most programming languages. Calling a method or function pushes a new stack frame onto the stack. The stack frame is destroyed when the function returns. This simple design, possible because the data obeys strict LIFO order, means that no complex data structure is needed to track stack contents – a simple pointer to the top of the stack will do. Pushing and popping are thus very fast and deterministic. Also, the constant reuse of stack regions tends to keep active stack memory in the cpu caches, speeding up access. Each thread in a process gets its own stack.
进程最上面的段是栈,在大部分编程语言中它用来保存局部变量和函数参数。调用一个新的方法或者函数,会把一个新的stack frame压入栈中。当函数退出的时候对应的stack frame会删除。这是一个简单的设计,可能是因为数据遵循严格的LIFO的原则,这意味着不需要用复杂的数据结构来跟踪栈的内容--一个简单的指向栈顶的指针就足够了。入栈和出栈因此变得很快而明确。同时,一个经常被使用的栈的区域会被保存在cpu cache中,从而加快访问速度。进程中的每个线程有各自的栈。
It is possible to exhaust the area mapping the stack by pushing more data than it can fit. This triggers a page fault that is handled in Linux by expand_stack(), which in turn calls acct_stack_growth() to check whether it’s appropriate to grow the stack. If the stack size is belowRLIMIT_STACK (usually 8MB), then normally the stack grows and the program continues merrily, unaware of what just happened. This is the normal mechanism whereby stack size adjusts to demand. However, if the maximum stack size has been reached, we have a stack overflow and the program receives a Segmentation Fault. While the mapped stack area expands to meet demand, it does not shrink back when the stack gets smaller. Like the federal budget, it only expands.
通过压入足够的多的数据可以用完所有stack可以映射的空间。这就会导致页错误,从而被linux的expand_stack()函数所处理,它又会调用acct_stack_growth()来判断是否可以增加栈的大小。如果栈的大小低于RLIMIT_STACK的限制(通常是8MB),那么栈会增长,从而程序会继续运行而不会知道linux内部所为它做的努力。这就是通常的根据要求调整栈大小的机制。然而,如果达到了最大所允许的栈大小,就会导致栈溢出,从而引发一个程序的segmentation fault。当一个映射的栈区域根据需要而扩大了,它是不会随着栈变小而再缩减回来。
Dynamic stack growth is the only situation in which access to an unmapped memory region, shown in white above, might be valid. Any other access to unmapped memory triggers a page fault that results in a Segmentation Fault. Some mapped areas are read-only, hence write attempts to these areas also lead to segfaults.
动态的栈增长是唯一的一种合法的访问未被映射的内存区域的场景。任何访问一个为被映射的内存的行为都会触发一个页错误,从而导致segmentation fault。有一些映射区域是只读的,因此对它的写访问也会导致相同的segmentation faults。
Below the stack, we have the memory mapping segment. Here the kernel maps contents of files directly to memory. Any application can ask for such a mapping via the Linux mmap() system call (implementation) or CreateFileMapping() / MapViewOfFile() in Windows. Memory mapping is a convenient and high-performance way to do file I/O, so it is used for loading dynamic libraries. It is also possible to create an anonymous memory mapping that does not correspond to any files, being used instead for program data. In Linux, if you request a large block of memory via malloc(), the C library will create such an anonymous mapping instead of using heap memory. ‘Large’ means larger than MMAP_THRESHOLD bytes, 128 kB by default and adjustable via mallopt().
栈的下面就是memory mapping segment。这里被内核用来把文件内容直接映射到内存。所有的应用程序都可以使用linux提供的mmap()系统调用或者在windows中使用CreateFileMapping()/MapViewOfFile来进行这样的映射。memory mapping是进行文件I/O的高效方法,所以动态库的加载使用这个方式来实现。当然,也可以进行一些不关联到文件的程序数据的匿名memory mapping。在linux中,如果你通过malloc()来申请一块大的内存,C库就会在memory mapping segment中创建一个匿名memory mapping而不是使用堆空间。这里的“大”意味着大于MMAP_THRESHOLD字节,默认是128kb,可以通过mallopt()来进行调整。
Speaking of the heap, it comes next in our plunge into address space. The heap provides runtime memory allocation, like the stack, meant for data that must outlive the function doing the allocation, unlike the stack. Most languages provide heap management to programs. Satisfying memory requests is thus a joint affair between the language runtime and the kernel. In C, the interface to heap allocation is malloc() and friends, whereas in a garbage-collected language like C# the interface is the new keyword.
谈到堆,它位于memory mapping segment的下面。它提供了运行时的内存分配。大多数的语言都提供了对堆进行管理的接口。因此满足内存请求就是一个程序的运行时环境和内核之间的接口的问题。在C语言中,这个接口就是malloc以及它的一些伙伴,而在有垃圾回收机制的语言如C#中的接口就是new关键字。
If there is enough space in the heap to satisfy a memory request, it can be handled by the language runtime without kernel involvement. Otherwise the heap is enlarged via the brk() system call (implementation) to make room for the requested block. Heap management is complex, requiring sophisticated algorithms that strive for speed and efficient memory usage in the face of our programs’ chaotic allocation patterns. The time needed to service a heap request can vary substantially. Real-time systems have special-purpose allocators to deal with this problem. Heaps also become fragmented, shown below:
如果有足够的堆空间来满足内存请求,它就可易被该语言的运行时环境所管理而不需要内核的干预。否则,堆通过内核提供的系统调用brk()来满足所请求的空间。堆的管理是复杂的,需要成熟的算法,它必须是满足速度和内存使用效率上的折衷。响应一个对堆内存的请求时间是跟具体的场景相关的。实时系统对于着疑问剃就有特殊的要求,所以产生了特殊的分配器。堆通常会被分裂开:
Finally, we get to the lowest segments of memory: BSS, data, and program text. Both BSS and data store contents for static (global) variables in C. The difference is that BSS stores the contents ofuninitialized static variables, whose values are not set by the programmer in source code. The BSS memory area is anonymous: it does not map any file. If you say static int cntActiveUsers, the contents of cntActiveUsers live in the BSS.
最后,让我们来看最下面的几个内存段:BSS, data和program text。BSS和data都是用来存储静态的(全局的)的变量。不同之处在于BSS中存放的是没有初始化的静态变量, 它的值没有被程序在代码中设置。BSS内存区是匿名的:它不会映射到任何文件。比如,如果你在代码中static int cntActiveUsers,那么该变量就会存在于BSS段。
The data segment, on the other hand, holds the contents for static variables initialized in source code. This memory area is not anonymous. It maps the part of the program’s binary image that contains the initial static values given in source code. So if you say static int cntWorkerBees = 10, the contents of cntWorkerBees live in the data segment and start out as 10. Even though the data segment maps a file, it is a private memory mapping, which means that updates to memory are not reflected in the underlying file. This must be the case, otherwise assignments to global variables would change your on-disk binary image. Inconceivable!
另一方面,数据段保存的是在代码中被初始化了的变量。这个内存区不是匿名的。它映射了程序二进制文件中包含的被初始化了的变量。所以,如果你在程序中写了static int cntWorkerBees = 10,那么该变量就会保存在数据段并且值为10。请注意,即使数据段会映射一个文件,它也是私有的内存映射,这意味着你更新了内存中额值也不会反应到它所映射的文件中。这是必须的,因为如果你在运行程序中改变了全局变量的值,却要把这个值写到硬盘上,这是不可取的!(除非你是要用跟新一个文件,但是这跟全局变量没有什么关系)
The data example in the diagram is trickier because it uses a pointer. In that case, the contents of pointer gonzo – a 4-byte memory address – live in the data segment. The actual string it points to does not, however. The string lives in the text segment, which is read-only and stores all of your code in addition to tidbits like string literals. The text segment also maps your binary file in memory, but writes to this area earn your program a Segmentation Fault. This helps prevent pointer bugs, though not as effectively as avoiding C in the first place. Here’s a diagram showing these segments and our example variables:
下面关于数据段的一个例子有点做作,它使用了一个指针。例子中,数据区的gonzo指针-一个4字节的内存地址-存放在数据段。而它指向的字符串却不是保存在数据区,而是保存在代码段当中,代码段是只读的并且你所有的代码都会保存在那里。代码段也会映射到内存当中,但是如果你尝试往那个区域去写就会导致一个Segmentation Fault。这会帮助预防指针bugs。下图就是相关的描述:
You can examine the memory areas in a Linux process by reading the file/proc/pid_of_process/maps. Keep in mind that a segment may contain many areas. For example, each memory mapped file normally has its own area in the mmap segment, and dynamic libraries have extra areas similar to BSS and data. The next post will clarify what ‘area’ really means. Also, sometimes people say “data segment” meaning all of data + bss + heap.
你可以在linux下通过查看/proc/pid_of_process/maps来查看你的内存区。请注意,段可能包含很多区域。例如,每个内存映射文件通常都会在mmap segment有它自己的区域,动态库也会在BSS和data段中占用一些额外的区域。作者会在接下来的文章中描述‘区域’的真正含义。有时候,人们也会把“data segment”作为是data+bss+heap的统称。
You can examine binary images using the nm and objdump commands to display symbols, their addresses, segments, and so on. Finally, the virtual address layout described above is the “flexible” layout in Linux, which has been the default for a few years. It assumes that we have a value forRLIMIT_STACK. When that’s not the case, Linux reverts back to the “classic” layout shown below:
你可以使用nm和objdump来查看一个二进制文件里的所有符号,地址,段等信息。最后,以上提到的虚拟内存布局在linux中是有一定的灵活性的,它这些年已经称为了linux中的默认实现。它假设我们已经设定好了RLIMIT_STACK。但是,如果没有设定这些限制,linux会回归到经典的内存布局,如下所示:(就像之前我所提到的,内核提供两种虚拟内存的布局,经典布局就是stack和memory mapping region会相对增长。而最新的实现memory mapping region确实向heap区增长)
That’s it for virtual address space layout. The next post discusses how the kernel keeps track of these memory areas. Coming up we’ll look at memory mapping, how file reading and writing ties into all this and what memory usage figures mean.
这篇文章就是说的虚拟内存空间布局。接下来一篇文章(Kernel是如何管理你的内存 How The Kernel Manages Your Memory)会讨论内核如何管理和跟踪这些内存区域。其中,我们会去看看内存映射,文件如何被读写,内存使用率是什么意思。