Windows 内存管理知识总结

工作中遇到了 32位 windows 程序虚拟内存不足的问题，于是对 Windows 内存相关知识做了调研探索。文内容总结自《Windows Internal》和 MSDN 文档，具体链接会注在文章最后，供大家参考

预备知识

在了解 Windows 内存知识前，需要弄清「虚拟内存」和「物理内存」的关系

虚拟内存和物理内存的关系

首先，了解一下内存分配过程涉及到的一些概念：

进程分配的都是虚拟内存，不能直接使用物理内存
虚拟内存地址通过 MMU (Mememory Management Unit），会被翻译为物理地址，找到对应的物理页
分配连续的虚拟内存，对应的物理内存不一定是连续的，好处是在进程层面不用过多考虑内存碎片化的影响
页命中，物理内存中存在对应的物理页
缺页（paging fault）异常，物理内存中没有找到对应的物理页
交换（swapping）或页面调度（paging），将当前没用的物理页（牺牲页）写入磁盘，将需要用的虚拟内存页映射到物理内存页

总的来说，我们的程序用的都是虚拟内存，操作系统和硬件帮助我们将虚拟地址翻译为真正的物理地址，然后程序才能访问到内存中的数据。

比如图中所示，物理内存一共只有4页。开始时，「进程A」分配了 4 页内存，此时物理内存已经占满。此时如果「进程B」又分配了 2 页内存「VP3」「VP4」，这时会触发缺页异常，操作系统会根据缓存策略将短时间用不到的内存数据交换到磁盘，比如「进程A」的「VP3」「VP4」被换出到磁盘。然后，「进程B」的「VP3」「VP4」才能被使用。

上面的例子只是帮助大家大致理解内存分配的流程，实际情况会更加复杂，涉及到缓存优化，空间优化等过程，本文不再赘述。

我们还可以观察到，图中的虚拟内存处在不同的状态，「Reserved」「Commited」，这两个状态代表了什么呢？请继续看下节。

Windows中虚拟内存的两种状态 reserved & comitted

保留中和提交中的页
Reserving and Committing Pages

进程虚拟地址空间中的页是空闲的、已分配的、已提交的或可共享的。
Pages in a process virtual address space are free, reserved, committed, or shareable.
提交和可共享的页是在访问时最终转换为物理内存中的有效页的页。
Committed and shareable pages are pages that, when accessed, ultimately translate to valid pages in physical memory.

提交的页也称为私有页(private page)。
Committed pages are also referred to as private pages.
这反映了这样一个事实:已提交的页不能与其他进程共享，而可共享的页可以(但当然，可能只被一个进程使用)。
This reflects the fact that committed pages cannot be shared with other processes, whereas shareable pages can be (but, of course, might be in use by only one process).

私有页是通过Windows的VirtualAlloc、VirtualAllocEx和VirtualAllocExNuma函数分配的。
Private pages are allocated through the Windows VirtualAlloc, VirtualAllocEx, and VirtualAllocExNuma functions.
这些函数允许线程分配地址空间，然后提交分配的部分空间。
These functions allow a thread to reserve address space and then commit portions of the reserved space.
中间的“保留”状态允许线程留出一段连续的虚拟地址，以备将来使用(例如数组)，同时消耗的系统资源可以忽略不计，然后在应用程序运行时根据需要提交保留的部分空间。
The intermediate "reserved" state allows the thread to set aside a range of contiguous virtual addresses for possible future use (such as an array), while consuming negligible system resources, and then commit portions of the reserved space as needed as the application runs.
或者，如果预先知道大小需求，线程可以在同一个函数调用中预留和提交。
Or, if the size requirements are known in advance, a thread can reserve and commit in the same function call.
在这两种情况下，线程都可以访问提交的页面。
In either case, the resulting committed pages can then be accessed by the thread.
试图访问空闲或分配的内存会导致异常，因为该页没有映射到任何可以解析引用的存储空间。
Attempting to access free or reserved memory results in an exception because the page isn't mapped to any storage that can resolve the reference.

如果提交的(私有)页以前从未访问过，则在第一次访问时以零初始化页的形式创建(或要求为零)。
If committed (private) pages have never been accessed before, they are created at the time of first access as zero-initialized pages (or demand zero).
私有提交的页可能稍后由操作系统自动写入到分页文件，如果需要物理内存的话。
Private committed pages may later be automatically written to the paging file by the operating system if required by demand for physical memory.
“私有”指的是，这些页通常是任何其他进程无法访问的。
"Private" refers to the fact that these pages are normally inaccessible to any other process.

reserved 预留，表示预先分配的虚拟内存，但还没有映射到物理内存，在使用时需要先命中物理页
commited 已经提交，表示虚拟内存已经映射到了物理内存或已经缓存在磁盘
commited pages 也是 private pages，表示不能与其他进程共享

为什么虚拟内存需要 reserved，而不是直接使用 commited？

这是我在 stackoverflow 上找到的我比较认可的回答：

Why would I want to reserve? Why not just get committed memory? There are several reasons I have in mind:

Some application needs a specific address range, say from 0x400000 to 0x600000, but does not need the memory for storing anything. It is used to trap memory access. E.g., if some code accesses such area, it will be caught. (Useful for some reason.)

Some thread needs to store progressively expanding data. And the data needs to be in one contiguous chunk of memory. It is preferred not to commit large physical memory at one go because it is not needed and would be such a waste. The memory can be utilized by some other threads first. The physical memory is committed only on demand.

翻译一下：

某些应用需要特定的地址空间用于捕获内存捕获监测，一但某些代码开辟了这块空间，就捕获这个事件
预留连续的空间，后续再使用，比如开辟一条线程时，会先预留 1MB 的空间，而不会直接提交到物理内存

我为什么要预订?

Why would I want to reserve?

为什么不直接提交内存呢?

Why not just get committed memory?

我想到的原因有几个:一些应用程序需要一个特定的地址范围，比如从0x400000到0x600000，但不需要内存来存储任何东西。

There are several reasons I have in mind: Some application needs a specific address range, say from 0x400000 to 0x600000, but does not need the memory for storing anything.

它用于捕获内存访问。

It is used to trap memory access.

例如，如果某些代码访问了该区域，就会被捕获。

E.g., if some code accesses such area, it will be caught.

(出于某种原因是有用的。)

(Useful for some reason.)

有些线程需要存储渐进扩展的数据。

Some thread needs to store progressively expanding data.

数据需要存储在一个连续的内存块中。

And the data needs to be in one contiguous chunk of memory.

最好不要一次性提交大的物理内存，因为不需要，而且会造成浪费。

It is preferred not to commit large physical memory at one go because it is not needed and would be such a waste.

内存可以先被其他线程使用。

The memory can be utilized by some other threads first.

物理内存仅按需提交。

The physical memory is committed only on demand.

关于「32位程序」和「32位CPU」的 Q&A

Q1. 为什么 8G 甚至 16G 物理内存的笔记本电脑跑 winp32 程序还是会 OOM？

A：win32程序的内存瓶颈在于虚拟内存不足，而不是物理内存

下面做个比喻，解释 32位程序虚拟内存和物理内存的关系是什么。

比如虚拟内存是学校，物理内存是宿舍。

学校盖的大，能招的学生就多，程序能分配的虚拟内存空间就大。
如果学校盖的小，宿舍盖的大，那么宿舍一定会有空位，因为学校就算招满人了，宿舍也住不满（代表了单进程，虚拟内存小于物理内存的情况，不考虑使用 PAE 技术的情况）
如果学校盖的大，宿舍盖的小，宿舍就会住满。那么就需要设定策略，让更需要住宿的同学住进宿舍，不太需要住宿的同学就要搬出宿舍，给需要的同学腾出位置（代表了虚拟内存大于物理内存的情况下，物理内存打满后，需要将不需要的内存数据写入磁盘）

Q2. 为什么32位程序瓶颈是在虚拟内存上？

A: 32位进程，虚拟内存空间是 4GB，Windows系统中，内核空间占用 2GB，用户空间只有 2GB

32位程序\操作系统的指针只能表示 2^32 = 4GB 范围内的地址，所以我们开辟的虚拟内存也只能在 4GB 以内。

一个进程的内存空间布局是什么样子，为什么我们可用的空间只有 2GB 会在介绍 Windows 进程内存布局一节中回答。

Q3. 32位CPU和32位操作系统的关系是什么？

A：32位操作系统的一条指令是32位，32位CPU一个时钟周期正好处理一条32位指令

32位CPU 是不能使用 64 位操作系统的，因为 64位操作系统一条指令是 64位，32位 CPU 无法处理
反过来，64位CPU 可以运行 32位操作系统，但无法发挥出 CPU 的全部能力，有点「大马拉小车」的感觉

Q4. 32位CPU只能使用 4GB 的物理内存么？CPU的寻址能力和CPU的位宽相关么？

A：不是。不相关，CPU的寻址范围和CPU的位宽毫无关系

寻址范围和地址线宽度有关，和 CPU 位宽无关，Intel 32位CPU 早在1995年就支持36位地址线了，也就是 32位CPU 能使用 64GB 的物理内存
为什么能访问更大的内存地址？可以详细了解 PAE（Physical Address Extension）技术
PAE 技术是为了让多个 32位进程累计使用内存的情况下，能使用更多的物理内存（超过4GB）

Windows 内存布局（Windows Process Virtual Space)

用户地址空间（User Address Space Layout）

我们重点关注我们能用到的地址空间是什么样子的，对内核空间感兴趣的同学可以自己查阅其他资料。

下图出自《Windows Internals 6》

我们知道程序需要先被加载到内存中，才能运行

上图描述了 x86（32位）进程的内存布局：

分为了 3GB 的用户空间，和 1GB 的内核空间，但这并不是 Win32 程序的正常布局，而是开启了大地址空间模式的程序（LARGE_ADDRESS_AWARE）
正常的 Win32 程序用户空间只有 2GB，内核空间也占用 2GB
用户空间占用低地址（00000000 ~ 7FFFEFFF），内核空间占用高地址(7FFF000 ~ FFFFFFFF)
用户空间存放了「代码」「全局变量」「线程栈」「DLL」等
内核空间图中详细标明了包含什么，本文不再赘述，感兴趣的同学可以自行了解

ASLR 是如何保护 Linux 系统免受缓冲区溢出攻击的 - 知乎 (zhihu.com)

ASLR - 简书 (jianshu.com)

上图详细描述了用户空间的布局：

最低地址存放了 .exe
然后是 .dll
然后是 Heap，Heap 中存放的是通过 HeapAlloc 等 API 分配的堆内存
然后是 Thread Stack，存放的是线程栈内存，每开一条新线程就会对应开辟一块栈内存

图中还提到了 ASLR，这是什么，后文会具体介绍。

下面，再来看一张图，此图出自《程序员的自我修养》

图中描述的用户空间非常「碎片化」，这可能也和 ASLR 相关。如果你要分析应用的虚拟内存布局，不要完全以图中的布局为准，要以自己程序真正运行的情况为准。

用户地址空间布局

User Address Space Layout

就像内核中的地址空间是动态的一样，用户地址空间也是动态构建的，线程栈、进程堆和加载的映像(如dll和应用程序的可执行文件)的地址都是动态计算的(如果应用程序及其映像支持的话)，通过一种称为地址空间布局随机化(ASLR)的机制。

Just as address space in the kernel is dynamic, the user address space is also built dynamically-the addresses of the thread stacks, process heaps, and loaded images (such as DLLs and an application's executable) are dynamically computed (if the application and its images support it) through a mechanism known as Address Space Layout Randomization, or ASLR.

在操作系统级，用户地址空间被划分为几个定义良好的内存区域，如图10-14所示。

At the operating system level, user address space is divided into a few well-defined regions of memory, shown in Figure 10-14.

可执行文件和dll本身以内存映射映像文件的形式出现，紧随其后的是进程的堆和线程的栈。

The executable and DLLs themselves are present as memory mapped image files, followed by the heap(s) of the process and the stack(s) of its thread(s).

除了这些内存区域(以及一些预留的系统结构，如teb和PEB)之外，所有其他内存分配都是运行时相关的，并且是生成的。

Apart from these regions (and some reserved system structures such as the TEBs and PEB), all other memory allocations are run-time dependent and generated.

ASLR负责定位所有这些依赖于运行时的区域，并与DEP结合，提供了一种机制，使得通过内存操作对系统进行远程漏洞攻击更加难以实现。

ASLR is involved with the location of all these run-timedependent regions and, combined with DEP, provides a mechanism for making remote exploitation of a system through memory manipulation harder to achieve.

由于Windows代码和数据被放置在动态位置，攻击者通常无法在程序或系统提供的DLL中硬编码有意义的偏移量。

Since Windows code and data are placed at dynamic locations, an attacker cannot typically hardcode a meaningful offset into either a program or a system-supplied DLL.

这是书中对地址空间如何计算的一些描述：

线程栈、进程堆、已装载的镜像文件（exe、dll）的地址是动态计算获得的
其中 exe dll 需要应用支持 ASLR（随机选择地址）

DEP（数据执行保护）怎么设置-百度经验 (baidu.com)

数据执行保护_百度百科 (baidu.com)

ASLR 是什么？

下面具体看看，到底什么是 ASLR

Windows XP和Windows 7的结果之间的差异是由Windows Vista地址空间负载随机化(ASLR)引入的地址空间布局更随机的性质引起的，这导致了一些碎片。

The difference between the Windows XP result and the Windows 7 result is caused by the more random nature of address space layout introduced in Windows Vista Address Space Load Randomization (ASLR), that leads to some fragmentation.

随机化DLL加载，线程堆栈和堆放置，有助于防御恶意代码注入。

Randomization of DLL loading, thread stack and heap placement, helps defend against malware code injection.

从VMMap的输出可以看出，还有357MB的地址空间可用，但最大的空闲块只有128K，比32位栈所需的1MB小:

As you can see from this VMMap output, there's 357MB of address space still available, but the largest free block is only 128K in size, which is smaller than the 1MB required for a 32-bit stack:

ASLR 全称是 Address Space Layout Randomization，可以翻译为随机地址空间
目的是为了防御恶意软件做注入攻击，因为固定地址更容易被攻击者破译
这么做随之而来的缺点是更容易造成「内存碎片化」

如何关闭 ASLR？

修改链接器高级配置，关闭随机基址（/DYNAMICBASE:NO)

此能力我没有亲自试验过，有需求的同学可以自己尝试

Stacks

在 Windows 中，Memory Manager 会为每个线程提供两个栈，用户栈(user stack)和内核栈(kernel stack)

我们仍然只总结用户栈

用户堆

User Stacks

当创建一个线程时，内存管理器自动分配预定数量的虚拟内存，默认为1 MB。这个数量可以在调用CreateThread或CreateRemoteThread函数中配置，或者在使用Microsoft C/ c++编译器中的/STACK:reserve开关编译应用程序时配置，它将在image.header中存储信息。

When a thread is created, the memory manager automatically reserves a predetermined amount of virtual memory, which by default is 1 MB. This amount can be configured in the call to the CreateThread or CreateRemoteThread function or when compiling the application by using the /STACK:reserve switch in the Microsoft C/C++ compiler, which will store the information in the image.header.

尽管预留了1 MB内存，但只提交栈的第一页(除非图像的PE头另有指定)，以及一个保护页。

Although 1 MB is reserved, only the first page of the stack will be committed (unless the PE header of the image specifies otherwise), along with a guard page.

当线程的堆栈增长到足以触及守卫页面时，将发生异常，导致试图分配另一个守卫。

When a thread's stack grows large enough to touch the guard page, an exception will occur, causing an attempt to allocate another guard.

通过这种机制，用户栈不会立即消耗所有1 MB的已提交内存，而是随着需求增长。

Through this mechanism, a user stack doesn't immediately consume all 1 MB of committedmemory but instead grows with demand.

(然而，它永远不会退缩。)

(However, it will never shrink back.)

线程创建时，默认预留 1MB 虚拟内存
通过编译器指定参数 /STACK:reverse 可以将预留内存大小写入 PE Header 中（修改 stack size）
尽管预留了 1 MB 虚拟内存，但只有 first page 虚拟内存会被提交（真正分配）

实验:创建最大线程数

EXPERIMENT: Creating the Maximum Number of Threads

每个32位进程只有2 GB的用户地址空间可用，为每个线程堆栈保留的相对较大的内存允许轻松计算一个进程可以支持的最大线程数:略小于2048，总共接近2 GB的内存(除非使用increaseuserva BCD选项，并且图像是大地址空间感知的)。

With only 2 GB of user address space available to each 32-bit process, the relatively large memory that is reserved for each thread's stack allows for an easy calculation of the maximum number of threads that a process can support: a little less than 2,048, for a total of nearly 2 GB of memory (unless the increaseuserva BCD option is used and the image is large address space aware).

通过强制每个新线程使用尽可能小的堆栈预留大小(64 KB)，这个限制可以增长到大约30 400个线程，您可以使用Sysinternals的TestLimit实用程序自己测试。

By forcing each new thread to use the smallest possible stack reservation size, 64 KB, the limit can grow to about 30,400 threads, which you can test for yourself by using the TestLimit utility from Sysinternals.

下面是一些示例输出:

Here is some sample output:

如果您尝试在64位Windows安装上进行此实验(有8 TB可用的用户地址空间)，您可能会看到可能创建数十万个线程(只要有足够的内存)。

If you attempt this experiment on a 64-bit Windows installation (with 8 TB of user address space available), you would expect to see potentially hundreds of thousands of threads created (as long as sufficient memory were available).

然而，有趣的是，TestLimit实际上会比32位机器上创建更少的线程，这与TestLimit .exe是32位应用程序，因此运行在Wow64环境下有关。

Interestingly, however, TestLimit will actually create fewer threads than on a 32-bit machine, which has to do with the fact that Testlimit.exe is a 32-bit application and thus runs under the Wow64 environment.

(关于Wow64的更多信息，请参阅第一部分的第3章。)

(See Chapter 3 in Part 1 for more information on Wow64.)

因此，每个线程不仅有它的32位Wow64栈，而且还有它的64位栈，因此消耗超过两倍的内存，同时仍然保持只有2 GB的地址空间。

Each thread will therefore have not only its 32-bit Wow64 stack but also its 64-bit stack, thus consuming more than twice the memory, while still keeping only 2 GB of address space.

要在64位Windows上正确测试线程创建限制，请使用Testlimit64.exe二进制文件。

To properly test the thread-creation limit on 64-bit Windows, use the Testlimit64.exe binary instead.

请注意，您需要使用进程资源管理器或任务管理器终止TestLimit——使用Ctrl+C来中断应用程序将不起作用，因为此操作本身会创建一个新线程，而一旦内存耗尽，则不可能创建新线程。

Note that you will need to terminate TestLimit with Process Explorer or Task Manager-using Ctrl+C to break the application will not function because this operation itself creates a new thread, which will not be possible once memory is exhausted.

64 位系统跑 32 位程序，最大线程数量比 32 位机器跑 32 程序要少
原因是 64 位机器跑 32 位程序，会额外创建 64 位的栈，同样只有 2GB 虚拟内存空间，但每个线程重复消耗了两份内存
实测，64 位栈占用 256 kb 内存，每个线程栈合计占用 1.25 MB

总结，理论上在 64位系统上跑 32位程序，会有额外的开销，本来 32 位程序虚拟内存只有 2GB 可用，运行在 64 位系统上时会更快的暴露这个短板。想了解更多的同学可以去查阅一下 WoW64(windows on windows64)相关内容

分析 Windows 虚拟内存的利器，VMMap

上面介绍了那么多理论，实际上我们该如何分析应用的虚拟内存呢？

官方为我们提供了一款工具 vmmap

内存区域含义

Total:：总的分配过的虚拟内存
Free：可用的虚拟内存
Image：exe dll 占用的虚拟内存
Private data：进程私有的堆占用的内存
Stack：线程栈占用的虚拟内存

我们也可以打开 vmmap 点 help 进行查看每个区域的具体含义

CLI

除了 GUI，vmmap 也提供了 CLI 供我们在脚本中使用

如何解决 Win32 程序的虚拟内存瓶颈？

介绍了理论和工具，如何解决实际问题呢？

将 32位程序升级为 64位

虚拟内存在 64位程序上将不会成为瓶颈，但将现有程序改为 64位并不是一件容易的事，具体需要做什么就不再本文赘述了。

缩小冗余的预留空间（Reserved）

减小线程栈分配空间，在上文得出结论，默认情况下，32位程序跑在64位系统上，每条线程需要开辟 1.25MB内存，那我们可以适当减小栈大小。如果是 java 程序可以通过JVM启动参数 Xss 来减少栈空间
减少大的预留的堆空间，比如 java 程序在 JVM 启动的时候就会预留分配 XmX 大小的空间，如果是 1GB，就占用了一半的空间。

扩大进程虚拟内存空间

默认情况下，32位Windows上进程的虚拟大小是2 GB。

By default, the virtual size of a process on 32-bit Windows is 2 GB.

如果映像被特别标记为大地址空间感知，并且系统通过一个特殊选项启动(本•章后面会介绍)，那么一个32位进程在32位Windows上可以增长到3 GB，在64位Windows上可以增长到4 GB。

If the image is marked specifically as large address space aware, and the system is booted with a special option (described later in this•chapter), a 32-bit process can grow to be 3 GB on 32-bit Windows and to 4 GB on 64-bit Windows.

在64位Windows上，进程虚拟地址空间大小在IA64系统上为7 152 GB，在x64系统上为8 192 GB。

The process virtual address space size on 64-bit Windows is 7,152 GB on IA64 systems and 8,192 GB on x64 systems.

(这个值在未来的版本中可能会增加。)

(This value could be increased in future releases.)