浅析 x86 架构中 cache 的组织结构-每个程序员都应该了解的 CPU 高速缓存

cache通常被翻译为高速缓冲存储器（以下简称“高速缓存”），虽然现在cache的含义已经不单单指CPU和主存储器（也就是通常所谓的内存）之间的高速缓存了，但在本文中所谓的cache依旧特指CPU和主存储器之间的高速缓存。

这篇文章诞生的源头是我之前在stackoverflow看到的一个问题：

Why is transposing a matrix of 512×512 much slower than transposing a matrix of 513×513 ?

这个问题虽然国外的大神给出了完美的解释，但是我当时看过之后还是一头雾水。想必对x86架构上的cache没有较深入了解过的童鞋看过之后也是一样的感受吧。于是趁着寒假回家第一天还没有过多外界干扰的时候，我们就来详细的研究下x86架构下cache的组织方式吧。

我们就由这个问题开始讨论吧。这个问题说为什么转置一个512×512的矩阵反倒比513×513的矩阵要慢？（不知道什么是矩阵转置的童鞋补习线性代数去）提问者给出了测试的代码以及执行的时间。

不过我们不知道提问者测试机器的硬件架构，不过我的测试环境就是我这台笔记本了，x86架构，处理器是Intel Core i3-2310M 2.10GHz。顺便啰嗦一句，在linux下，直接用cat命令查看/proc/cpuinfo这个虚拟文件就可以查看到当前CPU的很多信息。

首先，我们将提问者给出的代码修改为C语言版，然后编译运行进行测试。提问者所给出的这段代码有逻辑问题，但是这和我们的讨论主题无关，所以请无视这些细节吧 :），代码如下：

#include <stdio.h>
#include <time.h>

#define SAMPLES 1000
#define MATSIZE 513

int mat[MATSIZE][MATSIZE];

void transpose()
{
    int i, j, aux;

    for (i = 0; i < MATSIZE; i++) {
        for (j = 0; j < MATSIZE; j++) {
            aux = mat[i][j];
            mat[i][j] = mat[j][i];
            mat[j][i] = aux;
        }
    }
}

int main(void)
{
    int i, j;

    for (i = 0; i < MATSIZE; i++) {
        for (j = 0; j < MATSIZE; j++) {
            mat[i][j] = i + j;
        }
    }

    clock_t begin = clock();
    for (i = 0; i < SAMPLES; i++) {
        transpose();
    }
    clock_t elapsed = clock() - begin;

    printf("Average for a matrix of %d : %f s\n",
        MATSIZE, ((double)elapsed / CLOCKS_PER_SEC) / MATSIZE);

    return 0;
}

我的机器上得出了如下的测试结果：

Average for a matrix of 513 : 0.003879 s
Average for a matrix of 512 : 0.004570 s

512×512的矩阵转置确实慢于513×513的矩阵，但是有意思的是我并没有提问者那么悬殊的执行结果。不过在编译命令行加上参数 -O2 优化后差异很明显了：

Average for a matrix of 513 : 0.001442 s
Average for a matrix of 512 : 0.005469 s

也就是说512×512的矩阵居然比513×513的矩阵转置平均慢了近4倍！

那么，是什么原因导致这个神奇的结果呢？

如果真是cache的缘故，那么cache又是如何影响代码执行的效率呢？
如果是因为cache具体的组织方式带来的特殊现象，那cache究竟是怎么组织的呢？
除此之外，仅仅是512×512的矩阵转置慢吗？其它的数字又会怎样呢？
搞明白了cache的组织方式之后，能给我们平时写代码定义变量有怎样的启示呢？

好了，我们提出的问题足够多了，现在我们来尝试在探索中逐一解答这些问题，并尝试分析一些现代CPU的特性对代码执行造成的影响。

我们从cache的原理说起，cache存在的目的是在高速的CPU和较低速的主存储器之间建立一个数据存储的缓冲地带，通常由SRAM制造，访问速度略慢于CPU的寄存器，但是却高于DRAM制造的主存储器。因为制造成本过高，所以cache的容量一般都很小，一般只有几MB甚至几十到几百KB而已。那你可能会说，这么小的cache怎么可能有大作用。有趣的是还真有大作用，由于程序的局部性原理的存在，小容量的cache在工作时能轻易达到90%以上的读写命中率。局部性原理分为时间局部性和空间局部性，这里不再详述，有兴趣的童鞋请参阅其他资料。

顺便插一句嘴，不光金字塔型的存储器体系结构和制造成本相关，甚至我觉得计算机体系结构很大程度受制于成本等因素的考量。假设主存储器的存储速率能和CPU寄存器比肩的话，cache肯定就会退出历史的舞台了。如果磁盘的读写速度能达到寄存器级别并且随机存取，那恐怕内存也就没有存在的必要的……

言归正传，我们如何查看自己机器上CPU的cache信息呢？/proc/cpuinfo这里是没有的，我们需要使用lscpu命令查看，这条命令在我的机器上得到了如下的输出结果：

可以看到，我的机器拥有L1d（L1数据cache）和L1i（L1指令cache）各32KB、L2 cache 256KB、L3 cache 3072KB（3MB）。

L1缓存居然分为数据缓存和指令缓存，这不是哈佛架构么？x86不是冯·诺伊曼架构么，怎么会在存储区域区分指令和数据？其实，教科书中讲述的都是完全理想化的模型，在实际的工程中，很难找到这种理想化的设计。就拿操作系统内核而言，尽管所谓的微内核组织结构更好，但是在目前所有知名的操作系统中是找不到完全符合学术意义上的微内核的例子。工程上某些时候就是一种折衷，选择更“接地气”的做法，而不是一味的契合理论模型。

既然cache容量很有限，那么如何组织数据便是重点了。接下来，我们谈谈cache和内存数据的映射方式。一般而言，有所谓的全相联映射，直接相联映射和组相联映射三种方式。

CPU和cache是以字为单位进行数据交换的，而cache却是以行（块）（即Cache Block，或Cache Line）为单位进行数据交换的。在cache中划分若干个字为一行，在内存中划分若干个字为一块，这里的行和块是大小相等的。CPU要获取某内存地址的数据时会先检查该地址所在的块是否在cache中，如果在称之为cache命中，CPU很快就可以读取到所需数据；反之称为cache未命中，此时需要从内存读取数据，同时会将该地址所在的整个内存块复制到cache里存储以备再次使用。

我们依次来看这三种映射方式，首先是全相联映射，这种映射方式很简单，内存中的任意一块都可以放置到cache中的任意一行去。为了便于说明，我们给出以下的简单模型来理解这个设计。

我们假设有一个4行的cache，每行4个字，每个字占4个字节，即64字节的容量。另外还有256字节（16块，每块4字，每字4字节）的一个RAM存储器来和这个cache进行映射。映射结构如图所示：

那么如何判断cache是否命中呢？由于内存和cache是多对一的映射，所以必须在cache存储一行数据的同时标示出这些数据在内存中的确切位置。简单的说，在cache每一行中都有一个Index，这个Index记录着该行数据来自内存的哪一块（其实还有若干标志位，包括有效位（valid bit）、脏位（dirty bit）、使用位（use bit）等。这些位在保证正确性、排除冲突、优化性能等方面起着重要作用）。那么在进行一个地址的判断时，采用全相联方式的话，因为任意一行都有可能存在所需数据，所以需要比较每一行的索引值才能确定cache中是否存在所需数据。这样的电路延迟较长，设计困难且复杂性高，所以一般只有在特殊场合，比如cache很小时才会使用。

然后是第二种方法：直接相连映射。这个方法固定了行和块的对应关系，例如内存第0块必须放在cache第0行，第一块必须放在第一行，第二块必须放在第二行……循环放置，即满足公式：

内存块放置行号 = 内存块号 % cache总行数

映射如图所示：

这样做解决了比较起来困难的问题，由于每一块固定到了某一行，只需要计算出目标内存所在的行号进行检查即可判断出cache是否命中。但是这么做的话因为一旦发生冲突就必须换出cache中的指定行，频繁的更换缓存内容造成了大量延迟，而且未能有效利用程序运行期所具有的时间局部性。

综上，最终的解决方案是最后的组相联映射方式（Set Associativity），这个方案结合了以上两种映射方式的优点。具体的方法是先将cache的行进行分组，然后内存块按照组号求模来决定该内存块放置到cache的哪一个组。但是具体放置在组内哪一行都可以，具体由cache替换算法决定。

我们依旧以上面的例子来说明，将cache里的4行分为两组，然后采用内存里的块号对组号求模的方式进行组号判断，即内存0号块第一组里，2号块放置在第二组里，3号块又放置在第一组，以此类推。这么做的话，在组内发生冲突的话，可以选择换出组内一个不经常读写的内存块，从而减少冲突，更好的利用了资源（具体的cache替换策略不在讨论范围内，有兴趣的童鞋请自行研究）。同时因为组内行数不会很多，也避免了电路延迟和设计的复杂性。

x86中cache的组织方式采用的便是组相联映射方式。

上面的阐述可能过于简单，不过大家应该理解了组相联映射方式是怎么回事了。那么我们接下来结合我的机器上具体的cache映射计算方法继续分析。

我们刚说过组相联映射方式的行号可以通过块号 % 分组个数的公式来计算，那么直接给出一个内存地址的话如何计算呢？其实内存地址所在的块号就是内存地址值 / 分块字节数，那么直接由一个内存地址计算出所在cache中的行分组的组号计算公式就是：

内存地址所在cache组号 = （内存地址值 / 分块字节数） % 分组个数

很简单吧？假定一个cache行（内存块）有4个字，我们画出一个32位地址拆分后的样子：

因为字长32的话，每个字有4个字节，所以需要内存地址最低2位时字节偏移，同理每行（块）有4个字，块内偏移也是2位。这里的索引位数取决于cache里的行数，这个图里我画了8位，那就表示cache一共有256个分组（0～255）存在，每个分组有多少行呢？这个随意了，这里的行数是N，cache就是N路组相联映射。具体的判断自然是取tag进行组内逐一匹配测试了，如果不幸没有命中，那就需要按照cache替换算法换出组内的一行了。顺带画出这个地址对应的cache结构图：

标志位是有效位（valid bit）、脏位（dirty bit）、使用位（use bit）等，用于该cache行的写回算法，替换算法使用。这里简单期间我就画了一个2路组相联映射的例子出来。现在大家应该大致明白cache工作的流程了吧？首先由给出的内存地址计算出所在cache的组号（索引），再由判断电路逐一比较标签（tag）值来判断是否命中，若命中则通过行（块）内偏移返回所在字数据，否则由cache替换算法决定换出某一行（块），同时由内存调出该行（块）数据进行替换。

其实工作的流程就是这样，至于cache写回的策略（写回法，写一次法，全写法）不在本文的讨论范围之内，就不细说了。

有了以上铺垫，我们终于可以来解释那个512×512的矩阵转置问题了。很艰难的铺垫，不是吗？但我们距离胜利越来越近了。

512×512的矩阵，或者用C语言的说法称之为512×512的整型二维数组，在内存中是按顺序存储的。

那么以我的机器为例，在上面的lscpu命令输出的结果中，L1d（一级数据缓存）拥有32KB的容量。但是，有没有更详细的行大小和分组数量的信息？当然有，而且不需要多余的命令。在/sys/devices/system/cpu目录下就可以看到各个CPU核的所有详细信息，当然也包括cache的详细信息，我们主要关注L1d缓存的信息，以核0为例，在/sys/devices/system/cpu/cpu0/cache目录下有index0～index4这四个目录，分别对应L1d，L1i，L2，L3的信息。我们以L1d（index0）为例查看详细参数。

从图中我们可以知道，这是L1数据缓存的相关信息：共有64个组，每组8行，每行16字（64字节），共有32KB的总容量。按照我们之前的分析，相信你很容易就能说出这个机器上L1d缓存的组织方式。没错，就是8路组相联映射。

顺带贴出Intel的官方文档证明我不是在信口开河：

此时32位内存地址的拆分如下：

对应的cache图想必也难不倒大家吧？和上边的cache结构不同的就是改变了分组数量、每组行数和每行大小。

我们继续分析转置问题。每个cache行（块）拥有64个字节，正好是16个int变量的大小。一个n阶矩阵的一个行正好填充n / 16个cache行。512阶矩阵的话，每个矩阵的行就填充了32个组中的行，2个矩阵的行就覆盖了64个组。之后的行若要使用，就必然牵扯到cache的替换了。如果此时二维数组的array[0][0]开始从cache第一行开始放置。那么当进入第二重for循环之后，由于内存地址计算出的cache组号相同，导致每一个组中的正在使用的cache行发生了替换，不断发生的组内替换使得cache完全没有发挥出效果，所以造成了512×512的矩阵在转置的时候耗时较大的原因。具体的替换点大家可以自行去计算，另外513×513矩阵大家也可以试着去分析没有过多cache失效的原因。不过这个问题是和CPU架构有关的，所以假如你的机器没有产生同样的效果，不妨自己研究研究自己机器的cache结构。

另外网上针对这个问题也有诸多大牛给出的解释，大家不妨参照着理解吧。别人说过的我就不说了，大家可以参考着分析。

原本想把这篇作为上篇，再去写一个下篇讲述一些编程中要注意的问题。不过偶然间看到了微软大牛Igor Ostrovsky的博文《Gallery of Processor Cache Effects》，瞬间感觉自己不可能写的更好了。所以推荐大家去读这篇文章。如果感觉英文吃力的话，耗子叔这里有@我的上铺叫路遥做的翻译解释《7个示例科普CPU Cache》。

另外，开源中国这里的一篇译文也有参考价值：《每个程序员都应该了解的 CPU 高速缓存》。

[ Editor's note: This is the second installment in Ulrich Drepper's "What every programmer should know about memory" document. Those who have not read the first part will likely want to start there. This is good stuff, and we once again thank Ulrich for allowing us to publish it.

One quick request: in a document of this length there are bound to be a few typographical errors remaining. If you find one, and wish to see it corrected, please let us know via mail to lwn@lwn.net rather than by posting a comment. That way we will be sure to incorporate the fix and get it back into Ulrich's copy of the document and other readers will not have to plow through uninteresting comments.]

[编者按：这是Ulrich Drepper写“程序员都该知道存储器”的第二部。那些没有读过第一部 的读者可能希望从这一部开始。这本书写的非常好，并且感谢Ulrich授权我们出版。

一点说明：书籍出版时可能会有一些印刷错误，如果你发现，并且想让它在后续的出版中更正，请将意见发邮件到lwn@lwn.net ，我们一定会更正，并反馈给Ulrich的文档副本，别的读者就不会受到这些困扰。]

CPUs are today much more sophisticated than they were only 25 years ago. In those days, the frequency of the CPU core was at a level equivalent to that of the memory bus. Memory access was only a bit slower than register access. But this changed dramatically in the early 90s, when CPU designers increased the frequency of the CPU core but the frequency of the memory bus and the performance of RAM chips did not increase proportionally. This is not due to the fact that faster RAM could not be built, as explained in the previous section. It is possible but it is not economical. RAM as fast as current CPU cores is orders of magnitude more expensive than any dynamic RAM.

现在的CPU比25年前要精密得多了。在那个年代，CPU的频率与内存总线的频率基本在同一层面上。内存的访问速度仅比寄存器慢那么一点点。但是，这一局面在上世纪90年代被打破了。CPU的频率大大提升，但内存总线的频率与内存芯片的性能却没有得到成比例的提升。并不是因为造不出更快的内存，只是因为太贵了。内存如果要达到目前CPU那样的速度，那么它的造价恐怕要贵上好几个数量级。

If the choice is between a machine with very little, very fast RAM and a machine with a lot of relatively fast RAM, the second will always win given a working set size which exceeds the small RAM size and the cost of accessing secondary storage media such as hard drives. The problem here is the speed of secondary storage, usually hard disks, which must be used to hold the swapped out part of the working set. Accessing those disks is orders of magnitude slower than even DRAM access.

Fortunately it does not have to be an all-or-nothing decision. A computer can have a small amount of high-speed SRAM in addition to the large amount of DRAM. One possible implementation would be to dedicate a certain area of the address space of the processor as containing the SRAM and the rest the DRAM. The task of the operating system would then be to optimally distribute data to make use of the SRAM. Basically, the SRAM serves in this situation as an extension of the register set of the processor.

如果有两个选项让你选择，一个是速度非常快、但容量很小的内存，一个是速度还算快、但容量很多的内存，如果你的工作集比较大，超过了前一种情况，那么人们总是会选择第二个选项。原因在于辅存(一般为磁盘)的速度。由于工作集超过主存，那么必须用辅存来保存交换出去的那部分数据，而辅存的速度往往要比主存慢上好几个数量级。

好在这问题也并不全然是非甲即乙的选择。在配置大量DRAM的同时，我们还可以配置少量SRAM。将地址空间的某个部分划给SRAM，剩下的部分划给DRAM。一般来说，SRAM可以当作扩展的寄存器来使用。

While this is a possible implementation, it is not viable. Ignoring the problem of mapping the physical resources of such SRAM-backed memory to the virtual address spaces of the processes (which by itself is terribly hard) this approach would require each process to administer in software the allocation of this memory region. The size of the memory region can vary from processor to processor (i.e., processors have different amounts of the expensive SRAM-backed memory). Each module which makes up part of a program will claim its share of the fast memory, which introduces additional costs through synchronization requirements. In short, the gains of having fast memory would be eaten up completely by the overhead of administering the resources.

上面的做法看起来似乎可以，但实际上并不可行。首先，将SRAM内存映射到进程的虚拟地址空间就是个非常复杂的工作，而且，在这种做法中，每个进程都需要管理这个SRAM区内存的分配。每个进程可能有大小完全不同的SRAM区，而组成程序的每个模块也需要索取属于自身的SRAM，更引入了额外的同步需求。简而言之，快速内存带来的好处完全被额外的管理开销给抵消了。

So, instead of putting the SRAM under the control of the OS or user, it becomes a resource which is transparently used and administered by the processors. In this mode, SRAM is used to make temporary copies of (to cache, in other words) data in main memory which is likely to be used soon by the processor. This is possible because program code and data has temporal and spatial locality. This means that, over short periods of time, there is a good chance that the same code or data gets reused. For code this means that there are most likely loops in the code so that the same code gets executed over and over again (the perfect case for spatial locality). Data accesses are also ideally limited to small regions. Even if the memory used over short time periods is not close together there is a high chance that the same data will be reused before long (temporal locality). For code this means, for instance, that in a loop a function call is made and that function is located elsewhere in the address space. The function may be distant in memory, but calls to that function will be close in time. For data it means that the total amount of memory used at one time (the working set size) is ideally limited but the memory used, as a result of the random access nature of RAM, is not close together. Realizing that locality exists is key to the concept of CPU caches as we use them today.

因此，SRAM是作为CPU自动使用和管理的一个资源，而不是由OS或者用户管理的。在这种模式下，SRAM用来复制保存（或者叫缓存）主内存中有可能即将被CPU使用的数据。这意味着，在较短时间内，CPU很有可能重复运行某一段代码，或者重复使用某部分数据。从代码上看，这意味着CPU执行了一个循环，所以相同的代码一次又一次地执行（空间局部性的绝佳例子）。数据访问也相对局限在一个小的区间内。即使程序使用的物理内存不是相连的，在短期内程序仍然很有可能使用同样的数据（时间局部性）。这个在代码上表现为，程序在一个循环体内调用了入口一个位于另外的物理地址的函数。这个函数可能与当前指令的物理位置相距甚远，但是调用的时间差不大。在数据上表现为，程序使用的内存是有限的（相当于工作集的大小）。但是实际上由于RAM的随机访问特性，程序使用的物理内存并不是连续的。正是由于空间局部性和时间局部性的存在，我们才提炼出今天的CPU缓存概念。

A simple computation can show how effective caches can theoretically be. Assume access to main memory takes 200 cycles and access to the cache memory take 15 cycles. Then code using 100 data elements 100 times each will spend 2,000,000 cycles on memory operations if there is no cache and only 168,500 cycles if all data can be cached. That is an improvement of 91.5%.

The size of the SRAM used for caches is many times smaller than the main memory. In the author's experience with workstations with CPU caches the cache size has always been around 1/1000th of the size of the main memory (today: 4MB cache and 4GB main memory). This alone does not constitute a problem. If the size of the working set (the set of data currently worked on) is smaller than the cache size it does not matter. But computers do not have large main memories for no reason. The working set is bound to be larger than the cache. This is especially true for systems running multiple processes where the size of the working set is the sum of the sizes of all the individual processes and the kernel.

我们先用一个简单的计算来展示一下高速缓存的效率。假设，访问主存需要200个周期，而访问高速缓存需要15个周期。如果使用100个数据元素100次，那么在没有高速缓存的情况下，需要2000000个周期，而在有高速缓存、而且所有数据都已被缓存的情况下，只需要168500个周期。节约了91.5%的时间。

用作高速缓存的SRAM容量比主存小得多。以我的经验来说，高速缓存的大小一般是主存的千分之一左右(目前一般是4GB主存、4MB缓存)。这一点本身并不是什么问题。只是，计算机一般都会有比较大的主存，因此工作集的大小总是会大于缓存。特别是那些运行多进程的系统，它的工作集大小是所有进程加上内核的总和。

What is needed to deal with the limited size of the cache is a set of good strategies to determine what should be cached at any given time. Since not all data of the working set is used at exactly the same time we can use techniques to temporarily replace some data in the cache with other data. And maybe this can be done before the data is actually needed. This prefetching would remove some of the costs of accessing main memory since it happens asynchronously with respect to the execution of the program. All these techniques and more can be used to make the cache appear bigger than it actually is. We will discuss them in Section 3.3. Once all these techniques are exploited it is up to the programmer to help the processor. How this can be done will be discussed in Section 6.

处理高速缓存大小的限制需要制定一套很好的策略来决定在给定的时间内什么数据应该被缓存。由于不是所有数据的工作集都是在完全相同的时间段内被使用的，我们可以用一些技术手段将需要用到的数据临时替换那些当前并未使用的缓存数据。这种预取将会减少部分访问主存的成本，因为它与程序的执行是异步的。所有的这些技术将会使高速缓存在使用的时候看起来比实际更大。我们将在3.3节讨论这些问题。我们将在第6章讨论如何让这些技术能很好地帮助程序员，让处理器更高效地工作。

3.1 CPU Caches in the Big Picture

Before diving into technical details of the implementation of CPU caches some readers might find it useful to first see in some more details how caches fit into the “big picture” of a modern computer system.

Figure 3.1: Minimum Cache Configuration

Figure 3.1 shows the minimum cache configuration. It corresponds to the architecture which could be found in early systems which deployed CPU caches. The CPU core is no longer directly connected to the main memory. {In even earlier systems the cache was attached to the system bus just like the CPU and the main memory. This was more a hack than a real solution.} All loads and stores have to go through the cache. The connection between the CPU core and the cache is a special, fast connection. In a simplified representation, the main memory and the cache are connected to the system bus which can also be used for communication with other components of the system. We introduced the system bus as “FSB” which is the name in use today; see Section 2.2. In this section we ignore the Northbridge; it is assumed to be present to facilitate the communication of the CPU(s) with the main memory.

3.1 高速缓存的位置

在深入介绍高速缓存的技术细节之前，有必要说明一下它在现代计算机系统中所处的位置。

图3.1: 最简单的高速缓存配置图

图3.1展示了最简单的高速缓存配置。早期的一些系统就是类似的架构。在这种架构中，CPU核心不再直连到主存。{在一些更早的系统中，高速缓存像CPU与主存一样连到系统总线上。那种做法更像是一种hack，而不是真正的解决方案。}数据的读取和存储都经过高速缓存。CPU核心与高速缓存之间是一条特殊的快速通道。在简化的表示法中，主存与高速缓存都连到系统总线上，这条总线同时还用于与其它组件通信。我们管这条总线叫“FSB”——就是现在称呼它的术语，参见第2.2节。在这一节里，我们将忽略北桥。

Even though computers for the last several decades have used the von Neumann architecture, experience has shown that it is of advantage to separate the caches used for code and for data. Intel has used separate code and data caches since 1993 and never looked back. The memory regions needed for code and data are pretty much independent of each other, which is why independent caches work better. In recent years another advantage emerged: the instruction decoding step for the most common processors is slow; caching decoded instructions can speed up the execution, especially when the pipeline is empty due to incorrectly predicted or impossible-to-predict branches.

在过去的几十年，经验表明使用了冯诺伊曼结构的计算机，将用于代码和数据的高速缓存分开是存在巨大优势的。自1993年以来，Intel 并且一直坚持使用独立的代码和数据高速缓存。由于所需的代码和数据的内存区域是几乎相互独立的，这就是为什么独立缓存工作得更完美的原因。近年来，独立缓存的另一个优势慢慢显现出来：常见处理器解码指令的步骤是缓慢的，尤其当管线为空的时候，往往会伴随着错误的预测或无法预测的分支的出现，将高速缓存技术用于指令解码可以加快其执行速度。

Soon after the introduction of the cache, the system got more complicated. The speed difference between the cache and the main memory increased again, to a point that another level of cache was added, bigger and slower than the first-level cache. Only increasing the size of the first-level cache was not an option for economical reasons. Today, there are even machines with three levels of cache in regular use. A system with such a processor looks like Figure 3.2. With the increase on the number of cores in a single CPU the number of cache levels might increase in the future even more.

Figure 3.2: Processor with Level 3 Cache

Figure 3.2 shows three levels of cache and introduces the nomenclature we will use in the remainder of the document. L1d is the level 1 data cache, L1i the level 1 instruction cache, etc. Note that this is a schematic; the data flow in reality need not pass through any of the higher-level caches on the way from the core to the main memory. CPU designers have a lot of freedom designing the interfaces of the caches. For programmers these design choices are invisible.

在高速缓存出现后不久，系统变得更加复杂。高速缓存与主存之间的速度差异进一步拉大，直到加入了另一级缓存。新加入的这一级缓存比第一级缓存更大，但是更慢。由于加大一级缓存的做法从经济上考虑是行不通的，所以有了二级缓存，甚至现在的有些系统拥有三级缓存，如图3.2所示。随着单个CPU中核数的增加，未来甚至可能会出现更多层级的缓存。

图3.2: 三级缓存的处理器

图3.2展示了三级缓存，并介绍了本文将使用的一些术语。L1d是一级数据缓存，L1i是一级指令缓存，等等。请注意，这只是示意图，真正的数据流并不需要流经上级缓存。CPU的设计者们在设计高速缓存的接口时拥有很大的自由。而程序员是看不到这些设计选项的。

In addition we have processors which have multiple cores and each core can have multiple “threads”. The difference between a core and a thread is that separate cores have separate copies of (almost {Early multi-core processors even had separate 2nd level caches and no 3rd level cache.}) all the hardware resources. The cores can run completely independently unless they are using the same resources—e.g., the connections to the outside—at the same time. Threads, on the other hand, share almost all of the processor's resources. Intel's implementation of threads has only separate registers for the threads and even that is limited, some registers are shared. The complete picture for a modern CPU therefore looks like Figure 3.3.

Figure 3.3: Multi processor, multi-core, multi-thread

In this figure we have two processors, each with two cores, each of which has two threads. The threads share the Level 1 caches. The cores (shaded in the darker gray) have individual Level 1 caches. All cores of the CPU share the higher-level caches. The two processors (the two big boxes shaded in the lighter gray) of course do not share any caches. All this will be important, especially when we are discussing the cache effects on multi-process and multi-thread applications.

另外，我们有多核CPU，每个核心可以有多个“线程”。核心与线程的不同之处在于，核心拥有独立的硬件资源({早期的多核CPU甚至有独立的二级缓存。})。在不同时使用相同资源(比如，通往外界的连接)的情况下，核心可以完全独立地运行。而线程只是共享资源。Intel的线程只有独立的寄存器，而且还有限制——不是所有寄存器都独立，有些是共享的。综上，现代CPU的结构就像图3.3所示。

图3.3 多处理器、多核心、多线程

在上图中，有两个处理器，每个处理器有两个核心，每个核心有两个线程。线程们共享一级缓存。核心(以深灰色表示)有独立的一级缓存，同时共享二级缓存。处理器(淡灰色)之间不共享任何缓存。这些信息很重要，特别是在讨论多进程和多线程情况下缓存的影响时尤为重要。

3.2 Cache Operation at High Level

To understand the costs and savings of using a cache we have to combine the knowledge about the machine architecture and RAM technology from Section 2 with the structure of caches described in the previous section.

By default all data read or written by the CPU cores is stored in the cache. There are memory regions which cannot be cached but this is something only the OS implementers have to be concerned about; it is not visible to the application programmer. There are also instructions which allow the programmer to deliberately bypass certain caches. This will be discussed in Section 6.

3.2 高级的缓存操作

了解成本和节约使用缓存，我们必须结合在第二节中讲到的关于计算机体系结构和RAM技术，以及前一节讲到的缓存描述来探讨。

默认情况下，CPU核心所有的数据的读或写都存储在缓存中。当然，也有内存区域不能被缓存的，但是这种情况只发生在操作系统的实现者对数据考虑的前提下；对程序实现者来说，这是不可见的。这也说明，程序设计者可以故意绕过某些缓存，不过这将是第六节中讨论的内容了。

If the CPU needs a data word the caches are searched first. Obviously, the cache cannot contain the content of the entire main memory (otherwise we would need no cache), but since all memory addresses are cacheable, each cache entry is tagged using the address of the data word in the main memory. This way a request to read or write to an address can search the caches for a matching tag. The address in this context can be either the virtual or physical address, varying based on the cache implementation.

Since the tag requires space in addition to the actual memory, it is inefficient to chose a word as the granularity of the cache. For a 32-bit word on an x86 machine the tag itself might need 32 bits or more. Furthermore, since spatial locality is one of the principles on which caches are based, it would be bad to not take this into account. Since neighboring memory is likely to be used together it should also be loaded into the cache together. Remember also what we learned in Section 2.2.1: RAM modules are much more effective if they can transport many data words in a row without a new CAS or even RAS signal. So the entries stored in the caches are not single words but, instead, “lines” of several contiguous words. In early caches these lines were 32 bytes long; now the norm is 64 bytes. If the memory bus is 64 bits wide this means 8 transfers per cache line. DDR supports this transport mode efficiently.

如果CPU需要访问某个字(word)，先检索缓存。很显然，缓存不可能容纳主存所有内容(否则还需要主存干嘛)。系统用字的内存地址来对缓存条目进行标记。如果需要读写某个地址的字，那么根据标签来检索缓存即可。这里用到的地址可以是虚拟地址，也可以是物理地址，取决于缓存的具体实现。

标签是需要额外空间的，用字作为缓存的粒度显然毫无效率。比如，在x86机器上，32位字的标签可能需要32位，甚至更长。另一方面，由于空间局部性的存在，与当前地址相邻的地址有很大可能会被一起访问。再回忆下2.2.1节——内存模块在传输位于同一行上的多份数据时，由于不需要发送新CAS信号，甚至不需要发送RAS信号，因此可以实现很高的效率。基于以上的原因，缓存条目并不存储单个字，而是存储若干连续字组成的“线”。在早期的缓存中，线长是32字节，现在一般是64字节。对于64位宽的内存总线，每条线需要8次传输。而DDR对于这种传输模式的支持更为高效。

When memory content is needed by the processor the entire cache line is loaded into the L1d. The memory address for each cache line is computed by masking the address value according to the cache line size. For a 64 byte cache line this means the low 6 bits are zeroed. The discarded bits are used as the offset into the cache line. The remaining bits are in some cases used to locate the line in the cache and as the tag. In practice an address value is split into three parts. For a 32-bit address it might look as follows:

With a cache line size of 2O the low O bits are used as the offset into the cache line. The next S bits select the “cache set”. We will go into more detail soon on why sets, and not single slots, are used for cache lines. For now it is sufficient to understand there are 2S sets of cache lines. This leaves the top 32 - S - O = T bits which form the tag. These T bits are the value associated with each cache line to distinguish all the aliases {All cache lines with the same S part of the address are known by the same alias.} which are cached in the same cache set. The S bits used to address the cache set do not have to be stored since they are the same for all cache lines in the same set.

当处理器需要内存中的某块数据时，整条缓存线被装入L1d。缓存线的地址通过对内存地址进行掩码操作生成。对于64字节的缓存线，是将低6位置0。这些被丢弃的位作为线内偏移量。其它的位作为标签，并用于在缓存内定位。在实践中，我们将地址分为三个部分。32位地址的情况如下:

如果缓存线长度为2O，那么地址的低O位用作线内偏移量。上面的S位选择“缓存集”。后面我们会说明使用缓存集的原因。现在只需要明白一共有2S个缓存集就够了。剩下的32 - S - O = T位组成标签。它们用来区分别名相同的各条线{有相同S部分的缓存线被称为有相同的别名。}用于定位缓存集的S部分不需要存储，因为属于同一缓存集的所有线的S部分都是相同的。

When an instruction modifies memory the processor still has to load a cache line first because no instruction modifies an entire cache line at once (exception to the rule: write-combining as explained in Section 6.1). The content of the cache line before the write operation therefore has to be loaded. It is not possible for a cache to hold partial cache lines. A cache line which has been written to and which has not been written back to main memory is said to be “dirty”. Once it is written the dirty flag is cleared.

To be able to load new data in a cache it is almost always first necessary to make room in the cache. An eviction from L1d pushes the cache line down into L2 (which uses the same cache line size). This of course means room has to be made in L2. This in turn might push the content into L3 and ultimately into main memory. Each eviction is progressively more expensive. What is described here is the model for an exclusive cache as is preferred by modern AMD and VIA processors. Intel implements inclusive caches {This generalization is not completely correct. A few caches are exclusive and some inclusive caches have exclusive cache properties.} where each cache line in L1d is also present in L2. Therefore evicting from L1d is much faster. With enough L2 cache the disadvantage of wasting memory for content held in two places is minimal and it pays off when evicting. A possible advantage of an exclusive cache is that loading a new cache line only has to touch the L1d and not the L2, which could be faster.

当某条指令修改内存时，仍然要先装入缓存线，因为任何指令都不可能同时修改整条线(只有一个例外——第6.1节中将会介绍的写合并(write-combine))。因此需要在写操作前先把缓存线装载进来。如果缓存线被写入，但还没有写回主存，那就是所谓的“脏了”。脏了的线一旦写回主存，脏标记即被清除。

为了装入新数据，基本上总是要先在缓存中清理出位置。L1d将内容逐出L1d，推入L2(线长相同)。当然，L2也需要清理位置。于是L2将内容推入L3，最后L3将它推入主存。这种逐出操作一级比一级昂贵。这里所说的是现代AMD和VIA处理器所采用的独占型缓存(exclusive cache)。而Intel采用的是包容型缓存(inclusive cache)，{并不完全正确，Intel有些缓存是独占型的，还有一些缓存具有独占型缓存的特点。}L1d的每条线同时存在于L2里。对这种缓存，逐出操作就很快了。如果有足够L2，对于相同内容存在不同地方造成内存浪费的缺点可以降到最低，而且在逐出时非常有利。而独占型缓存在装载新数据时只需要操作L1d，不需要碰L2，因此会比较快。

The CPUs are allowed to manage the caches as they like as long as the memory model defined for the processor architecture is not changed. It is, for instance, perfectly fine for a processor to take advantage of little or no memory bus activity and proactively write dirty cache lines back to main memory. The wide variety of cache architectures among the processors for the x86 and x86-64, between manufacturers and even within the models of the same manufacturer, are testament to the power of the memory model abstraction.

In symmetric multi-processor (SMP) systems the caches of the CPUs cannot work independently from each other. All processors are supposed to see the same memory content at all times. The maintenance of this uniform view of memory is called “cache coherency”. If a processor were to look simply at its own caches and main memory it would not see the content of dirty cache lines in other processors. Providing direct access to the caches of one processor from another processor would be terribly expensive and a huge bottleneck. Instead, processors detect when another processor wants to read or write to a certain cache line.

处理器体系结构中定义的作为存储器的模型只要还没有改变，那就允许多CPU按照自己的方式来管理高速缓存。这表示，例如，设计优良的处理器，利用很少或根本没有内存总线活动，并主动写回主内存脏高速缓存行。这种高速缓存架构在如x86和x86-64各种各样的处理器间存在。制造商之间，即使在同一制造商生产的产品中，证明了的内存模型抽象的力量。

在对称多处理器（SMP）架构的系统中，CPU的高速缓存不能独立的工作。在任何时候，所有的处理器都应该拥有相同的内存内容。保证这样的统一的内存视图被称为“高速缓存一致性”。如果在其自己的高速缓存和主内存间，处理器设计简单，它将不会看到在其他处理器上的脏高速缓存行的内容。从一个处理器直接访问另一个处理器的高速缓存这种模型设计代价将是非常昂贵的，它是一个相当大的瓶颈。相反，当另一个处理器要读取或写入到高速缓存线上时，处理器会去检测。

If a write access is detected and the processor has a clean copy of the cache line in its cache, this cache line is marked invalid. Future references will require the cache line to be reloaded. Note that a read access on another CPU does not necessitate an invalidation, multiple clean copies can very well be kept around.

More sophisticated cache implementations allow another possibility to happen. If the cache line which another processor wants to read from or write to is currently marked dirty in the first processor's cache a different course of action is needed. In this case the main memory is out-of-date and the requesting processor must, instead, get the cache line content from the first processor. Through snooping, the first processor notices this situation and automatically sends the requesting processor the data. This action bypasses main memory, though in some implementations the memory controller is supposed to notice this direct transfer and store the updated cache line content in main memory. If the access is for writing the first processor then invalidates its copy of the local cache line.

如果CPU检测到一个写访问，而且该CPU的cache中已经缓存了一个cache line的原始副本，那么这个cache line将被标记为无效的cache line。接下来在引用这个cache line之前，需要重新加载该cache line。需要注意的是读访问并不会导致cache line被标记为无效的。

更精确的cache实现需要考虑到其他更多的可能性，比如第二个CPU在读或者写他的cache line时，发现该cache line在第一个CPU的cache中被标记为脏数据了，此时我们就需要做进一步的处理。在这种情况下，主存储器已经失效，第二个CPU需要读取第一个CPU的cache line。通过测试，我们知道在这种情况下第一个CPU会将自己的cache line数据自动发送给第二个CPU。这种操作是绕过主存储器的，但是有时候存储控制器是可以直接将第一个CPU中的cache line数据存储到主存储器中。对第一个CPU的cache的写访问会导致本地cache line的所有拷贝被标记为无效。

Over time a number of cache coherency protocols have been developed. The most important is MESI, which we will introduce in Section 3.3.4. The outcome of all this can be summarized in a few simple rules:

A dirty cache line is not present in any other processor's cache.
Clean copies of the same cache line can reside in arbitrarily many caches.

If these rules can be maintained, processors can use their caches efficiently even in multi-processor systems. All the processors need to do is to monitor each others' write accesses and compare the addresses with those in their local caches. In the next section we will go into a few more details about the implementation and especially the costs.

随着时间的推移，一大批缓存一致性协议已经建立。其中，最重要的是MESI,我们将在第3.3.4节进行介绍。以上结论可以概括为几个简单的规则:

一个脏缓存线不存在于任何其他处理器的缓存之中。
同一缓存线中的干净拷贝可以驻留在任意多个其他缓存之中。

如果遵守这些规则,处理器甚至可以在多处理器系统中更加有效的使用它们的缓存。所有的处理器需要做的就是监控其他每一个写访问和比较本地缓存中的地址。在下一节中,我们将介绍更多细节方面的实现,尤其是存储开销方面的细节。

Finally, we should at least give an impression of the costs associated with cache hits and misses. These are the numbers Intel lists for a Pentium M:

To Where Cycles

Register <= 1

L1d ~3

L2 ~14

Main Memory ~240

These are the actual access times measured in CPU cycles. It is interesting to note that for the on-die L2 cache a large part (probably even the majority) of the access time is caused by wire delays. This is a physical limitation which can only get worse with increasing cache sizes. Only process shrinking (for instance, going from 60nm for Merom to 45nm for Penryn in Intel's lineup) can improve those numbers.

最后，我们至少应该关注高速缓存命中或未命中带来的消耗。下面是英特尔奔腾 M 的数据：

To Where	Cycles
Register	<= 1
L1d	~3
L2	~14
Main Memory	~240

这是在CPU周期中的实际访问时间。有趣的是，对于L2高速缓存的访问时间很大一部分（甚至是大部分）是由线路的延迟引起的。这是一个限制，增加高速缓存的大小变得更糟。只有当减小时（例如，从60纳米的Merom到45纳米Penryn处理器），可以提高这些数据。

The numbers in the table look high but, fortunately, the entire cost does not have to be paid for each occurrence of the cache load and miss. Some parts of the cost can be hidden. Today's processors all use internal pipelines of different lengths where the instructions are decoded and prepared for execution. Part of the preparation is loading values from memory (or cache) if they are transferred to a register. If the memory load operation can be started early enough in the pipeline, it may happen in parallel with other operations and the entire cost of the load might be hidden. This is often possible for L1d; for some processors with long pipelines for L2 as well.

There are many obstacles to starting the memory read early. It might be as simple as not having sufficient resources for the memory access or it might be that the final address of the load becomes available late as the result of another instruction. In these cases the load costs cannot be hidden (completely).

表格中的数字看起来很高，但是，幸运的是，整个成本不必须负担每次出现的缓存加载和缓存失效。某些部分的成本可以被隐藏。现在的处理器都使用不同长度的内部管道，在管道内指令被解码，并为准备执行。如果数据要传送到一个寄存器，那么部分的准备工作是从存储器（或高速缓存）加载数据。如果内存加载操作在管道中足够早的进行，它可以与其他操作并行发生，那么加载的全部发销可能会被隐藏。对L1D常常可能如此；某些有长管道的处理器的L2也可以。

提早启动内存的读取有许多障碍。它可能只是简单的不具有足够资源供内存访问，或者地址从另一个指令获取，然后加载的最终地址才变得可用。在这种情况下，加载成本是不能隐藏的（完全的）。

For write operations the CPU does not necessarily have to wait until the value is safely stored in memory. As long as the execution of the following instructions appears to have the same effect as if the value were stored in memory there is nothing which prevents the CPU from taking shortcuts. It can start executing the next instruction early. With the help of shadow registers which can hold values no longer available in a regular register it is even possible to change the value which is to be stored in the incomplete write operation.

Figure 3.4: Access Times for Random Writes

For an illustration of the effects of cache behavior see Figure 3.4. We will talk about the program which generated the data later; it is a simple simulation of a program which accesses a configurable amount of memory repeatedly in a random fashion. Each data item has a fixed size. The number of elements depends on the selected working set size. The Y–axis shows the average number of CPU cycles it takes to process one element; note that the scale for the Y–axis is logarithmic. The same applies in all the diagrams of this kind to the X–axis. The size of the working set is always shown in powers of two.

对于写操作，CPU并不需要等待数据被安全地放入内存。只要指令具有类似的效果，就没有什么东西可以阻止CPU走捷径了。它可以早早地执行下一条指令，甚至可以在影子寄存器(shadow register)的帮助下，更改这个写操作将要存储的数据。

图3.4: 随机写操作的访问时间

图3.4展示了缓存的效果。关于产生图中数据的程序，我们会在稍后讨论。这里大致说下，这个程序是连续随机地访问某块大小可配的内存区域。每个数据项的大小是固定的。数据项的多少取决于选择的工作集大小。Y轴表示处理每个元素平均需要多少个CPU周期，注意它是对数刻度。X轴也是同样，工作集的大小都以2的n次方表示。

The graph shows three distinct plateaus. This is not surprising: the specific processor has L1d and L2 caches, but no L3. With some experience we can deduce that the L1d is 213 bytes in size and that the L2 is 220 bytes in size. If the entire working set fits into the L1d the cycles per operation on each element is below 10. Once the L1d size is exceeded the processor has to load data from L2 and the average time springs up to around 28. Once the L2 is not sufficient anymore the times jump to 480 cycles and more. This is when many or most operations have to load data from main memory. And worse: since data is being modified dirty cache lines have to be written back, too.

This graph should give sufficient motivation to look into coding improvements which help improve cache usage. We are not talking about a few measly percent here; we are talking about orders-of-magnitude improvements which are sometimes possible. In Section 6 we will discuss techniques which allow writing more efficient code. The next section goes into more details of CPU cache designs. The knowledge is good to have but not necessary for the rest of the paper. So this section could be skipped.

图中有三个比较明显的不同阶段。很正常，这个处理器有L1d和L2，没有L3。根据经验可以推测出，L1d有213字节，而L2有220字节。因为，如果整个工作集都可以放入L1d，那么只需不到10个周期就可以完成操作。如果工作集超过L1d，处理器不得不从L2获取数据，于是时间飘升到28个周期左右。如果工作集更大，超过了L2，那么时间进一步暴涨到480个周期以上。这时候，许多操作将不得不从主存中获取数据。更糟糕的是，如果修改了数据，还需要将这些脏了的缓存线写回内存。

看了这个图，大家应该会有足够的动力去检查代码、改进缓存的利用方式了吧？这里的性能改善可不只是微不足道的几个百分点，而是几个数量级呀。在第6节中，我们将介绍一些编写高效代码的技巧。而下一节将进一步深入缓存的设计。虽然精彩，但并不是必修课，大家可以选择性地跳过。

3.3 CPU Cache Implementation Details

Cache implementers have the problem that each cell in the huge main memory potentially has to be cached. If the working set of a program is large enough this means there are many main memory locations which fight for each place in the cache. Previously it was noted that a ratio of 1-to-1000 for cache versus main memory size is not uncommon.

3.3.1 Associativity

It would be possible to implement a cache where each cache line can hold a copy of any memory location. This is called a fully associative cache. To access a cache line the processor core would have to compare the tags of each and every cache line with the tag for the requested address. The tag would be comprised of the entire part of the address which is not the offset into the cache line (that means, S in the figure on Section 3.2 is zero).

3.3 CPU缓存实现的细节

缓存的实现者们都要面对一个问题——主存中每一个单元都可能需被缓存。如果程序的工作集很大，就会有许多内存位置为了缓存而打架。前面我们曾经提过缓存与主存的容量比，1:1000也十分常见。

3.3.1 关联性

我们可以让缓存的每条线能存放任何内存地址的数据。这就是所谓的全关联缓存(fully associative cache)。对于这种缓存，处理器为了访问某条线，将不得不检索所有线的标签。而标签则包含了整个地址，而不仅仅只是线内偏移量(也就意味着，图3.2中的S为0)。

There are caches which are implemented like this but, by looking at the numbers for an L2 in use today, will show that this is impractical. Given a 4MB cache with 64B cache lines the cache would have 65,536 entries. To achieve adequate performance the cache logic would have to be able to pick from all these entries the one matching a given tag in just a few cycles. The effort to implement this would be enormous.

Figure 3.5: Fully Associative Cache Schematics

For each cache line a comparator is needed to compare the large tag (note, S is zero). The letter next to each connection indicates the width in bits. If none is given it is a single bit line. Each comparator has to compare two T-bit-wide values. Then, based on the result, the appropriate cache line content is selected and made available. This requires merging as many sets of O data lines as there are cache buckets. The number of transistors needed to implement a single comparator is large especially since it must work very fast. No iterative comparator is usable. The only way to save on the number of comparators is to reduce the number of them by iteratively comparing the tags. This is not suitable for the same reason that iterative comparators are not: it takes too long.

高速缓存有类似这样的实现，但是，看看在今天使用的L2的数目，表明这是不切实际的。给定4MB的高速缓存和64B的高速缓存段，高速缓存将有65,536个项。为了达到足够的性能，缓存逻辑必须能够在短短的几个时钟周期内，从所有这些项中，挑一个匹配给定的标签。实现这一点的工作将是巨大的。

Figure 3.5: 全关联高速缓存原理图

对于每个高速缓存行，比较器是需要比较大标签（注意，S是零）。每个连接旁边的字母表示位的宽度。如果没有给出，它是一个单比特线。每个比较器都要比较两个T-位宽的值。然后，基于该结果，适当的高速缓存行的内容被选中，并使其可用。这需要合并多套O数据线，因为他们是缓存桶（译注：这里类似把O输出接入多选器，所以需要合并）。实现仅仅一个比较器，需要晶体管的数量就非常大，特别是因为它必须非常快。没有迭代比较器是可用的。节省比较器的数目的唯一途径是通过反复比较标签，以减少它们的数目。这是不适合的，出于同样的原因，迭代比较器不可用：它的时间太长。

Fully associative caches are practical for small caches (for instance, the TLB caches on some Intel processors are fully associative) but those caches are small, really small. We are talking about a few dozen entries at most.

For L1i, L1d, and higher level caches a different approach is needed. What can be done is to restrict the search. In the most extreme restriction each tag maps to exactly one cache entry. The computation is simple: given the 4MB/64B cache with 65,536 entries we can directly address each entry by using bits 6 to 21 of the address (16 bits). The low 6 bits are the index into the cache line.

全关联高速缓存对小缓存是实用的（例如，在某些Intel处理器的TLB缓存是全关联的），但这些缓存都很小，非常小的。我们正在谈论的最多几十项。

对于L1i，L1d和更高级别的缓存，需要采用不同的方法。可以做的就是是限制搜索。最极端的限制是，每个标签映射到一个明确的缓存条目。计算很简单：给定的4MB/64B缓存有65536项，我们可以使用地址的bit6到bit21（16位）来直接寻址高速缓存的每一个项。地址的低6位作为高速缓存段的索引。

Figure 3.6: Direct-Mapped Cache Schematics

Such a direct-mapped cache is fast and relatively easy to implement as can be seen in Figure 3.6. It requires exactly one comparator, one multiplexer (two in this diagram where tag and data are separated, but this is not a hard requirement on the design), and some logic to select only valid cache line content. The comparator is complex due to the speed requirements but there is only one of them now; as a result more effort can be spent on making it fast. The real complexity in this approach lies in the multiplexers. The number of transistors in a simple multiplexer grows with O(log N), where N is the number of cache lines. This is tolerable but might get slow, in which case speed can be increased by spending more real estate on transistors in the multiplexers to parallelize some of the work and to increase the speed. The total number of transistors can grow slowly with a growing cache size which makes this solution very attractive. But it has a drawback: it only works well if the addresses used by the program are evenly distributed with respect to the bits used for the direct mapping. If they are not, and this is usually the case, some cache entries are heavily used and therefore repeatedly evicted while others are hardly used at all or remain empty.

Figure 3.6: Direct-Mapped Cache Schematics

在图3.6中可以看出，这种直接映射的高速缓存，速度快，比较容易实现。它只是需要一个比较器，一个多路复用器（在这个图中有两个，标记和数据是分离的，但是对于设计这不是一个硬性要求），和一些逻辑来选择只是有效的高速缓存行的内容。由于速度的要求，比较器是复杂的，但是现在只需要一个，结果是可以花更多的精力，让其变得快速。这种方法的复杂性在于在多路复用器。一个简单的多路转换器中的晶体管的数量增速是O（log N）的，其中N是高速缓存段的数目。这是可以容忍的，但可能会很慢，在某种情况下，速度可提升，通过增加多路复用器晶体管数量，来并行化的一些工作和自身增速。晶体管的总数只是随着快速增长的高速缓存缓慢的增加，这使得这种解决方案非常有吸引力。但它有一个缺点：只有用于直接映射地址的相关的地址位均匀分布，程序才能很好工作。如果分布的不均匀，而且这是常态，一些缓存项频繁的使用，并因此多次被换出，而另一些则几乎不被使用或一直是空的。

Figure 3.7: Set-Associative Cache Schematics

This problem can be solved by making the cache set associative. A set-associative cache combines the features of the full associative and direct-mapped caches to largely avoid the weaknesses of those designs. Figure 3.7 shows the design of a set-associative cache. The tag and data storage are divided into sets which are selected by the address. This is similar to the direct-mapped cache. But instead of only having one element for each set value in the cache a small number of values is cached for the same set value. The tags for all the set members are compared in parallel, which is similar to the functioning of the fully associative cache.

Figure 3.7: 组关联高速缓存原理图

可以通过使高速缓存的组关联来解决此问题。组关联结合高速缓存的全关联和直接映射高速缓存特点，在很大程度上避免那些设计的弱点。图3.7显示了一个组关联高速缓存的设计。标签和数据存储分成不同的组并可以通过地址选择。这类似直接映射高速缓存。但是，小数目的值可以在同一个高速缓存组缓存，而不是一个缓存组只有一个元素，用于在高速缓存中的每个设定值是相同的一组值的缓存。所有组的成员的标签可以并行比较，这类似全关联缓存的功能。

The result is a cache which is not easily defeated by unfortunate or deliberate selection of addresses with the same set numbers and at the same time the size of the cache is not limited by the number of comparators which can be implemented in parallel. If the cache grows it is (in this figure) only the number of columns which increases, not the number of rows. The number of rows only increases if the associativity of the cache is increased. Today processors are using associativity levels of up to 16 for L2 caches or higher. L1 caches usually get by with 8.

L2 Cache Size	Associativity
Direct	2	4	8
CL=32	CL=64	CL=32	CL=64	CL=32	CL=64	CL=32	CL=64
512k	27,794,595	20,422,527	25,222,611	18,303,581	24,096,510	17,356,121	23,666,929	17,029,334
1M	19,007,315	13,903,854	16,566,738	12,127,174	15,537,500	11,436,705	15,162,895	11,233,896
2M	12,230,962	8,801,403	9,081,881	6,491,011	7,878,601	5,675,181	7,391,389	5,382,064
4M	7,749,986	5,427,836	4,736,187	3,159,507	3,788,122	2,418,898	3,430,713	2,125,103
8M	4,731,904	3,209,693	2,690,498	1,602,957	2,207,655	1,228,190	2,111,075	1,155,847
16M	2,620,587	1,528,592	1,958,293	1,089,580	1,704,878	883,530	1,671,541	862,324

Table 3.1: Effects of Cache Size, Associativity, and Line Size

Given our 4MB/64B cache and 8-way set associativity the cache we are left with has 8,192 sets and only 13 bits of the tag are used in addressing the cache set. To determine which (if any) of the entries in the cache set contains the addressed cache line 8 tags have to be compared. That is feasible to do in very short time. With an experiment we can see that this makes sense.

其结果是高速缓存，不容易被不幸或故意选择同属同一组编号的地址所击败，同时高速缓存的大小并不限于由比较器的数目，可以以并行的方式实现。如果高速缓存增长，只（在该图中）增加列的数目，而不增加行数。只有高速缓存之间的关联性增加，行数才会增加。今天，处理器的L2高速缓存或更高的高速缓存，使用的关联性高达16。 L1高速缓存通常使用8。

L2 Cache Size	Associativity
Direct	2	4	8
CL=32	CL=64	CL=32	CL=64	CL=32	CL=64	CL=32	CL=64
512k	27,794,595	20,422,527	25,222,611	18,303,581	24,096,510	17,356,121	23,666,929	17,029,334
1M	19,007,315	13,903,854	16,566,738	12,127,174	15,537,500	11,436,705	15,162,895	11,233,896
2M	12,230,962	8,801,403	9,081,881	6,491,011	7,878,601	5,675,181	7,391,389	5,382,064
4M	7,749,986	5,427,836	4,736,187	3,159,507	3,788,122	2,418,898	3,430,713	2,125,103
8M	4,731,904	3,209,693	2,690,498	1,602,957	2,207,655	1,228,190	2,111,075	1,155,847
16M	2,620,587	1,528,592	1,958,293	1,089,580	1,704,878	883,530	1,671,541	862,324

Table 3.1: 高速缓存大小，关联行，段大小的影响

给定我们4MB/64B高速缓存，8路组关联，相关的缓存留给我们的有8192组，只用标签的13位，就可以寻址缓集。要确定哪些（如果有的话）的缓存组设置中的条目包含寻址的高速缓存行，8个标签都要进行比较。在很短的时间内做出来是可行的。通过一个实验，我们可以看到，这是有意义的。

Table 3.1 shows the number of L2 cache misses for a program (gcc in this case, the most important benchmark of them all, according to the Linux kernel people) for changing cache size, cache line size, and associativity set size. In Section 7.2 we will introduce the tool to simulate the caches as required for this test.

Just in case this is not yet obvious, the relationship of all these values is that the cache size is

cache line size × associativity × number of sets

The addresses are mapped into the cache by using

O = log 2 cache line size
S = log 2 number of sets

in the way the figure in Section 3.2 shows.

Figure 3.8: Cache Size vs Associativity (CL=32)

Figure 3.8 makes the data of the table more comprehensible. It shows the data for a fixed cache line size of 32 bytes. Looking at the numbers for a given cache size we can see that associativity can indeed help to reduce the number of cache misses significantly. For an 8MB cache going from direct mapping to 2-way set associative cache saves almost 44% of the cache misses. The processor can keep more of the working set in the cache with a set associative cache compared with a direct mapped cache.

表3.1显示一个程序在改变缓存大小，缓存段大小和关联集大小，L2高速缓存的缓存失效数量（根据Linux内核相关的方面人的说法，GCC在这种情况下，是他们所有中最重要的标尺）。在7.2节中，我们将介绍工具来模拟此测试要求的高速缓存。

万一这还不是很明显，所有这些值之间的关系是高速缓存的大小为：

cache line size × associativity × number of sets

地址被映射到高速缓存使用

O = log 2 cache line size
S = log 2 number of sets

在第3.2节中的图显示的方式。

Figure 3.8: 缓存段大小 vs 关联行 (CL=32)

图3.8表中的数据更易于理解。它显示一个固定的32个字节大小的高速缓存行的数据。对于一个给定的高速缓存大小，我们可以看出，关联性，的确可以帮助明显减少高速缓存未命中的数量。对于8MB的缓存，从直接映射到2路组相联，可以减少近44％的高速缓存未命中。组相联高速缓存和直接映射缓存相比，该处理器可以把更多的工作集保持在缓存中。

In the literature one can occasionally read that introducing associativity has the same effect as doubling cache size. This is true in some extreme cases as can be seen in the jump from the 4MB to the 8MB cache. But it certainly is not true for further doubling of the associativity. As we can see in the data, the successive gains are much smaller. We should not completely discount the effects, though. In the example program the peak memory use is 5.6M. So with a 8MB cache there are unlikely to be many (more than two) uses for the same cache set. With a larger working set the savings can be higher as we can see from the larger benefits of associativity for the smaller cache sizes.

In general, increasing the associativity of a cache above 8 seems to have little effects for a single-thread workload. With the introduction of multi-core processors which use a shared L2 the situation changes. Now you basically have two programs hitting on the same cache which causes the associativity in practice to be halved (or quartered for quad-core processors). So it can be expected that, with increasing numbers of cores, the associativity of the shared caches should grow. Once this is not possible anymore (16-way set associativity is already hard) processor designers have to start using shared L3 caches and beyond, while L2 caches are potentially shared by a subset of the cores.

在文献中，偶尔可以读到，引入关联性，和加倍高速缓存的大小具有相同的效果。在从4M缓存跃升到8MB缓存的极端的情况下，这是正确的。关联性再提高一倍那就肯定不正确啦。正如我们所看到的数据，后面的收益要小得多。我们不应该完全低估它的效果，虽然。在示例程序中的内存使用的峰值是5.6M。因此，具有8MB缓存不太可能有很多（两个以上）使用相同的高速缓存的组。从较小的缓存的关联性的巨大收益可以看出，较大工作集可以节省更多。

在一般情况下，增加8以上的高速缓存之间的关联性似乎对只有一个单线程工作量影响不大。随着介绍一个使用共享L2的多核处理器，形势发生了变化。现在你基本上有两个程序命中相同的缓存，实际上导致高速缓存减半（对于四核处理器是1/4）。因此，可以预期，随着核的数目的增加，共享高速缓存的相关性也应增长。一旦这种方法不再可行（16 路组关联性已经很难）处理器设计者不得不开始使用共享的三级高速缓存和更高级别的，而L2高速缓存只被核的一个子集共享。

Another effect we can study in Figure 3.8 is how the increase in cache size helps with performance. This data cannot be interpreted without knowing about the working set size. Obviously, a cache as large as the main memory would lead to better results than a smaller cache, so there is in general no limit to the largest cache size with measurable benefits.

As already mentioned above, the size of the working set at its peak is 5.6M. This does not give us any absolute number of the maximum beneficial cache size but it allows us to estimate the number. The problem is that not all the memory used is contiguous and, therefore, we have, even with a 16M cache and a 5.6M working set, conflicts (see the benefit of the 2-way set associative 16MB cache over the direct mapped version). But it is a safe bet that with the same workload the benefits of a 32MB cache would be negligible. But who says the working set has to stay the same? Workloads are growing over time and so should the cache size. When buying machines, and one has to choose the cache size one is willing to pay for, it is worthwhile to measure the working set size. Why this is important can be seen in the figures on Figure 3.10.

Figure 3.9: Test Memory Layouts

Two types of tests are run. In the first test the elements are processed sequentially. The test program follows the pointernbut the array elements are chained so that they are traversed in the order in which they are found in memory. This can be seen in the lower part of Figure 3.9. There is one back reference from the last element. In the second test (upper part of the figure) the array elements are traversed in a random order. In both cases the array elements form a circular single-linked list.

从图3.8中，我们还可以研究缓存大小对性能的影响。这一数据需要了解工作集的大小才能进行解读。很显然，与主存相同的缓存比小缓存能产生更好的结果，因此，缓存通常是越大越好。

上文已经说过，示例中最大的工作集为5.6M。它并没有给出最佳缓存大小值，但我们可以估算出来。问题主要在于内存的使用并不连续，因此，即使是16M的缓存，在处理5.6M的工作集时也会出现冲突(参见2路集合关联式16MB缓存vs直接映射式缓存的优点)。不管怎样，我们可以有把握地说，在同样5.6M的负载下，缓存从16MB升到32MB基本已没有多少提高的余地。但是，工作集是会变的。如果工作集不断增大，缓存也需要随之增大。在购买计算机时，如果需要选择缓存大小，一定要先衡量工作集的大小。原因可以参见图3.10。

图3.9: 测试的内存分布情况

我们执行两项测试。第一项测试是按顺序地访问所有元素。测试程序循着指针n进行访问，而所有元素是链接在一起的，从而使它们的被访问顺序与在内存中排布的顺序一致，如图3.9的下半部分所示，末尾的元素有一个指向首元素的引用。而第二项测试(见图3.9的上半部分)则是按随机顺序访问所有元素。在上述两个测试中，所有元素都构成一个单向循环链表。

3.3.2 Measurements of Cache Effects

All the figures are created by measuring a program which can simulate working sets of arbitrary size, read and write access, and sequential or random access. We have already seen some results in Figure 3.4. The program creates an array corresponding to the working set size of elements of this type:

  struct l {
    struct l *n;
    long int pad[NPAD];
  };

All entries are chained in a circular list using thenelement, either in sequential or random order. Advancing from one entry to the next always uses the pointer, even if the elements are laid out sequentially. Thepadelement is the payload and it can grow arbitrarily large. In some tests the data is modified, in others the program only performs read operations.

In the performance measurements we are talking about working set sizes. The working set is made up of an array ofstruct lelements. A working set of 2N bytes contains

2 N/sizeof(struct l)

elements. Obviouslysizeof(struct l)depends on the value ofNPAD. For 32-bit systems,NPAD=7 means the size of each array element is 32 bytes, for 64-bit systems the size is 64 bytes.

3.3.2 Cache的性能测试

用于测试程序的数据可以模拟一个任意大小的工作集：包括读、写访问，随机、连续访问。在图3.4中我们可以看到，程序为工作集创建了一个与其大小和元素类型相同的数组：

  struct l {
    struct l *n;
    long int pad[NPAD];
  };

n字段将所有节点随机得或者顺序的加入到环形链表中，用指针从当前节点进入到下一个节点。pad字段用来存储数据，其可以是任意大小。在一些测试程序中，pad字段是可以修改的, 在其他程序中，pad字段只可以进行读操作。

在性能测试中，我们谈到工作集大小的问题，工作集使用结构体l定义的元素表示的。2N 字节的工作集包含

2 N/sizeof(struct l)

个元素. 显然sizeof(struct l) 的值取决于NPAD的大小。在32位系统上，NPAD=7意味着数组的每个元素的大小为32字节，在64位系统上，NPAD=7意味着数组的每个元素的大小为64字节。

Single Threaded Sequential Access

The simplest case is a simple walk over all the entries in the list. The list elements are laid out sequentially, densely packed. Whether the order of processing is forward or backward does not matter, the processor can deal with both directions equally well. What we measure here—and in all the following tests—is how long it takes to handle a single list element. The time unit is a processor cycle. Figure 3.10 shows the result. Unless otherwise specified, all measurements are made on a Pentium 4 machine in 64-bit mode which means the structurelwithNPAD=0is eight bytes in size.

Figure 3.10: Sequential Read Access, NPAD=0

Figure 3.11: Sequential Read for Several Sizes

单线程顺序访问

最简单的情况就是遍历链表中顺序存储的节点。无论是从前向后处理，还是从后向前，对于处理器来说没有什么区别。下面的测试中，我们需要得到处理链表中一个元素所需要的时间，以CPU时钟周期最为计时单元。图3.10显示了测试结构。除非有特殊说明, 所有的测试都是在Pentium 4 64-bit 平台上进行的，因此结构体l中NPAD=0，大小为8字节。

图 3.10: 顺序读访问, NPAD=0

图 3.11: 顺序读多个字节

The first two measurements are polluted by noise. The measured workload is simply too small to filter the effects of the rest of the system out. We can safely assume that the values are all at the 4 cycles level. With this in mind we can see three distinct levels:

Up to a working set size of 214 bytes.
From 215 bytes to 220 bytes.
From 221 bytes and up.

These steps can be easily explained: the processor has a 16kB L1d and 1MB L2. We do not see sharp edges in the transition from one level to the other because the caches are used by other parts of the system as well and so the cache is not exclusively available for the program data. Specifically the L2 cache is a unified cache and also used for the instructions (NB: Intel uses inclusive caches).

一开始的两个测试数据收到了噪音的污染。由于它们的工作负荷太小，无法过滤掉系统内其它进程对它们的影响。我们可以认为它们都是4个周期以内的。这样一来，整个图可以划分为比较明显的三个部分:

工作集小于214字节的。
工作集从215字节到220字节的。
工作集大于221字节的。

这样的结果很容易解释——是因为处理器有16KB的L1d和1MB的L2。而在这三个部分之间，并没有非常锐利的边缘，这是因为系统的其它部分也在使用缓存，我们的测试程序并不能独占缓存的使用。尤其是L2，它是统一式的缓存，处理器的指令也会使用它(注: Intel使用的是包容式缓存)。

What is perhaps not quite expected are the actual times for the different working set sizes. The times for the L1d hits are expected: load times after an L1d hit are around 4 cycles on the P4. But what about the L2 accesses? Once the L1d is not sufficient to hold the data one might expect it would take 14 cycles or more per element since this is the access time for the L2. But the results show that only about 9 cycles are required. This discrepancy can be explained by the advanced logic in the processors. In anticipation of using consecutive memory regions, the processor prefetches the next cache line. This means that when the next line is actually used it is already halfway loaded. The delay required to wait for the next cache line to be loaded is therefore much less than the L2 access time.

测试的实际耗时可能会出乎大家的意料。L1d的部分跟我们预想的差不多，在一台P4上耗时为4个周期左右。但L2的结果则出乎意料。大家可能觉得需要14个周期以上，但实际只用了9个周期。这要归功于处理器先进的处理逻辑，当它使用连续的内存区时，会 预先读取下一条缓存线的数据。这样一来，当真正使用下一条线的时候，其实已经早已读完一半了，于是真正的等待耗时会比L2的访问时间少很多。

The effect of prefetching is even more visible once the working set size grows beyond the L2 size. Before we said that a main memory access takes 200+ cycles. Only with effective prefetching is it possible for the processor to keep the access times as low as 9 cycles. As we can see from the difference between 200 and 9, this works out nicely.

We can observe the processor while prefetching, at least indirectly. In Figure 3.11 we see the times for the same working set sizes but this time we see the graphs for different sizes of the structurel. This means we have fewer but larger elements in the list. The different sizes have the effect that the distance between thenelements in the (still consecutive) list grows. In the four cases of the graph the distance is 0, 56, 120, and 248 bytes respectively.

在工作集超过L2的大小之后，预取的效果更明显了。前面我们说过，主存的访问需要耗时200个周期以上。但在预取的帮助下，实际耗时保持在9个周期左右。200 vs 9，效果非常不错。

我们可以观察到预取的行为，至少可以间接地观察到。图3.11中有4条线，它们表示处理不同大小结构时的耗时情况。随着结构的变大，元素间的距离变大了。图中4条线对应的元素距离分别是0、56、120和248字节。

At the bottom we can see the line from the previous graph, but this time it appears more or less as a flat line. The times for the other cases are simply so much worse. We can see in this graph, too, the three different levels and we see the large errors in the tests with the small working set sizes (ignore them again). The lines more or less all match each other as long as only the L1d is involved. There is no prefetching necessary so all element sizes just hit the L1d for each access.

For the L2 cache hits we see that the three new lines all pretty much match each other but that they are at a higher level (about 28). This is the level of the access time for the L2. This means prefetching from L2 into L1d is basically disabled. Even withNPAD=7 we need a new cache line for each iteration of the loop; forNPAD=0, instead, the loop has to iterate eight times before the next cache line is needed. The prefetch logic cannot load a new cache line every cycle. Therefore we see a stall to load from L2 in every iteration.

图中最下面的这一条线来自前一个图，但在这里更像是一条直线。其它三条线的耗时情况比较差。图中这些线也有比较明显的三个阶段，同时，在小工作集的情况下也有比较大的错误(请再次忽略这些错误)。在只使用L1d的阶段，这些线条基本重合。因为这时候还不需要预取，只需要访问L1d就行。

在L2阶段，三条新加的线基本重合，而且耗时比老的那条线高很多，大约在28个周期左右，差不多就是L2的访问时间。这表明，从L2到L1d的预取并没有生效。这是因为，对于最下面的线(NPAD=0)，由于结构小，8次循环后才需要访问一条新缓存线，而上面三条线对应的结构比较大，拿相对最小的NPAD=7来说，光是一次循环就需要访问一条新线，更不用说更大的NPAD=15和31了。而预取逻辑是无法在每个周期装载新线的，因此每次循环都需要从L2读取，我们看到的就是从L2读取的时延。

It gets even more interesting once the working set size exceeds the L2 capacity. Now all four lines vary widely. The different element sizes play obviously a big role in the difference in performance. The processor should recognize the size of the strides and not fetch unnecessary cache lines forNPAD=15 and 31 since the element size is smaller than the prefetch window (see Section 6.3.1). Where the element size is hampering the prefetching efforts is a result of a limitation of hardware prefetching: it cannot cross page boundaries. We are reducing the effectiveness of the hardware scheduler by 50% for each size increase. If the hardware prefetcher were allowed to cross page boundaries and the next page is not resident or valid the OS would have to get involved in locating the page. That means the program would experience a page fault it did not initiate itself. This is completely unacceptable since the processor does not know whether a page is not present or does not exist. In the latter case the OS would have to abort the process. In any case, given that, forNPAD=7 and higher, we need one cache line per list element the hardware prefetcher cannot do much. There simply is no time to load the data from memory since all the processor does is read one word and then load the next element.

更有趣的是工作集超过L2容量后的阶段。快看，4条线远远地拉开了。元素的大小变成了主角，左右了性能。处理器应能识别每一步(stride)的大小，不去为NPAD=15和31获取那些实际并不需要的缓存线(参见6.3.1)。元素大小对预取的约束是根源于硬件预取的限制——它无法跨越页边界。如果允许预取器跨越页边界，而下一页不存在或无效，那么OS还得去寻找它。这意味着，程序需要遭遇一次并非由它自己产生的页错误，这是完全不能接受的。在NPAD=7或者更大的时候，由于每个元素都至少需要一条缓存线，预取器已经帮不上忙了，它没有足够的时间去从内存装载数据。

Another big reason for the slowdown are the misses of the TLB cache. This is a cache where the results of the translation of a virtual address to a physical address are stored, as is explained in more detail in Section 4. The TLB cache is quite small since it has to be extremely fast. If more pages are accessed repeatedly than the TLB cache has entries for the translation from virtual to physical address has to be constantly repeated. This is a very costly operation. With larger element sizes the cost of a TLB lookup is amortized over fewer elements. That means the total number of TLB entries which have to be computed per list element is higher.

另一个导致慢下来的原因是TLB缓存的未命中。TLB是存储虚实地址映射的缓存，参见第4节。为了保持快速，TLB只有很小的容量。如果有大量页被反复访问，超出了TLB缓存容量，就会导致反复地进行地址翻译，这会耗费大量时间。TLB查找的代价分摊到所有元素上，如果元素越大，那么元素的数量越少，每个元素承担的那一份就越多。

To observe the TLB effects we can run a different test. For one measurement we lay out the elements sequentially as usual. We useNPAD=7 for elements which occupy one entire cache line. For the second measurement we place each list element on a separate page. The rest of each page is left untouched and we do not count it in the total for the working set size. {Yes, this is a bit inconsistent because in the other tests we count the unused part of the struct in the element size and we could defineNPADso that each element fills a page. In that case the working set sizes would be very different. This is not the point of this test, though, and since prefetching is ineffective anyway this makes little difference.} The consequence is that, for the first measurement, each list iteration requires a new cache line and, for every 64 elements, a new page. For the second measurement each iteration requires loading a new cache line which is on a new page.

Figure 3.12: TLB Influence for Sequential Read

为了观察TLB的性能，我们可以进行另两项测试。第一项：我们还是顺序存储列表中的元素，使NPAD=7，让每个元素占满整个cache line，第二项：我们将列表的每个元素存储在一个单独的页上，忽略每个页没有使用的部分以用来计算工作集的大小。（这样做可能不太一致，因为在前面的测试中，我计算了结构体中每个元素没有使用的部分，从而用来定义NPAD的大小，因此每个元素占满了整个页，这样以来工作集的大小将会有所不同。但是这不是这项测试的重点，预取的低效率多少使其有点不同）。结果表明，第一项测试中，每次列表的迭代都需要一个新的cache line，而且每64个元素就需要一个新的页。第二项测试中，每次迭代都会在一个新的页中加载一个新的cache line。

图 3.12: TLB 对顺序读的影响

The result can be seen in Figure 3.12. The measurements were performed on the same machine as Figure 3.11. Due to limitations of the available RAM the working set size had to be restricted to 2 24 bytes which requires 1GB to place the objects on separate pages. The lower, red curve corresponds exactly to theNPAD=7 curve in Figure 3.11. We see the distinct steps showing the sizes of the L1d and L2 caches. The second curve looks radically different. The important feature is the huge spike starting when the working set size reaches 2 13 bytes. This is when the TLB cache overflows. With an element size of 64 bytes we can compute that the TLB cache has 64 entries. There are no page faults affecting the cost since the program locks the memory to prevent it from being swapped out.

结果见图3.12。该测试与图3.11是在同一台机器上进行的。基于可用RAM空间的有限性，测试设置容量空间大小为2的24次方字节，这就需要1GB的容量将对象放置在分页上。图3.12中下方的红色曲线正好对应了图3.11中NPAD等于7的曲线。我们看到不同的步长显示了高速缓存L1d和L2的大小。第二条曲线看上去完全不同，其最重要的特点是当工作容量到达2的13次方字节时开始大幅度增长。这就是TLB缓存溢出的时候。我们能计算出一个64字节大小的元素的TLB缓存有64个输入。成本不会受页面错误影响，因为程序锁定了存储器以防止内存被换出。

As can be seen the number of cycles it takes to compute the physical address and store it in the TLB is very high. The graph in Figure 3.12 shows the extreme case, but it should now be clear that a significant factor in the slowdown for largerNPADvalues is the reduced efficiency of the TLB cache. Since the physical address has to be computed before a cache line can be read for either L2 or main memory the address translation penalties are additive to the memory access times. This in part explains why the total cost per list element forNPAD=31 is higher than the theoretical access time for the RAM.

Figure 3.13: Sequential Read and Write, NPAD=1

We can glimpse a few more details of the prefetch implementation by looking at the data of test runs where the list elements are modified. Figure 3.13 shows three lines. The element width is in all cases 16 bytes. The first line is the now familiar list walk which serves as a baseline. The second line, labeled “Inc”, simply increments thepad[0]member of the current element before going on to the next. The third line, labeled “Addnext0”, takes thepad[0]list element of the next element and adds it to thepad[0]member of the current list element.

可以看出，计算物理地址并把它存储在TLB中所花费的周期数量级是非常高的。图3.12的表格显示了一个极端的例子，但从中可以清楚的得到：TLB缓存效率降低的一个重要因素是大型NPAD值的减缓。由于物理地址必须在缓存行能被L2或主存读取之前计算出来，地址转换这个不利因素就增加了内存访问时间。这一点部分解释了为什么NPAD等于31时每个列表元素的总花费比理论上的RAM访问时间要高。

图3.13 NPAD等于1时的顺序读和写

通过查看链表元素被修改时测试数据的运行情况，我们可以窥见一些更详细的预取实现细节。图3.13显示了三条曲线。所有情况下元素宽度都为16个字节。第一条曲线“Follow”是熟悉的链表走线在这里作为基线。第二条曲线，标记为“Inc”，仅仅在当前元素进入下一个前给其增加thepad[0]成员。第三条曲线，标记为"Addnext0"，取出下一个元素的thepad[0]链表元素并把它添加为当前链表元素的thepad[0]成员。

The naïve assumption would be that the “Addnext0” test runs slower because it has more work to do. Before advancing to the next list element a value from that element has to be loaded. This is why it is surprising to see that this test actually runs, for some working set sizes, faster than the “Inc” test. The explanation for this is that the load from the next list element is basically a forced prefetch. Whenever the program advances to the next list element we know for sure that element is already in the L1d cache. As a result we see that the “Addnext0” performs as well as the simple “Follow” test as long as the working set size fits into the L2 cache.

The “Addnext0” test runs out of L2 faster than the “Inc” test, though. It needs more data loaded from main memory. This is why the “Addnext0” test reaches the 28 cycles level for a working set size of 221 bytes. The 28 cycles level is twice as high as the 14 cycles level the “Follow” test reaches. This is easy to explain, too. Since the other two tests modify memory an L2 cache eviction to make room for new cache lines cannot simply discard the data. Instead it has to be written to memory. This means the available bandwidth on the FSB is cut in half, hence doubling the time it takes to transfer the data from main memory to L2.

在没运行时，大家可能会以为"Addnext0"更慢，因为它要做的事情更多——在没进到下个元素之前就需要装载它的值。但实际的运行结果令人惊讶——在某些小工作集下，"Addnext0"比"Inc"更快。这是为什么呢？原因在于，系统一般会对下一个元素进行强制性预取。当程序前进到下个元素时，这个元素其实早已被预取在L1d里。因此，只要工作集比L2小，"Addnext0"的性能基本就能与"Follow"测试媲美。

但是，"Addnext0"比"Inc"更快离开L2，这是因为它需要从主存装载更多的数据。而在工作集达到2 21字节时，"Addnext0"的耗时达到了28个周期，是同期"Follow"14周期的两倍。这个两倍也很好解释。"Addnext0"和"Inc"涉及对内存的修改，因此L2的逐出操作不能简单地把数据一扔了事，而必须将它们写入内存。因此FSB的可用带宽变成了一半，传输等量数据的耗时也就变成了原来的两倍。

Figure 3.14: Advantage of Larger L2/L3 Caches

One last aspect of the sequential, efficient cache handling is the size of the cache. This should be obvious but it still should be pointed out. Figure 3.14 shows the timing for the Increment benchmark with 128-byte elements (NPAD=15 on 64-bit machines). This time we see the measurement from three different machines. The first two machines are P4s, the last one a Core2 processor. The first two differentiate themselves by having different cache sizes. The first processor has a 32k L1d and an 1M L2. The second one has 16k L1d, 512k L2, and 2M L3. The Core2 processor has 32k L1d and 4M L2.

图3.14: 更大L2/L3缓存的优势

决定顺序式缓存处理性能的另一个重要因素是缓存容量。虽然这一点比较明显，但还是值得一说。图3.14展示了128字节长元素的测试结果(64位机，NPAD=15)。这次我们比较三台不同计算机的曲线，两台P4，一台Core 2。两台P4的区别是缓存容量不同，一台是32k的L1d和1M的L2，一台是16K的L1d、512k的L2和2M的L3。Core 2那台则是32k的L1d和4M的L2。

The interesting part of the graph is not necessarily how well the Core2 processor performs relative to the other two (although it is impressive). The main point of interest here is the region where the working set size is too large for the respective last level cache and the main memory gets heavily involved.

Set Size	Sequential	Random
L2 Hit	L2 Miss	#Iter	Ratio Miss/Hit	L2 Accesses Per Iter	L2 Hit	L2 Miss	#Iter	Ratio Miss/Hit	L2 Accesses Per Iter
220	88,636	843	16,384	0.94%	5.5	30,462	4721	1,024	13.42%	34.4
221	88,105	1,584	8,192	1.77%	10.9	21,817	15,151	512	40.98%	72.2
222	88,106	1,600	4,096	1.78%	21.9	22,258	22,285	256	50.03%	174.0
223	88,104	1,614	2,048	1.80%	43.8	27,521	26,274	128	48.84%	420.3
224	88,114	1,655	1,024	1.84%	87.7	33,166	29,115	64	46.75%	973.1
225	88,112	1,730	512	1.93%	175.5	39,858	32,360	32	44.81%	2,256.8
226	88,112	1,906	256	2.12%	351.6	48,539	38,151	16	44.01%	5,418.1
227	88,114	2,244	128	2.48%	705.9	62,423	52,049	8	45.47%	14,309.0
228	88,120	2,939	64	3.23%	1,422.8	81,906	87,167	4	51.56%	42,268.3
229	88,137	4,318	32	4.67%	2,889.2	119,079	163,398	2	57.84%	141,238.5

Table 3.2: L2 Hits and Misses for Sequential and Random Walks, NPAD=0

As expected, the larger the last level cache is the longer the curve stays at the low level corresponding to the L2 access costs. The important part to notice is the performance advantage this provides. The second processor (which is slightly older) can perform the work on the working set of 220 bytes twice as fast as the first processor. All thanks to the increased last level cache size. The Core2 processor with its 4M L2 performs even better.

图中最有趣的地方，并不是Core 2如何大胜两台P4，而是工作集开始增长到连末级缓存也放不下、需要主存热情参与之后的部分。

Set Size	Sequential	Random
L2 Hit	L2 Miss	#Iter	Ratio Miss/Hit	L2 Accesses Per Iter	L2 Hit	L2 Miss	#Iter	Ratio Miss/Hit	L2 Accesses Per Iter
220	88,636	843	16,384	0.94%	5.5	30,462	4721	1,024	13.42%	34.4
221	88,105	1,584	8,192	1.77%	10.9	21,817	15,151	512	40.98%	72.2
222	88,106	1,600	4,096	1.78%	21.9	22,258	22,285	256	50.03%	174.0
223	88,104	1,614	2,048	1.80%	43.8	27,521	26,274	128	48.84%	420.3
224	88,114	1,655	1,024	1.84%	87.7	33,166	29,115	64	46.75%	973.1
225	88,112	1,730	512	1.93%	175.5	39,858	32,360	32	44.81%	2,256.8
226	88,112	1,906	256	2.12%	351.6	48,539	38,151	16	44.01%	5,418.1
227	88,114	2,244	128	2.48%	705.9	62,423	52,049	8	45.47%	14,309.0
228	88,120	2,939	64	3.23%	1,422.8	81,906	87,167	4	51.56%	42,268.3
229	88,137	4,318	32	4.67%	2,889.2	119,079	163,398	2	57.84%	141,238.5

表3.2: 顺序访问与随机访问时L2命中与未命中的情况，NPAD=0

与我们预计的相似，最末级缓存越大，曲线停留在L2访问耗时区的时间越长。在220字节的工作集时，第二台P4(更老一些)比第一台P4要快上一倍，这要完全归功于更大的末级缓存。而Core 2拜它巨大的4M L2所赐，表现更为卓越。

For a random workload this might not mean that much. But if the workload can be tailored to the size of the last level cache the program performance can be increased quite dramatically. This is why it sometimes is worthwhile to spend the extra money for a processor with a larger cache.

Single Threaded Random Access Measurements

We have seen that the processor is able to hide most of the main memory and even L2 access latency by prefetching cache lines into L2 and L1d. This can work well only when the memory access is predictable, though.

Figure 3.15: Sequential vs Random Read, NPAD=0

If the access is unpredictable or random the situation is quite different. Figure 3.15 compares the per-list-element times for the sequential access (same as in Figure 3.10) with the times when the list elements are randomly distributed in the working set. The order is determined by the linked list which is randomized. There is no way for the processor to reliably prefetch data. This can only work by chance if elements which are used shortly after one another are also close to each other in memory.

对于随机的工作负荷而言，可能没有这么惊人的效果，但是，如果我们能将工作负荷进行一些裁剪，让它匹配末级缓存的容量，就完全可以得到非常大的性能提升。也是由于这个原因，有时候我们需要多花一些钱，买一个拥有更大缓存的处理器。

单线程随机访问模式的测量

前面我们已经看到，处理器能够利用L1d到L2之间的预取消除访问主存、甚至是访问L2的时延。

图3.15: 顺序读取vs随机读取，NPAD=0

但是，如果换成随机访问或者不可预测的访问，情况就大不相同了。图3.15比较了顺序读取与随机读取的耗时情况。

换成随机之后，处理器无法再有效地预取数据，只有少数情况下靠运气刚好碰到先后访问的两个元素挨在一起的情形。

There are two important points to note in Figure 3.15. First, the large number is cycles needed for growing working set sizes. The machine makes it possible to access the main memory in 200-300 cycles but here we reach 450 cycles and more. We have seen this phenomenon before (compare Figure 3.11). The automatic prefetching is actually working to a disadvantage here.

The second interesting point is that the curve is not flattening at various plateaus as it has been for the sequential access cases. The curve keeps on rising. To explain this we can measure the L2 access of the program for the various working set sizes. The result can be seen in Figure 3.16 and Table 3.2.

图3.15中有两个需要关注的地方。首先，在大的工作集下需要非常多的周期。这台机器访问主存的时间大约为200-300个周期，但图中的耗时甚至超过了450个周期。我们前面已经观察到过类似现象(对比图3.11)。这说明，处理器的自动预取在这里起到了反效果。

其次，代表随机访问的曲线在各个阶段不像顺序访问那样保持平坦，而是不断攀升。为了解释这个问题，我们测量了程序在不同工作集下对L2的访问情况。结果如图3.16和表3.2。

The figure shows that, when the working set size is larger than the L2 size, the cache miss ratio (L2 misses / L2 access) starts to grow. The curve has a similar form to the one in Figure 3.15: it rises quickly, declines slightly, and starts to rise again. There is a strong correlation with the cycles per list element graph. The L2 miss rate will grow until it eventually reaches close to 100%. Given a large enough working set (and RAM) the probability that any of the randomly picked cache lines is in L2 or is in the process of being loaded can be reduced arbitrarily.

The increasing cache miss rate alone explains some of the costs. But there is another factor. Looking at Table 3.2 we can see in the L2/#Iter columns that the total number of L2 uses per iteration of the program is growing. Each working set is twice as large as the one before. So, without caching we would expect double the main memory accesses. With caches and (almost) perfect predictability we see the modest increase in the L2 use shown in the data for sequential access. The increase is due to the increase of the working set size and nothing else.

Figure 3.16: L2d Miss Ratio

Figure 3.17: Page-Wise Randomization, NPAD=7

从图中可以看出，当工作集大小超过L2时，未命中率(L2未命中次数/L2访问次数)开始上升。整条曲线的走向与图3.15有些相似: 先急速爬升，随后缓缓下滑，最后再度爬升。它与耗时图有紧密的关联。L2未命中率会一直爬升到100%为止。只要工作集足够大(并且内存也足够大)，就可以将缓存线位于L2内或处于装载过程中的可能性降到非常低。

缓存未命中率的攀升已经可以解释一部分的开销。除此以外，还有一个因素。观察表3.2的L2/#Iter列，可以看到每个循环对L2的使用次数在增长。由于工作集每次为上一次的两倍，如果没有缓存的话，内存的访问次数也将是上一次的两倍。在按顺序访问时，由于缓存的帮助及完美的预见性，对L2使用的增长比较平缓，完全取决于工作集的增长速度。

图3.16: L2d未命中率

图3.17: 页意义上(Page-Wise)的随机化，NPAD=7

For random access the per-element time increases by more than 100% for each doubling of the working set size. This means the average access time per list element increases since the working set size only doubles. The reason behind this is a rising rate of TLB misses. In Figure 3.17 we see the cost for random accesses forNPAD=7. Only this time the randomization is modified. While in the normal case the entire list of randomized as one block (indicated by the label ∞) the other 11 curves show randomizations which are performed in smaller blocks. For the curve labeled '60' each set of 60 pages (245.760 bytes) is randomized individually. That means all list elements in the block are traversed before going over to an element in the next block. This has the effect that number of TLB entries which are used at any one time is limited.

而换成随机访问后，单位耗时的增长超过了工作集的增长，根源是TLB未命中率的上升。图3.17描绘的是NPAD=7时随机访问的耗时情况。这一次，我们修改了随机访问的方式。正常情况下是把整个列表作为一个块进行随机(以∞表示)，而其它11条线则是在小一些的块里进行随机。例如，标签为'60'的线表示以60页(245760字节)为单位进行随机。先遍历完这个块里的所有元素，再访问另一个块。这样一来，可以保证任意时刻使用的TLB条目数都是有限的。

The element size forNPAD=7 is 64 bytes, which corresponds to the cache line size. Due to the randomized order of the list elements it is unlikely that the hardware prefetcher has any effect, most certainly not for more than a handful of elements. This means the L2 cache miss rate does not differ significantly from the randomization of the entire list in one block. The performance of the test with increasing block size approaches asymptotically the curve for the one-block randomization. This means the performance of this latter test case is significantly influenced by the TLB misses. If the TLB misses can be lowered the performance increases significantly (in one test we will see later up to 38%).

NPAD=7对应于64字节，正好等于缓存线的长度。由于元素顺序随机，硬件预取不可能有任何效果，特别是在元素较多的情况下。这意味着，分块随机时的L2未命中率与整个列表随机时的未命中率没有本质的差别。随着块的增大，曲线逐渐逼近整个列表随机对应的曲线。这说明，在这个测试里，性能受到TLB命中率的影响很大，如果我们能提高TLB命中率，就能大幅度地提升性能(在后面的一个例子里，性能提升了38%之多)。

3.3.3 Write Behavior

Before we start looking at the cache behavior when multiple execution contexts (threads or processes) use the same memory we have to explore a detail of cache implementations. Caches are supposed to be coherent and this coherency is supposed to be completely transparent for the userlevel code. Kernel code is a different story; it occasionally requires cache flushes.

This specifically means that, if a cache line is modified, the result for the system after this point in time is the same as if there were no cache at all and the main memory location itself had been modified. This can be implemented in two ways or policies:

write-through cache implementation;
write-back cache implementation.

3.3.3 写入时的行为

在我们开始研究多个线程或进程同时使用相同内存之前，先来看一下缓存实现的一些细节。我们要求缓存是一致的，而且这种一致性必须对用户级代码完全透明。而内核代码则有所不同，它有时候需要对缓存进行转储(flush)。

这意味着，如果对缓存线进行了修改，那么在这个时间点之后，系统的结果应该是与没有缓存的情况下是相同的，即主存的对应位置也已经被修改的状态。这种要求可以通过两种方式或策略实现：

写通(write-through)
写回(write-back)

The write-through cache is the simplest way to implement cache coherency. If the cache line is written to, the processor immediately also writes the cache line into main memory. This ensures that, at all times, the main memory and cache are in sync. The cache content could simply be discarded whenever a cache line is replaced. This cache policy is simple but not very fast. A program which, for instance, modifies a local variable over and over again would create a lot of traffic on the FSB even though the data is likely not used anywhere else and might be short-lived.

The write-back policy is more sophisticated. Here the processor does not immediately write the modified cache line back to main memory. Instead, the cache line is only marked as dirty. When the cache line is dropped from the cache at some point in the future the dirty bit will instruct the processor to write the data back at that time instead of just discarding the content.

写通比较简单。当修改缓存线时，处理器立即将它写入主存。这样可以保证主存与缓存的内容永远保持一致。当缓存线被替代时，只需要简单地将它丢弃即可。这种策略很简单，但是速度比较慢。如果某个程序反复修改一个本地变量，可能导致FSB上产生大量数据流，而不管这个变量是不是有人在用，或者是不是短期变量。

写回比较复杂。当修改缓存线时，处理器不再马上将它写入主存，而是打上已弄脏(dirty)的标记。当以后某个时间点缓存线被丢弃时，这个已弄脏标记会通知处理器把数据写回到主存中，而不是简单地扔掉。

Write-back caches have the chance to be significantly better performing, which is why most memory in a system with a decent processor is cached this way. The processor can even take advantage of free capacity on the FSB to store the content of a cache line before the line has to be evacuated. This allows the dirty bit to be cleared and the processor can just drop the cache line when the room in the cache is needed.

But there is a significant problem with the write-back implementation. When more than one processor (or core or hyper-thread) is available and accessing the same memory it must still be assured that both processors see the same memory content at all times. If a cache line is dirty on one processor (i.e., it has not been written back yet) and a second processor tries to read the same memory location, the read operation cannot just go out to the main memory. Instead the content of the first processor's cache line is needed. In the next section we will see how this is currently implemented.

写回有时候会有非常不错的性能，因此较好的系统大多采用这种方式。采用写回时，处理器们甚至可以利用FSB的空闲容量来存储缓存线。这样一来，当需要缓存空间时，处理器只需清除脏标记，丢弃缓存线即可。

但写回也有一个很大的问题。当有多个处理器(或核心、超线程)访问同一块内存时，必须确保它们在任何时候看到的都是相同的内容。如果缓存线在其中一个处理器上弄脏了(修改了，但还没写回主存)，而第二个处理器刚好要读取同一个内存地址，那么这个读操作不能去读主存，而需要读第一个处理器的缓存线。在下一节中，我们将研究如何实现这种需求。

Before we get to this there are two more cache policies to mention:

write-combining; and
uncacheable.

Both these policies are used for special regions of the address space which are not backed by real RAM. The kernel sets up these policies for the address ranges (on x86 processors using the Memory Type Range Registers, MTRRs) and the rest happens automatically. The MTRRs are also usable to select between write-through and write-back policies.

Write-combining is a limited caching optimization more often used for RAM on devices such as graphics cards. Since the transfer costs to the devices are much higher than the local RAM access it is even more important to avoid doing too many transfers. Transferring an entire cache line just because a word in the line has been written is wasteful if the next operation modifies the next word. One can easily imagine that this is a common occurrence, the memory for horizontal neighboring pixels on a screen are in most cases neighbors, too. As the name suggests, write-combining combines multiple write accesses before the cache line is written out. In ideal cases the entire cache line is modified word by word and, only after the last word is written, the cache line is written to the device. This can speed up access to RAM on devices significantly.

在此之前，还有其它两种缓存策略需要提一下:

写入合并
不可缓存

这两种策略用于真实内存不支持的特殊地址区，内核为地址区设置这些策略(x86处理器利用内存类型范围寄存器MTRR)，余下的部分自动进行。MTRR还可用于写通和写回策略的选择。

写入合并是一种有限的缓存优化策略，更多地用于显卡等设备之上的内存。由于设备的传输开销比本地内存要高的多，因此避免进行过多的传输显得尤为重要。如果仅仅因为修改了缓存线上的一个字，就传输整条线，而下个操作刚好是修改线上的下一个字，那么这次传输就过于浪费了。而这恰恰对于显卡来说是比较常见的情形——屏幕上水平邻接的像素往往在内存中也是靠在一起的。顾名思义，写入合并是在写出缓存线前，先将多个写入访问合并起来。在理想的情况下，缓存线被逐字逐字地修改，只有当写入最后一个字时，才将整条线写入内存，从而极大地加速内存的访问。

Finally there is uncacheable memory. This usually means the memory location is not backed by RAM at all. It might be a special address which is hardcoded to have some functionality outside the CPU. For commodity hardware this most often is the case for memory mapped address ranges which translate to accesses to cards and devices attached to a bus (PCIe etc). On embedded boards one sometimes finds such a memory address which can be used to turn an LED on and off. Caching such an address would obviously be a bad idea. LEDs in this context are used for debugging or status reports and one wants to see this as soon as possible. The memory on PCIe cards can change without the CPU's interaction, so this memory should not be cached.

最后来讲一下不可缓存的内存。一般指的是不被RAM支持的内存位置，它可以是硬编码的特殊地址，承担CPU以外的某些功能。对于商用硬件来说，比较常见的是映射到外部卡或设备的地址。在嵌入式主板上，有时也有类似的地址，用来开关LED。对这些地址进行缓存显然没有什么意义。比如上述的LED，一般是用来调试或报告状态，显然应该尽快点亮或关闭。而对于那些PCI卡上的内存，由于不需要CPU的干涉即可更改，也不该缓存。

3.3.4 Multi-Processor Support

In the previous section we have already pointed out the problem we have when multiple processors come into play. Even multi-core processors have the problem for those cache levels which are not shared (at least the L1d).

It is completely impractical to provide direct access from one processor to the cache of another processor. The connection is simply not fast enough, for a start. The practical alternative is to transfer the cache content over to the other processor in case it is needed. Note that this also applies to caches which are not shared on the same processor.

3.3.4 多处理器支持

在上节中我们已经指出当多处理器开始发挥作用的时候所遇到的问题。甚至对于那些不共享的高速级别的缓存（至少在L1d级别）的多核处理器也有问题。

直接提供从一个处理器到另一处理器的高速访问，这是完全不切实际的。从一开始，连接速度根本就不够快。实际的选择是，在其需要的情况下，转移到其他处理器。需要注意的是，这同样应用在相同处理器上无需共享的高速缓存。

The question now is when does this cache line transfer have to happen? This question is pretty easy to answer: when one processor needs a cache line which is dirty in another processor's cache for reading or writing. But how can a processor determine whether a cache line is dirty in another processor's cache? Assuming it is just because a cache line is loaded by another processor would be suboptimal (at best). Usually the majority of memory accesses are read accesses and the resulting cache lines are not dirty. Processor operations on cache lines are frequent (of course, why else would we have this paper?) which means broadcasting information about changed cache lines after each write access would be impractical.

现在的问题是，当该高速缓存线转移的时候会发生什么？这个问题回答起来相当容易：当一个处理器需要在另一个处理器的高速缓存中读或者写的脏的高速缓存线的时候。但怎样处理器怎样确定在另一个处理器的缓存中的高速缓存线是脏的？假设它仅仅是因为一个高速缓存线被另一个处理器加载将是次优的（最好的）。通常情况下，大多数的内存访问是只读的访问和产生高速缓存线，并不脏。在高速缓存线上处理器频繁的操作（当然，否则为什么我们有这样的文件呢？），也就意味着每一次写访问后，都要广播关于高速缓存线的改变将变得不切实际。

What developed over the years is the MESI cache coherency protocol (Modified, Exclusive, Shared, Invalid). The protocol is named after the four states a cache line can be in when using the MESI protocol:

Modified: The local processor has modified the cache line. This also implies it is the only copy in any cache.
Exclusive: The cache line is not modified but known to not be loaded into any other processor's cache.
Shared: The cache line is not modified and might exist in another processor's cache.
Invalid: The cache line is invalid, i.e., unused.

This protocol developed over the years from simpler versions which were less complicated but also less efficient. With these four states it is possible to efficiently implement write-back caches while also supporting concurrent use of read-only data on different processors.

多年来，人们开发除了MESI缓存一致性协议(MESI=Modified, Exclusive, Shared, Invalid，变更的、独占的、共享的、无效的)。协议的名称来自协议中缓存线可以进入的四种状态:

变更的: 本地处理器修改了缓存线。同时暗示，它是所有缓存中唯一的拷贝。
独占的: 缓存线没有被修改，而且没有被装入其它处理器缓存。
共享的: 缓存线没有被修改，但可能已被装入其它处理器缓存。
无效的: 缓存线无效，即，未被使用。

MESI协议开发了很多年，最初的版本比较简单，但是效率也比较差。现在的版本通过以上4个状态可以有效地实现写回式缓存，同时支持不同处理器对只读数据的并发访问。

Figure 3.18: MESI Protocol Transitions

The state changes are accomplished without too much effort by the processors listening, or snooping, on the other processors' work. Certain operations a processor performs are announced on external pins and thus make the processor's cache handling visible to the outside. The address of the cache line in question is visible on the address bus. In the following description of the states and their transitions (shown in Figure 3.18) we will point out when the bus is involved.

Initially all cache lines are empty and hence also Invalid. If data is loaded into the cache for writing the cache changes to Modified. If the data is loaded for reading the new state depends on whether another processor has the cache line loaded as well. If this is the case then the new state is Shared, otherwise Exclusive.

图3.18: MESI协议的状态跃迁图

在协议中，通过处理器监听其它处理器的活动，不需太多努力即可实现状态变更。处理器将操作发布在外部引脚上，使外部可以了解到处理过程。目标的缓存线地址则可以在地址总线上看到。在下文讲述状态时，我们将介绍总线参与的时机。

一开始，所有缓存线都是空的，缓存为无效(Invalid)状态。当有数据装进缓存供写入时，缓存变为变更(Modified)状态。如果有数据装进缓存供读取，那么新状态取决于其它处理器是否已经状态了同一条缓存线。如果是，那么新状态变成共享(Shared)状态，否则变成独占(Exclusive)状态。

If a Modified cache line is read from or written to on the local processor, the instruction can use the current cache content and the state does not change. If a second processor wants to read from the cache line the first processor has to send the content of its cache to the second processor and then it can change the state to Shared. The data sent to the second processor is also received and processed by the memory controller which stores the content in memory. If this did not happen the cache line could not be marked as Shared. If the second processor wants to write to the cache line the first processor sends the cache line content and marks the cache line locally as Invalid. This is the infamous “Request For Ownership” (RFO) operation. Performing this operation in the last level cache, just like the I→M transition is comparatively expensive. For write-through caches we also have to add the time it takes to write the new cache line content to the next higher-level cache or the main memory, further increasing the cost.

如果本地处理器对某条Modified缓存线进行读写，那么直接使用缓存内容，状态保持不变。如果另一个处理器希望读它，那么第一个处理器将内容发给第一个处理器，然后可以将缓存状态置为Shared。而发给第二个处理器的数据由内存控制器接收，并放入内存中。如果这一步没有发生，就不能将这条线置为Shared。如果第二个处理器希望的是写，那么第一个处理器将内容发给它后，将缓存置为Invalid。这就是臭名昭著的"请求所有权(Request For Ownership,RFO)"操作。在末级缓存执行RFO操作的代价比较高。如果是写通式缓存，还要加上将内容写入上一层缓存或主存的时间，进一步提升了代价。

If a cache line is in the Shared state and the local processor reads from it no state change is necessary and the read request can be fulfilled from the cache. If the cache line is locally written to the cache line can be used as well but the state changes to Modified. It also requires that all other possible copies of the cache line in other processors are marked as Invalid. Therefore the write operation has to be announced to the other processors via an RFO message. If the cache line is requested for reading by a second processor nothing has to happen. The main memory contains the current data and the local state is already Shared. In case a second processor wants to write to the cache line (RFO) the cache line is simply marked Invalid. No bus operation is needed.

对于Shared缓存线，本地处理器的读取操作并不需要修改状态，而且可以直接从缓存满足。而本地处理器的写入操作则需要将状态置为Modified，而且需要将缓存线在其它处理器的所有拷贝置为Invalid。因此，这个写入操作需要通过RFO消息发通知其它处理器。如果第二个处理器请求读取，无事发生。因为主存已经包含了当前数据，而且状态已经为Shared。如果第二个处理器需要写入，则将缓存线置为Invalid。不需要总线操作。

The Exclusive state is mostly identical to the Shared state with one crucial difference: a local write operation does not have to be announced on the bus. The local cache copy is known to be the only one. This can be a huge advantage so the processor will try to keep as many cache lines as possible in the Exclusive state instead of the Shared state. The latter is the fallback in case the information is not available at that moment. The Exclusive state can also be left out completely without causing functional problems. It is only the performance that will suffer since the E→M transition is much faster than the S→M transition.

From this description of the state transitions it should be clear where the costs specific to multi-processor operations are. Yes, filling caches is still expensive but now we also have to look out for RFO messages. Whenever such a message has to be sent things are going to be slow.

Exclusive状态与Shared状态很像，只有一个不同之处: 在Exclusive状态时，本地写入操作不需要在总线上声明，因为本地的缓存是系统中唯一的拷贝。这是一个巨大的优势，所以处理器会尽量将缓存线保留在Exclusive状态，而不是Shared状态。只有在信息不可用时，才退而求其次选择shared。放弃Exclusive不会引起任何功能缺失，但会导致性能下降，因为E→M要远远快于S→M。

从以上的说明中应该已经可以看出，在多处理器环境下，哪一步的代价比较大了。填充缓存的代价当然还是很高，但我们还需要留意RFO消息。一旦涉及RFO，操作就快不起来了。

There are two situations when RFO messages are necessary:

A thread is migrated from one processor to another and all the cache lines have to be moved over to the new processor once.
A cache line is truly needed in two different processors. {At a smaller level the same is true for two cores on the same processor. The costs are just a bit smaller. The RFO message is likely to be sent many times.}

In multi-thread or multi-process programs there is always some need for synchronization; this synchronization is implemented using memory. So there are some valid RFO messages. They still have to be kept as infrequent as possible. There are other sources of RFO messages, though. In Section 6 we will explain these scenarios. The Cache coherency protocol messages must be distributed among the processors of the system. A MESI transition cannot happen until it is clear that all the processors in the system have had a chance to reply to the message. That means that the longest possible time a reply can take determines the speed of the coherency protocol. {Which is why we see nowadays, for instance, AMD Opteron systems with three sockets. Each processor is exactly one hop away given that the processors only have three hyperlinks and one is needed for the Southbridge connection.} Collisions on the bus are possible, latency can be high in NUMA systems, and of course sheer traffic volume can slow things down. All good reasons to focus on avoiding unnecessary traffic.

RFO在两种情况下是必需的:

线程从一个处理器迁移到另一个处理器，需要将所有缓存线移到新处理器。
某条缓存线确实需要被两个处理器使用。{对于同一处理器的两个核心，也有同样的情况，只是代价稍低。RFO消息可能会被发送多次。}

多线程或多进程的程序总是需要同步，而这种同步依赖内存来实现。因此，有些RFO消息是合理的，但仍然需要尽量降低发送频率。除此以外，还有其它来源的RFO。在第6节中，我们将解释这些场景。缓存一致性协议的消息必须发给系统中所有处理器。只有当协议确定已经给过所有处理器响应机会之后，才能进行状态跃迁。也就是说，协议的速度取决于最长响应时间。{这也是现在能看到三插槽AMD Opteron系统的原因。这类系统只有三个超级链路(hyperlink)，其中一个连接南桥，每个处理器之间都只有一跳的距离。}总线上可能会发生冲突，NUMA系统的延时很大，突发的流量会拖慢通信。这些都是让我们避免无谓流量的充足理由。

There is one more problem related to having more than one processor in play. The effects are highly machine specific but in principle the problem always exists: the FSB is a shared resource. In most machines all processors are connected via one single bus to the memory controller (see Figure 2.1). If a single processor can saturate the bus (as is usually the case) then two or four processors sharing the same bus will restrict the bandwidth available to each processor even more.

Even if each processor has its own bus to the memory controller as in Figure 2.2 there is still the bus to the memory modules. Usually this is one bus but, even in the extended model in Figure 2.2, concurrent accesses to the same memory module will limit the bandwidth.

The same is true with the AMD model where each processor can have local memory. Yes, all processors can concurrently access their local memory quickly. But multi-thread and multi-process programs--at least from time to time--have to access the same memory regions to synchronize.

此外，关于多处理器还有一个问题。虽然它的影响与具体机器密切相关，但根源是唯一的——FSB是共享的。在大多数情况下，所有处理器通过唯一的总线连接到内存控制器(参见图2.1)。如果一个处理器就能占满总线(十分常见)，那么共享总线的两个或四个处理器显然只会得到更有限的带宽。

即使每个处理器有自己连接内存控制器的总线，如图2.2，但还需要通往内存模块的总线。一般情况下，这种总线只有一条。退一步说，即使像图2.2那样不止一条，对同一个内存模块的并发访问也会限制它的带宽。

对于每个处理器拥有本地内存的AMD模型来说，也是同样的问题。的确，所有处理器可以非常快速地同时访问它们自己的内存。但是，多线程呢？多进程呢？它们仍然需要通过访问同一块内存来进行同步。

Concurrency is severely limited by the finite bandwidth available for the implementation of the necessary synchronization. Programs need to be carefully designed to minimize accesses from different processors and cores to the same memory locations. The following measurements will show this and the other cache effects related to multi-threaded code.

Multi Threaded Measurements

To ensure that the gravity of the problems introduced by concurrently using the same cache lines on different processors is understood, we will look here at some more performance graphs for the same program we used before. This time, though, more than one thread is running at the same time. What is measured is the fastest runtime of any of the threads. This means the time for a complete run when all threads are done is even higher. The machine used has four processors; the tests use up to four threads. All processors share one bus to the memory controller and there is only one bus to the memory modules.

对同步来说，有限的带宽严重地制约着并发度。程序需要更加谨慎的设计，将不同处理器访问同一块内存的机会降到最低。以下的测试展示了这一点，还展示了与多线程代码相关的其它效果。

多线程测量

为了帮助大家理解问题的严重性，我们来看一些曲线图，主角也是前文的那个程序。只不过这一次，我们运行多个线程，并测量这些线程中最快那个的运行时间。也就是说，等它们全部运行完是需要更长时间的。我们用的机器有4个处理器，而测试是做多跑4个线程。所有处理器共享同一条通往内存控制器的总线，另外，通往内存模块的总线也只有一条。

Figure 3.19: Sequential Read Access, Multiple Threads

Figure 3.19 shows the performance for sequential read-only access for 128 bytes entries (NPAD=15 on 64-bit machines). For the curve for one thread we can expect a curve similar to Figure 3.11. The measurements are for a different machine so the actual numbers vary.

The important part in this figure is of course the behavior when running multiple threads. Note that no memory is modified and no attempts are made to keep the threads in sync when walking the linked list. Even though no RFO messages are necessary and all the cache lines can be shared, we see up to an 18% performance decrease for the fastest thread when two threads are used and up to 34% when four threads are used. Since no cache lines have to be transported between the processors this slowdown is solely caused by one or both of the two bottlenecks: the shared bus from the processor to the memory controller and bus from the memory controller to the memory modules. Once the working set size is larger than the L3 cache in this machine all three threads will be prefetching new list elements. Even with two threads the available bandwidth is not sufficient to scale linearly (i.e., have no penalty from running multiple threads).

图3.19: 顺序读操作，多线程

图3.19展示了顺序读访问时的性能，元素为128字节长(64位计算机，NPAD=15)。对于单线程的曲线，我们预计是与图3.11相似，只不过是换了一台机器，所以实际的数字会有些小差别。

更重要的部分当然是多线程的环节。由于是只读，不会去修改内存，不会尝试同步。但即使不需要RFO，而且所有缓存线都可共享，性能仍然分别下降了18%(双线程)和34%(四线程)。由于不需要在处理器之间传输缓存，因此这里的性能下降完全由以下两个瓶颈之一或同时引起: 一是从处理器到内存控制器的共享总线，二是从内存控制器到内存模块的共享总线。当工作集超过L3后，三种情况下都要预取新元素，而即使是双线程，可用的带宽也无法满足线性扩展(无惩罚)。

When we modify memory things get even uglier. Figure 3.20 shows the results for the sequential Increment test.

Figure 3.20: Sequential Increment, Multiple Threads

This graph is using a logarithmic scale for the Y axis. So, do not be fooled by the apparently small differences. We still have about a 18% penalty for running two threads and now an amazing 93% penalty for running four threads. This means the prefetch traffic together with the write-back traffic is pretty much saturating the bus when four threads are used.

We use the logarithmic scale to show the results for the L1d range. What can be seen is that, as soon as more than one thread is running, the L1d is basically ineffective. The single-thread access times exceed 20 cycles only when the L1d is not sufficient to hold the working set. When multiple threads are running, those access times are hit immediately, even with the smallest working set sizes.

当加入修改之后，场面更加难看了。图3.20展示了顺序递增测试的结果。

图3.20: 顺序递增，多线程

图中Y轴采用的是对数刻度，不要被看起来很小的差值欺骗了。现在，双线程的性能惩罚仍然是18%，但四线程的惩罚飙升到了93%！原因在于，采用四线程时，预取的流量与写回的流量加在一起，占满了整个总线。

我们用对数刻度来展示L1d范围的结果。可以发现，当超过一个线程后，L1d就无力了。单线程时，仅当工作集超过L1d时访问时间才会超过20个周期，而多线程时，即使在很小的工作集情况下，访问时间也达到了那个水平。

One aspect of the problem is not shown here. It is hard to measure with this specific test program. Even though the test modifies memory and we therefore must expect RFO messages we do not see higher costs for the L2 range when more than one thread is used. The program would have to use a large amount of memory and all threads must access the same memory in parallel. This is hard to achieve without a lot of synchronization which would then dominate the execution time.

Figure 3.21: Random Addnextlast, Multiple Threads

Finally in Figure 3.21 we have the numbers for the Addnextlast test with random access of memory. This figure is provided mainly to show to the appallingly high numbers. It now take around 1,500 cycles to process a single list element in the extreme case. The use of more threads is even more questionable. We can summarize the efficiency of multiple thread use in a table.

#Threads Seq Read Seq Inc Rand Add

2 1.69 1.69 1.54

4 2.98 2.07 1.65

Table 3.3: Efficiency for Multiple Threads

#Threads	Seq Read	Seq Inc	Rand Add
2	1.69	1.69	1.54
4	2.98	2.07	1.65

The table shows the efficiency for the multi-thread run with the largest working set size in the three figures on Figure 3.21. The number shows the best possible speed-up the test program incurs for the largest working set size by using two or four threads. For two threads the theoretical limits for the speed-up are 2 and, for four threads, 4. The numbers for two threads are not that bad. But for four threads the numbers for the last test show that it is almost not worth it to scale beyond two threads. The additional benefit is minuscule. We can see this more easily if we represent the data in Figure 3.21 a bit differently.

这里并没有揭示问题的另一方面，主要是用这个程序很难进行测量。问题是这样的，我们的测试程序修改了内存，所以本应看到RFO的影响，但在结果中，我们并没有在L2阶段看到更大的开销。原因在于，要看到RFO的影响，程序必须使用大量内存，而且所有线程必须同时访问同一块内存。如果没有大量的同步，这是很难实现的，而如果加入同步，则会占满执行时间。

图3.21: 随机的Addnextlast，多线程

最后，在图3.21中，我们展示了随机访问的Addnextlast测试的结果。这里主要是为了让大家感受一下这些巨大到爆的数字。极端情况下，甚至用了1500个周期才处理完一个元素。如果加入更多线程，真是不可想象哪。我们把多线程的效能总结了一下:

#Threads Seq Read Seq Inc Rand Add

2 1.69 1.69 1.54

4 2.98 2.07 1.65

表3.3: 多线程的效能

#Threads	Seq Read	Seq Inc	Rand Add
2	1.69	1.69	1.54
4	2.98	2.07	1.65

这个表展示了图3.21中多线程运行大工作集时的效能。表中的数字表示测试程序在使用多线程处理大工作集时可能达到的最大加速因子。双线程和四线程的理论最大加速因子分别是2和4。从表中数据来看，双线程的结果还能接受，但四线程的结果表明，扩展到双线程以上是没有什么意义的，带来的收益可以忽略不计。只要我们把图3.21换个方式呈现，就可以很容易看清这一点。

Figure 3.22: Speed-Up Through Parallelism

The curves in Figure 3.22 show the speed-up factors, i.e., relative performance compared to the code executed by a single thread. We have to ignore the smallest sizes, the measurements are not accurate enough. For the range of the L2 and L3 cache we can see that we indeed achieve almost linear acceleration. We almost reach factors of 2 and 4 respectively. But as soon as the L3 cache is not sufficient to hold the working set the numbers crash. They crash to the point that the speed-up of two and four threads is identical (see the fourth column in Table 3.3). This is one of the reasons why one can hardly find motherboard with sockets for more than four CPUs all using the same memory controller. Machines with more processors have to be built differently (see Section 5).

图3.22: 通过并行化实现的加速因子

图3.22中的曲线展示了加速因子，即多线程相对于单线程所能获取的性能加成值。测量值的精确度有限，因此我们需要忽略比较小的那些数字。可以看到，在L2与L3范围内，多线程基本可以做到线性加速，双线程和四线程分别达到了2和4的加速因子。但是，一旦工作集的大小超出L3，曲线就崩塌了，双线程和四线程降到了基本相同的数值(参见表3.3中第4列)。也是部分由于这个原因，我们很少看到4CPU以上的主板共享同一个内存控制器。如果需要配置更多处理器，我们只能选择其它的实现方式(参见第5节)。

These numbers are not universal. In some cases even working sets which fit into the last level cache will not allow linear speed-ups. In fact, this is the norm since threads are usually not as decoupled as is the case in this test program. On the other hand it is possible to work with large working sets and still take advantage of more than two threads. Doing this requires thought, though. We will talk about some approaches in Section 6.

Special Case: Hyper-Threads

Hyper-Threads (sometimes called Symmetric Multi-Threading, SMT) are implemented by the CPU and are a special case since the individual threads cannot really run concurrently. They all share almost all the processing resources except for the register set. Individual cores and CPUs still work in parallel but the threads implemented on each core are limited by this restriction. In theory there can be many threads per core but, so far, Intel's CPUs at most have two threads per core. The CPU is responsible for time-multiplexing the threads. This alone would not make much sense, though. The real advantage is that the CPU can schedule another hyper-thread when the currently running hyper-thread is delayed. In most cases this is a delay caused by memory accesses.

可惜，上图中的数据并不是普遍情况。在某些情况下，即使工作集能够放入末级缓存，也无法实现线性加速。实际上，这反而是正常的，因为普通的线程都有一定的耦合关系，不会像我们的测试程序这样完全独立。而反过来说，即使是很大的工作集，即使是两个以上的线程，也是可以通过并行化受益的，但是需要程序员的聪明才智。我们会在第6节进行一些介绍。

特例: 超线程

由CPU实现的超线程(有时又叫对称多线程，SMT)是一种比较特殊的情况，每个线程并不能真正并发地运行。它们共享着除寄存器外的绝大多数处理资源。每个核心和CPU仍然是并行工作的，但核心上的线程则受到这个限制。理论上，每个核心可以有大量线程，不过到目前为止，Intel的CPU最多只有两个线程。CPU负责对各线程进行时分复用，但这种复用本身并没有多少厉害。它真正的优势在于，CPU可以在当前运行的超线程发生延迟时，调度另一个线程。这种延迟一般由内存访问引起。

posted @ 2021-10-31 18:51 CharyGao 阅读(54) 评论(0) 编辑收藏举报

刷新页面返回顶部

N	=	Number of instructions.
Fmem	=	Fraction of N that access memory.
Ghit	=	Fraction of loads that hit the cache.
Tproc	=	Number of cycles per instruction.
Tcache	=	Number of cycles for cache hit.
Tmiss	=	Number of cycles for cache miss.
Texe	=	Execution time for program.

N	=	指令数
Fmem	=	N中访问内存的比例
Ghit	=	命中缓存的比例
Tproc	=	每条指令所用的周期数
Tcache	=	缓存命中所用的周期数
Tmiss	=	缓冲未命中所用的周期数
Texe	=	程序的执行时间

硅基文明

代码改变不了世界，但是改变世界需要代码。

浅析 x86 架构中 cache 的组织结构-每个程序员都应该了解的 CPU 高速缓存

3.1 CPU Caches in the Big Picture

3.1 高速缓存的位置

3.2 Cache Operation at High Level

3.2 高级的缓存操作

3.3 CPU Cache Implementation Details

3.3 CPU缓存实现的细节

3.3.1 关联性

Single Threaded Sequential Access

单线程顺序访问

Single Threaded Random Access Measurements

3.3.3 写入时的行为

Multi Threaded Measurements

Special Case: Hyper-Threads

3.3.5 其它细节

3.4 Instruction Cache

3.4 指令缓存

3.4.1 自修改的代码

3.5 Cache Miss Factors

3.5 缓存未命中的因素

3.5.1 缓存与内存带宽

3.5.4 FSB的影响