http://www.infoq.com/cn/articles/cache-coherency-primer

http://www.cnblogs.com/xybaby/p/6641928.html

 english:

http://www.tuicool.com/articles/BVRNZbV

=========================================

yxr注:

 1) 由于曾研究IBM的CPU加速(CAPI),其提到内存一致性,为了弄清楚其和通过加速卡进行加速的区别,所以与必要弄清楚缓存一致性的概念和好处,才能理解

IBM所说的体系的优势

 2) 本文内容浅显易懂,精简而且循序论述,中文翻译到位,非常适合primer学习。为便于学习将中文和英文同时转载

 3) summary:

     文中先介绍cache的来源,cache line(cache line应该是指一段内存内容,并不一定在cache中)的含义和cache访问所遵循的定理。指出一致性问题(一致性协议(Coherency protocols))来源于多核所带的多cache在write-back模式访问memory造成。在解决一致性问题中提到了最普遍的方法,snoop协议,其物理基础是

所有cache均能够同时检测memory interface的access。为达到cache conherency,引入了MESI状态(梅西,四个字母代表四种状态),各个core之间需要相互通信,完成各个状态的装换,保证cache的一致性

 ===========================================

本文是RAD Game Tools程序员Fabian “ryg” Giesen在其博客上发表的《Cache coherency primer》一文的翻译,经作者许可分享至InfoQ中文站。该系列共有两篇,本文系第一篇。

我计划写一些关于多核场景下数据组织的文章。写了第一篇,但我很快意识到有大量的基础知识我首先需要讲一下。在本文中,我就尝试阐述这些知识。

缓存(Cache)

本文是关于CPU缓存的快速入门。我假设你已经有了基本概念,但你可能不熟悉其中的一些细节。(如果你已经熟悉了,你可以忽略这部分。)

 在现代的CPU(大多数)上,所有的内存访问都需要通过层层的缓存来进行。也有些例外,比如,对映射成内存地址的I/O口、写合并(Write-combined)内存,这些访问至少会绕开这个流程的一部分。但这两者都是罕见的场景(意味着绝大多数的用户态代码都不会遇到这两种情况),所以在本文中,我将忽略这两者。

CPU的读/写(以及取指令)单元正常情况下甚至都不能直接访问内存——这是物理结构决定的;CPU都没有管脚直接连到内存。相反,CPU和一级缓存(L1 Cache)通讯,而一级缓存才能和内存通讯。大约二十年前,一级缓存可以直接和内存传输数据。如今,更多级别的缓存加入到设计中,一级缓存已经不能直接和内存通讯了,它和二级缓存通讯——而二级缓存才能和内存通讯。或者还可能有三级缓存。你明白这个意思就行。

缓存是分“段”(line)的,一个段对应一块存储空间,大小是32(较早的ARM、90年代/2000年代早期的x86和PowerPC)、64(较新的ARM和x86)或128(较新的Power ISA机器)字节。每个缓存段知道自己对应什么范围的物理内存地址,并且在本文中,我不打算区分物理上的缓存段和它所代表的内存,这听起来有点草率,但是为了方便起见,还是请熟悉这种提法。具体地说,当我提到“缓存段”的时候,我就是指一段和缓存大小对齐的内存,不关心里面的内容是否真正被缓存进去(就是说保存在任何级别的缓存中)了。

 

当CPU看到一条读内存的指令时,它会把内存地址传递给一级数据缓存(或可戏称为L1D$,因为英语中“缓存(cache)”和“现金(cash)”的发音相同)。一级数据缓存会检查它是否有这个内存地址对应的缓存段。如果没有,它会把整个缓存段从内存(或者从更高一级的缓存,如果有的话)中加载进来。是的,一次加载整个缓存段,这是基于这样一个假设:内存访问倾向于本地化(localized),如果我们当前需要某个地址的数据,那么很可能我们马上要访问它的邻近地址。一旦缓存段被加载到缓存中,读指令就可以正常进行读取。

如果我们只处理读操作,那么事情会很简单,因为所有级别的缓存都遵守以下规律,我称之为:

基本定律:在任意时刻,任意级别缓存中的缓存段的内容,等同于它对应的内存中的内容。

一旦我们允许写操作,事情就变得复杂一点了。这里有两种基本的写模式:直写(write-through)和回写(write-back)。直写更简单一点:我们透过本级缓存,直接把数据写到下一级缓存(或直接到内存)中,如果对应的段被缓存了,我们同时更新缓存中的内容(甚至直接丢弃),就这么简单。这也遵守前面的定律:缓存中的段永远和它对应的内存内容匹配。

回写模式就有点复杂了。缓存不会立即把写操作传递到下一级,而是仅修改本级缓存中的数据,并且把对应的缓存段标记为“脏”段。脏段会触发回写,也就是把里面的内容写到对应的内存或下一级缓存中。回写后,脏段又变“干净”了。当一个脏段被丢弃的时候,总是先要进行一次回写。回写所遵循的规律有点不同。

回写定律:当所有的脏段被回写后,任意级别缓存中的缓存段的内容,等同于它对应的内存中的内容。

换句话说,回写模式的定律中,我们去掉了“在任意时刻”这个修饰语,代之以弱化一点的条件:要么缓存段的内容和内存一致(如果缓存段是干净的话),要么缓存段中的内容最终要回写到内存中(对于脏缓存段来说)。

直接模式更简单,但是回写模式有它的优势:它能过滤掉对同一地址的反复写操作,并且,如果大多数缓存段都在回写模式下工作,那么系统经常可以一下子写一大片内存,而不是分成小块来写,前者的效率更高。

有些(大多数是比较老的)CPU只使用直写模式,有些只使用回写模式,还有一些,一级缓存使用直写而二级缓存使用回写。这样做虽然在一级和二级缓存之间产生了不必要的数据流量,但二级缓存和更低级缓存或内存之间依然保留了回写的优势。我想说的是,这里涉及到一系列的取舍问题,且不同的设计有不同的解决方案。没有人规定各级缓存的大小必须一致。举个例子,我们会看到有CPU的一级缓存是32字节,而二级缓存却有128字节。

为了简化问题,我省略了一些内容:缓存关联性(cache associativity),缓存组(cache sets),使用分配写(write-allocate)还是非分配写(上面我描述的直写是和分配写相结合的,而回写是和非分配写相结合的),非对齐的访问(unaligned access),基于虚拟地址的缓存。如果你感兴趣,所有这些内容都可以去查查资料,但我不准备在这里讲了。

一致性协议(Coherency protocols)

只要系统只有一个CPU核在工作,一切都没问题。如果有多个核,每个核又都有自己的缓存,那么我们就遇到问题了:如果某个CPU缓存段中对应的内存内容被另外一个CPU偷偷改了,会发生什么?

好吧,答案很简单:什么也不会发生。这很糟糕。因为如果一个CPU缓存了某块内存,那么在其他CPU修改这块内存的时候,我们希望得到通知。我们拥有多组缓存的时候,真的需要它们保持同步。或者说,系统的内存在各个CPU之间无法做到与生俱来的同步,我们实际上是需要一个大家都能遵守的方法来达到同步的目的。

注意,这个问题的根源是我们拥有多组缓存,而不是多个CPU核。我们也可以这样解决问题,让多个CPU核共用一组缓存:也就是说只有一块一级缓存,所有处理器都必须共用它。在每一个指令周期,只有一个幸运的CPU能通过一级缓存做内存操作,运行它的指令。

这本身没问题。唯一的问题就是太慢了,因为这下处理器的时间都花在排队等待使用一级缓存了(并且处理器会做大量的这种操作,至少每个读写指令都要做一次)。我指出这一点是因为它表明了问题不是由多核引起的,而是由多缓存引起的。我们知道了只有一组缓存也能工作,只是太慢了,接下来最好就是能做到:使用多组缓存,但使它们的行为看起来就像只有一组缓存那样。缓存一致性协议就是为了做到这一点而设计的。就像名称所暗示的那样,这类协议就是要使多组缓存的内容保持一致。

缓存一致性协议有多种,但是你日常处理的大多数计算机设备使用的都属于“窥探(snooping)”协议,这也是我这里要讲的。(还有一种叫“基于目录的(directory-based)”协议,这种协议的延迟性较大,但是在拥有很多个处理器的系统中,它有更好的可扩展性。)

“窥探”背后的基本思想是,所有内存传输都发生在一条共享的总线上,而所有的处理器都能看到这条总线:缓存本身是独立的,但是内存是共享资源,所有的内存访问都要经过仲裁(arbitrate):同一个指令周期中,只有一个缓存可以读写内存。窥探协议的思想是,缓存不仅仅在做内存传输的时候才和总线打交道,而是不停地在窥探总线上发生的数据交换,跟踪其他缓存在做什么。所以当一个缓存代表它所属的处理器去读写内存时,其他处理器都会得到通知,它们以此来使自己的缓存保持同步。只要某个处理器一写内存,其他处理器马上就知道这块内存在它们自己的缓存中对应的段已经失效。

在直写模式下,这是很直接的,因为写操作一旦发生,它的效果马上会被“公布”出去。但是如果混着回写模式,就有问题了。因为有可能在写指令执行过后很久,数据才会被真正回写到物理内存中——在这段时间内,其他处理器的缓存也可能会傻乎乎地去写同一块内存地址,导致冲突。在回写模型中,简单把内存写操作的信息广播给其他处理器是不够的,我们需要做的是,在修改本地缓存之前,就要告知其他处理器。搞懂了细节,就找到了处理回写模式这个问题的最简单方案,我们通常叫做MESI协议(译者注:MESI是Modified、Exclusive、Shared、Invalid的首字母缩写,代表四种缓存状态,下面的译文中可能会以单个字母指代相应的状态)。

MESI以及衍生协议

本节叫做“MESI以及衍生协议”,是因为MESI衍生了一系列紧密相关的一致性协议。我们先从原生的MESI协议开始:MESI是四种缓存段状态的首字母缩写,任何多核系统中的缓存段都处于这四种状态之一。我将以相反的顺序逐个讲解,因为这个顺序更合理:

  • 失效(Invalid)缓存段,要么已经不在缓存中,要么它的内容已经过时。为了达到缓存的目的,这种状态的段将会被忽略。一旦缓存段被标记为失效,那效果就等同于它从来没被加载到缓存中。
  • 共享(Shared)缓存段,它是和主内存内容保持一致的一份拷贝,在这种状态下的缓存段只能被读取,不能被写入。多组缓存可以同时拥有针对同一内存地址的共享缓存段,这就是名称的由来。
  • 独占(Exclusive)缓存段,和S状态一样,也是和主内存内容保持一致的一份拷贝。区别在于,如果一个处理器持有了某个E状态的缓存段,那其他处理器就不能同时持有它,所以叫“独占”。这意味着,如果其他处理器原本也持有同一缓存段,那么它会马上变成“失效”状态。
  • 已修改(Modified)缓存段,属于脏段,它们已经被所属的处理器修改了。如果一个段处于已修改状态,那么它在其他处理器缓存中的拷贝马上会变成失效状态,这个规律和E状态一样。此外,已修改缓存段如果被丢弃或标记为失效,那么先要把它的内容回写到内存中——这和回写模式下常规的脏段处理方式一样。

如果把以上这些状态和单核系统中回写模式的缓存做对比,你会发现I、S和M状态已经有对应的概念:失效/未载入、干净以及脏的缓存段。所以这里的新知识只有E状态,代表独占式访问。这个状态解决了“在我们开始修改某块内存之前,我们需要告诉其他处理器”这一问题:只有当缓存段处于E或M状态时,处理器才能去写它,也就是说只有这两种状态下,处理器是独占这个缓存段的。当处理器想写某个缓存段时,如果它没有独占权,它必须先发送一条“我要独占权”的请求给总线,这会通知其他处理器,把它们拥有的同一缓存段的拷贝失效(如果它们有的话)。只有在获得独占权后,处理器才能开始修改数据——并且此时,这个处理器知道,这个缓存段只有一份拷贝,在我自己的缓存里,所以不会有任何冲突。

反之,如果有其他处理器想读取这个缓存段(我们马上能知道,因为我们一直在窥探总线),独占或已修改的缓存段必须先回到“共享”状态。如果是已修改的缓存段,那么还要先把内容回写到内存中。

MESI协议是一个合适的状态机,既能处理来自本地处理器的请求,也能把信息广播到总线上。我不打算讲更多关于状态图的细节以及不同的状态转换类型。如果你感兴趣的话,可以在关于硬件架构的书中找到更多的深度内容,但对于本文来说,讲这些东西有点过了。作为一个软件开发者,你只要理解以下两点,就大有可为:

第一,在多核系统中,读取某个缓存段,实际上会牵涉到和其他处理器的通讯,并且可能导致它们发生内存传输。写某个缓存段需要多个步骤:在你写任何东西之前,你首先要获得独占权,以及所请求的缓存段的当前内容的拷贝(所谓的“带权限获取的读(Read For Ownership)”请求)。

第二,尽管我们为了一致性问题做了额外的工作,但是最终结果还是非常有保证的。即它遵守以下定理,我称之为:

MESI定律:在所有的脏缓存段(M状态)被回写后,任意缓存级别的所有缓存段中的内容,和它们对应的内存中的内容一致。此外,在任意时刻,当某个位置的内存被一个处理器加载入独占缓存段时(E状态),那它就不会再出现在其他任何处理器的缓存中。

注意,这其实就是我们已经讲过的回写定律加上独占规则而已。我认为MESI协议或多核系统的存在根本没有弱化我们现有的内存模型。

好了,至此我们(粗略)讲了原生MESI协议(以及使用它的CPU,比如ARM)。其他处理器使用MESI扩展后的变种。常见的扩展包括“O”(Owned)状态,它和E状态类似,也是保证缓存间一致性的手段,但它直接共享脏段的内容,而不需要先把它们回写到内存中(“脏段共享”),由此产生了MOSEI协议。还有MERSI和MESIF,这两个名字代表同一种思想,即指定某个处理器专门处理针对某个缓存段的读操作。当多个处理器同时拥有某个S状态的缓存段的时候,只有被指定的那个处理器(对应的缓存段为R或F状态)才能对读操作做出回应,而不是每个处理器都能这么做。这种设计可以降低总线的数据流量。当然你可以同时加入R/F状态和O状态,或者更多的状态。这些都属于优化,没有一种会改变基本定律,也没有一种会改变MESI协议所确保的结果。

我不是这方面的专家,很有可能有系统在使用其他协议,这些协议并不能完全保证一致性,不过如果有,我没有注意到它们,或者没有看到有什么流行的处理器在使用它们。所以为了达到我们的目的,我们真的就可以假设一致性协议能保证缓存的一致性。不是基本一致,不是“写入一会儿后才能保持一致”——而是完全的一致。从这个层面上说,除非硬件有问题,内存的状态总是一致的。用技术术语来说,MESI以及它的衍生协议,至少在原理上,提供了完整的顺序一致性(sequential consistency),在C++ 11的内存模型中,这是最强的一种确保内存顺序的模型。这也引出了问题,为什么我们需要弱一点的内存模型,以及“什么时候会用到它们”?

内存模型

不同的体系结构提供不同的内存模型。到本文写作的时候为止,ARM和POWER体系结构的机器拥有相对较弱的内存模型:这类CPU在读写指令重排序(reordering)方面有相当大的自由度,这种重排序有可能会改变程序在多核环境下的语义。通过“内存屏障(memory barrier)”,程序可以对此加以限制:“重排序操作不允许越过这条边界”。相反,x86则拥有较强的内存模型。

我不打算在这里深入到内存模型的细节中,这很容易陷入堆砌技术术语中,而且也超出了本文的范围。但是我想说一点关于“他们如何发生”的内容——也就是,弱内存模型如何保证正确性(相比较于MESI协议给缓存带来的顺序一致性),以及为什么。当然,一切都归结于性能。

规则是这样的:如果满足下面的条件,你就可以得到完全的顺序一致性:第一,缓存一收到总线事件,就可以在当前指令周期中迅速做出响应。第二,处理器如实地按程序的顺序,把内存操作指令送到缓存,并且等前一条执行完后才能发送下一条。当然,实际上现代处理器一般都无法满足以上条件:

  • 缓存不会及时响应总线事件。如果总线上发来一条消息,要使某个缓存段失效,但是如果此时缓存正在处理其他事情(比如和CPU传输数据),那这个消息可能无法在当前的指令周期中得到处理,而会进入所谓的“失效队列(invalidation queue)”,这个消息等在队列中直到缓存有空为止。
  • 处理器一般不会严格按照程序的顺序向缓存发送内存操作指令。当然,有乱序执行(Out-of-Order execution)功能的处理器肯定是这样的。顺序执行(in-order execution)的处理器有时候也无法完全保证内存操作的顺序(比如想要的内存不在缓存中时,CPU就不能为了载入缓存而停止工作)。
  • 写操作尤其特殊,因为它分为两阶段操作:在写之前我们先要得到缓存段的独占权。如果我们当前没有独占权,我们先要和其他处理器协商,这也需要一些时间。同理,在这种场景下让处理器闲着无所事事是一种资源浪费。实际上,写操作首先发起获得独占权的请求,然后就进入所谓的由“写缓冲(store buffer)”组成的队列(有些地方使用“写缓冲”指代整个队列,我这里使用它指代队列的一条入口)。写操作在队列中等待,直到缓存准备好处理它,此时写缓冲就被“清空(drained)”了,缓冲区被回收用于处理新的写操作。

这些特性意味着,默认情况下,读操作有可能会读到过时的数据(如果对应失效请求还等在队列中没执行),写操作真正完成的时间有可能比它们在代码中的位置晚,一旦牵涉到乱序执行,一切都变得模棱两可。回到内存模型,本质上只有两大阵营:

在弱内存模型的体系结构中,处理器为了开发者能写出正确的代码而做的工作是最小化的,指令重排序和各种缓冲的步骤都是被正式允许的,也就是说没有任何保证。如果你需要确保某种结果,你需要自己插入合适的内存屏障——它能防止重排序,并且等待队列中的操作全部完成。

使用强一点的内存模型的体系结构则会在内部做很多记录工作。比如,x86会跟踪所有在等待中的内存操作,这些操作都还没有完全完成(称为“退休(retired)”)。它会把它们的信息保存在芯片内部的MOB(“memory ordering buffer”,内存排序缓冲)。x86作为部分支持乱序执行的体系结构,在出问题的时候能把尚未“退休”的指令撤销掉——比如发生页错误(page fault),或者分支预测失败(branch mispredict)的时候。我已经在我以前的文章“好奇地说”中提到过一些细节,以及和内存子系统的一些交互。主旨是x86处理器会主动地监控外部事件(比如缓存失效),有些已经执行完的操作会因为这些事件而被撤销,但不算“退休”。这就是说,x86知道自己的内存模型应该是什么样子的,当发生了一件和这个模型冲突的事,处理器会回退到上一个与内存模型兼容的状态。这就是我在以前另一篇文章中提到的“清除内存排序机(memory ordering machine clear)”。最后的结果是,x86处理器为内存操作提供了很强的一致性保证——虽然没有达到完美的顺序一致性。

无论如何,一篇文章讲这么多已经够了。我把它放在我的博客上。我的想法是将来的文章只要引用它就行了。我们看效果吧。感谢阅读!

查看参考原文:http://fgiesen.wordpress.com/2014/07/07/cache-coherency/

 

I’m planning to write a bit about data organization for multi-core scenarios. I started writing a first post but quickly realized that there’s a bunch of basics I need to cover first. In this post, I’ll try just that.

Caches

This is a whirlwhind primer on CPU caches. I’m assuming you know the basic concept, but you might not be familiar with some of the details. (If you are, feel free to skip this section.)

In modern CPUs (almost) all memory accesses go through the cache hierarchy; there’s some exceptions for memory-mapped IO and write-combined memory that bypass at least parts of this process, but both of these are corner cases (in the sense that the vast majority of user-mode code will never see either), so I’ll ignore them in this post.

The CPU core’s load/store (and instruction fetch) units normally can’t even access memory directly – it’s physically impossible; the necessary wires don’t exist! Instead, they talk to their L1 caches which are supposed to handle it. And about 20 years ago, the L1 caches would indeed talk to memory directly. At this point, there’s generally more cache levels involved; this means the L1 cache doesn’t talk to memory directly anymore, it talks to a L2 cache – which in turns talks to memory. Or maybe to a L3 cache. You get the idea.

Caches are organized into “lines”, corresponding to aligned blocks of either 32 (older ARMs, 90s/early 2000s x86s/PowerPCs), 64 (newer ARMs and x86s) or 128 (newer Power ISA machines) bytes of memory. Each cache line knows what physical memory address range it corresponds to, and in this article I’m not going to differentiate between the physical cache line and the memory it represents – this is sloppy, but conventional usage, so better get used to it. In particular, I’m going to say “cache line” to mean a suitably aligned group of bytes in memory, no matter whether these bytes are currently cached (i.e. present in any of the cache levels) or not.

When the CPU core sees a memory load instruction, it passes the address to the L1 data cache (or “L1D$”, playing on the “cache” being pronounced the same way as “cash”). The L1D$ checks whether it contains the corresponding cache line. If not, the whole cache line is brought in from memory (or the next-deeper cache level, if present) – yes, the whole cache line; the assumption being that memory accesses are localized, so if we’re looking at some byte in memory we’re likely to access its neighbors soon. Once the cache line is present in the L1D$, the load instruction can go ahead and perform its memory read.

And as long as we’re dealing with read-only access, it’s all really simple, since all cache levels obey what I’ll call the

Basic invariant : the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses, at all times.

Things gets a bit more complicated once we allow stores, i.e. memory writes. There’s two basic approaches here: write-through and write-back. Write-through is the easier one: we just pass stores through to the next-level cache (or memory). If we have the corresponding line cached, we update our copy (or maybe even just discard it), but that’s it. This preserves the same invariant as before: if a cache line is present in the cache, its contents match memory, always.

Write-back is a bit trickier. The cache doesn’t pass writes on immediately. Instead, such modifications are applied locally to the cached data, and the corresponding cache lines are flagged “dirty”. Dirty cache lines can trigger a write-back, at which points their contents are written back to memory or the next cache level. After a write-back, dirty cache lines are “clean” again. When a dirty cache line is evicted (usually to make space for something else in the cache), it always needs to perform a write-back first. The invariant for write-back caches is slightly different.

Write-back invariant : after writing back all dirty cache lines , the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses.

In other words, in write-back caches we lose the “at all times” qualifier and replace it with a weaker condition: either the cache contents match memory (this is true for all clean cache lines), or they contain values that eventually need to get written back to memory (for dirty cache lines).

Write-through caches are simpler, but write-back has some advantages: it can filter repeated writes to the same location, and if most of the cache line changes on a write-back, it can issue one large memory transaction instead of several small ones, which is more efficient.

Some (mostly older) CPUs use write-through caches everywhere; some use write-back caches everywhere; some have a simpler write-through L1$ backed by a write-back L2$. This may generate redundant traffic between L1$ and L2$ but gets the write-back benefits for transfers to lower cache levels or memory. My point being that there’s a whole set of trade-offs here, and different designs use different solutions. Nor is there a requirement that cache line sizes be the same at all levels – it’s not unheard-of for CPUs to have 32-byte lines in L1$ but 128-byte lines in L2$ for example.

Omitted for simplicity in this section: cache associativity/sets; write-allocate or not (I described write-through without write-allocate and write-back with, which is the most common usage); unaligned accesses; virtually-addressed caches. These are all things you can look up if you’re interested, but I’m not going to go that deep here.

Coherency protocols

As long as that single CPU core is alone in the system, this all works just fine. Add more cores, each with their own caches, and we have a problem: what happens if some other core modifies data that’s in one of our caches?

Well, the answer is quite simple: nothing happens. And that’s bad, because we wantsomething to happen when someone else modifies memory that we have a cached copy of. Once we have multiple caches, we really need to keep them synchronized, or we don’t really have a “shared memory” system, more like a “shared general idea of what’s in memory” system.

Note that the problem really is that we have multiple caches, not that we have multiple cores. We could solve the entire problem by sharing all caches between all cores: there’s only one L1$, and all processors have to share it. Each cycle, the L1$ picks one lucky core that gets to do a memory operation this cycle, and runs it.

This works just fine. The only problem is that it’s also slow, because cores now spend most of their time waiting in line for their next turn at a L1$ request (and processors do a lot of those, at least one for every load/store instruction). I’m pointing this out because it shows that the problem really isn’t so much a multi- core problem as it is a multi- cache problem. We know that one set of caches works, but when that’s too slow, the next best thing is to have multiple caches and then make them behave as if there was only one cache. This is what cache coherency protocols are for: as the name suggests, they ensure that the contents of multiple caches stay coherent.

There are multiple types of coherency protocols, but most computing devices you deal with daily fall into the category of “snooping” protocols, and that’s what I’ll cover here. (The primary alternative, directory-based systems, has higher latency but scales better to systems with lots of cores).

The basic idea behind snooping is that all memory transactions take place on a shared bus that’s visible to all cores: the caches themselves are independent, but memory itself is a shared resource, and memory access needs to be arbitrated: only one cache gets to read data from, or write back to, memory in any given cycle. Now the idea in a snooping protocol is that the caches don’t just interact with the bus when they want to do a memory transaction themselves; instead, each cache continuously snoops on bus traffic to keep track of what the other caches are doing. So if one cache wants to read from or write to memory on behalf of its core, all the other cores notice, and that allows them to keep their caches synchronized. As soon as one core writes to a memory location, the other cores know that their copies of the corresponding cache line are now stale and hence invalid.

With write-through caches, this is fairly straightforward, since writes get “published” as soon as they happen. But if there are write-back caches in the mix, this doesn’t work, since the physical write-back to memory can happen a long time after the core executed the corresponding store – and for the intervening time, the other cores and their caches are none the wiser, and might themselves try to write to the same location, causing a conflict. So with a write-back model, it’s not enough to broadcast just the writes to memory when they happen; if we want to avoid conflicts, we need to tell other cores about our intention to write before we start changing anything in our local copy. Working out the details, the easiest solution that fits the bill and works for write-back caches is what’s commonly called the MESI protocol .

MESI and friends

This section is called “MESI and friends” because MESI spawned a whole host of closely related coherency protocols. Let’s start with the original though: MESI are the initials for the four states a cache line can be in for any of the multiple cores in a multi-core system. I’m gonna cover them in reverse order, because that’s the better order to explain them in:

  • I nvalid lines are cache lines that are either not present in the cache, or whose contents are known to be stale. For the purposes of caching, these are ignored. Once a cache line is invalidated, it’s as if it wasn’t in the cache in the first place.
  • S hared lines are clean copies of the contents of main memory. Cache lines in the shared state can be used to serve reads but they can’t be written to. Multiple caches are allowed to have a copy of the same memory location in “shared” state at the same time, hence the name.
  • E xclusive lines are also clean copies of the contents of main memory, just like the S state. The difference is that when one core holds a line in E state, no other core may hold it at the same time, hence “exclusive”. That is, the same line must be in the I state in the caches of all other cores.
  • M odified lines are dirty; they have been locally modified. If a line is in the M state, it must be in the I state for all other cores, same as E. In addition, modified cache lines need to be written back to memory when they get evicted or invalidated – same as the regular dirty state in a write-back cache.

If you compare this to the presentation of write-back caches in the single-core case above, you’ll see that the I, S and M states already had their equivalents: invalid/not present, clean, and dirty cache lines, respectively. So what’s new is the E state denoting exclusive access. This state solves the “we need to tell other cores before we start modifying memory” problem: each core may only write to cache lines if their caches hold them in the E or M states, i.e. they’re exclusively owned. If a core does not have exclusive access to a cache line when it wants to write, it first needs to send an “I want exclusive access” request to the bus. This tells all other cores to invalidate their copies of that cache line, if they have any. Only once that exclusive access is granted may the core start modifying data – and at that point, the core knows that the only copies of that cache line are in its own caches, so there can’t be any conflicts.

Conversely, once some other core wants to read from that cache line (which we learn immediately because we’re snooping the bus), exclusive and modified cache lines have to revert back to the “shared” (S) state. In the case of modified cache lines, this also involves writing their data back to memory first.

The MESI protocol is a proper state machine that responds both to requests coming from the local core, and to messages on the bus. I’m not going to go into detail about the full state diagram and what the different transition types are; you can find more in-depth information in books on hardware architecture if you care, but for our purposes this is overkill. As a software developer, you’ll get pretty far knowing only two things:

Firstly, in a multi-core system, getting read access to a cache line involves talking to the other cores, and might cause them to perform memory transactions.Writing to a cache line is a multi-step process: before you can write anything, you first need to acquire both exclusive ownership of the cache line and a copy of its existing contents (a so-called “Read For Ownership” request).

And secondly, while we have to do some extra gymnastics, the end result actually does provide some pretty strong guarantees. Namely, it obeys what I’ll call the

MESI invariant : after writing back all dirty ( M-state ) cache lines, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses. In addition, at all times, when a memory location is exclusively cached (in E or M state) by one core, it is not present in any of the other core’s caches. .

Note that this is really just the write-back invariant we already saw with the additional exclusivity rule thrown in. My point being that the presence of MESI or multiple cores does not necessarily weaken our memory model at all.

Okay, so that (very roughly) covers vanilla MESI (and hence also CPUs that use it, ARMs for example). Other processors use extended variants. Popular extensions include an “O” (Owned) state similar to “E” that allows sharing of dirty cache lines without having to write them back to memory first (“dirty sharing”), yielding MOESI, and MERSI/MESIF, which are different names for the same idea, namely making one core the designated responder for read requests to a given cache line. When multiple cores hold a cache line in Shared state, only the designated responder (which holds the cache line in “R” or “F” state) replies to read requests, rather than everyone who holds the cache line in S state. This reduces bus traffic. And of course you can add both the R/F states and the O state, or get even fancier. All these are optimizations, but none of them change the basic invariants provided or guarantees made by the protocol.

I’m no expert on the topic, and it’s quite possible that there are other protocols in use that only provide substantially weaker guarantees, but if so I’m not aware of them, or any popular CPU core that uses them. So for our purposes, we really can assume that coherency protocols keep caches coherent, period. Not mostly-coherent, not “coherent except for a short window after a change” – properly coherent. At that level, barring hardware malfunction, there is always agreement on what the current state of memory should be. In technical terms, MESI and all its variants can, in principle anyway, provide full sequential consistency , the strongest memory ordering guarantee specified in the C++11 memory model. Which begs the question, why do we have weaker memory models, and “where do they happen”?

Memory models

Different architectures provide different memory models. As of this writing, ARM and POWER architecture machines have comparatively “weak” memory models: the CPU core has considerable leeway in reordering load and store operations in ways that might change the semantics of programs in a multi-core context, along with “memory barrier” instructions that can be used by the program to specify constraints: “do not reorder memory operations across this line”. By contrast, x86 comes with a quite strong memory model.

I won’t go into the details of memory models here; it quickly gets really technical, and is outside the scope of this article. But I do want to talk a bit about “how they happen” – that is, where the weakened guarantees (compared to the full sequential consistency we can get from MESI etc.) come from, and why. And as usual, it all boils down to performance.

So here’s the deal: you will indeed get full sequential consistency if a) the cache immediately responds to bus events on the very cycle it receives them, and b) the core dutifully sends each memory operation to the cache, in program order, and wait for it to complete before you send the next one. And of course, in practice modern CPUs normally do none of these things:

  • Caches do not respond to bus events immediately. If a bus message triggering a cache line invalidation arrives while the cache is busy doing other things (sending data to the core for example), it might not get processed that cycle. Instead, it will enter a so-called “invalidation queue”, where it sits for a while until the cache has time to process it.
  • Cores do not, in general, send memory operations to the cache in strict program order; this is certainly the case for cores with Out-of-Order execution, but even otherwise in-order cores may have somewhat weaker ordering guarantees for memory operations (for example, to ensure that a single cache miss doesn’t immediately make the entire core grind to a halt).
  • In particular, stores are special, because they’re a two-phase operation: we first need to acquire exclusive ownership of a cache line before a store can go through. And if we don’t already have exclusive ownership, we need to talk to the other cores, which takes a while. Again, having the core idle and twiddling thumbs while this is happening is not a good use of execution resources. Instead, what happens is that stores start the process of getting exclusive ownership, then get entered into a queue of so-called “store buffers” (some refer to the entire queue as “store buffer”, but I’m going to use the term to refer to the entries). They stay around in this queue for a while until the cache is ready to actually perform the store operation, at which point the corresponding store buffer is “drained” and can be recycled to hold a new pending store.

The implication of all these things is that, by default, loads can fetch stale data (if a corresponding invalidation request was sitting in the invalidation queue), stores actually finish later than their position in the code would suggest, and everything gets even more vague when Out of Order execution is involved. So going back to memory models, there are essentially two camps:

Architectures with a weak memory model do the minimum amount of work necessary in the core that allows software developers to write correct code. Instruction reordering and the various buffering stages are officially permitted; there are no guarantees. If you need guarantees, you need to insert the appropriate memory barriers – which will prevent reordering and drain queues of pending operations where required.

Architectures with stronger memory models do a lot more bookkeeping on the inside. For example, x86 processors keep track of all pending memory operations that are not fully finished (“retired”) yet, in a chip-internal data structure that’s called the MOB (“memory ordering buffer”). As part of the Out of Order infrastructure, x86 cores can roll back non-retired operations if there’s a problem – say an exception like a page fault, or a branch mispredict. I covered some of the details, as well as some of the interactions with the memory subsystem, in my earlier article “Speculatively speaking “. The gist of it is that x86 processors actively watch out for external events (such as cache invalidations) that would retroactively invalidate the results of some of the operations that have already executed, but not been retired yet. That is, x86 processors know what their memory model is, and when an event happens that’s inconsistent within that model, the machine state is rolled back to the last time when it was still consistent with the rules of the memory model. This is the “memory ordering machine clear” I covered in yet another earlier post . The end result is that x86 processors provide very strong guarantees for all memory operations – not quite sequential consistency, though.

So, weaker memory models make for simpler (and potentially lower-power) cores. Stronger memory models make the design of cores (and their memory subsystems) more complex, but are easier to write code for. In theory, the weaker models allow for more scheduling freedom and can be potentially faster; in practice, x86s seem to be doing fine on the performance of memory operations, for the time being at least. So it’s hard for me to call a definite winner so far. Certainly, as a software developer I’m happy to take the stronger x86 memory model when I can get it.

Anyway. That’s plenty for one post. And now that I have all this written up on my blog, the idea is that future posts can just reference it. We’ll see how that goes. Thanks for reading!

posted on 2017-04-07 12:48  e_shannon  阅读(1258)  评论(0编辑  收藏  举报