當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

zt:缓存一致性（Cache Coherency）入门 cach coherency

發(fā)布時(shí)間：2025/3/8 编程问答 43 豆豆

生活随笔收集整理的這篇文章主要介紹了 zt:缓存一致性（Cache Coherency）入门 cach coherency 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

http://www.infoq.com/cn/articles/cache-coherency-primer

http://www.cnblogs.com/xybaby/p/6641928.html

?english:

http://www.tuicool.com/articles/BVRNZbV

=========================================

yxr注：

?1) 由于曾研究IBM的CPU加速(CAPI)，其提到內(nèi)存一致性，為了弄清楚其和通過(guò)加速卡進(jìn)行加速的區(qū)別，所以與必要弄清楚緩存一致性的概念和好處，才能理解

IBM所說(shuō)的體系的優(yōu)勢(shì)

?2) 本文內(nèi)容淺顯易懂，精簡(jiǎn)而且循序論述，中文翻譯到位，非常適合primer學(xué)習(xí)。為便于學(xué)習(xí)將中文和英文同時(shí)轉(zhuǎn)載

?3) summary:

? ? ?文中先介紹cache的來(lái)源，cache line（cache line應(yīng)該是指一段內(nèi)存內(nèi)容，并不一定在cache中）的含義和cache訪問(wèn)所遵循的定理。指出一致性問(wèn)題（一致性協(xié)議（Coherency protocols））來(lái)源于多核所帶的多cache在write-back模式訪問(wèn)memory造成。在解決一致性問(wèn)題中提到了最普遍的方法，snoop協(xié)議，其物理基礎(chǔ)是

所有cache均能夠同時(shí)檢測(cè)memory interface的access。為達(dá)到cache conherency，引入了MESI狀態(tài)（梅西，四個(gè)字母代表四種狀態(tài)），各個(gè)core之間需要相互通信，完成各個(gè)狀態(tài)的裝換，保證cache的一致性

?===========================================

本文是RAD Game Tools程序員Fabian “ryg” Giesen在其博客上發(fā)表的《Cache coherency primer》一文的翻譯，經(jīng)作者許可分享至InfoQ中文站。該系列共有兩篇，本文系第一篇。

我計(jì)劃寫一些關(guān)于多核場(chǎng)景下數(shù)據(jù)組織的文章。寫了第一篇，但我很快意識(shí)到有大量的基礎(chǔ)知識(shí)我首先需要講一下。在本文中，我就嘗試闡述這些知識(shí)。

緩存（Cache）

本文是關(guān)于CPU緩存的快速入門。我假設(shè)你已經(jīng)有了基本概念，但你可能不熟悉其中的一些細(xì)節(jié)。（如果你已經(jīng)熟悉了，你可以忽略這部分。）

在現(xiàn)代的CPU（大多數(shù)）上，所有的內(nèi)存訪問(wèn)都需要通過(guò)層層的緩存來(lái)進(jìn)行。也有些例外，比如，對(duì)映射成內(nèi)存地址的I/O口、寫合并（Write-combined）內(nèi)存，這些訪問(wèn)至少會(huì)繞開(kāi)這個(gè)流程的一部分。但這兩者都是罕見(jiàn)的場(chǎng)景（意味著絕大多數(shù)的用戶態(tài)代碼都不會(huì)遇到這兩種情況），所以在本文中，我將忽略這兩者。

CPU的讀/寫（以及取指令）單元正常情況下甚至都不能直接訪問(wèn)內(nèi)存——這是物理結(jié)構(gòu)決定的；CPU都沒(méi)有管腳直接連到內(nèi)存。相反，CPU和一級(jí)緩存（L1 Cache）通訊，而一級(jí)緩存才能和內(nèi)存通訊。大約二十年前，一級(jí)緩存可以直接和內(nèi)存?zhèn)鬏敂?shù)據(jù)。如今，更多級(jí)別的緩存加入到設(shè)計(jì)中，一級(jí)緩存已經(jīng)不能直接和內(nèi)存通訊了，它和二級(jí)緩存通訊——而二級(jí)緩存才能和內(nèi)存通訊。或者還可能有三級(jí)緩存。你明白這個(gè)意思就行。

緩存是分“段”（line）的，一個(gè)段對(duì)應(yīng)一塊存儲(chǔ)空間，大小是32（較早的ARM、90年代/2000年代早期的x86和PowerPC）、64（較新的ARM和x86）或128（較新的Power ISA機(jī)器）字節(jié)。每個(gè)緩存段知道自己對(duì)應(yīng)什么范圍的物理內(nèi)存地址，并且在本文中，我不打算區(qū)分物理上的緩存段和它所代表的內(nèi)存，這聽(tīng)起來(lái)有點(diǎn)草率，但是為了方便起見(jiàn)，還是請(qǐng)熟悉這種提法。具體地說(shuō)，當(dāng)我提到“緩存段”的時(shí)候，我就是指一段和緩存大小對(duì)齊的內(nèi)存，不關(guān)心里面的內(nèi)容是否真正被緩存進(jìn)去（就是說(shuō)保存在任何級(jí)別的緩存中）了。

當(dāng)CPU看到一條讀內(nèi)存的指令時(shí)，它會(huì)把內(nèi)存地址傳遞給一級(jí)數(shù)據(jù)緩存（或可戲稱為L(zhǎng)1D$，因?yàn)橛⒄Z(yǔ)中“緩存（cache）”和“現(xiàn)金（cash）”的發(fā)音相同）。一級(jí)數(shù)據(jù)緩存會(huì)檢查它是否有這個(gè)內(nèi)存地址對(duì)應(yīng)的緩存段。如果沒(méi)有，它會(huì)把整個(gè)緩存段從內(nèi)存（或者從更高一級(jí)的緩存，如果有的話）中加載進(jìn)來(lái)。是的，一次加載整個(gè)緩存段，這是基于這樣一個(gè)假設(shè)：內(nèi)存訪問(wèn)傾向于本地化（localized），如果我們當(dāng)前需要某個(gè)地址的數(shù)據(jù)，那么很可能我們馬上要訪問(wèn)它的鄰近地址。一旦緩存段被加載到緩存中，讀指令就可以正常進(jìn)行讀取。

如果我們只處理讀操作，那么事情會(huì)很簡(jiǎn)單，因?yàn)樗屑?jí)別的緩存都遵守以下規(guī)律，我稱之為：

基本定律：在任意時(shí)刻，任意級(jí)別緩存中的緩存段的內(nèi)容，等同于它對(duì)應(yīng)的內(nèi)存中的內(nèi)容。

一旦我們?cè)试S寫操作，事情就變得復(fù)雜一點(diǎn)了。這里有兩種基本的寫模式：直寫（write-through）和回寫（write-back）。直寫更簡(jiǎn)單一點(diǎn)：我們透過(guò)本級(jí)緩存，直接把數(shù)據(jù)寫到下一級(jí)緩存（或直接到內(nèi)存）中，如果對(duì)應(yīng)的段被緩存了，我們同時(shí)更新緩存中的內(nèi)容（甚至直接丟棄），就這么簡(jiǎn)單。這也遵守前面的定律：緩存中的段永遠(yuǎn)和它對(duì)應(yīng)的內(nèi)存內(nèi)容匹配。

回寫模式就有點(diǎn)復(fù)雜了。緩存不會(huì)立即把寫操作傳遞到下一級(jí)，而是僅修改本級(jí)緩存中的數(shù)據(jù)，并且把對(duì)應(yīng)的緩存段標(biāo)記為“臟”段。臟段會(huì)觸發(fā)回寫，也就是把里面的內(nèi)容寫到對(duì)應(yīng)的內(nèi)存或下一級(jí)緩存中。回寫后，臟段又變“干凈”了。當(dāng)一個(gè)臟段被丟棄的時(shí)候，總是先要進(jìn)行一次回寫。回寫所遵循的規(guī)律有點(diǎn)不同。

回寫定律：當(dāng)所有的臟段被回寫后，任意級(jí)別緩存中的緩存段的內(nèi)容，等同于它對(duì)應(yīng)的內(nèi)存中的內(nèi)容。

換句話說(shuō)，回寫模式的定律中，我們?nèi)サ袅恕霸谌我鈺r(shí)刻”這個(gè)修飾語(yǔ)，代之以弱化一點(diǎn)的條件：要么緩存段的內(nèi)容和內(nèi)存一致（如果緩存段是干凈的話），要么緩存段中的內(nèi)容最終要回寫到內(nèi)存中（對(duì)于臟緩存段來(lái)說(shuō)）。

直接模式更簡(jiǎn)單，但是回寫模式有它的優(yōu)勢(shì)：它能過(guò)濾掉對(duì)同一地址的反復(fù)寫操作，并且，如果大多數(shù)緩存段都在回寫模式下工作，那么系統(tǒng)經(jīng)常可以一下子寫一大片內(nèi)存，而不是分成小塊來(lái)寫，前者的效率更高。

有些（大多數(shù)是比較老的）CPU只使用直寫模式，有些只使用回寫模式，還有一些，一級(jí)緩存使用直寫而二級(jí)緩存使用回寫。這樣做雖然在一級(jí)和二級(jí)緩存之間產(chǎn)生了不必要的數(shù)據(jù)流量，但二級(jí)緩存和更低級(jí)緩存或內(nèi)存之間依然保留了回寫的優(yōu)勢(shì)。我想說(shuō)的是，這里涉及到一系列的取舍問(wèn)題，且不同的設(shè)計(jì)有不同的解決方案。沒(méi)有人規(guī)定各級(jí)緩存的大小必須一致。舉個(gè)例子，我們會(huì)看到有CPU的一級(jí)緩存是32字節(jié)，而二級(jí)緩存卻有128字節(jié)。

為了簡(jiǎn)化問(wèn)題，我省略了一些內(nèi)容：緩存關(guān)聯(lián)性（cache associativity），緩存組（cache sets），使用分配寫（write-allocate）還是非分配寫（上面我描述的直寫是和分配寫相結(jié)合的，而回寫是和非分配寫相結(jié)合的），非對(duì)齊的訪問(wèn)（unaligned access），基于虛擬地址的緩存。如果你感興趣，所有這些內(nèi)容都可以去查查資料，但我不準(zhǔn)備在這里講了。

一致性協(xié)議（Coherency protocols）

只要系統(tǒng)只有一個(gè)CPU核在工作，一切都沒(méi)問(wèn)題。如果有多個(gè)核，每個(gè)核又都有自己的緩存，那么我們就遇到問(wèn)題了：如果某個(gè)CPU緩存段中對(duì)應(yīng)的內(nèi)存內(nèi)容被另外一個(gè)CPU偷偷改了，會(huì)發(fā)生什么？

好吧，答案很簡(jiǎn)單：什么也不會(huì)發(fā)生。這很糟糕。因?yàn)槿绻粋€(gè)CPU緩存了某塊內(nèi)存，那么在其他CPU修改這塊內(nèi)存的時(shí)候，我們希望得到通知。我們擁有多組緩存的時(shí)候，真的需要它們保持同步。或者說(shuō)，系統(tǒng)的內(nèi)存在各個(gè)CPU之間無(wú)法做到與生俱來(lái)的同步，我們實(shí)際上是需要一個(gè)大家都能遵守的方法來(lái)達(dá)到同步的目的。

注意，這個(gè)問(wèn)題的根源是我們擁有多組緩存，而不是多個(gè)CPU核。我們也可以這樣解決問(wèn)題，讓多個(gè)CPU核共用一組緩存：也就是說(shuō)只有一塊一級(jí)緩存，所有處理器都必須共用它。在每一個(gè)指令周期，只有一個(gè)幸運(yùn)的CPU能通過(guò)一級(jí)緩存做內(nèi)存操作，運(yùn)行它的指令。

這本身沒(méi)問(wèn)題。唯一的問(wèn)題就是太慢了，因?yàn)檫@下處理器的時(shí)間都花在排隊(duì)等待使用一級(jí)緩存了（并且處理器會(huì)做大量的這種操作，至少每個(gè)讀寫指令都要做一次）。我指出這一點(diǎn)是因?yàn)樗砻髁藛?wèn)題不是由多核引起的，而是由多緩存引起的。我們知道了只有一組緩存也能工作，只是太慢了，接下來(lái)最好就是能做到：使用多組緩存，但使它們的行為看起來(lái)就像只有一組緩存那樣。緩存一致性協(xié)議就是為了做到這一點(diǎn)而設(shè)計(jì)的。就像名稱所暗示的那樣，這類協(xié)議就是要使多組緩存的內(nèi)容保持一致。

緩存一致性協(xié)議有多種，但是你日常處理的大多數(shù)計(jì)算機(jī)設(shè)備使用的都屬于“窺探（snooping）”協(xié)議，這也是我這里要講的。（還有一種叫“基于目錄的（directory-based）”協(xié)議，這種協(xié)議的延遲性較大，但是在擁有很多個(gè)處理器的系統(tǒng)中，它有更好的可擴(kuò)展性。）

“窺探”背后的基本思想是，所有內(nèi)存?zhèn)鬏敹及l(fā)生在一條共享的總線上，而所有的處理器都能看到這條總線：緩存本身是獨(dú)立的，但是內(nèi)存是共享資源，所有的內(nèi)存訪問(wèn)都要經(jīng)過(guò)仲裁（arbitrate）：同一個(gè)指令周期中，只有一個(gè)緩存可以讀寫內(nèi)存。窺探協(xié)議的思想是，緩存不僅僅在做內(nèi)存?zhèn)鬏數(shù)臅r(shí)候才和總線打交道，而是不停地在窺探總線上發(fā)生的數(shù)據(jù)交換，跟蹤其他緩存在做什么。所以當(dāng)一個(gè)緩存代表它所屬的處理器去讀寫內(nèi)存時(shí)，其他處理器都會(huì)得到通知，它們以此來(lái)使自己的緩存保持同步。只要某個(gè)處理器一寫內(nèi)存，其他處理器馬上就知道這塊內(nèi)存在它們自己的緩存中對(duì)應(yīng)的段已經(jīng)失效。

在直寫模式下，這是很直接的，因?yàn)閷懖僮饕坏┌l(fā)生，它的效果馬上會(huì)被“公布”出去。但是如果混著回寫模式，就有問(wèn)題了。因?yàn)橛锌赡茉趯懼噶顖?zhí)行過(guò)后很久，數(shù)據(jù)才會(huì)被真正回寫到物理內(nèi)存中——在這段時(shí)間內(nèi)，其他處理器的緩存也可能會(huì)傻乎乎地去寫同一塊內(nèi)存地址，導(dǎo)致沖突。在回寫模型中，簡(jiǎn)單把內(nèi)存寫操作的信息廣播給其他處理器是不夠的，我們需要做的是，在修改本地緩存之前，就要告知其他處理器。搞懂了細(xì)節(jié)，就找到了處理回寫模式這個(gè)問(wèn)題的最簡(jiǎn)單方案，我們通常叫做MESI協(xié)議（譯者注：MESI是Modified、Exclusive、Shared、Invalid的首字母縮寫，代表四種緩存狀態(tài)，下面的譯文中可能會(huì)以單個(gè)字母指代相應(yīng)的狀態(tài)）。

MESI以及衍生協(xié)議

本節(jié)叫做“MESI以及衍生協(xié)議”，是因?yàn)镸ESI衍生了一系列緊密相關(guān)的一致性協(xié)議。我們先從原生的MESI協(xié)議開(kāi)始：MESI是四種緩存段狀態(tài)的首字母縮寫，任何多核系統(tǒng)中的緩存段都處于這四種狀態(tài)之一。我將以相反的順序逐個(gè)講解，因?yàn)檫@個(gè)順序更合理：

失效（Invalid）緩存段，要么已經(jīng)不在緩存中，要么它的內(nèi)容已經(jīng)過(guò)時(shí)。為了達(dá)到緩存的目的，這種狀態(tài)的段將會(huì)被忽略。一旦緩存段被標(biāo)記為失效，那效果就等同于它從來(lái)沒(méi)被加載到緩存中。
共享（Shared）緩存段，它是和主內(nèi)存內(nèi)容保持一致的一份拷貝，在這種狀態(tài)下的緩存段只能被讀取，不能被寫入。多組緩存可以同時(shí)擁有針對(duì)同一內(nèi)存地址的共享緩存段，這就是名稱的由來(lái)。
獨(dú)占（Exclusive）緩存段，和S狀態(tài)一樣，也是和主內(nèi)存內(nèi)容保持一致的一份拷貝。區(qū)別在于，如果一個(gè)處理器持有了某個(gè)E狀態(tài)的緩存段，那其他處理器就不能同時(shí)持有它，所以叫“獨(dú)占”。這意味著，如果其他處理器原本也持有同一緩存段，那么它會(huì)馬上變成“失效”狀態(tài)。
已修改（Modified）緩存段，屬于臟段，它們已經(jīng)被所屬的處理器修改了。如果一個(gè)段處于已修改狀態(tài)，那么它在其他處理器緩存中的拷貝馬上會(huì)變成失效狀態(tài)，這個(gè)規(guī)律和E狀態(tài)一樣。此外，已修改緩存段如果被丟棄或標(biāo)記為失效，那么先要把它的內(nèi)容回寫到內(nèi)存中——這和回寫模式下常規(guī)的臟段處理方式一樣。

如果把以上這些狀態(tài)和單核系統(tǒng)中回寫模式的緩存做對(duì)比，你會(huì)發(fā)現(xiàn)I、S和M狀態(tài)已經(jīng)有對(duì)應(yīng)的概念：失效/未載入、干凈以及臟的緩存段。所以這里的新知識(shí)只有E狀態(tài)，代表獨(dú)占式訪問(wèn)。這個(gè)狀態(tài)解決了“在我們開(kāi)始修改某塊內(nèi)存之前，我們需要告訴其他處理器”這一問(wèn)題：只有當(dāng)緩存段處于E或M狀態(tài)時(shí)，處理器才能去寫它，也就是說(shuō)只有這兩種狀態(tài)下，處理器是獨(dú)占這個(gè)緩存段的。當(dāng)處理器想寫某個(gè)緩存段時(shí)，如果它沒(méi)有獨(dú)占權(quán)，它必須先發(fā)送一條“我要獨(dú)占權(quán)”的請(qǐng)求給總線，這會(huì)通知其他處理器，把它們擁有的同一緩存段的拷貝失效（如果它們有的話）。只有在獲得獨(dú)占權(quán)后，處理器才能開(kāi)始修改數(shù)據(jù)——并且此時(shí)，這個(gè)處理器知道，這個(gè)緩存段只有一份拷貝，在我自己的緩存里，所以不會(huì)有任何沖突。

反之，如果有其他處理器想讀取這個(gè)緩存段（我們馬上能知道，因?yàn)槲覀円恢痹诟Q探總線），獨(dú)占或已修改的緩存段必須先回到“共享”狀態(tài)。如果是已修改的緩存段，那么還要先把內(nèi)容回寫到內(nèi)存中。

MESI協(xié)議是一個(gè)合適的狀態(tài)機(jī)，既能處理來(lái)自本地處理器的請(qǐng)求，也能把信息廣播到總線上。我不打算講更多關(guān)于狀態(tài)圖的細(xì)節(jié)以及不同的狀態(tài)轉(zhuǎn)換類型。如果你感興趣的話，可以在關(guān)于硬件架構(gòu)的書(shū)中找到更多的深度內(nèi)容，但對(duì)于本文來(lái)說(shuō)，講這些東西有點(diǎn)過(guò)了。作為一個(gè)軟件開(kāi)發(fā)者，你只要理解以下兩點(diǎn)，就大有可為：

第一，在多核系統(tǒng)中，讀取某個(gè)緩存段，實(shí)際上會(huì)牽涉到和其他處理器的通訊，并且可能導(dǎo)致它們發(fā)生內(nèi)存?zhèn)鬏敗懩硞€(gè)緩存段需要多個(gè)步驟：在你寫任何東西之前，你首先要獲得獨(dú)占權(quán)，以及所請(qǐng)求的緩存段的當(dāng)前內(nèi)容的拷貝（所謂的“帶權(quán)限獲取的讀（Read For Ownership）”請(qǐng)求）。

第二，盡管我們?yōu)榱艘恢滦詥?wèn)題做了額外的工作，但是最終結(jié)果還是非常有保證的。即它遵守以下定理，我稱之為：

MESI定律：在所有的臟緩存段（M狀態(tài)）被回寫后，任意緩存級(jí)別的所有緩存段中的內(nèi)容，和它們對(duì)應(yīng)的內(nèi)存中的內(nèi)容一致。此外，在任意時(shí)刻，當(dāng)某個(gè)位置的內(nèi)存被一個(gè)處理器加載入獨(dú)占緩存段時(shí)（E狀態(tài)），那它就不會(huì)再出現(xiàn)在其他任何處理器的緩存中。

注意，這其實(shí)就是我們已經(jīng)講過(guò)的回寫定律加上獨(dú)占規(guī)則而已。我認(rèn)為MESI協(xié)議或多核系統(tǒng)的存在根本沒(méi)有弱化我們現(xiàn)有的內(nèi)存模型。

好了，至此我們（粗略）講了原生MESI協(xié)議（以及使用它的CPU，比如ARM）。其他處理器使用MESI擴(kuò)展后的變種。常見(jiàn)的擴(kuò)展包括“O”（Owned）狀態(tài)，它和E狀態(tài)類似，也是保證緩存間一致性的手段，但它直接共享臟段的內(nèi)容，而不需要先把它們回寫到內(nèi)存中（“臟段共享”），由此產(chǎn)生了MOSEI協(xié)議。還有MERSI和MESIF，這兩個(gè)名字代表同一種思想，即指定某個(gè)處理器專門處理針對(duì)某個(gè)緩存段的讀操作。當(dāng)多個(gè)處理器同時(shí)擁有某個(gè)S狀態(tài)的緩存段的時(shí)候，只有被指定的那個(gè)處理器（對(duì)應(yīng)的緩存段為R或F狀態(tài)）才能對(duì)讀操作做出回應(yīng)，而不是每個(gè)處理器都能這么做。這種設(shè)計(jì)可以降低總線的數(shù)據(jù)流量。當(dāng)然你可以同時(shí)加入R/F狀態(tài)和O狀態(tài)，或者更多的狀態(tài)。這些都屬于優(yōu)化，沒(méi)有一種會(huì)改變基本定律，也沒(méi)有一種會(huì)改變MESI協(xié)議所確保的結(jié)果。

我不是這方面的專家，很有可能有系統(tǒng)在使用其他協(xié)議，這些協(xié)議并不能完全保證一致性，不過(guò)如果有，我沒(méi)有注意到它們，或者沒(méi)有看到有什么流行的處理器在使用它們。所以為了達(dá)到我們的目的，我們真的就可以假設(shè)一致性協(xié)議能保證緩存的一致性。不是基本一致，不是“寫入一會(huì)兒后才能保持一致”——而是完全的一致。從這個(gè)層面上說(shuō)，除非硬件有問(wèn)題，內(nèi)存的狀態(tài)總是一致的。用技術(shù)術(shù)語(yǔ)來(lái)說(shuō)，MESI以及它的衍生協(xié)議，至少在原理上，提供了完整的順序一致性（sequential consistency），在C++ 11的內(nèi)存模型中，這是最強(qiáng)的一種確保內(nèi)存順序的模型。這也引出了問(wèn)題，為什么我們需要弱一點(diǎn)的內(nèi)存模型，以及“什么時(shí)候會(huì)用到它們”？

內(nèi)存模型

不同的體系結(jié)構(gòu)提供不同的內(nèi)存模型。到本文寫作的時(shí)候?yàn)橹?#xff0c;ARM和POWER體系結(jié)構(gòu)的機(jī)器擁有相對(duì)較弱的內(nèi)存模型：這類CPU在讀寫指令重排序（reordering）方面有相當(dāng)大的自由度，這種重排序有可能會(huì)改變程序在多核環(huán)境下的語(yǔ)義。通過(guò)“內(nèi)存屏障（memory barrier）”，程序可以對(duì)此加以限制：“重排序操作不允許越過(guò)這條邊界”。相反，x86則擁有較強(qiáng)的內(nèi)存模型。

我不打算在這里深入到內(nèi)存模型的細(xì)節(jié)中，這很容易陷入堆砌技術(shù)術(shù)語(yǔ)中，而且也超出了本文的范圍。但是我想說(shuō)一點(diǎn)關(guān)于“他們?nèi)绾伟l(fā)生”的內(nèi)容——也就是，弱內(nèi)存模型如何保證正確性（相比較于MESI協(xié)議給緩存帶來(lái)的順序一致性），以及為什么。當(dāng)然，一切都?xì)w結(jié)于性能。

規(guī)則是這樣的：如果滿足下面的條件，你就可以得到完全的順序一致性：第一，緩存一收到總線事件，就可以在當(dāng)前指令周期中迅速做出響應(yīng)。第二，處理器如實(shí)地按程序的順序，把內(nèi)存操作指令送到緩存，并且等前一條執(zhí)行完后才能發(fā)送下一條。當(dāng)然，實(shí)際上現(xiàn)代處理器一般都無(wú)法滿足以上條件：

緩存不會(huì)及時(shí)響應(yīng)總線事件。如果總線上發(fā)來(lái)一條消息，要使某個(gè)緩存段失效，但是如果此時(shí)緩存正在處理其他事情（比如和CPU傳輸數(shù)據(jù)），那這個(gè)消息可能無(wú)法在當(dāng)前的指令周期中得到處理，而會(huì)進(jìn)入所謂的“失效隊(duì)列（invalidation queue）”，這個(gè)消息等在隊(duì)列中直到緩存有空為止。
處理器一般不會(huì)嚴(yán)格按照程序的順序向緩存發(fā)送內(nèi)存操作指令。當(dāng)然，有亂序執(zhí)行（Out-of-Order execution）功能的處理器肯定是這樣的。順序執(zhí)行（in-order execution）的處理器有時(shí)候也無(wú)法完全保證內(nèi)存操作的順序（比如想要的內(nèi)存不在緩存中時(shí)，CPU就不能為了載入緩存而停止工作）。
寫操作尤其特殊，因?yàn)樗譃閮呻A段操作：在寫之前我們先要得到緩存段的獨(dú)占權(quán)。如果我們當(dāng)前沒(méi)有獨(dú)占權(quán)，我們先要和其他處理器協(xié)商，這也需要一些時(shí)間。同理，在這種場(chǎng)景下讓處理器閑著無(wú)所事事是一種資源浪費(fèi)。實(shí)際上，寫操作首先發(fā)起獲得獨(dú)占權(quán)的請(qǐng)求，然后就進(jìn)入所謂的由“寫緩沖（store buffer）”組成的隊(duì)列（有些地方使用“寫緩沖”指代整個(gè)隊(duì)列，我這里使用它指代隊(duì)列的一條入口）。寫操作在隊(duì)列中等待，直到緩存準(zhǔn)備好處理它，此時(shí)寫緩沖就被“清空（drained）”了，緩沖區(qū)被回收用于處理新的寫操作。

這些特性意味著，默認(rèn)情況下，讀操作有可能會(huì)讀到過(guò)時(shí)的數(shù)據(jù)（如果對(duì)應(yīng)失效請(qǐng)求還等在隊(duì)列中沒(méi)執(zhí)行），寫操作真正完成的時(shí)間有可能比它們?cè)诖a中的位置晚，一旦牽涉到亂序執(zhí)行，一切都變得模棱兩可。回到內(nèi)存模型，本質(zhì)上只有兩大陣營(yíng)：

在弱內(nèi)存模型的體系結(jié)構(gòu)中，處理器為了開(kāi)發(fā)者能寫出正確的代碼而做的工作是最小化的，指令重排序和各種緩沖的步驟都是被正式允許的，也就是說(shuō)沒(méi)有任何保證。如果你需要確保某種結(jié)果，你需要自己插入合適的內(nèi)存屏障——它能防止重排序，并且等待隊(duì)列中的操作全部完成。

使用強(qiáng)一點(diǎn)的內(nèi)存模型的體系結(jié)構(gòu)則會(huì)在內(nèi)部做很多記錄工作。比如，x86會(huì)跟蹤所有在等待中的內(nèi)存操作，這些操作都還沒(méi)有完全完成（稱為“退休（retired）”）。它會(huì)把它們的信息保存在芯片內(nèi)部的MOB（“memory ordering buffer”，內(nèi)存排序緩沖）。x86作為部分支持亂序執(zhí)行的體系結(jié)構(gòu)，在出問(wèn)題的時(shí)候能把尚未“退休”的指令撤銷掉——比如發(fā)生頁(yè)錯(cuò)誤（page fault），或者分支預(yù)測(cè)失敗（branch mispredict）的時(shí)候。我已經(jīng)在我以前的文章“好奇地說(shuō)”中提到過(guò)一些細(xì)節(jié)，以及和內(nèi)存子系統(tǒng)的一些交互。主旨是x86處理器會(huì)主動(dòng)地監(jiān)控外部事件（比如緩存失效），有些已經(jīng)執(zhí)行完的操作會(huì)因?yàn)檫@些事件而被撤銷，但不算“退休”。這就是說(shuō)，x86知道自己的內(nèi)存模型應(yīng)該是什么樣子的，當(dāng)發(fā)生了一件和這個(gè)模型沖突的事，處理器會(huì)回退到上一個(gè)與內(nèi)存模型兼容的狀態(tài)。這就是我在以前另一篇文章中提到的“清除內(nèi)存排序機(jī)（memory ordering machine clear）”。最后的結(jié)果是，x86處理器為內(nèi)存操作提供了很強(qiáng)的一致性保證——雖然沒(méi)有達(dá)到完美的順序一致性。

無(wú)論如何，一篇文章講這么多已經(jīng)夠了。我把它放在我的博客上。我的想法是將來(lái)的文章只要引用它就行了。我們看效果吧。感謝閱讀！

查看參考原文：http://fgiesen.wordpress.com/2014/07/07/cache-coherency/

I’m planning to write a bit about data organization for multi-core scenarios. I started writing a first post but quickly realized that there’s a bunch of basics I need to cover first. In this post, I’ll try just that.

Caches

This is a whirlwhind primer on CPU caches. I’m assuming you know the basic concept, but you might not be familiar with some of the details. (If you are, feel free to skip this section.)

In modern CPUs (almost) all memory accesses go through the cache hierarchy; there’s some exceptions for memory-mapped IO and write-combined memory that bypass at least parts of this process, but both of these are corner cases (in the sense that the vast majority of user-mode code will never see either), so I’ll ignore them in this post.

The CPU core’s load/store (and instruction fetch) units normally can’t even access memory directly – it’s physically impossible; the necessary wires don’t exist! Instead, they talk to their L1 caches which are supposed to handle it. And about 20 years ago, the L1 caches would indeed talk to memory directly. At this point, there’s generally more cache levels involved; this means the L1 cache doesn’t talk to memory directly anymore, it talks to a L2 cache – which in turns talks to memory. Or maybe to a L3 cache. You get the idea.

Caches are organized into “l(fā)ines”, corresponding to aligned blocks of either 32 (older ARMs, 90s/early 2000s x86s/PowerPCs), 64 (newer ARMs and x86s) or 128 (newer Power ISA machines) bytes of memory. Each cache line knows what physical memory address range it corresponds to, and in this article I’m not going to differentiate between the physical cache line and the memory it represents – this is sloppy, but conventional usage, so better get used to it. In particular, I’m going to say “cache line” to mean a suitably aligned group of bytes in memory, no matter whether these bytes are currently cached (i.e. present in any of the cache levels) or not.

When the CPU core sees a memory load instruction, it passes the address to the L1 data cache (or “L1D$”, playing on the “cache” being pronounced the same way as “cash”). The L1D$ checks whether it contains the corresponding cache line. If not, the whole cache line is brought in from memory (or the next-deeper cache level, if present) – yes, the whole cache line; the assumption being that memory accesses are localized, so if we’re looking at some byte in memory we’re likely to access its neighbors soon. Once the cache line is present in the L1D$, the load instruction can go ahead and perform its memory read.

And as long as we’re dealing with read-only access, it’s all really simple, since all cache levels obey what I’ll call the

Basic invariant?: the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses, at all times.

Things gets a bit more complicated once we allow stores, i.e. memory writes. There’s two basic approaches here: write-through and write-back. Write-through is the easier one: we just pass stores through to the next-level cache (or memory). If we have the corresponding line cached, we update our copy (or maybe even just discard it), but that’s it. This preserves the same invariant as before: if a cache line is present in the cache, its contents match memory, always.

Write-back is a bit trickier. The cache doesn’t pass writes on immediately. Instead, such modifications are applied locally to the cached data, and the corresponding cache lines are flagged “dirty”. Dirty cache lines can trigger a write-back, at which points their contents are written back to memory or the next cache level. After a write-back, dirty cache lines are “clean” again. When a dirty cache line is evicted (usually to make space for something else in the cache), it always needs to perform a write-back first. The invariant for write-back caches is slightly different.

Write-back invariant?:?after writing back all dirty cache lines?, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses.

In other words, in write-back caches we lose the “at all times” qualifier and replace it with a weaker condition: either the cache contents match memory (this is true for all clean cache lines), or they contain values that eventually need to get written back to memory (for dirty cache lines).

Write-through caches are simpler, but write-back has some advantages: it can filter repeated writes to the same location, and if most of the cache line changes on a write-back, it can issue one large memory transaction instead of several small ones, which is more efficient.

Some (mostly older) CPUs use write-through caches everywhere; some use write-back caches everywhere; some have a simpler write-through L1$ backed by a write-back L2$. This may generate redundant traffic between L1$ and L2$ but gets the write-back benefits for transfers to lower cache levels or memory. My point being that there’s a whole set of trade-offs here, and different designs use different solutions. Nor is there a requirement that cache line sizes be the same at all levels – it’s not unheard-of for CPUs to have 32-byte lines in L1$ but 128-byte lines in L2$ for example.

Omitted for simplicity in this section: cache associativity/sets; write-allocate or not (I described write-through without write-allocate and write-back with, which is the most common usage); unaligned accesses; virtually-addressed caches. These are all things you can look up if you’re interested, but I’m not going to go that deep here.

Coherency protocols

As long as that single CPU core is alone in the system, this all works just fine. Add more cores, each with their own caches, and we have a problem: what happens if some other core modifies data that’s in one of our caches?

Well, the answer is quite simple: nothing happens. And that’s bad, because we?wantsomething to happen when someone else modifies memory that we have a cached copy of. Once we have multiple caches, we really need to keep them synchronized, or we don’t really have a “shared memory” system, more like a “shared general idea of what’s in memory” system.

Note that the problem really is that we have multiple caches, not that we have multiple cores. We could solve the entire problem by sharing all caches between all cores: there’s only one L1$, and all processors have to share it. Each cycle, the L1$ picks one lucky core that gets to do a memory operation this cycle, and runs it.

This works just fine. The only problem is that it’s also slow, because cores now spend most of their time waiting in line for their next turn at a L1$ request (and processors do a?lot?of those, at least one for every load/store instruction). I’m pointing this out because it shows that the problem really isn’t so much a multi-?core?problem as it is a multi-?cache?problem. We know that one set of caches works, but when that’s too slow, the next best thing is to have multiple caches and then make them behave?as if?there was only one cache. This is what cache coherency protocols are for: as the name suggests, they ensure that the contents of multiple caches stay coherent.

There are multiple types of coherency protocols, but most computing devices you deal with daily fall into the category of “snooping” protocols, and that’s what I’ll cover here. (The primary alternative, directory-based systems, has higher latency but scales better to systems with lots of cores).

The basic idea behind snooping is that all memory transactions take place on a shared bus that’s visible to all cores: the caches themselves are independent, but memory itself is a shared resource, and memory access needs to be arbitrated: only one cache gets to read data from, or write back to, memory in any given cycle. Now the idea in a snooping protocol is that the caches don’t just interact with the bus when they want to do a memory transaction themselves; instead, each cache continuously snoops on bus traffic to keep track of what the other caches are doing. So if one cache wants to read from or write to memory on behalf of its core, all the other cores notice, and that allows them to keep their caches synchronized. As soon as one core writes to a memory location, the other cores know that their copies of the corresponding cache line are now stale and hence invalid.

With write-through caches, this is fairly straightforward, since writes get “published” as soon as they happen. But if there are write-back caches in the mix, this doesn’t work, since the physical write-back to memory can happen a long time after the core executed the corresponding store – and for the intervening time, the other cores and their caches are none the wiser, and might themselves try to write to the same location, causing a conflict. So with a write-back model, it’s not enough to broadcast just the writes to memory when they happen; if we want to avoid conflicts, we need to tell other cores about our intention to write?before?we start changing anything in our local copy. Working out the details, the easiest solution that fits the bill and works for write-back caches is what’s commonly called the?MESI protocol?.

MESI and friends

This section is called “MESI and friends” because MESI spawned a whole host of closely related coherency protocols. Let’s start with the original though: MESI are the initials for the four states a cache line can be in for any of the multiple cores in a multi-core system. I’m gonna cover them in reverse order, because that’s the better order to explain them in:

I?nvalid lines are cache lines that are either not present in the cache, or whose contents are known to be stale. For the purposes of caching, these are ignored. Once a cache line is invalidated, it’s as if it wasn’t in the cache in the first place.
S?hared lines are clean copies of the contents of main memory. Cache lines in the shared state can be used to serve reads but they can’t be written to. Multiple caches are allowed to have a copy of the same memory location in “shared” state at the same time, hence the name.
E?xclusive lines are also clean copies of the contents of main memory, just like the S state. The difference is that when one core holds a line in E state, no other core may hold it at the same time, hence “exclusive”. That is, the same line must be in the I state in the caches of all other cores.
M?odified lines are dirty; they have been locally modified. If a line is in the M state, it must be in the I state for all other cores, same as E. In addition, modified cache lines need to be written back to memory when they get evicted or invalidated – same as the regular dirty state in a write-back cache.

If you compare this to the presentation of write-back caches in the single-core case above, you’ll see that the I, S and M states already had their equivalents: invalid/not present, clean, and dirty cache lines, respectively. So what’s new is the E state denoting exclusive access. This state solves the “we need to tell other cores before we start modifying memory” problem: each core may only write to cache lines if their caches hold them in the E or M states, i.e. they’re exclusively owned. If a core does not have exclusive access to a cache line when it wants to write, it first needs to send an “I want exclusive access” request to the bus. This tells all other cores to invalidate their copies of that cache line, if they have any. Only once that exclusive access is granted may the core start modifying data – and at that point, the core knows that the only copies of that cache line are in its own caches, so there can’t be any conflicts.

Conversely, once some other core wants to read from that cache line (which we learn immediately because we’re snooping the bus), exclusive and modified cache lines have to revert back to the “shared” (S) state. In the case of modified cache lines, this also involves writing their data back to memory first.

The MESI protocol is a proper state machine that responds both to requests coming from the local core, and to messages on the bus. I’m not going to go into detail about the full state diagram and what the different transition types are; you can find more in-depth information in books on hardware architecture if you care, but for our purposes this is overkill. As a software developer, you’ll get pretty far knowing only two things:

Firstly, in a multi-core system, getting read access to a cache line involves talking to the other cores, and might cause them to perform memory transactions.Writing to a cache line is a multi-step process: before you can write anything, you first need to acquire both exclusive ownership of the cache line and a copy of its existing contents (a so-called “Read For Ownership” request).

And secondly, while we have to do some extra gymnastics, the end result actually does provide some pretty strong guarantees. Namely, it obeys what I’ll call the

MESI invariant?: after writing back all dirty (?M-state?) cache lines, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses. In addition,?at all times, when a memory location is exclusively cached (in E or M state) by one core, it is not present in any of the other core’s caches.?.

Note that this is really just the write-back invariant we already saw with the additional exclusivity rule thrown in. My point being that the presence of MESI or multiple cores does not necessarily weaken our memory model at all.

Okay, so that (very roughly) covers vanilla MESI (and hence also CPUs that use it, ARMs for example). Other processors use extended variants. Popular extensions include an “O” (Owned) state similar to “E” that allows sharing of dirty cache lines without having to write them back to memory first (“dirty sharing”), yielding MOESI, and MERSI/MESIF, which are different names for the same idea, namely making one core the designated responder for read requests to a given cache line. When multiple cores hold a cache line in Shared state, only the designated responder (which holds the cache line in “R” or “F” state) replies to read requests, rather than everyone who holds the cache line in S state. This reduces bus traffic. And of course you can add both the R/F states and the O state, or get even fancier. All these are optimizations, but none of them change the basic invariants provided or guarantees made by the protocol.

I’m no expert on the topic, and it’s quite possible that there are other protocols in use that only provide substantially weaker guarantees, but if so I’m not aware of them, or any popular CPU core that uses them. So for our purposes, we really can assume that coherency protocols keep caches coherent, period. Not mostly-coherent, not “coherent except for a short window after a change” – properly coherent. At that level, barring hardware malfunction, there is always agreement on what the current state of memory should be. In technical terms, MESI and all its variants can, in principle anyway, provide full?sequential consistency?, the strongest memory ordering guarantee specified in the C++11 memory model. Which begs the question, why do we have weaker memory models, and “where do they happen”?

Memory models

Different architectures provide different memory models. As of this writing, ARM and POWER architecture machines have comparatively “weak” memory models: the CPU core has considerable leeway in reordering load and store operations in ways that might change the semantics of programs in a multi-core context, along with “memory barrier” instructions that can be used by the program to specify constraints: “do not reorder memory operations across this line”. By contrast, x86 comes with a quite strong memory model.

I won’t go into the details of memory models here; it quickly gets really technical, and is outside the scope of this article. But I do want to talk a bit about “how they happen” – that is, where the weakened guarantees (compared to the full sequential consistency we can get from MESI etc.) come from, and why. And as usual, it all boils down to performance.

So here’s the deal: you will indeed get full sequential consistency if a) the cache immediately responds to bus events on the very cycle it receives them, and b) the core dutifully sends each memory operation to the cache, in program order, and wait for it to complete before you send the next one. And of course, in practice modern CPUs normally do none of these things:

Caches do?not?respond to bus events immediately. If a bus message triggering a cache line invalidation arrives while the cache is busy doing other things (sending data to the core for example), it might not get processed that cycle. Instead, it will enter a so-called “invalidation queue”, where it sits for a while until the cache has time to process it.
Cores do not, in general, send memory operations to the cache in strict program order; this is certainly the case for cores with Out-of-Order execution, but even otherwise in-order cores may have somewhat weaker ordering guarantees for memory operations (for example, to ensure that a single cache miss doesn’t immediately make the entire core grind to a halt).
In particular, stores are special, because they’re a two-phase operation: we first need to acquire exclusive ownership of a cache line before a store can go through. And if we don’t already have exclusive ownership, we need to talk to the other cores, which takes a while. Again, having the core idle and twiddling thumbs while this is happening is not a good use of execution resources. Instead, what happens is that stores start the process of getting exclusive ownership, then get entered into a queue of so-called “store buffers” (some refer to the entire queue as “store buffer”, but I’m going to use the term to refer to the entries). They stay around in this queue for a while until the cache is ready to actually perform the store operation, at which point the corresponding store buffer is “drained” and can be recycled to hold a new pending store.

The implication of all these things is that, by default, loads can fetch stale data (if a corresponding invalidation request was sitting in the invalidation queue), stores actually finish later than their position in the code would suggest, and everything gets even more vague when Out of Order execution is involved. So going back to memory models, there are essentially two camps:

Architectures with a weak memory model do the minimum amount of work necessary in the core that allows software developers to write correct code. Instruction reordering and the various buffering stages are officially permitted; there are no guarantees. If you need guarantees, you need to insert the appropriate memory barriers – which will prevent reordering and drain queues of pending operations where required.

Architectures with stronger memory models do a lot more bookkeeping on the inside. For example, x86 processors keep track of all pending memory operations that are not fully finished (“retired”) yet, in a chip-internal data structure that’s called the MOB (“memory ordering buffer”). As part of the Out of Order infrastructure, x86 cores can roll back non-retired operations if there’s a problem – say an exception like a page fault, or a branch mispredict. I covered some of the details, as well as some of the interactions with the memory subsystem, in my earlier article “Speculatively speaking?“. The gist of it is that x86 processors actively watch out for external events (such as cache invalidations) that would retroactively invalidate the results of some of the operations that have already executed, but not been retired yet. That is, x86 processors know what their memory model is, and when an event happens that’s inconsistent within that model, the machine state is rolled back to the last time when it was still consistent with the rules of the memory model. This is the “memory ordering machine clear” I covered in?yet another earlier post?. The end result is that x86 processors provide very strong guarantees for all memory operations – not quite sequential consistency, though.

So, weaker memory models make for simpler (and potentially lower-power) cores. Stronger memory models make the design of cores (and their memory subsystems) more complex, but are easier to write code for. In theory, the weaker models allow for more scheduling freedom and can be potentially faster; in practice, x86s seem to be doing fine on the performance of memory operations, for the time being at least. So it’s hard for me to call a definite winner so far. Certainly, as a software developer I’m happy to take the stronger x86 memory model when I can get it.

Anyway. That’s plenty for one post. And now that I have all this written up on my blog, the idea is that future posts can just reference it. We’ll see how that goes. Thanks for reading!

轉(zhuǎn)載于:https://www.cnblogs.com/e-shannon/p/6677627.html

總結(jié)

以上是生活随笔為你收集整理的zt:缓存一致性（Cache Coherency）入门 cach coherency的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：如何快速精确的和leader沟通
下一篇： CentOS忘记普通用户密码解决办法