當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Persistent Memory编程简介

發布時間：2024/2/28 编程问答 43 豆豆

生活随笔收集整理的這篇文章主要介紹了 Persistent Memory编程简介小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Persistent Memory編程簡介

- 編程
- - libpmem
  - - 持久化函數
  - libpmemobj
  - - 跟對象 root object
    - 例程
    - - 事務支持
    - type safety
    - 線程安全
- 管理工具
- - ipmctl
  - ndctl
  - - create-namespace
    - 例子
- 測試工具
- - fio
  - pmembench
  - ipmwatch
  - emon
  - pcm
- 參考鏈接

本文主要目的是介紹PM基礎的的編程方法、管理工具、監測手段等

編程

持久內存開發套件(Persistent Memory Development Kit-PMDK) - pmem.io: PMDK

PMDK based Persistent Memory Programming

libpmem

libpmem簡介

peme底層庫，不支持事務，編程方法如下：

#include <libpmem.h> // 其他頭文件省略 /* using 4k of pmem for this example */ #define PMEM_LEN 4096int main(int argc, char *argv[]) {int fd;char *pmemaddr;int is_pmem;/* 1. 打開pm文件 */if ((fd = open("/pmem-fs/myfile", O_CREAT|O_RDWR, 0666)) < 0) {perror("open");exit(1);}/* 2. 創建固定的文件大小，分配4k大小 */if ((errno = posix_fallocate(fd, 0, PMEM_LEN)) != 0) {perror("posix_fallocate");exit(1);}/* 3. mmap這個pm文件 */// 這里也可以用系統調用mmap，只不過pmem版本效率更高// 也可以使用pmem_map_file直接map文件if ((pmemaddr = pmem_map(fd)) == NULL) {perror("pmem_map");exit(1);}// 4. 只要mmap之后，fd就可以關閉了。close(fd);/* determine if range is true pmem */is_pmem = pmem_is_pmem(pmemaddr, PMEM_LEN);/* 使用libc系統調用訪問pm，但是這種方法無法確定該數據何時落盤PM，cacheline刷盤的順序也不保證 */// 這里多說一句，cpu的cacheline下刷機制本身就是沒有順序保證的。strcpy(pmemaddr, "hello, persistent memory");/* 通過正確的方式訪問PM */if (is_pmem) {// 這個函數拷貝完后會直接持久化pmem_memcpy(pmemaddr, buf, cc);} else {memcpy(pmemaddr, buf, cc);pmem_msync(pmemaddr, cc);}/* copy the file, saving ~the last flush step to the end */while ((cc = read(srcfd, buf, BUF_LEN)) > 0) {// 只拷貝，不持久化pmem_memcpy_nodrain(pmemaddr, buf, cc);pmemaddr += cc;}if (cc < 0) {perror("read");exit(1);}/* 和上述的nodrain聯合使用，持久化數據 */pmem_drain();/* 持久化cacheline中的數據 */if (is_pmem)// 通過在用戶態調用CLWB and CLFLUSHOPT指令，達到高效刷盤的目的pmem_persist(pmemaddr, PMEM_LEN);else// 實際上就是系統調用msync()pmem_msync(pmemaddr, PMEM_LEN); }

注意，mmap的一般用法是mmap一個普通文件，其持久化的方法是使用系統調用msync()來flush，這個指令在pmem上是相對較慢的，所以如果使用pmem（可以用pmem_is_pmem確認）可以使用pm的persist函數pmem_persist，可以使用環境變量PMEM_IS_PMEM_FORCE=1強行指定不適用msync()

持久化函數

以下是目前所有的和持久化相關的函數

#include <libpmem.h>void pmem_persist(const void *addr, size_t len); // 將對應的區域強制持久化下去，相當于調用msync()，調用該函數不需要考慮align（如果不align，底層會擴大sync范圍到align） int pmem_msync(const void *addr, size_t len); // 相當于調用msync，和pmem_persist功能一致。 Since it calls msync(), this function works on either persistent memory or a memory mapped file on traditional storage. pmem_msync() takes steps to ensure the alignment of addresses and lengths passed to msync() meet the requirements of that system call. void pmem_flush(const void *addr, size_t len); // 這個的粒度應該是cacheline void pmem_deep_flush(const void *addr, size_t len); (EXPERIMENTAL) // 不考慮PMEM_NO_FLUSH變量，一定會flushcpu寄存器 int pmem_deep_drain(const void *addr, size_t len); (EXPERIMENTAL) int pmem_deep_persist(const void *addr, size_t len); (EXPERIMENTAL) void pmem_drain(void); int pmem_has_auto_flush(void); (EXPERIMENTAL) // 檢測CPU是否支持power failure時自動flush cache int pmem_has_hw_drain(void);

調用pmem_persist相當于調用了sync和drain

void pmem_persist(const void *addr, size_t len) {/* flush the processor caches */pmem_flush(addr, len);/* wait for any pmem stores to drain from HW buffers */pmem_drain(); }

討論x86-64環境

pmem_flush含義是調用clflush將對應的區域flush下去。flush系指令的封裝，只不過libpmem會在裝載時獲取相關信息自動選擇最優的指令
- CLFLUSH會命令cpu將對應cacheline逐出，強制性的寫回介質，這在一定程度上可以解決我們的問題，但是這是一個同步指令，將會阻塞流水線，損失了一定的運行速度，于是Intel添加了新的指令CLFLUSHOPT和CLWB，這是兩個異步的指令。盡管都能寫回介質，區別在前者會清空cacheline，后者則會保留，這使得在大部分場景下CLWB可能有更高的性能。
- 一般的pmem_memmove(), pmem_memcpy() and pmem_memset()在下發完成之后都會flush的，除非指定PMEM_F_MEM_NOFLUSH
pmem_drain含義是調用sfense等待所有的pipline都下刷到PM完成(等待其他的store指令都完成才會返回)
- 上面flush異步的代價是我們對于cache下刷的順序依舊不可預測，考慮到有些操作需要順序保證，于是我們需要使用SFENCE提供保證，SFENCE強制sfence指令前的寫操作必須在sfence指令后的寫操作前完成。
考慮到pmem_drain可能會阻塞一些操作，更好的做法是對數據結構里互不相干的幾個字段分別flush，最后一并調用pmem_drain，以將阻塞帶來的問題降到最低。
programs using pmem_flush() to flush ranges of memory should still follow up by calling pmem_drain() once to ensure the flushes are complete.
還有一個flagPMEM_F_MEM_NONTEMPORAL，使用這個flag下發的IO，會繞過CPU cache，直接下刷到PM里。

The main feature of libpmem library is to provide a method to flush dirty data to persistent memory. Commonly used functions mainly include pmem_flush, pmem_drain, pmem_memcpy_nodrain. Since the timing and sequence of the CPU CACHE content flashing to the PM is not controlled by the user, a specific instruction is required for forced flashing. The function of pmem_flush is to call the CLWB, CLFLUSHOPT or CLFLUSH instructions to force the content in the CPU CACHE (in cache line as a unit) to be flushed to the PM; after the instruction is initiated, because the CPU is multi-core, the order of the content in the cache to the PM is different, so It also needs pmem_drain to call the SFENCE instruction to ensure that all CLWBs are executed. If the instruction called by pmem_flush is CLFLUSH, the instruction contains sfence, so in theory there is no need to call pmem_drain, in fact, if it is this instruction, pmem_drain does nothing.

The above describes the function of flashing the contents of the CPU cache to the PM. The following describes memory copy, which means copying data from memory to PM. This function is completed by pmem_memcpy_nodrain, calling the MOVNT instruction (MOV or MOVNTDQ), the instruction copy does not go through the CPU CACHE, so this function does not require flush. But you need to establish a sfence at the end to ensure that all data has been copied to the PM.

libpmemobj

libpmemobj簡介
libpmemobj api之類的文檔

libpmem的上層封裝，所有對pmem的操作都抽象為obj pool的形式。

pmemobj_create創建obj pool

pmemobj_open打開已經創建的obj

pmemobj_close關閉對應的obj

pmemobj_check對metadata進行校驗

libpmemobj的內存指針是普通指針的兩倍大，它說明了該pool是指向那個obj pool的，和其中的offset

typedef struct pmemoid {uint64_t pool_uuid_lo; // 具體的某個obj，通過cuckoo hash table的兩層哈希對應到實際的地址pooluint64_t off; // 對應的offset } PMEMoid; // 我們把它叫做persistent pointer

因此，從這個指針數據結構需要(void *)((uint64_t)pool + oid.off)這樣的轉換，才能轉到實際的地址，這就是pmemobj_direct作的事情。

跟對象 root object

根據官方的說法，根對象的作用就是一個訪問持久內存對象的入口點，是一個錨的作用。使用如下方式

pmemobj_root(PMEMobjpool* pop, size_t size):非類型化的原始API。create或者resize根對象，根據官方文檔的描述，當你初次調用這個函數的時候，如果size大于0并且沒有根對象存在，則會分配空間并創建一個根對象。當size大于當前根對象的size的時候會進行重分配并resize。
POBJ_ROOT(PMEMobjpool* pop, TYPE)：這是一個宏，傳入的TYPE是根對象的類型，并且最后返回值類型是一個void指針

例程

#include <stdio.h> #include <string.h> #include <libpmemobj.h>// layout #define LAYOUT_NAME "intro_0" /* will use this in create and open */ #define MAX_BUF_LEN 10 /* maximum length of our buffer */struct my_root {size_t len; /* = strlen(buf) */char buf[MAX_BUF_LEN]; };int main(int argc, char *argv[]) {// 創建poolPMEMobjpool *pop = pmemobj_create(argv[1], LAYOUT_NAME, PMEMOBJ_MIN_POOL, 0666);if (pop == NULL) {perror("pmemobj_create");return 1;}// 創建pm root對象（已經zeroed了），并通過pmemobj_direct將其轉化為一個void指針PMEMoid root = pmemobj_root(pop, sizeof (struct my_root));struct my_root *rootp = pmemobj_direct(root);char buf[MAX_BUF_LEN];// 先給pm對象賦值rootp->len = strlen(buf);// 然后持久化，記得8byte原子寫pmemobj_persist(pop, &rootp->len, sizeof (rootp->len));// 寫數據，順便持久化pmemobj_memcpy_persist(pop, rootp->buf, my_buf, rootp->len);// 持久化之后就可以像正常內存那樣讀寫了if (rootp->len == strlen(rootp->buf))printf("%s\n", rootp->buf);pmemobj_close(pop);return 0; }

事務支持

/* TX_STAGE_NONE */TX_BEGIN(pop) {/* TX_STAGE_WORK */ } TX_ONCOMMIT {/* TX_STAGE_ONCOMMIT */ } TX_ONABORT {/* TX_STAGE_ONABORT */ } TX_FINALLY {/* TX_STAGE_FINALLY */ } TX_END /* TX_STAGE_NONE */

整個事務的流程可以通過這幾個宏以及代碼塊來定義，并且將事務分成了多個階段，中間的三個階段為可選的，最基本的一個事務流程是TX_BEGIN-TX_END，這也是最常用的部分，其他的幾個部分在嵌套事務中使用較多。

除了基本的事務代碼塊，libpmemobj還提供了相應的事務操作API。

一個是事務性數據寫入API：pmemobj_tx_add_range&pmemobj_tx_add_range_direct，add_range函數主要有三個參數：root object、offset以及size，該函數表示我們將會操作[offset, offset+size)這段內存空間，PMDK將會自動在undo log中分配一個新的對象，然后將這段空間的內容記錄到undo log中，這樣我們就能隨機去修改這段空間的內容并且保證一致性。帶上direct標志的函數用法一致，區別在于direct函數直接操作的是一段虛擬地址空間。

type safety

An introduction to pmemobj (part 3) - types
Type safety macros in libpmemobj

libpmemobj使用了一系列macro來將persistent pointer和某個具體類型聯系起來

FeatureAnonymous unionsNamed unions

Declaration	+	-
Assignment	-	+
Function parameter	-	+
Type numbers	-	+

pmdk/src/examples/libpmemobj/string_store_tx_type/writer.c例程如下：

#include <stdio.h> #include <string.h> #include <libpmemobj.h>#include "layout.h"int main(int argc, char *argv[]) {if (argc != 2) {printf("usage: %s file-name\n", argv[0]);return 1;}PMEMobjpool *pop = pmemobj_create(argv[1],POBJ_LAYOUT_NAME(string_store), PMEMOBJ_MIN_POOL, 0666);if (pop == NULL) {perror("pmemobj_create");return 1;}char buf[MAX_BUF_LEN] = {0};int num = scanf("%9s", buf);if (num == EOF) {fprintf(stderr, "EOF\n");return 1;}TOID(struct my_root) root = POBJ_ROOT(pop, struct my_root);// D_RW 寫TX_BEGIN(pop) {TX_MEMCPY(D_RW(root)->buf, buf, strlen(buf));} TX_END// D_RO()讀printf("%s\n", D_RO(root)->buf);pmemobj_close(pop);return 0; }

通過TOID_VALID驗證對應的type是否合法

if (TOID_VALID(D_RO(root)->data)) {/* can use the data ptr safely */ } else {/* declared type doesn't match the object */ }

在transaction里面可以使用TX_NEW創建新的對象

TOID(struct my_root) root = POBJ_ROOT(pop); TX_BEGIN(pop) {TX_ADD(root); /* we are going to operate on the root object */TOID(struct rectangle) rect = TX_NEW(struct rectangle);D_RW(rect)->x = 5;D_RW(rect)->y = 10;D_RW(root)->rect = rect; } TX_END

線程安全

所有的libpmemobj函數都是線程安全的。除了管理obj pool的函數例如open、close和pmemobj_root，宏里面只有FOREACH的不是線程安全的。

我們可以將pthread_mutex_t類放到pm里，叫做pmem-aware lock，下面是一個簡單的例子

struct foo {PMEMmutex lock;int bar; };int fetch_and_add(TOID(struct foo) foo, int val) {pmemobj_mutex_lock(pop, &D_RW(foo)->lock);int ret = D_RO(foo)->bar;D_RW(foo)->bar += val;pmemobj_mutex_unlock(pop, &D_RW(foo)->lock);return ret; }

管理工具

ipmctl

PM的管理工具

ipmctl create -goal PersistentMemoryType=AppDirect創建AppDirect GOAL

ipmctl show -firmware查看DIMM固件版本

ipmctl show -dimm列出DIMM

ipmctl show -sensor獲取更多詳細信息，類似SMART

ipmctl show -topology定位device位置

ndctl

管理“libnvdimm”對應的系統設備（Non-volatile Memory），常用命令：

ndctl list -u

create-namespace

通過fsdax, devdax, sector, and raw這四種方式管理PM的namespace

fsdax，默認模式，創建之后將在文件系統下創建塊設備/dev/pmemX[.Y]，可以在其上創建xfs、ext4文件系統。**DAX(direct access) removes the page cache from the I/O path and allows mmap to establish direct mappings to persistent memory media.**使用這種的好處是可以多個進程共享同一塊PM。
devdax，創建之后在文件系統下創建char device/dev/daxX.Y，沒有塊設備映射出來。但是使用這種方式仍然可以通過mmap映射。(只可以使用open(),close(),mmap())

一個create-namespace的典型命令如下：

ndctl create-namespace --type=pmem --mode=fsdax --region=X [--align=4k] # --region 指定某個pmem設備，不寫的話默認是all，全部設備 # --align，內部的對齊的pagesize，默認2M，每次page fault之后讀上2M的頁

例子

通過FSDAX初始化pmem

ndctl create-namespace mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0 mkdir /pmem0 mount -o dax /dev/pmem0 /pmem0 xfs_io -c "extsize 2m" /pmem0

測試工具

fio

首先要選ioengine，有以下幾種選擇：

libpmem：使用fsdax配置pmem namespace的模式，也是比較常用的模式。這里提供了個小例子

dev-dax：針對devdax的pmem設備

pmemblk：使用libpmemblk庫讀寫pm

mmap：非PM特有，使用posix系統調用跑IO（mmap、fdatasync…）

默認的讀操作是將PM中的數據拷貝到內存中
默認的寫操作是將內存中的數據拷貝到PM中，--sync=sync或者--sync=dsync或者--sync=1代表每次寫數據之后都會drain，默認或者--sync=0代表按需調用pmem_drain()（調用pmem_memcpy的時候會增加標志位PMEM_F_MEM_NODRAIN)，使用--direct=1增加標志位PMEM_F_MEM_NONTEMPORAL
- 可以使用fio選項fsync=int或者fdatasync=int,確保在下發多少個write命令之后，會下發一個sync也就是pmem_drain()。

pmembench

ipmwatch

查看吞吐,包括PM內部真正的讀寫數據，在Intel VTune Amplifier 2019 since Update 5有包含，安裝vtune_profiler_2020里面肯定有，我把一些數據名稱列在下面

bytes_read (derived) bytes_written (derived) read_hit_ratio (derived) write_hit_ratio (derived) wdb_merge_percent (derived) media_read_ops (derived) media_write_ops (derived) read_64B_ops_received write_64B_ops_received ddrt_read_ops ddrt_write_ops

emon

查看耗時

pcm

intel的pcm工具集有一系列工具查看cpu和其訪問memory的性能指標。例如pcm-memory.x可以查看當前PM的性能指標

|---------------------------------------| |-- Socket 0 --| |---------------------------------------| |-- Memory Channel Monitoring --| |---------------------------------------| |-- Mem Ch 0: Reads (MB/s): 227.67 --| |-- Writes(MB/s): 43.34 --| |-- PMM Reads(MB/s) : 0.00 --| |-- PMM Writes(MB/s) : 0.00 --| |-- Mem Ch 1: Reads (MB/s): 0.00 --| |-- Writes(MB/s): 0.00 --| |-- PMM Reads(MB/s) : 355.99 --| |-- PMM Writes(MB/s) : 355.99 --| |-- Mem Ch 2: Reads (MB/s): 209.37 --| |-- Writes(MB/s): 42.72 --| |-- PMM Reads(MB/s) : 0.00 --| |-- PMM Writes(MB/s) : 0.00 --| |-- Mem Ch 3: Reads (MB/s): 211.65 --| |-- Writes(MB/s): 42.81 --| |-- PMM Reads(MB/s) : 0.00 --| |-- PMM Writes(MB/s) : 0.00 --| |-- Mem Ch 4: Reads (MB/s): 0.00 --| |-- Writes(MB/s): 0.00 --| |-- PMM Reads(MB/s) : 356.08 --| |-- PMM Writes(MB/s) : 356.08 --| |-- Mem Ch 5: Reads (MB/s): 205.36 --| |-- Writes(MB/s): 42.57 --| |-- PMM Reads(MB/s) : 0.00 --| |-- PMM Writes(MB/s) : 0.00 --| |-- NODE 0 Mem Read (MB/s) : 854.05 --| |-- NODE 0 Mem Write(MB/s) : 171.44 --| |-- NODE 0 PMM Read (MB/s): 712.08 --| |-- NODE 0 PMM Write(MB/s): 712.08 --| |-- NODE 0.0 NM read hit rate : 1.00 --| |-- NODE 0.1 NM read hit rate : 1.00 --| |-- NODE 0.2 NM read hit rate : 0.00 --| |-- NODE 0.3 NM read hit rate : 0.00 --| |-- NODE 0 Memory (MB/s): 2449.64 --| |---------------------------------------| |---------------------------------------||---------------------------------------| |-- System DRAM Read Throughput(MB/s): 854.05 --| |-- System DRAM Write Throughput(MB/s): 171.44 --| |-- System PMM Read Throughput(MB/s): 712.08 --| |-- System PMM Write Throughput(MB/s): 712.08 --| |-- System Read Throughput(MB/s): 1566.12 --| |-- System Write Throughput(MB/s): 883.52 --| |-- System Memory Throughput(MB/s): 2449.64 --| |---------------------------------------||---------------------------------------|

參考鏈接

Direct Write to PMem how to disable DDIO

Correct, Fast Remote PersistenceDDIO是在CPU層面enable的。

基于RDMA和NVM的大數據系統一致性協議研究

pmem/valgrind

PMDK based Persistent Memory Programming

Running FIO with pmem engines

Documentation for ndctl and daxctl

AEPWatch

CHAPTER 5. USING NVDIMM PERSISTENT MEMORY STORAGE

I/O Alignment Considerations里面有一些常用的命令

peresistent memory programming the remote access perspective

pmem_flush

Create Memory Allocation Goal - IPMCTL User Guide

磁盤I:O 性能指標以及如何通過 fio 對nvme ssd,optane ssd, pmem 性能摸底

2MB FSDAX 使用2Mpagesize的PM FSDAX namespace

總結

以上是生活随笔為你收集整理的Persistent Memory编程简介的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Persistent Memory错误注
下一篇： BoltDB 源码分析