當前位置：首頁 > 运维知识 > windows >内容正文

windows

上下级平台之间数据同步方案_Alluxio与底层存储系统之间的元数据同步机制

發布時間：2025/3/12 windows 18 豆豆

生活随笔收集整理的這篇文章主要介紹了上下级平台之间数据同步方案_Alluxio与底层存储系统之间的元数据同步机制小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

請點擊上方藍字，關注我們哦！

作者簡介：林意群，Apache Hadoop PMC member，Apache Ozone PMC member，擁有多年參與開源社區經驗，主要專注于存儲領域的研究和學習，目前任eBay Hadoop team 大數據研發工程師。

前言

Alluxio作為一套構建于底層存儲系統之上的中間層，它必不可少的會涉及到與底層系統之間metadata之間的同步問題。外部client請求訪問Alluxio系統，然后Alluxio再從底層系統中(為稱呼方便，后面都簡稱為Underlying FileSystem, UFS)查詢真實的元數據信息，然后再返回給client。當然為了減少對于UFS的壓力，我們當然不會每次都去查UFS。本文我們來聊聊Alluxio內部對此元數據同步處理的設計實現，它是最大可能性做到元數據請求處理的高效性以及數據的精準性的。

Alluxio內部的元數據同步行為

首先，這里我們需要想清楚一個基本的問題：作為一套構建于底層存儲系統之上的Cache層，Alluxio內部會存在哪些元數據需要同步的情況。

從元數據同步的源頭，目標來劃分，總共為2類：

1)Alluxio內部metadata先修改，UFS后修改，此過程是從Alluxio到UFS的metadata同步。
2)UFS的metadata先被修改，Alluxio隨后同步此修改，此過程則為從UFS到Alluxio的metadata同步。

在上述兩種情形中，1)較之于2)來說同步控制更為簡單一些，因為Alluxio本身作為外部請求的處理入口，它能第一時間知道請求的發生處理，然后它來自己控制后續如何做UFS底層存儲系統的metadata同步。Alluxio率先更新metadata后，對于外界來說，其元數據已經是最新狀態的了。這時Alluxio可以選擇靈活的策略來更新UFS中滯后的metadata了，比如它可以采用異步更新的方式或者強制同步更新的方式。歸納起來一句話，1)情況下元數據同步更新的主動權完全掌握在Alluxio系統這邊。

相比較而言，元數據同步較為復雜的是第二種情況：底層系統metadata發生改變(存在外部程序直接訪問UFS導致metadata發生改變)，又沒有途徑能夠通知到Alluxio，而且Alluxio是外界請求訪問的服務。

2)的情況如下圖右半邊圖所示，1)則為下圖左半圖所示情形：

上面右半圖顯示的就是底層存儲系統HDFS存在額外更新的情況，需要Alluxio去同步來自Hive這邊的對HDFS的額外更新。

下面我們來看看Alluxio內部是如何解決上面這種棘手的情況的。

基于給定時間，path粒度的UFS?Status?Cache

既然說存在UFS元數據意外更新的情況，為了保證Alluxio對外數據服務的準確性，我們很容易想到一種極端的做法，就是準實時地去同步HDFS中的metadata。

說到準實時的同步UFS中的metadata，就會涉及到兩大核心問題：

多久時間的同步，time interval是設定多少，時間過短會導致大量的RPC請求查詢UFS，過長又會有數據延時性的問題。
同步多少量的metadata，一個目錄？一個文件？

針對上面2個主要問題，Alluxio內部實現了一套基于給定時間，Path粒度的UFS Status Cache實現，架構圖設計如下所示：

有人可能會對上圖理解上有點疑惑，Alluxio本身作為Cache層，為什么還在內部又做了一層Cache？注意這里Cache的對象已經不一樣了，上圖Cache顯示的是從UFS查詢到的metadata信息。

上述步驟過程如下所述：

(1)Client發起文件信息查詢請求

(2)Alluxio收到請求，檢查其內部UFS Status Cache是否存在未過期(在cache更新時間間隔內)的對應的UFS Status，如果有則返回給Client。

(3)如果沒有，則發起請求到UFS，進行最新狀態文件信息的查詢，并加到UFS Status Cache中，同時更新此Path的Status的同步時間。

上圖Alluxio內部角色介紹為：

UfsSyncPathCache，此類用于記錄那些被Cache了的Status的Path路徑，此類存有各Path最近一次的metadata同步時間。
UfsStatusCache，此類cache了實際Path對應的metadata cache，此類同時cache了以及，path對應子文件status的映射關系。其中路徑對應孩子文件信息的cache是為了加速目錄級別的list查詢。

以下是上面這2個類的定義說明：/** * This cache maintains the Alluxio paths which have been synced with UFS. */@ThreadSafepublic final class UfsSyncPathCache { private static final Logger LOG = LoggerFactory.getLogger(UfsSyncPathCache.class); /** Number of paths to cache. */ private static final int MAX_PATHS = ServerConfiguration.getInt(PropertyKey.MASTER_UFS_PATH_CACHE_CAPACITY); /** Cache of paths which have been synced. */ private final Cache mCache;...}/** * This class is a cache from an Alluxio namespace URI ({@link AlluxioURI}, i.e. /path/to/inode) to * UFS statuses. * * It also allows associating a path with child inodes, so that the statuses for a specific path can * be searched for later. */@ThreadSafepublic class UfsStatusCache { private static final Logger LOG = LoggerFactory.getLogger(UfsStatusCache.class); private final ConcurrentHashMap mStatuses; private final ConcurrentHashMap>> mActivePrefetchJobs; // path對應children list的ufs status cache private final ConcurrentHashMap> mChildren; private final ExecutorService mPrefetchExecutor;...}

我們知道存儲系統在list大目錄情況時的開銷是比較大的，因此上面的children file list的cache可以在一定程度上提升請求的響應速度的。

這里主要來看Alluxio是如何做基于時間粒度的metadata cache的，相關代碼邏輯如下：

UfsSyncPathCache.java類

/** * The logic of shouldSyncPath need to consider the difference between file and directory, * with the variable isGetFileInfo we just process getFileInfo specially. * * There are three cases needed to address: * 1. the ancestor directories * 2. the direct parent directory * 3. the difference with file and directory * * @param path the path to check * @param intervalMs the sync interval, in ms * @param isGetFileInfo the operate is from getFileInfo or not * @return true if a sync should occur for the path and interval setting, false otherwise */ public boolean shouldSyncPath(String path, long intervalMs, boolean isGetFileInfo) { if (intervalMs < 0) { // Never sync. return false; } if (intervalMs == 0) { // Always sync. return true; } // 1)從cache中取出給定path的最近一次的同步時間 SyncTime lastSync = mCache.getIfPresent(path); // 2)判斷是否同步時間已經超過過期間隔時間 if (!shouldSyncInternal(lastSync, intervalMs, false)) { // Sync is not necessary for this path. return false; } int parentLevel = 0; String currPath = path; while (!currPath.equals(AlluxioURI.SEPARATOR)) { try { // 3)如果時間超出，則進行父目錄的查找，判斷父目錄是否達到需要更新的時間 currPath = PathUtils.getParent(currPath); parentLevel++; lastSync = mCache.getIfPresent(currPath); if (!shouldSyncInternal(lastSync, intervalMs, parentLevel > 1 || !isGetFileInfo)) { // Sync is not necessary because an ancestor was already recursively synced return false; } } catch (InvalidPathException e) { // this is not expected, but the sync should be triggered just in case. LOG.debug("Failed to get parent of ({}), for checking sync for ({})", currPath, path); return true; } } // trigger a sync, because a sync on the path (or an ancestor) was performed recently return true; }

如上如果需要進行metadata的sync操作，則會觸發后續的ufs status的查詢然后加到UfsStatusCache中。如果涉及到目錄下的文件信息的查詢，為了避免可能出現查詢子文件數量很多，查詢較慢的情況，alluxio做成了異步線程處理的方式。

UfsStatusCache.java

/** * Submit a request to asynchronously fetch the statuses corresponding to a given directory. * * Retrieve any fetched statuses by calling {@link #fetchChildrenIfAbsent(AlluxioURI, MountTable)} * with the same Alluxio path. * * If no {@link ExecutorService} was provided to this object before instantiation, this method is * a no-op. * * @param path the path to prefetch * @param mountTable the Alluxio mount table * @return the future corresponding to the fetch task */ @Nullable public Future> prefetchChildren(AlluxioURI path, MountTable mountTable) { if (mPrefetchExecutor == null) { return null; } try { Future> job = mPrefetchExecutor.submit(() -> getChildrenIfAbsent(path, mountTable)); Future> prev = mActivePrefetchJobs.put(path, job); if (prev != null) { prev.cancel(true); } return job; } catch (RejectedExecutionException e) { LOG.debug("Failed to submit prefetch job for path {}", path, e); return null; } }對于純單個文件的查詢請求，Alluxio采用了簡單直接的辦法，每次嘗試做一次sync操作，如果cache在有效期內，則實際不會做實際metadata同步行為，然后從UFS cache中load metadata返回結果。 @Override public FileInfo getFileInfo(AlluxioURI path, GetStatusContext context) throws FileDoesNotExistException, InvalidPathException, AccessControlException, IOException { Metrics.GET_FILE_INFO_OPS.inc(); long opTimeMs = System.currentTimeMillis(); try (RpcContext rpcContext = createRpcContext(); FileSystemMasterAuditContext auditContext = createAuditContext("getFileInfo", path, null, null)) { // 執行sync metadata的操作，實際由cache interval時間控制 if (syncMetadata(rpcContext, path, context.getOptions().getCommonOptions(), DescendantType.ONE, auditContext, LockedInodePath::getInodeOrNull, (inodePath, permChecker) -> permChecker.checkPermission(Mode.Bits.READ, inodePath), true)) { // If synced, do not load metadata. context.getOptions().setLoadMetadataType(LoadMetadataPType.NEVER); } LoadMetadataContext lmCtx = LoadMetadataContext.mergeFrom( LoadMetadataPOptions.newBuilder().setCreateAncestors(true).setCommonOptions( FileSystemMasterCommonPOptions.newBuilder() .setTtl(context.getOptions().getCommonOptions().getTtl()) .setTtlAction(context.getOptions().getCommonOptions().getTtlAction())));...}

還有一種比較典型地需要load metadata的場景是文件或目錄不存在于alluxio的情況。

以上就是本文所要簡單闡述的Alluxio與底層存儲系統間元數據的同步方式相關的內容，Alluxio本身作為底層存儲cache層，在內部新維護了UFS的cache來做與底層UFS的status的同步。而且用戶可以按照實際場景需要來設定這個cache需要同步的間隔時間。另外一方面，UFS status cache的引入也減少了list查詢操作的代價，在這點上比client直接訪問底層存儲系統做大目錄list要高效不少。

引用

https://dzone.com/articles/two-ways-to-keep-files-in-sync-between-alluxio-and

更多精彩內容，請

·end·

—如果喜歡，快分享給你的朋友們吧—

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的上下级平台之间数据同步方案_Alluxio与底层存储系统之间的元数据同步机制的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：运行android程序时显示stop,A
下一篇： java信息管理系统总结_java实现科