dremio cloud cache 简单说明(二)
以前我介绍过关于cache 的CacheFileSystemWrapper,以下说明下关于cache 缓存以及加载的处理
参考配置
- 主要是在executor 节点的
services: {
coordinator.enabled: false,
coordinator.master.enabled: false,
executor.enabled: true
executor.cache.path.db : "/mnt/cachemanagerdisk/db",
executor.cache.path.fs : [ "/mnt/cachemanagerdisk/dir1","/mnt/cachemanagerdisk/dir2","/mnt/cachemanagerdisk/dir3","/mnt/cachemanagerdisk/dir4"]
}
- cache 效果
db 目录主要是关于元数据的,disk1里边的是关于实际cache 的部分数据
- 系统表效果
参考表查询sql
SELECT * from sys.cache.storage_plugins
SELECT * from sys.cache.datasets
SELECT * from sys.cache.mount_points
SELECT * from sys.cache.objects
效果(iceberg 表cache 信息)
可以看到会有具体cache 的执行器节点,存储插件,数据集,以及path,version,时间,以及偏移
参考资料
ce 包内部实现
ce 包包含了一个CacheManager的服务实现,里边的start 方法会进行服务的启动处理
- CacheFileSystemWrapper wrap 对于cache manager 的启动以及cache filesystem 关联
属于一个懒加载处理的处理,当调用的wrap的时候会启动,同时会关联一个实际存储插件, 实际上是在创建文件类存储插件之后,同时发起文件请求的时候创建的文件夹,之后进行cache 处理
public FileSystem wrap(FileSystem fs, String storageId, AsyncStreamConf conf, OperatorContext context, boolean enableAsync, boolean isMetadataRefresh) throws IOException {
LOGGER.debug("cache-file-system-wrapper-creator for plugin-id {}, global-cm {}, local-cm {}, plugin-cm {}, operator-cm {}, isMetadataRefresh {}", new Object[]{storageId, this.cmo.getCacheManagerEnabled(), this.dremioConfig.getBoolean("services.executor.cache.enabled"), conf.getCacheProperties().isCachingEnabled(this.cmo.getOptionManager()), "true", isMetadataRefresh});
boolean cachingEnabled = this.cmo.getCacheManagerEnabled() && this.dremioConfig.getBoolean("services.executor.cache.enabled") && conf.getCacheProperties().isCachingEnabled(this.cmo.getOptionManager());
boolean invalidPluginId = storageId.contains(":::");
if (cachingEnabled && !invalidPluginId) {
boolean isExecutor = this.dremioConfig.getBoolean("services.executor.enabled");
if (isMetadataRefresh) {
return new CacheFileSystemWrapper.CacheFileSystem(fs, storageId, conf.getCacheProperties());
} else {
if (isExecutor && enableAsync) {
// 先启动
if (this.cm == null) {
this.startCacheManager();
}
// 创建CacheFileSystem
if (this.cm != null && !this.cm.isInError() && !this.cm.isClosed()) {
return new CacheFileSystemWrapper.CacheFileSystem(fs, storageId, conf.getCacheProperties());
}
}
return fs;
}
} else {
return fs;
}
}
- startCacheManager
会基于配置的cache 文件系统创建文件夹
private synchronized void startCacheManager() {
if (this.cm == null) {
String dbDirectory = this.dremioConfig.getString("services.executor.cache.path.db");
long dbQuota = this.dremioConfig.getLong("services.executor.cache.pctquota.db");
ArrayList fsDirectories = new ArrayList(this.dremioConfig.getStringList("services.executor.cache.path.fs"));
ArrayList fsQuotas = new ArrayList(this.dremioConfig.getLongList("services.executor.cache.pctquota.fs"));
ArrayList fsEnsureFreeSpaceList = new ArrayList(this.dremioConfig.getLongList("services.executor.cache.ensurefreespace.fs"));
this.fixFsPathsAndParams(fsDirectories, fsQuotas);
this.fixFsPathsAndParams(fsDirectories, fsEnsureFreeSpaceList);
LOGGER.info("starting cm, dbDir {}, dbQuota {}, fsDirs {}, fsQuota {}, fsEnsureFreeSpaceList {}", new Object[]{dbDirectory, dbQuota, fsDirectories, fsQuotas, fsEnsureFreeSpaceList});
try {
// 创建CacheManager
this.cm = new CacheManager(this.cmo, dbDirectory, dbQuota, fsDirectories, fsQuotas, fsEnsureFreeSpaceList, this.allocator, this.dremioConfig.getThisNode(), this.executorPort, this.isYarnDeployment());
if (!this.cm.isInError()) {
// 使用start 启动服务,此处会进行rocksdb 的初始化,CacheDBController以及CacheFSController 的处理也在此处,对于挂载点信息会保存到db中,sys.cache.mount_points 可以看到,同时会包含存储cache 的驱逐任务创建,检查点任务创建
this.cm.start();
} else {
LOGGER.error("cache-manager could not be initialized, disabling caching");
}
} catch (Exception var8) {
LOGGER.error("cache-manager initialisation hit exception, disabling caching", var8);
}
}
}
- 子文件夹数量的创建 (实际上存储cache 分块数据的文件夹)
CacheFSController 类中,最小为16个,会按照挂载的磁盘数量处理,默认是128 除以磁盘数,取整,公式如下
int subDirCreateBatchSize = Math.max(16, 128 / this.mountPointConfigList.size());
格式处理
private static String makeSubDirNameFromId(int subDirId) {
return String.format("%06d", subDirId);
}
- 文件的读写处理
CacheFSMountPoint 会包含了对特定挂载点文件的实际读取操作,直接使用是CacheFSController,处理部分是基于
CacheFSController的ReadWriteFSHandle 对于读写操作使用了CacheFSMountPoint暴露的方法,实际cache 文件系统
实现CacheFileSystem 中使用了CacheAsyncByteReader
CacheAsyncByteReader 读取处理
public CompletableFuture readFully(long offset, ByteBuf dstBuf, int dstOff, int len) {
if (len == 0) {
throw new IllegalArgumentException("empty reads not allowed.");
} else {
LOGGER.debug("[{}] Reading fully from cache for path {} at source offset {} for length {}", new Object[]{this.threadName, this.path, offset, len});
long startTime = System.currentTimeMillis();
int cacheHits = 0;
int cacheMisses = 0;
CompletableFuture combinedFuture;
try {
Stopwatch sw = Stopwatch.createStarted();
long objOffsetStepper = offset / this.fileChunkSize * this.fileChunkSize;
long offsetInObj = offset;
int remaining = len;
ArrayList combinedFutureList = new ArrayList();
if (!this.cmStateSupplier.getAsBoolean()) {
return this.primarySrcAsyncByteReader.versionedReadFully(this.version, offset, dstBuf, dstOff, len);
}
this.numOutstandingReads.getAndIncrement();
while(remaining > 0) {
long objHeadBytes = remaining == len ? offset % this.fileChunkSize : 0L;
long objCopyBytes = Math.min(this.fileChunkSize - objHeadBytes, (long)remaining);
int currDstOffset = dstOff + len - remaining;
Preconditions.checkState(currDstOffset >= 0);
// 请求文件关联key
CacheTranslationKey key = new CacheTranslationKey(this.pluginUID, this.dataSetID, this.path, this.version, objOffsetStepper);
// cache 查找
CacheTranslationValue lookupValue = this.tdb.lookup(key);
CompletableFuture future;
if (lookupValue == null) {
// cahe miss , 获取并cache
++cacheMisses;
CacheMemoryLockController.CacheMemoryBufferRequest bufferRequest = this.mlc.getBuffer(key, (int)this.fileChunkSize);
if (bufferRequest.isBlocked()) {
future = bufferRequest.getBlockingFuture();
if (!bufferRequest.isChunkBufferAvailable()) {
future = future.thenCompose((v) -> {
return this.loadDataIntoDstBufferRequestAsync(key, objOffsetStepper, bufferRequest, startTime);
});
future = future.thenAccept((v) -> {
this.copyDataFromSrcBufferRequestToDstBuf(bufferRequest, (int)objHeadBytes, dstBuf, currDstOffset, (int)objCopyBytes);
});
} else {
future = future.thenAccept((v) -> {
this.copyDataFromSrcBufferRequestToDstBuf(bufferRequest, (int)objHeadBytes, dstBuf, currDstOffset, (int)objCopyBytes);
});
}
} else {
if (bufferRequest.isChunkBufferValid()) {
future = bufferRequest.getBlockingFuture();
} else {
future = this.fetchDataFromPrimarySrcAsync(objOffsetStepper, bufferRequest);
future = future.thenCompose((v) -> {
// writeDataToCacheAsync 同时会进行cache 关系的存储
return this.writeDataToCacheAsync(bufferRequest, key, startTime);
});
}
future = future.thenAccept((v) -> {
this.copyDataFromSrcBufferRequestToDstBuf(bufferRequest, (int)objHeadBytes, dstBuf, currDstOffset, (int)objCopyBytes);
});
}
future = future.whenComplete((v, e) -> {
this.mlc.releaseBuffer(bufferRequest);
});
} else {
// cache hit , 从本地文件夹直接读取
++cacheHits;
future = this.readDataFromCacheAsync(key, lookupValue, (int)objHeadBytes, offsetInObj, dstBuf, currDstOffset, (int)objCopyBytes, startTime);
LOGGER.debug("[{}] Submitted request to read from cache for path {} at offset {} for length {}", new Object[]{this.threadName, this.path, offsetInObj, objCopyBytes});
}
combinedFutureList.add(future);
objOffsetStepper += this.fileChunkSize;
remaining = (int)((long)remaining - (this.fileChunkSize - objHeadBytes));
offsetInObj += objCopyBytes;
}
int nScheduled = combinedFutureList.size();
combinedFuture = CompletableFuture.allOf((CompletableFuture[])combinedFutureList.toArray(new CompletableFuture[nScheduled]));
combinedFuture = combinedFuture.whenComplete((v, e) -> {
this.numOutstandingReads.getAndDecrement();
int nFailures = 0;
int nCancellations = 0;
Iterator var11 = combinedFutureList.iterator();
while(var11.hasNext()) {
CompletableFuture f = (CompletableFuture)var11.next();
if (f.isCompletedExceptionally()) {
++nFailures;
}
if (f.isCancelled()) {
++nCancellations;
}
}
long elapsedTime = sw.elapsed(TimeUnit.NANOSECONDS);
this.updateStats(elapsedTime, nScheduled, nFailures, nCancellations);
if (e == null && nFailures == 0) {
if (nCancellations != 0) {
LOGGER.info(" [{}] cache-async-byte-reader cancellations, skipping error handler, nCancellations {}", this.threadName, nCancellations);
} else {
LOGGER.trace(" [{}] cache-async-byte-reader completed, path {}, offset {}, len {}, elapsed {}, nScheduled {}, nMiss {}, nHit {}, nFailed {}, nCancelled {}", new Object[]{this.threadName, this.path, offset, len, elapsedTime, this.nFutures, this.nCacheMisses, this.nCacheHits, this.nFailedFutures, this.nCancelledFutures});
}
} else {
LOGGER.error(" [{}] cache-async-byte-reader failures, nFailures {}", new Object[]{this.threadName, nFailures, e});
throw new CompletionException(e);
}
});
this.nCacheHits += cacheHits;
this.nCacheMisses += cacheMisses;
} catch (Exception var31) {
LOGGER.error("cache-async-byte-reader exception setting up futures", var31);
combinedFuture = new CompletableFuture();
combinedFuture.completeExceptionally(new IOException("cache-async-byte-reader exception, while setting up", var31));
}
return combinedFuture;
}
}
writeDataToCacheAsync 写入处理
private CompletableFuture writeDataToCacheAsync(CacheMemoryLockController.CacheMemoryBufferRequest srcBufferRequest, CacheTranslationKey key, long writeTime) {
return CompletableFuture.runAsync(() -> {
Preconditions.checkState(srcBufferRequest.isChunkBufferAvailable(), "buffer must be available.");
Preconditions.checkState(!srcBufferRequest.isChunkBufferInValid(), "buffer must be valid.");
if (srcBufferRequest.isChunkBufferInError()) {
throw new CompletionException(new IOException("chunk buffer is in error"));
} else {
CacheMemoryLockController.GenerationMapEntry generationMapEntry = this.mlc.getRefOnGenerationNumber();
try {
long generationNumber = generationMapEntry.getGenerationNumber();
// 实际写入操作
CacheFSController.PathInfo pathInfo = this.rwh.writeChunkFile(generationNumber, key, srcBufferRequest.getChunkBuffer().nioBuffer(0, (int)this.fileChunkSize), (int)this.fileChunkSize);
CacheTranslationValue insertValue = new CacheTranslationValue(pathInfo, generationNumber, writeTime, 1);
this.tdb.insert(key, insertValue);
this.numNewBlocks.incrementAndGet();
if (this.mlc.triggerForceOneDataWriteError()) {
throw new IOException("force one data write error for test");
}
} catch (FSState.IllegalFSStateException var15) {
long currTimeMillis = System.currentTimeMillis();
long diff = currTimeMillis - lastLoggedFSStateExceptionTime;
if (diff > 1800000L) {
LOGGER.info("[{}] Unable to write cache chunk to disk due to {}", this.threadName, var15.getMessage());
lastLoggedFSStateExceptionTime = currTimeMillis;
}
this.nFailedFutures.incrementAndGet();
} catch (Exception var16) {
if (!this.loggedCacheWriteFailure) {
LOGGER.info("[{}] Unable to write cache chunk to disk due to {}", this.threadName, var16.getMessage());
this.loggedCacheWriteFailure = true;
}
this.nFailedFutures.incrementAndGet();
} finally {
this.mlc.releaseRefOnGenerationNumber(generationMapEntry);
}
}
}, CACHE_THREAD_POOL);
}
CacheFSController ReadWriteFSHandle 处理
CacheFSController.PathInfo writeChunkFile(long generationNumber, CacheTranslationKey key, ByteBuffer srcBuf, int nBytes) throws Exception {
基于key 的hashcode 获取的
int keyHash = key.hashCodeUnsignedInt();
if (CacheFSController.this.fsState == FSState.ERROR) {
throw new FSState.IllegalFSStateException("write-chunk-file exception, fs-controller is in ERROR");
} else if (CacheFSController.this.fsState == FSState.READ_ONLY) {
throw new FSState.IllegalFSStateException("write-chunk-file exception, fs-controller is READ_ONLY");
} else {
CacheFSController.PathInfo sdPathInfo = this.getSubDirPath(-1, false, keyHash);
String relativeFilePath = ((CacheFSMountPoint)CacheFSController.this.mountPointMap.get(sdPathInfo.getMountPointId())).writeChunkFile(sdPathInfo.getRelativePath(), generationNumber, keyHash, srcBuf, nBytes);
this.updateFileStats(key, (long)nBytes);
return new CacheFSController.PathInfo(sdPathInfo.getMountPointId(), relativeFilePath, (long)nBytes);
}
}
获取key 存储的文件夹getSubDirPath 方法处理
private CacheFSController.PathInfo getSubDirPath(int mpID, boolean matchMountPoint, int keyHash) {
CacheFSController.this.inFlightSubDirLock.writeLock().lock();
CacheFSController.PathInfo sdPathInfo;
try {
sdPathInfo = (CacheFSController.PathInfo)CacheFSController.this.inFlightSubDirs.get(keyHash % CacheFSController.this.inFLightSubDirsListSize);
if (mpID != -1 && mpID != sdPathInfo.getMountPointId()) {
sdPathInfo = null;
for(int i = 0; i < CacheFSController.this.inFLightSubDirsListSize; ++i) {
if (((CacheFSController.PathInfo)CacheFSController.this.inFlightSubDirs.get(i)).getMountPointId() == mpID) {
sdPathInfo = (CacheFSController.PathInfo)CacheFSController.this.inFlightSubDirs.get(i);
break;
}
}
}
// inFLightSubDirsListSize 默认参数为17
if (sdPathInfo == null && !matchMountPoint) {
sdPathInfo = (CacheFSController.PathInfo)CacheFSController.this.inFlightSubDirs.get(keyHash % CacheFSController.this.inFLightSubDirsListSize);
}
} catch (RuntimeException var13) {
long currTimeMillis = System.currentTimeMillis();
long diff = currTimeMillis - CacheFSController.lastLoggedFSStateExceptionTime;
if (diff > CacheFSController.REPEAT_LOG_THRESHOLD) {
CacheFSController.LOGGER.error("Caught exception in getSubDirPath", var13);
CacheFSController.lastLoggedFSStateExceptionTime = currTimeMillis;
}
throw var13;
} finally {
CacheFSController.this.inFlightSubDirLock.writeLock().unlock();
}
if (!matchMountPoint) {
Preconditions.checkNotNull(sdPathInfo, "found null sub-dir in in-flight-sub-dir list.");
}
return sdPathInfo;
}
获取实际写入文件路径(需要传入上边的路径)
private Path toPath(String subDirName, long generationNumber, int keyHash, int retryCount) {
String paddedHexHash = toPaddedHexString(keyHash, retryCount);
return Paths.get(subDirName, generationNumber + "-" + paddedHexHash);
}
@VisibleForTesting
static String toPaddedHexString(int keyHash, int retryCount) {
String paddedHexHash = Strings.padStart(Integer.toHexString(keyHash), 8, '0');
if (retryCount != 0) {
paddedHexHash = paddedHexHash + "_" + retryCount;
}
return paddedHexHash;
}
实际写入的格式与上边截图的类似
说明
以上是简单的说明,详细的可以通过反编译源码查看,了解cloud cache 里边的一些东西比较有用,比如性能优化,磁盘空间规划
里边还有一些细节就是关于挂载点文件夹数据过多的,会进行一些轮转清理处理
关于CacheAsyncByteReader的加载是通过dremio 的配置key, 具体实现也是在ce包中ce-kernel 中,如下
public SeekableInputStream newStream() {
try {
// dremio.plugins.iceberg.manifests.input_stream_factory
//
SeekableInputStreamFactory factory = io.getContext() == null || io.getDataset() == null ?
SeekableInputStreamFactory.DEFAULT :
io.getContext().getConfig().getInstance(SeekableInputStreamFactory.KEY, SeekableInputStreamFactory.class,
SeekableInputStreamFactory.DEFAULT);
return factory.getStream(io.getFs(), io.getContext(),
path, fileSize, mtime, io.getDataset(), io.getDatasourcePluginUID());
} catch (FileNotFoundException e) {
throw new NotFoundException(e, "Path %s not found.", path);
} catch (IOException e) {
throw new UncheckedIOException(String.format("Failed to create new input stream for file: %s", path), e);
}
}
参考资料
sabot/kernel/src/main/java/com/dremio/exec/store/cache/BlockLocationsCacheManager.java
sabot/kernel/src/main/java/com/dremio/exec/store/cache/RecordingCacheReaderWriter.java
sabot/kernel/src/main/java/com/dremio/exec/store/dfs/SplitAssignmentTableFunction.java
sabot/kernel/src/main/java/com/dremio/exec/store/RecordReader.java
sabot/kernel/src/main/java/com/dremio/exec/store/iceberg/DremioInputFile.java
sabot/kernel/src/main/java/com/dremio/exec/store/iceberg/SeekableInputStreamFactory.java
sabot/kernel/src/main/java/com/dremio/exec/work/CacheManagerStoragePluginInfo.java
sabot/kernel/src/main/java/com/dremio/sabot/op/tablefunction/InternalTableFunctionFactory.java
sabot/kernel/src/main/java/com/dremio/sabot/op/tablefunction/TableFunctionOperator.java
sabot/kernel/src/main/java/com/dremio/exec/store/schedule/CompleteWork.java
sabot/kernel/src/main/java/com/dremio/exec/planner/physical/TableFunctionPrel.java
common/legacy/src/main/java/com/dremio/config/DremioConfig.java
sabot/kernel/src/main/java/com/dremio/exec/store/cache/RocksDbBroker.java
com/dremio/service/cachemanager/CacheFileSystemWrapper.java
common/legacy/src/main/java/com/dremio/io/AsyncByteReader.java
com/dremio/service/cachemanager/CacheManager.java
com/dremio/service/cachemanager/CacheDBController.java
com/dremio/service/cachemanager/CacheFSController.java
com/dremio/service/cachemanager/CacheFSMountPoint.java
com/dremio/service/cachemanager/CacheMemoryLockController.java
com/dremio/service/cachemanager/CacheAsyncByteReader.java
https://en.wikipedia.org/wiki/Rendezvous_hashing