dremio cloud cache 简单说明(二)

以前我介绍过关于cache 的CacheFileSystemWrapper,以下说明下关于cache 缓存以及加载的处理

参考配置

  • 主要是在executor 节点的
services: {
  coordinator.enabled: false,
  coordinator.master.enabled: false,
  executor.enabled: true
  executor.cache.path.db : "/mnt/cachemanagerdisk/db",
  executor.cache.path.fs : [ "/mnt/cachemanagerdisk/dir1","/mnt/cachemanagerdisk/dir2","/mnt/cachemanagerdisk/dir3","/mnt/cachemanagerdisk/dir4"]
}
  • cache 效果

db 目录主要是关于元数据的,disk1里边的是关于实际cache 的部分数据

  • 系统表效果

参考表查询sql

SELECT * from sys.cache.storage_plugins
SELECT * from sys.cache.datasets
SELECT  * from sys.cache.mount_points
SELECT * from sys.cache.objects

效果(iceberg 表cache 信息)
可以看到会有具体cache 的执行器节点,存储插件,数据集,以及path,version,时间,以及偏移

参考资料

ce 包内部实现

ce 包包含了一个CacheManager的服务实现,里边的start 方法会进行服务的启动处理

  • CacheFileSystemWrapper wrap 对于cache manager 的启动以及cache filesystem 关联
    属于一个懒加载处理的处理,当调用的wrap的时候会启动,同时会关联一个实际存储插件, 实际上是在创建文件类存储插件之后,同时发起文件请求的时候创建的文件夹,之后进行cache 处理
public FileSystem wrap(FileSystem fs, String storageId, AsyncStreamConf conf, OperatorContext context, boolean enableAsync, boolean isMetadataRefresh) throws IOException {
      LOGGER.debug("cache-file-system-wrapper-creator for plugin-id {}, global-cm {}, local-cm {}, plugin-cm {}, operator-cm {}, isMetadataRefresh {}", new Object[]{storageId, this.cmo.getCacheManagerEnabled(), this.dremioConfig.getBoolean("services.executor.cache.enabled"), conf.getCacheProperties().isCachingEnabled(this.cmo.getOptionManager()), "true", isMetadataRefresh});
      boolean cachingEnabled = this.cmo.getCacheManagerEnabled() && this.dremioConfig.getBoolean("services.executor.cache.enabled") && conf.getCacheProperties().isCachingEnabled(this.cmo.getOptionManager());
      boolean invalidPluginId = storageId.contains(":::");
      if (cachingEnabled && !invalidPluginId) {
         boolean isExecutor = this.dremioConfig.getBoolean("services.executor.enabled");
         if (isMetadataRefresh) {
            return new CacheFileSystemWrapper.CacheFileSystem(fs, storageId, conf.getCacheProperties());
         } else {
            if (isExecutor && enableAsync) {
              // 先启动
               if (this.cm == null) {
                  this.startCacheManager();
               }
               // 创建CacheFileSystem
               if (this.cm != null && !this.cm.isInError() && !this.cm.isClosed()) {
                  return new CacheFileSystemWrapper.CacheFileSystem(fs, storageId, conf.getCacheProperties());
               }
            }
 
            return fs;
         }
      } else {
         return fs;
      }
}
  • startCacheManager
    会基于配置的cache 文件系统创建文件夹
private synchronized void startCacheManager() {
      if (this.cm == null) {
         String dbDirectory = this.dremioConfig.getString("services.executor.cache.path.db");
         long dbQuota = this.dremioConfig.getLong("services.executor.cache.pctquota.db");
         ArrayList fsDirectories = new ArrayList(this.dremioConfig.getStringList("services.executor.cache.path.fs"));
         ArrayList fsQuotas = new ArrayList(this.dremioConfig.getLongList("services.executor.cache.pctquota.fs"));
         ArrayList fsEnsureFreeSpaceList = new ArrayList(this.dremioConfig.getLongList("services.executor.cache.ensurefreespace.fs"));
         this.fixFsPathsAndParams(fsDirectories, fsQuotas);
         this.fixFsPathsAndParams(fsDirectories, fsEnsureFreeSpaceList);
         LOGGER.info("starting cm, dbDir {}, dbQuota {}, fsDirs {}, fsQuota {}, fsEnsureFreeSpaceList {}", new Object[]{dbDirectory, dbQuota, fsDirectories, fsQuotas, fsEnsureFreeSpaceList});
 
         try {
            // 创建CacheManager
            this.cm = new CacheManager(this.cmo, dbDirectory, dbQuota, fsDirectories, fsQuotas, fsEnsureFreeSpaceList, this.allocator, this.dremioConfig.getThisNode(), this.executorPort, this.isYarnDeployment());
            if (!this.cm.isInError()) {
             // 使用start 启动服务,此处会进行rocksdb 的初始化,CacheDBController以及CacheFSController 的处理也在此处,对于挂载点信息会保存到db中,sys.cache.mount_points 可以看到,同时会包含存储cache 的驱逐任务创建,检查点任务创建
               this.cm.start();
            } else {
               LOGGER.error("cache-manager could not be initialized, disabling caching");
            }
         } catch (Exception var8) {
            LOGGER.error("cache-manager initialisation hit exception, disabling caching", var8);
         }
 
      }
}
  • 子文件夹数量的创建 (实际上存储cache 分块数据的文件夹)
    CacheFSController 类中,最小为16个,会按照挂载的磁盘数量处理,默认是128 除以磁盘数,取整,公式如下
int subDirCreateBatchSize = Math.max(16, 128 / this.mountPointConfigList.size());

格式处理

private static String makeSubDirNameFromId(int subDirId) {
    return String.format("%06d", subDirId);
}
  • 文件的读写处理
    CacheFSMountPoint 会包含了对特定挂载点文件的实际读取操作,直接使用是CacheFSController,处理部分是基于
    CacheFSController的ReadWriteFSHandle 对于读写操作使用了CacheFSMountPoint暴露的方法,实际cache 文件系统
    实现CacheFileSystem 中使用了CacheAsyncByteReader
    CacheAsyncByteReader 读取处理
public CompletableFuture readFully(long offset, ByteBuf dstBuf, int dstOff, int len) {
      if (len == 0) {
         throw new IllegalArgumentException("empty reads not allowed.");
      } else {
         LOGGER.debug("[{}] Reading fully from cache for path {} at source offset {} for length {}", new Object[]{this.threadName, this.path, offset, len});
         long startTime = System.currentTimeMillis();
         int cacheHits = 0;
         int cacheMisses = 0;
 
         CompletableFuture combinedFuture;
         try {
            Stopwatch sw = Stopwatch.createStarted();
            long objOffsetStepper = offset / this.fileChunkSize * this.fileChunkSize;
            long offsetInObj = offset;
            int remaining = len;
            ArrayList combinedFutureList = new ArrayList();
            if (!this.cmStateSupplier.getAsBoolean()) {
               return this.primarySrcAsyncByteReader.versionedReadFully(this.version, offset, dstBuf, dstOff, len);
            }
 
            this.numOutstandingReads.getAndIncrement();
 
            while(remaining > 0) {
               long objHeadBytes = remaining == len ? offset % this.fileChunkSize : 0L;
               long objCopyBytes = Math.min(this.fileChunkSize - objHeadBytes, (long)remaining);
               int currDstOffset = dstOff + len - remaining;
               Preconditions.checkState(currDstOffset >= 0);
              // 请求文件关联key
               CacheTranslationKey key = new CacheTranslationKey(this.pluginUID, this.dataSetID, this.path, this.version, objOffsetStepper);
             // cache 查找  
             CacheTranslationValue lookupValue = this.tdb.lookup(key);
               CompletableFuture future;
               if (lookupValue == null) {
                  // cahe miss , 获取并cache
                  ++cacheMisses;
                  CacheMemoryLockController.CacheMemoryBufferRequest bufferRequest = this.mlc.getBuffer(key, (int)this.fileChunkSize);
                  if (bufferRequest.isBlocked()) {
                     future = bufferRequest.getBlockingFuture();
                     if (!bufferRequest.isChunkBufferAvailable()) {
                        future = future.thenCompose((v) -> {
                           return this.loadDataIntoDstBufferRequestAsync(key, objOffsetStepper, bufferRequest, startTime);
                        });
                        future = future.thenAccept((v) -> {
                           this.copyDataFromSrcBufferRequestToDstBuf(bufferRequest, (int)objHeadBytes, dstBuf, currDstOffset, (int)objCopyBytes);
                        });
                     } else {
                        future = future.thenAccept((v) -> {
                           this.copyDataFromSrcBufferRequestToDstBuf(bufferRequest, (int)objHeadBytes, dstBuf, currDstOffset, (int)objCopyBytes);
                        });
                     }
                  } else {
                     if (bufferRequest.isChunkBufferValid()) {
                        future = bufferRequest.getBlockingFuture();
                     } else {
                        future = this.fetchDataFromPrimarySrcAsync(objOffsetStepper, bufferRequest);
                        future = future.thenCompose((v) -> {
                          //  writeDataToCacheAsync 同时会进行cache 关系的存储
                           return this.writeDataToCacheAsync(bufferRequest, key, startTime);
                        });
                     }
 
                     future = future.thenAccept((v) -> {
                        this.copyDataFromSrcBufferRequestToDstBuf(bufferRequest, (int)objHeadBytes, dstBuf, currDstOffset, (int)objCopyBytes);
                     });
                  }
 
                  future = future.whenComplete((v, e) -> {
                     this.mlc.releaseBuffer(bufferRequest);
                  });
               } else {
                // cache hit , 从本地文件夹直接读取
                  ++cacheHits;
                  future = this.readDataFromCacheAsync(key, lookupValue, (int)objHeadBytes, offsetInObj, dstBuf, currDstOffset, (int)objCopyBytes, startTime);
                  LOGGER.debug("[{}] Submitted request to read from cache for path {} at offset {} for length {}", new Object[]{this.threadName, this.path, offsetInObj, objCopyBytes});
               }
 
               combinedFutureList.add(future);
               objOffsetStepper += this.fileChunkSize;
               remaining = (int)((long)remaining - (this.fileChunkSize - objHeadBytes));
               offsetInObj += objCopyBytes;
            }
 
            int nScheduled = combinedFutureList.size();
            combinedFuture = CompletableFuture.allOf((CompletableFuture[])combinedFutureList.toArray(new CompletableFuture[nScheduled]));
            combinedFuture = combinedFuture.whenComplete((v, e) -> {
               this.numOutstandingReads.getAndDecrement();
               int nFailures = 0;
               int nCancellations = 0;
               Iterator var11 = combinedFutureList.iterator();
 
               while(var11.hasNext()) {
                  CompletableFuture f = (CompletableFuture)var11.next();
                  if (f.isCompletedExceptionally()) {
                     ++nFailures;
                  }
 
                  if (f.isCancelled()) {
                     ++nCancellations;
                  }
               }
 
               long elapsedTime = sw.elapsed(TimeUnit.NANOSECONDS);
               this.updateStats(elapsedTime, nScheduled, nFailures, nCancellations);
               if (e == null && nFailures == 0) {
                  if (nCancellations != 0) {
                     LOGGER.info(" [{}] cache-async-byte-reader cancellations, skipping error handler, nCancellations {}", this.threadName, nCancellations);
                  } else {
                     LOGGER.trace(" [{}] cache-async-byte-reader completed, path {}, offset {}, len {}, elapsed {}, nScheduled {}, nMiss {}, nHit {}, nFailed {}, nCancelled {}", new Object[]{this.threadName, this.path, offset, len, elapsedTime, this.nFutures, this.nCacheMisses, this.nCacheHits, this.nFailedFutures, this.nCancelledFutures});
                  }
 
               } else {
                  LOGGER.error(" [{}] cache-async-byte-reader failures, nFailures {}", new Object[]{this.threadName, nFailures, e});
                  throw new CompletionException(e);
               }
            });
            this.nCacheHits += cacheHits;
            this.nCacheMisses += cacheMisses;
         } catch (Exception var31) {
            LOGGER.error("cache-async-byte-reader exception setting up futures", var31);
            combinedFuture = new CompletableFuture();
            combinedFuture.completeExceptionally(new IOException("cache-async-byte-reader exception, while setting up", var31));
         }
 
         return combinedFuture;
      }
   }

writeDataToCacheAsync 写入处理

private CompletableFuture writeDataToCacheAsync(CacheMemoryLockController.CacheMemoryBufferRequest srcBufferRequest, CacheTranslationKey key, long writeTime) {
      return CompletableFuture.runAsync(() -> {
         Preconditions.checkState(srcBufferRequest.isChunkBufferAvailable(), "buffer must be available.");
         Preconditions.checkState(!srcBufferRequest.isChunkBufferInValid(), "buffer must be valid.");
         if (srcBufferRequest.isChunkBufferInError()) {
            throw new CompletionException(new IOException("chunk buffer is in error"));
         } else { 
            CacheMemoryLockController.GenerationMapEntry generationMapEntry = this.mlc.getRefOnGenerationNumber();
 
            try {
               long generationNumber = generationMapEntry.getGenerationNumber();
              // 实际写入操作
               CacheFSController.PathInfo pathInfo = this.rwh.writeChunkFile(generationNumber, key, srcBufferRequest.getChunkBuffer().nioBuffer(0, (int)this.fileChunkSize), (int)this.fileChunkSize);
               CacheTranslationValue insertValue = new CacheTranslationValue(pathInfo, generationNumber, writeTime, 1);
               this.tdb.insert(key, insertValue);
               this.numNewBlocks.incrementAndGet();
               if (this.mlc.triggerForceOneDataWriteError()) {
                  throw new IOException("force one data write error for test");
               }
            } catch (FSState.IllegalFSStateException var15) {
               long currTimeMillis = System.currentTimeMillis();
               long diff = currTimeMillis - lastLoggedFSStateExceptionTime;
               if (diff > 1800000L) {
                  LOGGER.info("[{}] Unable to write cache chunk to disk due to {}", this.threadName, var15.getMessage());
                  lastLoggedFSStateExceptionTime = currTimeMillis;
               }
 
               this.nFailedFutures.incrementAndGet();
            } catch (Exception var16) {
               if (!this.loggedCacheWriteFailure) {
                  LOGGER.info("[{}] Unable to write cache chunk to disk due to {}", this.threadName, var16.getMessage());
                  this.loggedCacheWriteFailure = true;
               }
 
               this.nFailedFutures.incrementAndGet();
            } finally {
               this.mlc.releaseRefOnGenerationNumber(generationMapEntry);
            }
 
         }
      }, CACHE_THREAD_POOL);
}

CacheFSController ReadWriteFSHandle 处理

 CacheFSController.PathInfo writeChunkFile(long generationNumber, CacheTranslationKey key, ByteBuffer srcBuf, int nBytes) throws Exception {
         基于key 的hashcode 获取的
         int keyHash = key.hashCodeUnsignedInt();
         if (CacheFSController.this.fsState == FSState.ERROR) {
            throw new FSState.IllegalFSStateException("write-chunk-file exception, fs-controller is in ERROR");
         } else if (CacheFSController.this.fsState == FSState.READ_ONLY) {
            throw new FSState.IllegalFSStateException("write-chunk-file exception, fs-controller is READ_ONLY");
         } else {
            CacheFSController.PathInfo sdPathInfo = this.getSubDirPath(-1, false, keyHash);
            String relativeFilePath = ((CacheFSMountPoint)CacheFSController.this.mountPointMap.get(sdPathInfo.getMountPointId())).writeChunkFile(sdPathInfo.getRelativePath(), generationNumber, keyHash, srcBuf, nBytes);
            this.updateFileStats(key, (long)nBytes);
            return new CacheFSController.PathInfo(sdPathInfo.getMountPointId(), relativeFilePath, (long)nBytes);
         }
      }

获取key 存储的文件夹getSubDirPath 方法处理

private CacheFSController.PathInfo getSubDirPath(int mpID, boolean matchMountPoint, int keyHash) {
         CacheFSController.this.inFlightSubDirLock.writeLock().lock();
 
         CacheFSController.PathInfo sdPathInfo;
         try {
            sdPathInfo = (CacheFSController.PathInfo)CacheFSController.this.inFlightSubDirs.get(keyHash % CacheFSController.this.inFLightSubDirsListSize);
            if (mpID != -1 && mpID != sdPathInfo.getMountPointId()) {
               sdPathInfo = null;
 
               for(int i = 0; i < CacheFSController.this.inFLightSubDirsListSize; ++i) {
                  if (((CacheFSController.PathInfo)CacheFSController.this.inFlightSubDirs.get(i)).getMountPointId() == mpID) {
                     sdPathInfo = (CacheFSController.PathInfo)CacheFSController.this.inFlightSubDirs.get(i);
                     break;
                  }
               }
            }
           // inFLightSubDirsListSize 默认参数为17
            if (sdPathInfo == null && !matchMountPoint) {
               sdPathInfo = (CacheFSController.PathInfo)CacheFSController.this.inFlightSubDirs.get(keyHash % CacheFSController.this.inFLightSubDirsListSize);
            }
         } catch (RuntimeException var13) {
            long currTimeMillis = System.currentTimeMillis();
            long diff = currTimeMillis - CacheFSController.lastLoggedFSStateExceptionTime;
            if (diff > CacheFSController.REPEAT_LOG_THRESHOLD) {
               CacheFSController.LOGGER.error("Caught exception in getSubDirPath", var13);
               CacheFSController.lastLoggedFSStateExceptionTime = currTimeMillis;
            }
 
            throw var13;
         } finally {
            CacheFSController.this.inFlightSubDirLock.writeLock().unlock();
         }
 
         if (!matchMountPoint) {
            Preconditions.checkNotNull(sdPathInfo, "found null sub-dir in in-flight-sub-dir list.");
         }
 
         return sdPathInfo;
}

获取实际写入文件路径(需要传入上边的路径)

private Path toPath(String subDirName, long generationNumber, int keyHash, int retryCount) {
      String paddedHexHash = toPaddedHexString(keyHash, retryCount);
      return Paths.get(subDirName, generationNumber + "-" + paddedHexHash);
}
 
@VisibleForTesting
static String toPaddedHexString(int keyHash, int retryCount) {
      String paddedHexHash = Strings.padStart(Integer.toHexString(keyHash), 8, '0');
      if (retryCount != 0) {
         paddedHexHash = paddedHexHash + "_" + retryCount;
      }
 
      return paddedHexHash;
}

实际写入的格式与上边截图的类似

说明

以上是简单的说明,详细的可以通过反编译源码查看,了解cloud cache 里边的一些东西比较有用,比如性能优化,磁盘空间规划
里边还有一些细节就是关于挂载点文件夹数据过多的,会进行一些轮转清理处理
关于CacheAsyncByteReader的加载是通过dremio 的配置key, 具体实现也是在ce包中ce-kernel 中,如下

public SeekableInputStream newStream() {
    try {
     // dremio.plugins.iceberg.manifests.input_stream_factory
    // 
      SeekableInputStreamFactory factory = io.getContext() == null || io.getDataset() == null ?
          SeekableInputStreamFactory.DEFAULT :
          io.getContext().getConfig().getInstance(SeekableInputStreamFactory.KEY, SeekableInputStreamFactory.class,
              SeekableInputStreamFactory.DEFAULT);
      return factory.getStream(io.getFs(), io.getContext(),
          path, fileSize, mtime, io.getDataset(), io.getDatasourcePluginUID());
    } catch (FileNotFoundException e) {
      throw new NotFoundException(e, "Path %s not found.", path);
    } catch (IOException e) {
      throw new UncheckedIOException(String.format("Failed to create new input stream for file: %s", path), e);
    }
}

参考资料

sabot/kernel/src/main/java/com/dremio/exec/store/cache/BlockLocationsCacheManager.java
sabot/kernel/src/main/java/com/dremio/exec/store/cache/RecordingCacheReaderWriter.java
sabot/kernel/src/main/java/com/dremio/exec/store/dfs/SplitAssignmentTableFunction.java
sabot/kernel/src/main/java/com/dremio/exec/store/RecordReader.java
sabot/kernel/src/main/java/com/dremio/exec/store/iceberg/DremioInputFile.java
sabot/kernel/src/main/java/com/dremio/exec/store/iceberg/SeekableInputStreamFactory.java
sabot/kernel/src/main/java/com/dremio/exec/work/CacheManagerStoragePluginInfo.java
sabot/kernel/src/main/java/com/dremio/sabot/op/tablefunction/InternalTableFunctionFactory.java
sabot/kernel/src/main/java/com/dremio/sabot/op/tablefunction/TableFunctionOperator.java
sabot/kernel/src/main/java/com/dremio/exec/store/schedule/CompleteWork.java
sabot/kernel/src/main/java/com/dremio/exec/planner/physical/TableFunctionPrel.java
common/legacy/src/main/java/com/dremio/config/DremioConfig.java
sabot/kernel/src/main/java/com/dremio/exec/store/cache/RocksDbBroker.java
com/dremio/service/cachemanager/CacheFileSystemWrapper.java
common/legacy/src/main/java/com/dremio/io/AsyncByteReader.java
com/dremio/service/cachemanager/CacheManager.java
com/dremio/service/cachemanager/CacheDBController.java
com/dremio/service/cachemanager/CacheFSController.java
com/dremio/service/cachemanager/CacheFSMountPoint.java
com/dremio/service/cachemanager/CacheMemoryLockController.java
com/dremio/service/cachemanager/CacheAsyncByteReader.java
https://en.wikipedia.org/wiki/Rendezvous_hashing

posted on 2024-03-24 08:00  荣锋亮  阅读(18)  评论(0编辑  收藏  举报

导航