dremio 的加速文件系统插件简单说明

dremio 包含一个强大的功能就是反射加速,dremio 为了对于反射数据的存储开发了自己的文件系统插件

插件类图

 

 

parquet 以及iceberg 的支持

从start 入口可以看出包含了不同的处理,实际上dremio 从21 版本开始就默认基于了iceberg 存储了(同时也需要分布式存储)

 @Override
  public void start() throws IOException {
    super.start();
     // 同时还需要初始化MaterializationStore,主要是反射元数据的存储
    materializationStore = new MaterializationStore(DirectProvider.<LegacyKVStoreProvider>wrap(getContext().getKVStoreProvider()));
    formatPlugin = (ParquetFormatPlugin) formatCreator.getFormatPluginByConfig(new ParquetFormatConfig());
    icebergFormatPlugin = (IcebergFormatPlugin)formatCreator.getFormatPluginByConfig(new IcebergFormatConfig());
  }

插件对于不同数据的处理(主要是getDatasetHandle)

 @Override
  public Optional<DatasetHandle> getDatasetHandle(EntityPath datasetPath, GetDatasetOption... options) throws ConnectorException {
    // 反射数据存储有自己的格式约定,会进行判断,实际上只包含两部分
    List<String> components = normalizeComponents(datasetPath.getComponents());
    if (components == null) {
      return Optional.empty();
    }
    Preconditions.checkState(components.size() == 3, "Unexpected number of components in path");
 
    ReflectionId reflectionId = new ReflectionId(components.get(1));
    MaterializationId materializationId = new MaterializationId(components.get(2));
    Materialization materialization = getMaterialization(materializationId);
    if (materialization == null) {
      return Optional.empty();
    }
 
    FluentIterable<Refresh> refreshes = getSlices(materialization, reflectionId);
    if(refreshes == null) {
      return Optional.empty();
    }
 
    final String selectionRoot = getConfig().getPath().resolve(refreshes.first().get().getReflectionId().getId()).toString();
 
    BatchSchema currentSchema = CurrentSchemaOption.getSchema(options);
    FileConfig fileConfig = FileConfigOption.getFileConfig(options);
    List<String> sortColumns = SortColumnsOption.getSortColumns(options);
    Integer fieldCount = MaxLeafFieldCount.getCount(options);
 
    boolean icebergDataset = isUsingIcebergDataset(materialization);
    final FileSelection selection = getFileSelection(refreshes, selectionRoot, icebergDataset);
 
    final PreviousDatasetInfo pdi = new PreviousDatasetInfo(fileConfig, currentSchema, sortColumns, null, null, true);
    if (!icebergDataset) {
      FileDatasetHandle.checkMaxFiles(datasetPath.getName(), selection.getFileAttributesList().size(), getContext(), getConfig().isInternal());
    }
    // 此方法包含了实际的处理,具体可以看源码,实际上就是对于parquet以及icegerg 数据集的处理,进行判断,使用不同的访问器
    return getDatasetHandle(datasetPath, fieldCount, icebergDataset, selection, pdi);
  }

参考资料

services/accelerator/src/main/java/com/dremio/service/reflection/materialization/AccelerationStoragePlugin.java
services/accelerator/src/main/java/com/dremio/service/reflection/materialization/AccelerationFileSystem.java
services/accelerator/src/main/java/com/dremio/service/reflection/materialization/AccelerationStoragePluginConfig.java
sabot/kernel/src/main/java/com/dremio/exec/store/iceberg/IcebergExecutionDatasetAccessor.java
sabot/kernel/src/main/java/com/dremio/exec/store/parquet/ParquetFormatDatasetAccessor.java

posted on   荣锋亮  阅读(79)  评论(0编辑  收藏  举报

相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
历史上的今天:
2022-01-02 ozone 为 apache hadoop 提供扩展分布式对象存储的服务
2022-01-02 esigate java esi实现框架
2022-01-02 streamsets 可选替换工具
2022-01-02 使用squid 让 docker for mac 支持基于容器ip 访问
2021-01-02 Rules of Micro-Frontends
2021-01-02 cube.js 查询格式
2021-01-02 Configuring Django Settings: Best Practices

导航

< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5
点击右上角即可分享
微信分享提示