dremio FormatCreator 简单说明

FormatCreator 的核心职责就是进行Format plugin 的创建

核心提供的能力

如下图,目前核心是为FileSystemPlugin 提供格式化处理的能了(包含了格式识别,数据预览,格式插件查找,创建格式插件)

  • 构造函数初始化

默认是基于了动态类发现以及加载进行系统FormatPlugin 初始化的,如下

FormatCreator(
      SabotContext context,
      FileSystemConf<?, ?> storageConfig,
      ScanResult classpathScan,
      FileSystemPlugin fsPlugin) {
    this.context = context;
    this.storageConfig = storageConfig;
    this.fsPlugin = fsPlugin;
   // dremio 的类扫描处理
    this.pluginClasses = classpathScan.getImplementations(FormatPlugin.class);
    this.configConstructors = initConfigConstructors(pluginClasses);
    Map<String, FormatPlugin> pluginsByName = Maps.newHashMap();
    Map<FormatPluginConfig, FormatPlugin> pluginsByConfig = Maps.newHashMap();
    List<FormatMatcher> formatMatchers = Lists.newArrayList();
    List<FormatMatcher> layeredFormatMatchers = Lists.newArrayList();
   // 首先会处理默认的格式配置
    final Map<String, FormatPluginConfig> formats = getDefaultFormats();
    // 格式插件类的创建
    if (formats != null && !formats.isEmpty()) {
      for (Map.Entry<String, FormatPluginConfig> e : formats.entrySet()) {
        Constructor<?> c = configConstructors.get(e.getValue().getClass());
        if (c == null) {
          logger.warn(
              "Unable to find constructor for storage config named '{}' of type '{}'.",
              e.getKey(),
              e.getValue().getClass().getName());
          continue;
        }
        try {
          FormatPlugin formatPlugin =
              (FormatPlugin) c.newInstance(e.getKey(), context, e.getValue(), fsPlugin);
          pluginsByName.put(e.getKey(), formatPlugin);
          pluginsByConfig.put(formatPlugin.getConfig(), formatPlugin);
 
          if (formatPlugin.isLayered()) {
            layeredFormatMatchers.add(formatPlugin.getMatcher());
            // add the layer ones at the top, so that they get checked first.
            formatMatchers.add(0, formatPlugin.getMatcher());
          } else {
            formatMatchers.add(formatPlugin.getMatcher());
          }
        } catch (InstantiationException
            | IllegalAccessException
            | IllegalArgumentException
            | InvocationTargetException e1) {
          logger.warn(
              "Failure initializing storage config named '{}' of type '{}'.",
              e.getKey(),
              e.getValue().getClass().getName(),
              e1);
        }
      }
    }
    // Fall back to default constructor based initialization.
   // 此处实际上就是一个支持扩展的地方了,处理完默认之后会加载一些其他扫描到的格式化插件类
    for (Class<? extends FormatPlugin> pluginClass : pluginClasses) {
      for (Constructor<?> c : pluginClass.getConstructors()) {
        try {
          if (!DEFAULT_BASED.check(c)) {
            continue;
          }
          FormatPlugin plugin = (FormatPlugin) c.newInstance(null, context, fsPlugin);
         // 首先会进行去重处理,因为格式化插件在上边可能已经处理了
          if (pluginsByName.containsKey(plugin.getName())) {
            continue;
          }
          pluginsByName.put(plugin.getName(), plugin);
          pluginsByConfig.put(plugin.getConfig(), plugin);
 
          if (plugin.isLayered()) {
            layeredFormatMatchers.add(plugin.getMatcher());
            // add the layer ones at the top, so that they get checked first.
            formatMatchers.add(0, plugin.getMatcher());
          } else {
            formatMatchers.add(plugin.getMatcher());
          }
        } catch (Exception e) {
          logger.warn(
              String.format(
                  "Failure while trying instantiate FormatPlugin %s.", pluginClass.getName()),
              e);
        }
      }
    }
    this.pluginsByName = Collections.unmodifiableMap(pluginsByName);
    this.pluginsByConfig = Collections.unmodifiableMap(pluginsByConfig);
    this.formatMatchers = Collections.unmodifiableList(formatMatchers);
    this.layeredFormatMatchers = Collections.unmodifiableList(layeredFormatMatchers);
}
  • 默认格式配置

getDefaultFormats 的处理,提供dremio 默认支持的格式处理,进行系统默认格式化插件的发现注册

public static Map<String, FormatPluginConfig> getDefaultFormats() {
  Map<String, FormatPluginConfig> defaultFormats = new TreeMap<>();
  defaultFormats.put("csv", createTextFormatPlugin(false, ",", Lists.newArrayList("csv")));
  defaultFormats.put("csvh", createTextFormatPlugin(true, ",", Lists.newArrayList("csvh")));
  defaultFormats.put("tsv", createTextFormatPlugin(false, "\t", Lists.newArrayList("tsv")));
  defaultFormats.put("psv", createTextFormatPlugin(false, "|", Lists.newArrayList("psv", "tbl")));
  defaultFormats.put("txt", createTextFormatPlugin(false, "\u0000", Lists.newArrayList("txt")));
  TextFormatConfig psva = createTextFormatPlugin(false, "|", Lists.newArrayList("psva", "tbla"));
  psva.autoGenerateColumnNames = true;
  defaultFormats.put("psva", psva);
 
  defaultFormats.put("parquet", new ParquetFormatConfig());
  defaultFormats.put("json", new JSONFormatPlugin.JSONFormatConfig());
  defaultFormats.put("dremarrow1", new ArrowFormatPluginConfig());
  defaultFormats.put("iceberg", new IcebergFormatConfig());
  defaultFormats.put("delta", new DeltaLakeFormatConfig());
  defaultFormats.put("xls", new ExcelFormatPluginConfig(true));
  defaultFormats.put("excel", new ExcelFormatPluginConfig(false));
  return defaultFormats;
}
  • 使用

如下图,目前就是文件系统实现类,主要包含了表创建,dataset 关联格式化插件

FormatPlugin的实现

  • FormatPlugin 接口定义

注意目前实际上getWriter 主要是写入操作的,对于数据的读取处理并不是getRecordReader,这个是进行预览处理的

public interface FormatPlugin {
  public boolean supportsRead();
 
  public boolean supportsWrite();
 
  // Is a layer on top of other single/multiple file formats.
  default boolean isLayered() {
    return false;
  }
 
  public SabotContext getContext();
 
  /**
   * Indicates whether this FormatPlugin supports auto-partitioning for CTAS statements
   *
   * @return true if auto-partitioning is supported
   */
  public boolean supportsAutoPartitioning();
 
  public FormatMatcher getMatcher();
 
  public AbstractWriter getWriter(
      PhysicalOperator child,
      String location,
      FileSystemPlugin<?> plugin,
      WriterOptions options,
      OpProps props)
      throws IOException;
 
  public FormatPluginConfig getConfig();
 
  public String getName();
 
  FileDatasetHandle getDatasetAccessor(
      DatasetType type,
      PreviousDatasetInfo previousInfo,
      FileSystem fs,
      FileSelection fileSelection,
      FileSystemPlugin<?> fsPlugin,
      NamespaceKey tableSchemaPath,
      FileUpdateKey updateKey,
      int maxLeafColumns,
      TimeTravelOption.TimeTravelRequest timeTravelRequest);
 
  /** Get a record reader specifically for the purposes of previews. */
  public RecordReader getRecordReader(
      final OperatorContext context, final FileSystem dfs, final FileAttributes attributes)
      throws ExecutionSetupException;
 
  default FileSelectionProcessor getFileSelectionProcessor(
      FileSystem fs, FileSelection fileSelection) {
    return new DefaultFileSelectionProcessor(fs, fileSelection, getMaxFilesLimit());
  }
 
  /** Get the files under a path for sample data purpose */
  DirectoryStream<FileAttributes> getFilesForSamples(
      FileSystem fs, FileSystemPlugin<?> fsPlugin, Path path)
      throws IOException, FileCountTooLargeException;
 
  /**
   * @return Returns the max number of files supported by this format plugin
   */
  default int getMaxFilesLimit() {
    return Math.toIntExact(
        getContext().getOptionManager().getOption(FileDatasetHandle.DFS_MAX_FILES));
  }
}
  • 目前的实现

  • 使用

FormatPlugin使用的地方就比较多了,包含了数据提升,格式预览,各类存储插件实现(文件系统类的),FormatMatcher ,详细的可以
参考实际源码

说明

以上是对于FormatCreator以及FormatPlugin的简单介绍,了解内部的机制有利于我们可以自己的格式化插件,方便数据处理

参考资料

sabot/kernel/src/main/java/com/dremio/exec/store/dfs/FormatCreator.java
sabot/kernel/src/main/java/com/dremio/exec/store/dfs/FormatPlugin.java
sabot/kernel/src/main/java/com/dremio/exec/store/dfs/EasyFileSystemCreateTableEntry.java
sabot/kernel/src/main/java/com/dremio/exec/planner/physical/WriterPrel.java
common/legacy/src/main/java/com/dremio/common/logical/FormatPluginConfig.java

posted on 2024-04-17 19:49  荣锋亮  阅读(4)  评论(0编辑  收藏  举报

导航