dremio FormatCreator 简单说明
FormatCreator 的核心职责就是进行Format plugin 的创建
核心提供的能力
如下图,目前核心是为FileSystemPlugin 提供格式化处理的能了(包含了格式识别,数据预览,格式插件查找,创建格式插件)
- 构造函数初始化
默认是基于了动态类发现以及加载进行系统FormatPlugin 初始化的,如下
FormatCreator(
SabotContext context,
FileSystemConf<?, ?> storageConfig,
ScanResult classpathScan,
FileSystemPlugin fsPlugin) {
this.context = context;
this.storageConfig = storageConfig;
this.fsPlugin = fsPlugin;
// dremio 的类扫描处理
this.pluginClasses = classpathScan.getImplementations(FormatPlugin.class);
this.configConstructors = initConfigConstructors(pluginClasses);
Map<String, FormatPlugin> pluginsByName = Maps.newHashMap();
Map<FormatPluginConfig, FormatPlugin> pluginsByConfig = Maps.newHashMap();
List<FormatMatcher> formatMatchers = Lists.newArrayList();
List<FormatMatcher> layeredFormatMatchers = Lists.newArrayList();
// 首先会处理默认的格式配置
final Map<String, FormatPluginConfig> formats = getDefaultFormats();
// 格式插件类的创建
if (formats != null && !formats.isEmpty()) {
for (Map.Entry<String, FormatPluginConfig> e : formats.entrySet()) {
Constructor<?> c = configConstructors.get(e.getValue().getClass());
if (c == null) {
logger.warn(
"Unable to find constructor for storage config named '{}' of type '{}'.",
e.getKey(),
e.getValue().getClass().getName());
continue;
}
try {
FormatPlugin formatPlugin =
(FormatPlugin) c.newInstance(e.getKey(), context, e.getValue(), fsPlugin);
pluginsByName.put(e.getKey(), formatPlugin);
pluginsByConfig.put(formatPlugin.getConfig(), formatPlugin);
if (formatPlugin.isLayered()) {
layeredFormatMatchers.add(formatPlugin.getMatcher());
// add the layer ones at the top, so that they get checked first.
formatMatchers.add(0, formatPlugin.getMatcher());
} else {
formatMatchers.add(formatPlugin.getMatcher());
}
} catch (InstantiationException
| IllegalAccessException
| IllegalArgumentException
| InvocationTargetException e1) {
logger.warn(
"Failure initializing storage config named '{}' of type '{}'.",
e.getKey(),
e.getValue().getClass().getName(),
e1);
}
}
}
// Fall back to default constructor based initialization.
// 此处实际上就是一个支持扩展的地方了,处理完默认之后会加载一些其他扫描到的格式化插件类
for (Class<? extends FormatPlugin> pluginClass : pluginClasses) {
for (Constructor<?> c : pluginClass.getConstructors()) {
try {
if (!DEFAULT_BASED.check(c)) {
continue;
}
FormatPlugin plugin = (FormatPlugin) c.newInstance(null, context, fsPlugin);
// 首先会进行去重处理,因为格式化插件在上边可能已经处理了
if (pluginsByName.containsKey(plugin.getName())) {
continue;
}
pluginsByName.put(plugin.getName(), plugin);
pluginsByConfig.put(plugin.getConfig(), plugin);
if (plugin.isLayered()) {
layeredFormatMatchers.add(plugin.getMatcher());
// add the layer ones at the top, so that they get checked first.
formatMatchers.add(0, plugin.getMatcher());
} else {
formatMatchers.add(plugin.getMatcher());
}
} catch (Exception e) {
logger.warn(
String.format(
"Failure while trying instantiate FormatPlugin %s.", pluginClass.getName()),
e);
}
}
}
this.pluginsByName = Collections.unmodifiableMap(pluginsByName);
this.pluginsByConfig = Collections.unmodifiableMap(pluginsByConfig);
this.formatMatchers = Collections.unmodifiableList(formatMatchers);
this.layeredFormatMatchers = Collections.unmodifiableList(layeredFormatMatchers);
}
- 默认格式配置
getDefaultFormats 的处理,提供dremio 默认支持的格式处理,进行系统默认格式化插件的发现注册
public static Map<String, FormatPluginConfig> getDefaultFormats() {
Map<String, FormatPluginConfig> defaultFormats = new TreeMap<>();
defaultFormats.put("csv", createTextFormatPlugin(false, ",", Lists.newArrayList("csv")));
defaultFormats.put("csvh", createTextFormatPlugin(true, ",", Lists.newArrayList("csvh")));
defaultFormats.put("tsv", createTextFormatPlugin(false, "\t", Lists.newArrayList("tsv")));
defaultFormats.put("psv", createTextFormatPlugin(false, "|", Lists.newArrayList("psv", "tbl")));
defaultFormats.put("txt", createTextFormatPlugin(false, "\u0000", Lists.newArrayList("txt")));
TextFormatConfig psva = createTextFormatPlugin(false, "|", Lists.newArrayList("psva", "tbla"));
psva.autoGenerateColumnNames = true;
defaultFormats.put("psva", psva);
defaultFormats.put("parquet", new ParquetFormatConfig());
defaultFormats.put("json", new JSONFormatPlugin.JSONFormatConfig());
defaultFormats.put("dremarrow1", new ArrowFormatPluginConfig());
defaultFormats.put("iceberg", new IcebergFormatConfig());
defaultFormats.put("delta", new DeltaLakeFormatConfig());
defaultFormats.put("xls", new ExcelFormatPluginConfig(true));
defaultFormats.put("excel", new ExcelFormatPluginConfig(false));
return defaultFormats;
}
- 使用
如下图,目前就是文件系统实现类,主要包含了表创建,dataset 关联格式化插件
FormatPlugin的实现
- FormatPlugin 接口定义
注意目前实际上getWriter 主要是写入操作的,对于数据的读取处理并不是getRecordReader,这个是进行预览处理的
public interface FormatPlugin {
public boolean supportsRead();
public boolean supportsWrite();
// Is a layer on top of other single/multiple file formats.
default boolean isLayered() {
return false;
}
public SabotContext getContext();
/**
* Indicates whether this FormatPlugin supports auto-partitioning for CTAS statements
*
* @return true if auto-partitioning is supported
*/
public boolean supportsAutoPartitioning();
public FormatMatcher getMatcher();
public AbstractWriter getWriter(
PhysicalOperator child,
String location,
FileSystemPlugin<?> plugin,
WriterOptions options,
OpProps props)
throws IOException;
public FormatPluginConfig getConfig();
public String getName();
FileDatasetHandle getDatasetAccessor(
DatasetType type,
PreviousDatasetInfo previousInfo,
FileSystem fs,
FileSelection fileSelection,
FileSystemPlugin<?> fsPlugin,
NamespaceKey tableSchemaPath,
FileUpdateKey updateKey,
int maxLeafColumns,
TimeTravelOption.TimeTravelRequest timeTravelRequest);
/** Get a record reader specifically for the purposes of previews. */
public RecordReader getRecordReader(
final OperatorContext context, final FileSystem dfs, final FileAttributes attributes)
throws ExecutionSetupException;
default FileSelectionProcessor getFileSelectionProcessor(
FileSystem fs, FileSelection fileSelection) {
return new DefaultFileSelectionProcessor(fs, fileSelection, getMaxFilesLimit());
}
/** Get the files under a path for sample data purpose */
DirectoryStream<FileAttributes> getFilesForSamples(
FileSystem fs, FileSystemPlugin<?> fsPlugin, Path path)
throws IOException, FileCountTooLargeException;
/**
* @return Returns the max number of files supported by this format plugin
*/
default int getMaxFilesLimit() {
return Math.toIntExact(
getContext().getOptionManager().getOption(FileDatasetHandle.DFS_MAX_FILES));
}
}
- 目前的实现
- 使用
FormatPlugin使用的地方就比较多了,包含了数据提升,格式预览,各类存储插件实现(文件系统类的),FormatMatcher ,详细的可以
参考实际源码
说明
以上是对于FormatCreator以及FormatPlugin的简单介绍,了解内部的机制有利于我们可以自己的格式化插件,方便数据处理
参考资料
sabot/kernel/src/main/java/com/dremio/exec/store/dfs/FormatCreator.java
sabot/kernel/src/main/java/com/dremio/exec/store/dfs/FormatPlugin.java
sabot/kernel/src/main/java/com/dremio/exec/store/dfs/EasyFileSystemCreateTableEntry.java
sabot/kernel/src/main/java/com/dremio/exec/planner/physical/WriterPrel.java
common/legacy/src/main/java/com/dremio/common/logical/FormatPluginConfig.java
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
2022-04-17 使用openresty-lua-types+TypeScriptToLua+test-nginx 开发强类型的nginx lua 模块
2022-04-17 openresty-lua-types openresty typescript 类型定义
2022-04-17 TypeScriptToLua 使用typescript 开发lua 功能试用
2022-04-17 TypeScriptToLua npm 集成玩法
2021-04-17 streamsets http client 请求过大问题
2021-04-17 使用graalvm 运行cube.js
2021-04-17 使用presto-gateway 进行trino 扩展