dremio FragmentExecutor 的执行顺序简单说明
dremio 在执行计划物理计划转换之后,对于执行计划会包含不同的fragment,fragment 会组成一颗树,包含了PlanFragmentMajor以及PlanFragmentMinor
对于组成的树之后dremio 就需要调度执行了(里边会包含资源分配,优先级,运算操作,大致处理可以参考drill 查询执行介绍)
FragmentExecutors 中的启动Fragments
参考如下,通过QueryStarterImpl 包装进行处理
public void startFragments(final InitializeFragments fragments, final FragmentExecutorBuilder builder,
final StreamObserver<Empty> sender, final NodeEndpoint identity) {
final SchedulingInfo schedulingInfo = fragments.hasSchedulingInfo() ? fragments.getSchedulingInfo() : null;
QueryStarterImpl queryStarter = new QueryStarterImpl(fragments, builder, sender, identity, schedulingInfo);
// 通过FragmentExecutorBuilder 构建FragmentExecutor,内部使用了QueriesClerk 支持工作负载以及查询级别的分配处理,包装完成
// 之后还是调用的QueryStarterImpl 内部的buildAndStartQuery
builder.buildAndStartQuery(queryStarter.getFirstFragment(), schedulingInfo, queryStarter);
}
执行顺序处理
QueryStarterImpl 中的buildAndStartQuery
public void buildAndStartQuery(final QueryTicket queryTicket) {
QueryId queryId = queryTicket.getQueryId();
/**
* To avoid race conditions between creation and deletion of phase/fragment tickets,
* build all the fragments first (creates the tickets) and then, start the fragments (can
* delete tickets).
*/
List<FragmentExecutor> fragmentExecutors = new ArrayList<>();
UserRpcException userRpcException = null;
Set<FragmentHandle> fragmentHandlesForQuery = Sets.newHashSet();
try {
if (!maestroProxy.tryStartQuery(queryId, queryTicket, initializeFragments.getQuerySentTime())) {
boolean isDuplicateStart = maestroProxy.isQueryStarted(queryId);
if (isDuplicateStart) {
// duplicate op, do nothing.
return;
} else {
throw new IllegalStateException("query already cancelled");
}
}
Map<Integer, Integer> priorityToWeightMap = buildPriorityToWeightMap();
for (PlanFragmentFull fragment : fullFragments) {
// 构建FragmentExecutor
FragmentExecutor fe = buildFragment(queryTicket, fragment, priorityToWeightMap.getOrDefault(fragment.getMajor().getFragmentExecWeight(), 1), schedulingInfo);
fragmentHandlesForQuery.add(fe.getHandle());
fragmentExecutors.add(fe);
}
} catch (UserRpcException e) {
userRpcException = e;
} catch (Exception e) {
userRpcException = new UserRpcException(NodeEndpoint.getDefaultInstance(), "Remote message leaked.", e);
} finally {
if (fragmentHandlesForQuery.size() > 0) {
maestroProxy.initFragmentHandlesForQuery(queryId, fragmentHandlesForQuery);
}
// highest weight first with leaf fragments first bottom up
// 进行排序处理,然后执行FragmentExecutor,实际的处理是可以参考drill 的查询执行介绍的
fragmentExecutors.sort(weightBasedComparator());
for (FragmentExecutor fe : fragmentExecutors) {
startFragment(fe);
}
queryTicket.release();
if (userRpcException == null) {
sender.onNext(Empty.getDefaultInstance());
sender.onCompleted();
} else {
sender.onError(userRpcException);
}
}
// if there was a cancel while the start was in-progress, clean up.
if (maestroProxy.isQueryCancelled(queryId)) {
cancelFragments(queryId, builder.getClerk());
}
}
buildFragment 构建FragmentExecutor
private FragmentExecutor buildFragment(final QueryTicket queryTicket, final PlanFragmentFull fragment,
final int schedulingWeight, final SchedulingInfo schedulingInfo) throws UserRpcException {
if (fragment.getMajor().getFragmentExecWeight() <= 0) {
logger.info("Received remote fragment start instruction for {}", QueryIdHelper.getQueryIdentifier(fragment.getHandle()));
} else {
logger.info("Received remote fragment start instruction for {} with assigned weight {} and scheduling weight {}",
QueryIdHelper.getQueryIdentifier(fragment.getHandle()), fragment.getMajor().getFragmentExecWeight(), schedulingWeight);
}
try {
final EventProvider eventProvider = getEventProvider(fragment.getHandle());
return builder.build(queryTicket, fragment, schedulingWeight, useMemoryArbiter ? memoryArbiter : null, eventProvider, schedulingInfo, fragmentReader);
} catch (final Exception e) {
throw new UserRpcException(identity, "Failure while trying to start remote fragment", e);
} catch (final OutOfMemoryError t) {
if (t.getMessage().startsWith("Direct buffer")) {
throw new UserRpcException(identity, "Out of direct memory while trying to start remote fragment", t);
} else {
throw t;
}
}
}
权重比较
// 通过PlanFragmentFull 获取执行的权重,此生成是根据,dremio 的简单并行化处理的(SimpleParallelizer,内部会通过资源的处理进行评估,后边详细介绍)
private Map<Integer, Integer> buildPriorityToWeightMap() {
Map<Integer, Integer> priorityToWeightMap = new HashMap<>();
if (fullFragments.get(0).getMajor().getFragmentExecWeight() <= 0) {
return priorityToWeightMap;
}
for(PlanFragmentFull fragment : fullFragments) {
int fragmentWeight = fragment.getMajor().getFragmentExecWeight();
priorityToWeightMap.put(fragmentWeight, Math.min(fragmentWeight, QueryTicket.MAX_EXPECTED_SIZE));
}
return priorityToWeightMap;
}
// 进行比较的,核心是处理root ,leaf 已经一些FragmentExecutor 的顺序,然后进行执行
private Comparator<FragmentExecutor> weightBasedComparator() {
return (e1, e2) -> {
if (e2.getFragmentWeight() < e1.getFragmentWeight()) {
// higher priority first
return -1;
}
if (e2.getFragmentWeight() == e1.getFragmentWeight()) {
// when priorities are equal, order the fragments based on leaf first and then descending order
// of major fragment number. This ensures fragments are started in order of dependency
if (e2.isLeafFragment() == e1.isLeafFragment()) {
return Integer.compare(e2.getHandle().getMajorFragmentId(), e1.getHandle().getMajorFragmentId());
} else {
return e1.isLeafFragment() ? -1 : 1;
}
}
return 1;
};
}
说明
实际上通过阅读drill 的查询执行介绍可以可以看出来顺序,但是通过源码的阅读可以更好的学习了解内部处理
参考资料
sabot/kernel/src/main/java/com/dremio/sabot/exec/fragment/FragmentExecutor.java
sabot/kernel/src/main/java/com/dremio/sabot/exec/FragmentExecutors.java
sabot/kernel/src/main/java/com/dremio/exec/planner/fragment/PlanFragmentFull.java
sabot/kernel/src/main/java/com/dremio/sabot/exec/WorkloadTicket.java
sabot/kernel/src/main/java/com/dremio/sabot/exec/fragment/FragmentExecutorBuilder.java
sabot/kernel/src/main/java/com/dremio/exec/planner/fragment/SimpleParallelizer.java
https://drill.apache.org/docs/drill-query-execution/
https://www.cnblogs.com/rongfengliang/p/17008637.html
https://www.cnblogs.com/rongfengliang/p/17008654.html
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
2022-01-13 使用网络classloader 实现业务功能动态修改加载
2021-01-13 pg_stat_monitor pg_stat_statements 的增强扩展
2021-01-13 Replication Between PostgreSQL Versions Using Logical Replication
2020-01-13 几个不错的gc viewer tools
2019-01-13 goreplay 输出流量捕获数据到 elasticsearch
2019-01-13 goreplay 镜像nginx web app流量