dremio FragmentExecutor 的执行顺序简单说明
dremio 在执行计划物理计划转换之后,对于执行计划会包含不同的fragment,fragment 会组成一颗树,包含了PlanFragmentMajor以及PlanFragmentMinor
对于组成的树之后dremio 就需要调度执行了(里边会包含资源分配,优先级,运算操作,大致处理可以参考drill 查询执行介绍)
FragmentExecutors 中的启动Fragments
参考如下,通过QueryStarterImpl 包装进行处理
public void startFragments(final InitializeFragments fragments, final FragmentExecutorBuilder builder,
final StreamObserver<Empty> sender, final NodeEndpoint identity) {
final SchedulingInfo schedulingInfo = fragments.hasSchedulingInfo() ? fragments.getSchedulingInfo() : null;
QueryStarterImpl queryStarter = new QueryStarterImpl(fragments, builder, sender, identity, schedulingInfo);
// 通过FragmentExecutorBuilder 构建FragmentExecutor,内部使用了QueriesClerk 支持工作负载以及查询级别的分配处理,包装完成
// 之后还是调用的QueryStarterImpl 内部的buildAndStartQuery
builder.buildAndStartQuery(queryStarter.getFirstFragment(), schedulingInfo, queryStarter);
}
执行顺序处理
QueryStarterImpl 中的buildAndStartQuery
public void buildAndStartQuery(final QueryTicket queryTicket) {
QueryId queryId = queryTicket.getQueryId();
/**
* To avoid race conditions between creation and deletion of phase/fragment tickets,
* build all the fragments first (creates the tickets) and then, start the fragments (can
* delete tickets).
*/
List<FragmentExecutor> fragmentExecutors = new ArrayList<>();
UserRpcException userRpcException = null;
Set<FragmentHandle> fragmentHandlesForQuery = Sets.newHashSet();
try {
if (!maestroProxy.tryStartQuery(queryId, queryTicket, initializeFragments.getQuerySentTime())) {
boolean isDuplicateStart = maestroProxy.isQueryStarted(queryId);
if (isDuplicateStart) {
// duplicate op, do nothing.
return;
} else {
throw new IllegalStateException("query already cancelled");
}
}
Map<Integer, Integer> priorityToWeightMap = buildPriorityToWeightMap();
for (PlanFragmentFull fragment : fullFragments) {
// 构建FragmentExecutor
FragmentExecutor fe = buildFragment(queryTicket, fragment, priorityToWeightMap.getOrDefault(fragment.getMajor().getFragmentExecWeight(), 1), schedulingInfo);
fragmentHandlesForQuery.add(fe.getHandle());
fragmentExecutors.add(fe);
}
} catch (UserRpcException e) {
userRpcException = e;
} catch (Exception e) {
userRpcException = new UserRpcException(NodeEndpoint.getDefaultInstance(), "Remote message leaked.", e);
} finally {
if (fragmentHandlesForQuery.size() > 0) {
maestroProxy.initFragmentHandlesForQuery(queryId, fragmentHandlesForQuery);
}
// highest weight first with leaf fragments first bottom up
// 进行排序处理,然后执行FragmentExecutor,实际的处理是可以参考drill 的查询执行介绍的
fragmentExecutors.sort(weightBasedComparator());
for (FragmentExecutor fe : fragmentExecutors) {
startFragment(fe);
}
queryTicket.release();
if (userRpcException == null) {
sender.onNext(Empty.getDefaultInstance());
sender.onCompleted();
} else {
sender.onError(userRpcException);
}
}
// if there was a cancel while the start was in-progress, clean up.
if (maestroProxy.isQueryCancelled(queryId)) {
cancelFragments(queryId, builder.getClerk());
}
}
buildFragment 构建FragmentExecutor
private FragmentExecutor buildFragment(final QueryTicket queryTicket, final PlanFragmentFull fragment,
final int schedulingWeight, final SchedulingInfo schedulingInfo) throws UserRpcException {
if (fragment.getMajor().getFragmentExecWeight() <= 0) {
logger.info("Received remote fragment start instruction for {}", QueryIdHelper.getQueryIdentifier(fragment.getHandle()));
} else {
logger.info("Received remote fragment start instruction for {} with assigned weight {} and scheduling weight {}",
QueryIdHelper.getQueryIdentifier(fragment.getHandle()), fragment.getMajor().getFragmentExecWeight(), schedulingWeight);
}
try {
final EventProvider eventProvider = getEventProvider(fragment.getHandle());
return builder.build(queryTicket, fragment, schedulingWeight, useMemoryArbiter ? memoryArbiter : null, eventProvider, schedulingInfo, fragmentReader);
} catch (final Exception e) {
throw new UserRpcException(identity, "Failure while trying to start remote fragment", e);
} catch (final OutOfMemoryError t) {
if (t.getMessage().startsWith("Direct buffer")) {
throw new UserRpcException(identity, "Out of direct memory while trying to start remote fragment", t);
} else {
throw t;
}
}
}
权重比较
// 通过PlanFragmentFull 获取执行的权重,此生成是根据,dremio 的简单并行化处理的(SimpleParallelizer,内部会通过资源的处理进行评估,后边详细介绍)
private Map<Integer, Integer> buildPriorityToWeightMap() {
Map<Integer, Integer> priorityToWeightMap = new HashMap<>();
if (fullFragments.get(0).getMajor().getFragmentExecWeight() <= 0) {
return priorityToWeightMap;
}
for(PlanFragmentFull fragment : fullFragments) {
int fragmentWeight = fragment.getMajor().getFragmentExecWeight();
priorityToWeightMap.put(fragmentWeight, Math.min(fragmentWeight, QueryTicket.MAX_EXPECTED_SIZE));
}
return priorityToWeightMap;
}
// 进行比较的,核心是处理root ,leaf 已经一些FragmentExecutor 的顺序,然后进行执行
private Comparator<FragmentExecutor> weightBasedComparator() {
return (e1, e2) -> {
if (e2.getFragmentWeight() < e1.getFragmentWeight()) {
// higher priority first
return -1;
}
if (e2.getFragmentWeight() == e1.getFragmentWeight()) {
// when priorities are equal, order the fragments based on leaf first and then descending order
// of major fragment number. This ensures fragments are started in order of dependency
if (e2.isLeafFragment() == e1.isLeafFragment()) {
return Integer.compare(e2.getHandle().getMajorFragmentId(), e1.getHandle().getMajorFragmentId());
} else {
return e1.isLeafFragment() ? -1 : 1;
}
}
return 1;
};
}
说明
实际上通过阅读drill 的查询执行介绍可以可以看出来顺序,但是通过源码的阅读可以更好的学习了解内部处理
参考资料
sabot/kernel/src/main/java/com/dremio/sabot/exec/fragment/FragmentExecutor.java
sabot/kernel/src/main/java/com/dremio/sabot/exec/FragmentExecutors.java
sabot/kernel/src/main/java/com/dremio/exec/planner/fragment/PlanFragmentFull.java
sabot/kernel/src/main/java/com/dremio/sabot/exec/WorkloadTicket.java
sabot/kernel/src/main/java/com/dremio/sabot/exec/fragment/FragmentExecutorBuilder.java
sabot/kernel/src/main/java/com/dremio/exec/planner/fragment/SimpleParallelizer.java
https://drill.apache.org/docs/drill-query-execution/
https://www.cnblogs.com/rongfengliang/p/17008637.html
https://www.cnblogs.com/rongfengliang/p/17008654.html