联童大数据调度平台之路

各位联童 IT MAN 大家好！列车长近日收到一篇来自大数据团队 - 张永清同学的原创投稿，这位多本畅销书的作者今天为大家分享了联童基于 incubator-dolphinscheduler 从 0 到 1 构建大数据调度平台的历程。

联童是一家智能化母婴童产业平台，从事母婴童行业以及互联网技术多年，拥有丰富的母婴门店运营和系统开发经验，在会员经营和商品经营方面，能够围绕会员需求，深入场景，更贴近合作伙伴和消费者，提供最优服务产品。公司致力于以技术来驱动母婴童产业的发展，也希望借助于大数据为客户提供更多智能数据分析和决策分析，大数据是公司重点发展的一部分，公司从成立初期起就搭建了大数据团队，有了大数据团队后，大数据调度平台的构建自然是最基础也是最重要的环节。

为什么选择 incubator-dolphinscheduler

1、incubator-dolphinscheduler 是一个由国内公司发起的开源项目，中国本土社区成员非常活跃，更加容易去进行社区沟通，同时联童也希望能加入到这个社区中，一起把这个由本土成员为主成立的社区做的更好。

2、incubator-dolphinscheduler 能够支撑非常多的应用场景

· 以DAG图的方式将Task按照任务的依赖关系关联起来，可实时可视化监控任务的运行状态

· 支持丰富的任务类型：Shell、MR、Spark、SQL(mysql、postgresql、hive、sparksql), Python, Sub_Process、Procedure，flink，datax，sqoop，http等

· 支持工作流定时调度、依赖调度、手动调度、手动暂停/停止/恢复，同时支持失败重试/告警、从指定节点恢复失败、Kill任务等操作

· 支持工作流优先级、任务优先级及任务的故障转移及任务超时告警/失败

· 支持工作流全局参数及节点自定义参数设置

· 支持资源文件的在线上传/下载，管理等，支持在线文件创建、编辑

· 支持任务日志在线查看及滚动、在线下载日志等

· 实现集群 HA，通过 Zookeeper实现Master集群和Worker集群去中心化

· 支持对 Master/Worker，cpu load，memory，cpu 在线查看

· 支持工作流运行历史树形/甘特图展示、支持任务状态统计、流程状态统计

· 支持补数

· 支持多租户

· 支持国际化

其中 DAG 图在 dolphinscheduler 一个工作流可以对应多个工作任务，每一个工作任务对应一个 DAG 中的节点。

3、incubator-dolphinscheduler在保证了高并发和高可用的设计时，架构思路也相对简单，技术架构中没有引入非常多的复杂技术组件，降低了学习和维护的成本。

备注：此架构图摘自社区官方网站

incubator-dolphinscheduler 在设计时，除了 zookeeper 外，没有引入太多复杂的技术组件。整个架构以 zookeeper 作为集群管理，采用去中心化思想进行设计。

incubator-dolphinscheduler 功能的不足

1、无法支持串行调度策略

incubator-dolphinscheduler 在一开始设计时，只支持并行调度，不支持串行调度，而在联童中，大部分场景都是需要串行运行的，也就是每一个工作流任务都只能有一个实例在运行，同一个工作流任务中必须要等前一个实例执行结束，下一个实例才能开始执行，这种场景大多出现在准实时任务中。

2、任务依赖不够强大，只能支持被动等待依赖执行成功，无法主动触发下游工作流实例运行

如下图所示，只能支持在创建任务时，被动去等待依赖执行成功，无法在当前任务执行成功后，主动去触发别的工作流任务执行。

3、部分模块中用户体验不足，并且在数据量大时，部分模块数据查询性能较慢

4、缺少比较完备的监控体系

在 incubator-dolphinscheduler 只提供了一些简单的监控，当有多大几千个任务在运行时，很难做到完备监控，更是缺少对每一个任务运行的性能分析。

我们对于 incubator-dolphinscheduler 的功能升级开发

1、增加串行调度的支持

如下图所示，我们在原有并行执行的基础上，增加了串行执行方式。

在串行执行时，我们还增加了串行执行的队列功能，每一任务都可以指定队列的长度大小。

2、增加下游工作流实例运行

如下图所示，我们在原有并行执行的基础上，增加主动触发下游一个或者多个工作流实例运行。

运行后效果如下：

3、一些较大的 Bug 修复

联童在使用 incubator-dolphinscheduler 时，也踩过不少坑，这里我们举其中一个例子，比如在内部使用时，同事反馈最多的问题就是调度任务的日志刷新不及时，有时候很久才能刷新出日志。后来经过源码分析，发现是源码中存在了一些不太健壮的处理导致了这个问题。

incubator-dolphinscheduler 中 AbstractCommandExecutor.java 部分源码

/**

* abstract command executor

public abstract class AbstractCommandExecutor {

..........

/**

* build process

* @param commandFile command file

* @throws IOException IO Exception

private void buildProcess(String commandFile) throws IOException {

// setting up user to run commands

List<String> command = new LinkedList<>();

//init process builder

ProcessBuilder processBuilder = new ProcessBuilder();

// setting up a working directory

processBuilder.directory(new File(taskExecutionContext.getExecutePath()));

// merge error information to standard output stream

processBuilder.redirectErrorStream(true);

// setting up user to run commands

command.add("sudo");

command.add("-u");

command.add(taskExecutionContext.getTenantCode());

command.add(commandInterpreter());

command.addAll(commandOptions());

command.add(commandFile);

// setting commands

processBuilder.command(command);

process = processBuilder.start();

// print command

printCommand(command);

}

..........

/**

* get the standard output of the process

* @param process process

private void parseProcessOutput(Process process) {

String threadLoggerInfoName = String.format(LoggerUtils.TASK_LOGGER_THREAD_NAME + "-%s", taskExecutionContext.getTaskAppId());

ExecutorService parseProcessOutputExecutorService = ThreadUtils.newDaemonSingleThreadExecutor(threadLoggerInfoName);

parseProcessOutputExecutorService.submit(new Runnable() {

@Override

public void run() {

BufferedReader inReader = null;

try {

inReader = new BufferedReader(new InputStreamReader(process.getInputStream()));

String line;

long lastFlushTime = System.currentTimeMillis();

while ((line = inReader.readLine()) != null) {

if (line.startsWith("${setValue(")) {

varPool.append(line.substring("${setValue(".length(), line.length() - 2));

varPool.append("$VarPool$");

} else {

logBuffer.add(line);

lastFlushTime = flush(lastFlushTime);

}

} catch (Exception e) {

logger.error(e.getMessage(), e);

} finally {

clear();

close(inReader);

}

});

parseProcessOutputExecutorService.shutdown();

}

................

/**

* when log buffer siz or flush time reach condition , then flush

* @param lastFlushTime last flush time

* @return last flush time

private long flush(long lastFlushTime) {

long now = System.currentTimeMillis();

/**

* when log buffer siz or flush time reach condition , then flush

if (logBuffer.size() >= Constants.DEFAULT_LOG_ROWS_NUM || now - lastFlushTime > Constants.DEFAULT_LOG_FLUSH_INTERVAL) {

lastFlushTime = now;

/** log handle */

logHandler.accept(logBuffer);

logBuffer.clear();

}

return lastFlushTime;

}

/**

* close buffer reader

* @param inReader in reader

private void close(BufferedReader inReader) {

if (inReader != null) {

try {

inReader.close();

} catch (IOException e) {

logger.error(e.getMessage(), e);

}

protected List<String> commandOptions() {

return Collections.emptyList();

}

protected abstract String buildCommandFilePath();

protected abstract String commandInterpreter();

protected abstract void createCommandFileIfNotExists(String execCommand, String commandFile) throws IOException;

}

在这段源码中，parseProcessOutput(Process process) 方法是负责任务日志的获取以及 Flush。但是由于采用了 BufferedReader 中的 readLine() 方法来读取任务进程的process.getInputStream() 日志，由于 readLine() 是一个阻塞方法，

flush(long lastFlushTime) 方法在处理时有一个判断条件 if (logBuffer.size() >= Constants.DEFAULT_LOG_ROWS_NUM || now - lastFlushTime > Constants.DEFAULT_LOG_FLUSH_INTERVAL)，只有当日志条数达到 64 条或者间隔 1s 时才会 flush。按理说，代码其实是要实现至少每隔 1s 会 flash 一次日志，但是由于 readLine() 是一个阻塞方法，所以并不会一直在执行，而是 readLine() 必须是读取到新数据后，才会执行flush 方法。那么在出现 1s 内产生的任务日志不满足 64 条，而任务又很久没有新日志出现时，就会触发这个 bug。例如执行如下一个 shell 脚本任务，由于每个执行步骤产生的日志少，而且每个步骤执行的时间又很久，时间间隔很大，就会出现很久都不会刷新上一次产生的日志。

#!/bin/bash

echo "hello world"

exec 10m

sleep 100000s

echo "hello world2"

exec 10m

sleep 100000s

echo "hello world3"

exec 10m

sleep 100000s　

之后我们对这段源码进行了重写，采用了两个线程进行处理，一个线程负责readline()，一个线程负责 flush 做到在 readline() 方法的线程阻塞时，不影响 flush 线程的处理。我们也把修改后的代码贡献给了社区，已被 merge 到 dev 分支。

public abstract class AbstractCommandExecutor {

/**

* rules for extracting application ID

protected static final Pattern APPLICATION_REGEX = Pattern.compile(Constants.APPLICATION_REGEX);

/**

* process

private Process process;

/**

* log handler

protected Consumer<List<String>> logHandler;

/**

* logger

protected Logger logger;

/**

* log list

protected final List<String> logBuffer;

protected boolean logOutputIsScuccess = false;

/**

* taskExecutionContext

protected TaskExecutionContext taskExecutionContext;

/**

* taskExecutionContextCacheManager

private TaskExecutionContextCacheManager taskExecutionContextCacheManager;

.........

/**

* get the standard output of the process

* @param process process

private void parseProcessOutput(Process process) {

String threadLoggerInfoName = String.format(LoggerUtils.TASK_LOGGER_THREAD_NAME + "-%s", taskExecutionContext.getTaskAppId());

ExecutorService getOutputLogService = ThreadUtils.newDaemonSingleThreadExecutor(threadLoggerInfoName + "-" + "getOutputLogService");

getOutputLogService.submit(() -> {

BufferedReader inReader = null;

try {

inReader = new BufferedReader(new InputStreamReader(process.getInputStream()));

String line;while ((line = inReader.readLine()) != null) {

logBuffer.add(line);

}

} catch (Exception e) {

logger.error(e.getMessage(), e);

} finally {

logOutputIsScuccess = true;

close(inReader);

}

});

getOutputLogService.shutdown();

ExecutorService parseProcessOutputExecutorService = ThreadUtils.newDaemonSingleThreadExecutor(threadLoggerInfoName);

parseProcessOutputExecutorService.submit(() -> {

try {

long lastFlushTime = System.currentTimeMillis();

while (logBuffer.size() > 0 || !logOutputIsScuccess) {

if (logBuffer.size() > 0) {

lastFlushTime = flush(lastFlushTime);

} else {

Thread.sleep(Constants.DEFAULT_LOG_FLUSH_INTERVAL);

}

} catch (Exception e) {

logger.error(e.getMessage(), e);

} finally {

clear();

}

});

parseProcessOutputExecutorService.shutdown();

}

.......

/**

* when log buffer siz or flush time reach condition , then flush

* @param lastFlushTime last flush time

* @return last flush time

private long flush(long lastFlushTime) throws InterruptedException {

long now = System.currentTimeMillis();

/**

* when log buffer siz or flush time reach condition , then flush

if (logBuffer.size() >= Constants.DEFAULT_LOG_ROWS_NUM || now - lastFlushTime > Constants.DEFAULT_LOG_FLUSH_INTERVAL) {

lastFlushTime = now;

/** log handle */

logHandler.accept(logBuffer);

logBuffer.clear();

}

return lastFlushTime;

}

.......

}

4、将调度系统的监控接入到 prometheus 和 grafana 中

incubator-dolphinscheduler 只提供了一些如下的简单实时监控，尤其缺少对任务的监控。

联童在此基础上，引入了 prometheus 和 grafana。

使用 prometheus 和 grafana 不但可以监控到调度系统任务的总体运行，也可以监控到单个任务的运行耗时曲线等。

5、对 incubator-dolphinscheduler 的性能优化

未完待续

首先，列车长非常感谢大数据团队的分享，也要为共创/共享的精神鼓掌，同时，我们希望各个团队能够在工作中沉淀经验、总结复盘、最终形成价值输出。

欢迎各位同学多多投稿，分享你的见解

Hello World→Change Our World

posted @ 2021-02-26 08:57 海豚调度阅读(79) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

联童大数据调度平台之路 ​

公告

联童大数据调度平台之路