DataX的mysqlreader插件优化
一、使用DataX的背景
新的统计库需要同步业务表数据,但是RDS库不能使用binlog同步到自己的数据库,所以就只能使用查询表的方式进行同步。
经过简单对比后,感觉DataX的使用场景更符合我们的场景,且很容易通过修改插件定制化,最后选择了XXL-JOB调度DataX进行全量、增量数据同步的方案。
二、同步数据遇到的问题及解决
2.1 增量查询SQL修正
mysqlreader插件可以使用两种模式同步数据,都可进行全量、增量同步,但有差别
insert
模式对于主键重复的数据会标记为脏数据,数量达到配置上限任务会失败,比较适合一次性全量同步replace
模式对于主键重复的数据会直接替换更新
增量同步按主键ID和更新时间,但是配置文件是固定的,只能传递参数替换配置中的变量方式做到增量同步,这样入口参数值的获取又是一个问题。
最后的方案是对mysqlreader插件进行修改,增加一个sql配置项,在数据同步时查询sql得到已同步的最大主键ID和前一次更新时间,对同步sql的检索条件进行修正,实现精准获取增量数据。
在MysqlReader.Task.init()
方法里增加下列配置的处理逻辑
readerParamSql
: 查询得到参数KV,替换querySql
中的变量。readerWhereSql
: 查询得到增量条件,拼接到Task的SQL条件后,实现按实际条件增量同步数据。
private final static Map<String, String> TASK_TABLE_MAP = new ConcurrentHashMap<>();
public void init() {
this.readerSliceConfig = super.getPluginJobConf();
this.commonRdbmsReaderTask = new CommonRdbmsReader.Task(DATABASE_TYPE,super.getTaskGroupId(), super.getTaskId());
this.commonRdbmsReaderTask.init(this.readerSliceConfig);
fixReaderWhereSql();
fixReaderParamSql();
}
/*
* 修正reader中的where条件
*/
private void fixReaderWhereSql() {
Configuration peerPluginJobConf = super.getPeerPluginJobConf();
String readerWhereSql = peerPluginJobConf.getString("readerWhereSql");
if (readerWhereSql == null) {
return;
}
// 替换为当前表名称
readerWhereSql = readerWhereSql.replace(Constant.TABLE_NAME_PLACEHOLDER,
peerPluginJobConf.getString(Key.TABLE));
LOG.info("readerWhereSql={}", readerWhereSql);
String jdbcUrl = peerPluginJobConf.getString(Key.JDBC_URL);
String username = peerPluginJobConf.getString(Key.USERNAME);
String password = peerPluginJobConf.getString(Key.PASSWORD);
// 每个数据源+每张表的缓存KEY
String cacheKey = jdbcUrl + peerPluginJobConf.getString(Key.TABLE);
synchronized (TASK_TABLE_MAP) {
// 避免重复查询
if (!TASK_TABLE_MAP.containsKey(cacheKey)) {
String whereSql = getWhereSql(readerWhereSql, jdbcUrl, username, password);
TASK_TABLE_MAP.putIfAbsent(cacheKey, whereSql);
}
}
String whereSql = TASK_TABLE_MAP.get(cacheKey);
if (!whereSql.isEmpty()) {
String querySql = this.readerSliceConfig.getString(Key.QUERY_SQL);
// 替换查询SQL
this.readerSliceConfig.set(Key.QUERY_SQL, querySql + whereSql);
LOG.info("querySql={}", querySql);
}
}
/*
* 根据数据源和表信息获取where条件
*/
private String getWhereSql(String readerWhereSql, String jdbcUrl, String username, String password) {
Connection conn = DBUtil.getConnectionWithoutRetry(DATABASE_TYPE, jdbcUrl, username, password);
String whereSql = "";
ResultSet rs = null;
try {
DBUtil.sqlValid(readerWhereSql, DATABASE_TYPE);
rs = DBUtil.query(conn, readerWhereSql, 1);
ResultSetMetaData metaData = rs.getMetaData();
if (DBUtil.asyncResultSetNext(rs)) {
for (int i = 1, len = metaData.getColumnCount(); i <= len; i++) {
// 配置文件里控制间隔
String val = rs.getString(i);
if (val == null) {
LOG.warn("配置的`readerWhereSql`查询出的条件值为空:i={}, name={}", i, metaData.getColumnName(i));
} else {
whereSql += val;
}
}
LOG.info("whereSql={}", whereSql);
}
} catch (ParserException e) {
throw RdbmsException.asSqlParserException(DATABASE_TYPE, e, readerWhereSql);
} catch (Exception e) {
throw DataXException.asDataXException(RdbmsException.mySqlQueryErrorAna(e.getMessage()),
"执行的SQL为: " + readerWhereSql + " 具体错误信息为:" + e);
} finally {
DBUtil.closeDBResources(rs, null, conn);
}
return whereSql;
}
// 根据SQL解决查询变量
private void fixReaderParamSql() {
Configuration peerPluginJobConf = super.getPeerPluginJobConf();
String readerParamSql = peerPluginJobConf
.getString("readerParamSql");
if (readerParamSql == null) {
return;
}
String jdbcUrl = peerPluginJobConf.getString(Key.JDBC_URL);
String username = peerPluginJobConf.getString(Key.USERNAME);
String password = peerPluginJobConf.getString(Key.PASSWORD);
Connection conn = DBUtil.getConnectionWithoutRetry(
MysqlReader.DATABASE_TYPE, jdbcUrl, username, password);
ResultSet rs = null;
try {
rs = DBUtil.query(conn, readerParamSql, 1);
ResultSetMetaData metaData = rs.getMetaData();
if (DBUtil.asyncResultSetNext(rs)) {
Map<String, String> params = new HashMap<>();
int i = 1;
String val;
for (int len = metaData.getColumnCount(); i <= len; i++) {
val = rs.getString(i);
if (val == null)
LOG.warn("配置的`readerParamSql`查询出的变量值为空:i={}, name={}",
Integer.valueOf(i),
metaData.getColumnName(i));
else {
params.put(String.format("${%s}", metaData.getColumnName(i)), val);
}
}
if (!params.isEmpty()) {
String querySql = this.readerSliceConfig
.getString("querySql");
for (Map.Entry<String, String> p : params.entrySet()) {
querySql = querySql.replace(
(CharSequence) p.getKey(),
(CharSequence) p.getValue());
}
this.readerSliceConfig.set("querySql", querySql);
LOG.info("param replace after querySql={}", querySql);
}
}
} catch (ParserException e) {
throw RdbmsException.asSqlParserException(
MysqlReader.DATABASE_TYPE, e, readerParamSql);
} catch (Exception e) {
throw DataXException.asDataXException(
RdbmsException.mySqlQueryErrorAna(e.getMessage()),
"执行的SQL为: " + readerParamSql + " 具体错误信息为:" + e);
} finally {
DBUtil.closeDBResources(rs, null, conn);
}
}
这里的难点在于peerPluginJobConf
是空值,即没法取得配对的mysqlwriter
的Task配置,也就无法获取已同步的表数据的情况了。所以需要修改任务split的逻辑,把对应的配置给补上。
在com.alibaba.datax.core.taskgroup.TaskGroupContainer
.generateRunner
方法里增加配对配置值的设置,即在reader里设置writer配置,在writer里设置reader的配置。
private AbstractRunner generateRunner(PluginType pluginType, List<TransformerExecution> transformerInfoExecs) {
AbstractRunner newRunner = null;
TaskPluginCollector pluginCollector;
switch (pluginType) {
case READER:
newRunner = LoadUtil.loadPluginRunner(pluginType,
this.taskConfig.getString(CoreConstant.JOB_READER_NAME));
newRunner.setJobConf(this.taskConfig.getConfiguration(
CoreConstant.JOB_READER_PARAMETER));
// 设置对方的写配置,以便在Task里获取writer的配置
newRunner.getPlugin().setPeerPluginJobConf(this.taskConfig.getConfiguration(
CoreConstant.JOB_WRITER_PARAMETER));
pluginCollector = ClassUtil.instantiate(
taskCollectorClass, AbstractTaskPluginCollector.class,
configuration, this.taskCommunication,
PluginType.READER);
RecordSender recordSender;
if (transformerInfoExecs != null && transformerInfoExecs.size() > 0) {
recordSender = new BufferedRecordTransformerExchanger(taskGroupId, this.taskId, this.channel,
this.taskCommunication ,pluginCollector, transformerInfoExecs);
} else {
recordSender = new BufferedRecordExchanger(this.channel, pluginCollector);
}
((ReaderRunner) newRunner).setRecordSender(recordSender);
/**
* 设置taskPlugin的collector,用来处理脏数据和job/task通信
*/
newRunner.setTaskPluginCollector(pluginCollector);
break;
case WRITER:
newRunner = LoadUtil.loadPluginRunner(pluginType,
this.taskConfig.getString(CoreConstant.JOB_WRITER_NAME));
newRunner.setJobConf(this.taskConfig
.getConfiguration(CoreConstant.JOB_WRITER_PARAMETER));
// 设置对方的读配置,以便在Task里获取reader的配置
newRunner.getPlugin().setPeerPluginJobConf(this.taskConfig.getConfiguration(
CoreConstant.JOB_READER_PARAMETER));
pluginCollector = ClassUtil.instantiate(
taskCollectorClass, AbstractTaskPluginCollector.class,
configuration, this.taskCommunication,
PluginType.WRITER);
((WriterRunner) newRunner).setRecordReceiver(new BufferedRecordExchanger(
this.channel, pluginCollector));
/**
* 设置taskPlugin的collector,用来处理脏数据和job/task通信
*/
newRunner.setTaskPluginCollector(pluginCollector);
break;
default:
throw DataXException.asDataXException(FrameworkErrorCode.ARGUMENT_ERROR, "Cant generateRunner for:" + pluginType);
}
newRunner.setTaskGroupId(taskGroupId);
newRunner.setTaskId(this.taskId);
newRunner.setRunnerCommunication(this.taskCommunication);
return newRunner;
}
经过实际的表同步进行验证,目前功能正常运行,在配置readerWhereSql
时可以实现修正查询SQL的目标。
2.2 与环境有关的属性动态查询替换
同步数据的job配置在不同环境的用户名、密码、IP都不一样,每个环境都配置一遍即麻烦又难维护,所以需要用动态自动替换的功能来解决。
目前的思路是重写MysqlReader.Job.preHandler()
方法对配置变量进行替换。
private final static String READER_PATH = "job.content[0].reader";
private final static String WRITER_PATH = "job.content[0].reader";
@Override
public void preHandler(Configuration jobConfiguration) {
/**
* 根据配置环境处理变量
*/
String envConfName = System.getProperty("env.conf.name", "");
if (!envConfName.isEmpty()) {
String envConfPath = System.getProperty("env.conf.path", "");
if (envConfPath.isEmpty()) {
envConfPath = StringUtils.join(
new String[] { System.getProperty("datax.home"), "conf", "env_conf.json" }, File.separator);
}
Configuration conf = Configuration.from(new File(envConfPath));
String querySql = conf.getString(Key.QUERY_SQL);
Configuration envConf = conf.getConfiguration(envConfName);
String jdbcUrl = envConf.getString(Key.JDBC_URL);
String username = envConf.getString(Key.USERNAME);
String password = envConf.getString(Key.PASSWORD);
Connection conn = DBUtil.getConnectionWithoutRetry(DATABASE_TYPE, jdbcUrl, username, password);
ResultSet rs = null;
try {
Map<String, Object> tables = conf.getMap("tables");
/*
* reader的配置表名、项目名
*/
String table = tables.get(System.getProperty("env.conf.rtable", "")).toString();
String projectName = jobConfiguration.getString("job.content[0].reader.parameter.projectName");
rs = setReplaceConfStr(jobConfiguration, READER_PATH, conn, querySql, table, projectName);
/*
* writer的配置表名、项目名
*/
table = tables.get(System.getProperty("env.conf.wtable", "")).toString();
projectName = jobConfiguration.getString("job.content[0].writer.parameter.projectName");
rs = setReplaceConfStr(jobConfiguration, WRITER_PATH, conn, querySql, table, projectName);
LOG.debug("jobConfiguration={}", jobConfiguration);
} finally {
DBUtil.closeDBResources(rs, null, conn);
}
}
}
// 根据条件替换配置里的变量
private ResultSet setReplaceConfStr(Configuration jobConfiguration, String paramPath, Connection conn,
String querySql, String table, String projectName) {
ResultSet rs = null;
try {
querySql = querySql.replace(Constant.TABLE_NAME_PLACEHOLDER, table)
.replace(Constant.PROJECT_NAME_PLACEHOLDER, projectName);
rs = DBUtil.query(conn, querySql, 100);
boolean hasProp = false;
while (DBUtil.asyncResultSetNext(rs)) {
String key = rs.getString(1);
String val = rs.getString(2);
if (StringUtils.isNotBlank(key)) {
hasProp = true;
System.setProperty(key, val);
}
}
if (hasProp) {
String replaceConf = StrUtil.replaceVariable(jobConfiguration.getString(paramPath));
// 重新设置插件配置
jobConfiguration.set(paramPath, Configuration.from(replaceConf));
}
} catch (ParserException e) {
throw RdbmsException.asSqlParserException(DATABASE_TYPE, e, querySql);
} catch (Exception e) {
throw DataXException.asDataXException(RdbmsException.mySqlQueryErrorAna(e.getMessage()),
"执行的SQL为: " + querySql + " 具体错误信息为:" + e);
}
return rs;
}
如果需要执行到这里的逻辑,前提是job_xxx.json里要配置插件的preHandler才可以。
{
"job": {
"preHandler": {
"pluginType": "reader",
"pluginName": "mysqlreader"
}
}
}
每个环境的属性查询配置env_conf.json
的结构:
{
"querySql": "select prop_key, prop_value from @table where project_name = '@project'",
"tables": {
"": "t_sys1_config",
"sys1": "t_sys1_config",
"sys2": "t_sys2_config"
},
"dev": {
"jdbcUrl": "jdbc:mysql://localhost:3306/config?characterEncoding=UTF-8&autoReconnect=true&failOverReadOnly=false",
"username": "config",
"password": "123"
}
}
直接使用java调用DataX的例子:
java -server -Xms1g -Xmx1g -Dfile.encoding=UTF-8 -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener \
-Ddatax.home=/opt/jar/datax -Dlogback.configurationFile=/opt/jar/datax/conf/logback.xml -Denv.conf.name=test \
-classpath /opt/jar/datax/lib/* com.alibaba.datax.core.Engine -mode standalone -jobid -1 -job /opt/jar/datax/job/test.t_hello_world.json
三、总结
在推进DataX使用到生产环境中遇到了好多问题,这里简单列举:
- 对于增量同步,需要在mysqlreader里知道增量同步的游标,但是变量值默认只能在调用时参数传入,很依赖外部的能力,所以这里调整了插件逻辑,增加了job、task运行前的变量获取和SQL调整能力
- job配置里的用户名密码等与环境有关的数据,需要自动根据环境获取,所以增加了
conf/env_conf.json
配置,根据环境查询远程配置表,实现变量替换 - 目前DataX使用XXL-JOB来定时调度,每张表的同步日志都能清晰查看,还有些功能需要优化调整,如:邮件告警与公司整合、管理更多执行器后运行稳定性的验证