使用datax从oracle日增同步数据到clickhouse
一. 前提步骤:
安装datax及配置,clickhouse
datax目录:/home/ad/datax/datax/
脚本目录://home/ad/datax/datax/job
log目录:/home/hadoop/workshell/tmp/
二. 脚本思路以及实现:
1. 编写基于时间的执行脚本(python脚本)
1) 如果使用datax去做增量,需要根据某个时间去判断,那需要将时间传给实际执行的datax.py脚本中去执行,datax支持脚本变量。
2)基于时间戳做增量,时间戳是一个参数并且是一个可变量,我的思路是将这个文件放到一个X.record的文件中记录时间,本次执行的时候获得上次执行时间lastExecuteTime和当前时间currentTime,抽取的数据就是介于此二者之间的数据。
3) 以下是增量python执行脚本,暂定义传入三个参数,datax的json文件路径,输出log日志的路径以及记录时间的record记录文件路径(.record文件不需创建,自动生成,默认第一次跑批历史所有):
#encoding="utf-8" # two args , first: datax config file path, logfile path import time import sys import os print("going to execute") configFilePath = sys.argv[1] logFilePath = sys.argv[2] lastTimeExecuteRecord = sys.argv[3] print "==============================================" print "configFilePath :", configFilePath print "configFilePath :", logFilePath print "lastTimeExecute File :",lastTimeExecuteRecord print "==============================================" lastExecuteTime="" try: fo = open(lastTimeExecuteRecord.strip(), "r") lastExecuteTime = fo.read() except IOError: lastExecuteTime = '1970-01-01 00:00:00' print("last time execute time: " + lastExecuteTime) currentTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) print("currentTime is :"+ currentTime) #os.system("python /home/hadoop/datax/bin/datax.py " + configFilePath + " --lastTime" + lastExecuteTime + " --currentTime" + currentTime + " >> " + logFilePath) script2execute = "python /home/hadoop/datax/bin/datax.py %s -p \"-DlastTime='%s' -DcurrentTime='%s'\" >> %s"%( configFilePath, lastExecuteTime, currentTime,logFilePath) print("to be excute script:"+script2execute) os.system(script2execute) print("script execute ending") # update timestamp to file fo = open(lastTimeExecuteRecord.strip(), "w+") fo.write(currentTime) fo.close() print "ending---",lastTimeExecuteRecord
4) 编写完后修改datax的执行配置json文件。
{ "setting": {}, "job": { "setting": { "speed": { "channel": 2 } }, "content": [{ "reader": { "name": "oraclereader", "parameter": { "username": "xx_dev1", "password": "xx_dev1", "connection": [{ "querySql": ["select PARTY_ID as P_PEM,CREATED_STAMP as P_CREATED_STAMP from PARTY where CREATED_STAMP >= to_date('${lastTime}', 'yyyy-MM-dd HH24:mi:ss') and CREATED_STAMP< to_date('${currentTime}', 'yyyy-MM-dd HH24:mi:ss')"], "jdbcUrl": ["jdbc:oracle:thin:@x.x.x.x:1521/test"] }] } }, "writer": { "name": "clickhousewriter", "parameter": { "username": "xx_dev", "password": "admin123!", "column": ["*"], "connection": [{ "jdbcUrl": "jdbc:clickhouse://x.x.x.x:xxxx/xx_prod", "table": ["dwd_xxx_allattr"] }] } } }] } }
5) 执行命令:
1 #!/bin/sh 2 source ~/.bash_profile 3 4 echo "start DI dwd_xxx_allattr" 5 date 6 7 8 # mysql同步到clickhouse 9 python /home/hadoop/workshell/datax_scheduler_dwd_xxx_allattr_di.py \ '/home/hadoop/datax/job/dwd_xxx_allattr_di.json' \ '/home/hadoop/workshell/tmp/dwd_xxx_allattr_di.log' \ '/home/hadoop/datax/jobTimer/record/dwd_xxx_allattr.record' 10 11 # 整体优化 12 sql="optimize table xx_prod.dwd_xxx_allattr FINAL;" 13 echo $sql 14 echo data1 15 clickhouse client -q"${sql}" 16 17 date 18 echo "end DI dwd_xxx_allattr"
record文件记录内容:
1 2022-01-21 16:13:01