使用datax从oracle日增同步数据到clickhouse

一. 前提步骤：

安装datax及配置，clickhouse

datax目录：/home/ad/datax/datax/

脚本目录：//home/ad/datax/datax/job

log目录：/home/hadoop/workshell/tmp/

二. 脚本思路以及实现：

1. 编写基于时间的执行脚本（python脚本）

1）如果使用datax去做增量，需要根据某个时间去判断，那需要将时间传给实际执行的datax.py脚本中去执行，datax支持脚本变量。

2）基于时间戳做增量，时间戳是一个参数并且是一个可变量，我的思路是将这个文件放到一个X.record的文件中记录时间，本次执行的时候获得上次执行时间lastExecuteTime和当前时间currentTime，抽取的数据就是介于此二者之间的数据。

3）以下是增量python执行脚本，暂定义传入三个参数，datax的json文件路径，输出log日志的路径以及记录时间的record记录文件路径（.record文件不需创建，自动生成，默认第一次跑批历史所有）：

#encoding="utf-8"

# two args , first: datax config file path, logfile path

import time

import sys

import os

print("going to execute")

configFilePath = sys.argv[1]

logFilePath = sys.argv[2]

lastTimeExecuteRecord = sys.argv[3]

print "=============================================="

print "configFilePath      :", configFilePath

print "configFilePath      :", logFilePath

print "lastTimeExecute File :",lastTimeExecuteRecord

print "=============================================="

lastExecuteTime=""

try:

    fo = open(lastTimeExecuteRecord.strip(), "r")

    lastExecuteTime = fo.read()

except IOError:

    lastExecuteTime = '1970-01-01 00:00:00'

print("last time execute time:  " + lastExecuteTime)

currentTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())

print("currentTime is        :"+ currentTime)

#os.system("python /home/hadoop/datax/bin/datax.py " + configFilePath + " --lastTime" +  lastExecuteTime + " --currentTime" + currentTime + " >> " + logFilePath)

script2execute  = "python /home/hadoop/datax/bin/datax.py %s -p \"-DlastTime='%s' -DcurrentTime='%s'\"  >>  %s"%( configFilePath, lastExecuteTime, currentTime,logFilePath)

print("to be excute script:"+script2execute)

os.system(script2execute)

print("script execute ending")

# update timestamp to file

fo = open(lastTimeExecuteRecord.strip(), "w+")

fo.write(currentTime)

fo.close()

print "ending---",lastTimeExecuteRecord

4）编写完后修改datax的执行配置json文件。

{
    "setting": {},
    "job": {
        "setting": {
            "speed": {
                "channel": 2
            }
        },
        "content": [{
            "reader": {
                "name": "oraclereader",
                "parameter": {
                    "username": "xx_dev1",
                    "password": "xx_dev1",
                    "connection": [{
                        "querySql": ["select PARTY_ID  as P_PEM,CREATED_STAMP as P_CREATED_STAMP from PARTY where CREATED_STAMP >= to_date('${lastTime}', 'yyyy-MM-dd HH24:mi:ss') and CREATED_STAMP< to_date('${currentTime}', 'yyyy-MM-dd HH24:mi:ss')"],
                        "jdbcUrl": ["jdbc:oracle:thin:@x.x.x.x:1521/test"]
                    }]
                }
            },
            "writer": {
                "name": "clickhousewriter",
                "parameter": {
                    "username": "xx_dev",
                    "password": "admin123!",
                    "column": ["*"],
                    "connection": [{
                        "jdbcUrl": "jdbc:clickhouse://x.x.x.x:xxxx/xx_prod",
                        "table": ["dwd_xxx_allattr"]
                    }]
                }
            }
        }]
    }
}

5）执行命令：

  1 #!/bin/sh                                                                                                                                          
  2 source ~/.bash_profile
  3 
  4 echo "start DI dwd_xxx_allattr"
  5 date
  6 
  7 
  8 # mysql同步到clickhouse
  9 python /home/hadoop/workshell/datax_scheduler_dwd_xxx_allattr_di.py \ '/home/hadoop/datax/job/dwd_xxx_allattr_di.json'  \ '/home/hadoop/workshell/tmp/dwd_xxx_allattr_di.log'  \ '/home/hadoop/datax/jobTimer/record/dwd_xxx_allattr.record'
 10 
 11 # 整体优化
 12 sql="optimize table xx_prod.dwd_xxx_allattr  FINAL;"
 13 echo $sql
 14 echo  data1  
 15 clickhouse client -q"${sql}"
 16 
 17 date
 18 echo "end DI dwd_xxx_allattr"