DataX
一、资料地址
1、Git地址:https://github.com/alibaba/DataX
2、DataX详细介绍:https://github.com/alibaba/DataX/blob/master/introduction.md/
3、编译下载:https://github.com/alibaba/DataX/blob/master/userGuid.md
4、DataX数据源参考指南:https://github.com/alibaba/DataX/wiki/DataX-all-data-channels
5、插件开发教程:https://github.com/alibaba/DataX/blob/master/dataxPluginDev.md
二、工具部署
1、源码下载
https://github.com/alibaba/DataX/archive/refs/heads/master.zip
2、执行编译命令
解压后进入源码目录,执行编译命令: mvn -U clean package assembly:assembly -Dmaven.test.skip=true
3、查看编译结果
编译成功后的DataX包位于:
./target/datax/datax/
4、将编译好的datax上传至/opt/module
5、查看配置模板
python datax.py -r {你的读} -w {你的写}
cd /opt/module/datax/bin
python datax.py -r streamreader -w streamwriter
6、官网小案例
(1)编写stream2stream.json
mkdir -p /opt/module/datax/data cd /opt/module/datax/data
vim stream2stream.json
添加如下内容:
{ "job": { "content": [ { "reader": { "name": "streamreader", "parameter": { "sliceRecordCount": 10, "column": [ { "type": "long", "value": "10" }, { "type": "string", "value": "hello,你好,世界-DataX" } ] } }, "writer": { "name": "streamwriter", "parameter": { "encoding": "UTF-8", "print": true } } } ], "setting": { "speed": { "channel": 5 } } } }
(2)启动DataX
cd /opt/module/datax/bin/
python datax.py /opt/module/datax/data/stream2stream.json
(3)控制台日志
...... 2022-10-17 11:21:26.003 [job-0] INFO JobContainer - PerfTrace not enable! 2022-10-17 11:21:26.004 [job-0] INFO StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.002s | Percentage 100.00% 2022-10-17 11:21:26.004 [job-0] INFO JobContainer - 任务启动时刻 : 2022-10-17 11:21:15 任务结束时刻 : 2022-10-17 11:21:26 任务总计耗时 : 10s 任务平均流量 : 95B/s 记录写入速度 : 5rec/s 读出记录总数 : 50 读写失败总数 : 0
三、数据同步
1、MySQL到MySQL
(1)查看需要编写的json数据格式
cd /opt/module/datax/bin/
python datax.py -r mysqlreader -w mysqlwriter
(2)编写json文件
mkdir -p /opt/module/datax/data/mysqlTomysql
vim /opt/module/datax/data/mysqlTomysql/mysqlTomysql.json
添加如下内容:
{ "job": { "content": [ { "reader": { "name": "mysqlreader", "parameter": { "column": [ "name", "sourceLabel", "targetLabel", "properties", "nullableKeys" ], "connection": [ { "jdbcUrl": [ "jdbc:mysql://192.168.xxx.xxx:3306/库名?autoReconnect=true&useSSL=false" ], "table": [ "表名" ] } ], "password": "密码", "username": "用户", "where": "" } }, "writer": { "name": "mysqlwriter", "parameter": { "column": [ "name", "sourceLabel", "targetLabel", "properties", "nullableKeys" ], "connection": [ { "jdbcUrl": "jdbc:mysql://192.168.xxx.xxx:3306/库名?autoReconnect=true&useSSL=false", "table": [ "表名" ] } ], "password": "密码", "preSql": [], "session": [], "username": "用户", "writeMode": "insert" } } } ], "setting": { "speed": { "channel": 3 }, "errorLimit": { "record": 0, "percentage": 0.02 } } } }
(3)执行同步
cd /opt/module/datax/bin/
python datax.py /opt/module/datax/data/mysqlTomysql/mysqlTomysql.json
2、SQLServer-->SQLServer
(1)查看需要编写的json数据格式
cd /opt/module/datax/bin/
python datax.py -r sqlserverreader -w sqlserverwriter
(2)编写json文件
mkdir -p /opt/module/datax/data/sqlserverTosqlserver
vim /opt/module/datax/data/sqlserverTosqlserver/sqlserverTosqlserver.json
添加如下内容:
{ "job": { "content": [ { "reader": { "name": "sqlserverreader", "parameter": { "column": [ "CompanyCode", "adate", "gid", "code", "total", "bcktotal", "memo", "outmemo" ], "connection": [ { "jdbcUrl": [ "jdbc:sqlserver://192.168.xxx.xxx:1433;DatabaseName=库名" ], "table": [ "表名" ] } ], "password": "密码", "username": "用户" } }, "writer": { "name": "sqlserverwriter", "parameter": { "column": [ "CompanyCode", "adate", "gid", "code", "total", "bcktotal", "memo", "outmemo" ], "connection": [ { "jdbcUrl": "jdbc:sqlserver://192.168.xxx.xxx:1433;DatabaseName=库名", "table": [ "表名" ] } ], "password": "密码", "postSql": [], "preSql": [], "username": "用户" } } } ], "setting": { "speed": { "channel": 3 }, "errorLimit": { "record": 0, "percentage": 0.02 } } } }
(3)执行同步
cd /opt/module/datax/bin/
python datax.py /opt/module/datax/data/sqlserverTosqlserver/sqlserverTosqlserver.json
3、MySQL-->SQLServer
(1)查看需要编写的json数据格式
cd /opt/module/datax/bin/
python datax.py -r mysqlreader -w sqlserverwriter
(2)编写json文件
mkdir -p /opt/module/datax/data/mysqlTosqlserver
vim /opt/module/datax/data/mysqlTosqlserver/mysqlTosqlserver.json
添加如下内容:
{ "job": { "content": [ { "reader": { "name": "mysqlreader", "parameter": { "column": [ "CompanyCode1", "adate1", "gid1", "code1", "total1", "bcktotal1", "outmemo1", "memo1" ], "connection": [ { "jdbcUrl": [ "jdbc:mysql://192.168.xxx.xxx:3306/库名?autoReconnect=true&useSSL=false" ], "table": [ "表名" ] } ], "password": "密码", "username": "用户", "where": "" } }, "writer": { "name": "sqlserverwriter", "parameter": { "column": [ "CompanyCode", "adate", "gid", "code", "total", "bcktotal", "memo", "outmemo" ], "connection": [ { "jdbcUrl": "jdbc:sqlserver://192.168.xxx.xxx:1433;DatabaseName=库名", "table": [ "表名" ] } ], "password": "密码", "postSql": [], "preSql": [], "username": "用户" } } } ], "setting": { "speed": { "channel": 3 }, "errorLimit": { "record": 0, "percentage": 0.02 } } } }
(3)执行同步
cd /opt/module/datax/bin/
python datax.py /opt/module/datax/data/mysqlTosqlserver/mysqlTosqlserver.json
4、SQLServer-->MySQL
(1)查看需要编写的json数据格式
cd /opt/module/datax/bin/
python datax.py -r sqlserverreader -w mysqlwriter
(2)编写json文件
mkdir -p /opt/module/datax/data/sqlserverTomysql
vim /opt/module/datax/data/sqlserverTomysql/sqlserverTomysql.json
添加如下内容:
{ "job": { "content": [ { "reader": { "name": "sqlserverreader", "parameter": { "column": [ "CompanyCode", "adate", "gid", "code", "total", "bcktotal", "memo", "outmemo" ], "connection": [ { "jdbcUrl": [ "jdbc:sqlserver://192.168.xxx.xxx:1433;DatabaseName=库名" ], "table": [ "表名" ] } ], "password": "密码", "username": "用户" } }, "writer": { "name": "mysqlwriter", "parameter": { "column": [ "CompanyCode1", "adate1", "gid1", "code1", "total1", "bcktotal1", "memo1", "outmemo1" ], "connection": [ { "jdbcUrl": "jdbc:mysql://192.168.xxx.xxx:3306/库名?autoReconnect=true&useSSL=false", "table": [ "表名" ] } ], "password": "密码", "preSql": [], "session": [], "username": "用户", "writeMode": "insert" } } } ], "setting": { "speed": { "channel": 3 }, "errorLimit": { "record": 0, "percentage": 0.02 } } } }
(3)执行同步
cd /opt/module/datax/bin/
python datax.py /opt/module/datax/data/sqlserverTomysql/sqlserverTomysql.json
5、MySQL-->SQLServer
(1)查看需要编写的json数据格式
cd /opt/module/datax/bin/
python datax.py -r mysqlreader -w sqlserverwriter
(2)编写json文件
mkdir -p /opt/module/datax/data/mysqlTosqlserver
vim /opt/module/datax/data/mysqlTosqlserver/mysqlTosqlserver.json
添加如下内容:
{ "job": { "content": [ { "reader": { "name": "mysqlreader", "parameter": { "column": [ "CompanyCode1", "adate1", "gid1", "code1", "total1", "bcktotal1", "outmemo1", "memo1" ], "connection": [ { "jdbcUrl": [ "jdbc:mysql://192.168.xxx.xxx:3306/库名?autoReconnect=true&useSSL=false" ], "table": [ "表名" ] } ], "password": "密码", "username": "用户", "where": "" } }, "writer": { "name": "sqlserverwriter", "parameter": { "column": [ "CompanyCode", "adate", "gid", "code", "total", "bcktotal", "memo", "outmemo" ], "connection": [ { "jdbcUrl": "jdbc:sqlserver://192.168.xxx.xxx:1433;DatabaseName=库名", "table": [ "表名" ] } ], "password": "密码", "postSql": [], "preSql": [], "username": "用户" } } } ], "setting": { "speed": { "channel": 3 }, "errorLimit": { "record": 0, "percentage": 0.02 } } } }
(3)执行同步
cd /opt/module/datax/bin/
python datax.py /opt/module/datax/data/sqlserverTosqlserver/sqlserverTosqlserver.json
6、MySQL-->HDFS
(1)查看需要编写的json数据格式
cd /opt/module/datax/bin/
python datax.py -r mysqlreader -w hdfswriter
(2)编写json文件
mkdir -p /opt/module/datax/data/mysqlTohdfs
vim /opt/module/datax/data/mysqlTohdfs/mysqlTohdfs.json
添加如下内容:
{ "job": { "content": [ { "reader": { "name": "mysqlreader", "parameter": { "column": [ "CompanyCode1", "adate1", "gid1", "code1", "total1", "bcktotal1", "outmemo1", "memo1" ], "connection": [ { "jdbcUrl": [ "jdbc:mysql://192.168.xxx.xxx:3306/库名?autoReconnect=true&useSSL=false" ], "table": [ "表名" ] } ], "password": "密码", "username": "用户", "where": "" } }, "writer": { "name": "hdfswriter", "parameter": { "column": [ { "name": "CompanyCode1", "type": "string" }, { "name": "adate1", "type": "string" }, { "name": "gid1", "type": "int" }, { "name": "code1", "type": "string" }, { "name": "total1", "type": "double" }, { "name": "bcktotal1", "type": "double" }, { "name": "outmemo1", "type": "string" }, { "name": "memo1", "type": "string" } ], "compress": "NONE", "defaultFS": "hdfs://nametest", "hadoopConfig": { "dfs.nameservices": "nametest", "dfs.client.failover.proxy.provider.nametest": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", "dfs.ha.automatic-failover.enabled.nametest": "true", "ha.zookeeper.quorum": "192.168.xxx.xxx:2181,192.168.xxx.xxx:2181,192.168.xxx.xxx:2181", "dfs.ha.namenodes.nametest": "namenode1,namenode2", "dfs.namenode.rpc-address.nametest.namenode1": "192.168.xxx.xxx:8020", "dfs.namenode.servicerpc-address.nametest.namenode1": "192.168.xxx.xxx:8022", "dfs.namenode.http-address.nametest.namenode1": "192.168.xxx.xxx:9870", "dfs.namenode.https-address.nametest.namenode1": "192.168.xxx.xxx:9871", "dfs.namenode.rpc-address.nametest.namenode2": "192.168.xxx.xxx:8020", "dfs.namenode.servicerpc-address.nametest.namenode2": "192.168.xxx.xxx:8022", "dfs.namenode.http-address.nametest.namenode2": "192.168.xxx.xxx:9870", "dfs.namenode.https-address.nametest.namenode2": "192.168.xxx.xxx:9871" }, "fieldDelimiter": "\t", "fileName": "文件名", "fileType": "ORC", "path": "/", "writeMode": "append" } } } ], "setting": { "speed": { "channel": 3 }, "errorLimit": { "record": 0, "percentage": 0.02 } } } }
(3)执行同步
cd /opt/module/datax/bin/
python datax.py /opt/module/datax/data/mysqlTohdfs/mysqlTohdfs.json
7、SQLServer-->HDFS
(1)查看需要编写的json数据格式
cd /opt/module/datax/bin/
python datax.py -r sqlserverreader -w hdfswriter
(2)编写json文件
mkdir -p /opt/module/datax/data/sqlserverTohdfs
vim /opt/module/datax/data/sqlserverTohdfs/sqlserverTohdfs.json
添加如下内容:
{ "job": { "content": [ { "reader": { "name": "sqlserverreader", "parameter": { "column": [ "CompanyCode", "adate", "gid", "code", "total", "bcktotal", "memo", "outmemo" ], "connection": [ { "jdbcUrl": [ "jdbc:sqlserver://192.168.xxx.xxx:1433;DatabaseName=库名" ], "table": [ "表名" ] } ], "password": "密码", "username": "用户" } }, "writer": { "name": "hdfswriter", "parameter": { "column": [ { "name": "CompanyCode1", "type": "string" }, { "name": "adate1", "type": "string" }, { "name": "gid1", "type": "int" }, { "name": "code1", "type": "string" }, { "name": "total1", "type": "double" }, { "name": "bcktotal1", "type": "double" }, { "name": "outmemo1", "type": "string" }, { "name": "memo1", "type": "string" } ], "compress": "snappy", "defaultFS": "hdfs://nametest", "hadoopConfig": { "dfs.nameservices": "nametest", "dfs.client.failover.proxy.provider.nametest": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", "dfs.ha.automatic-failover.enabled.nametest": "true", "ha.zookeeper.quorum": "192.168.xxx.xxx:2181,192.168.xxx.xxx:2181,192.168.xxx.xxx:2181", "dfs.ha.namenodes.nametest": "namenode1,namenode2", "dfs.namenode.rpc-address.nametest.namenode1": "192.168.xxx.xxx:8020", "dfs.namenode.servicerpc-address.nametest.namenode1": "192.168.xxx.xxx:8022", "dfs.namenode.http-address.nametest.namenode1": "192.168.xxx.xxx:9870", "dfs.namenode.https-address.nametest.namenode1": "192.168.xxx.xxx:9871", "dfs.namenode.rpc-address.nametest.namenode2": "192.168.xxx.xxx:8020", "dfs.namenode.servicerpc-address.nametest.namenode2": "192.168.xxx.xxx:8022", "dfs.namenode.http-address.nametest.namenode2": "192.168.xxx.xxx:9870", "dfs.namenode.https-address.nametest.namenode2": "192.168.xxx.xxx:9871" }, "fieldDelimiter": "\t", "fileName": "文件名", "fileType": "parquet", "path": "/", "writeMode": "append" } } } ], "setting": { "speed": { "channel": 3 }, "errorLimit": { "record": 0, "percentage": 0.02 } } } }
(3)执行同步
cd /opt/module/datax/bin/
python datax.py /opt/module/datax/data/sqlserverTohdfs/sqlserverTohdfs.json
8、HDFS-->SQLServer
(1)查看需要编写的json数据格式
cd /opt/module/datax/bin/
python datax.py -r hdfsreader -w sqlserverwriter
(2)编写json文件
mkdir -p /opt/module/datax/data/hdfsTosqlserver
vim /opt/module/datax/data/hdfsTosqlserver/hdfsTosqlserver.json
添加如下内容:
{ "job": { "content": [ { "reader": { "name": "hdfsreader", "parameter": { "column": [ { "name": "CompanyCode1", "type": "string" }, { "name": "adate1", "type": "string" }, { "name": "gid1", "type": "int" }, { "name": "code1", "type": "string" }, { "name": "total1", "type": "double" }, { "name": "bcktotal1", "type": "double" }, { "name": "outmemo1", "type": "string" }, { "name": "memo1", "type": "string" } ], "defaultFS": "hdfs://nametest", "hadoopConfig": { "dfs.nameservices": "nametest", "dfs.client.failover.proxy.provider.nametest": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", "dfs.ha.automatic-failover.enabled.nametest": "true", "ha.zookeeper.quorum": "192.168.xxx.xxx:2181,192.168.xxx.xxx:2181,192.168.xxx.xxx:2181", "dfs.ha.namenodes.nametest": "namenode1,namenode2", "dfs.namenode.rpc-address.nametest.namenode1": "192.168.xxx.xxx:8020", "dfs.namenode.servicerpc-address.nametest.namenode1": "192.168.xxx.xxx:8022", "dfs.namenode.http-address.nametest.namenode1": "192.168.xxx.xxx:9870", "dfs.namenode.https-address.nametest.namenode1": "192.168.xxx.xxx:9871", "dfs.namenode.rpc-address.nametest.namenode2": "192.168.xxx.xxx:8020", "dfs.namenode.servicerpc-address.nametest.namenode2": "192.168.xxx.xxx:8022", "dfs.namenode.http-address.nametest.namenode2": "192.168.xxx.xxx:9870", "dfs.namenode.https-address.nametest.namenode2": "192.168.xxx.xxx:9871" }, "encoding": "UTF-8", "fieldDelimiter": ",", "fileType": "orc", "path": "/*" } }, "writer": { "name": "sqlserverwriter", "parameter": { "column": [ "CompanyCode", "adate", "gid", "code", "total", "bcktotal", "memo", "outmemo" ], "connection": [ { "jdbcUrl": "jdbc:sqlserver://192.168.xxx.xxx:1433;DatabaseName=库名", "table": [ "表名" ] } ], "password": "密码", "postSql": [], "preSql": [], "username": "用户" } } } ], "setting": { "speed": { "channel": 3 }, "errorLimit": { "record": 0, "percentage": 0.02 } } } }
(3)执行同步
cd /opt/module/datax/bin/
python datax.py /opt/module/datax/data/hdfsTosqlserver/hdfsTosqlserver.json
9、Hive-->MySQL
(1)查看需要编写的json数据格式
cd /opt/module/datax/bin/
python datax.py -r hdfsreader -w mysqlwriter
(2)编写json文件
mkdir -p /opt/module/datax/data/hdfsTomysql
vim /opt/module/datax/data/hdfsTomysql/hdfsTomysql.json
添加如下内容:
{ "job": { "content": [ { "reader": { "name": "hdfsreader", "parameter": { "column": [ { "name": "CompanyCode1", "type": "string" }, { "name": "adate1", "type": "string" }, { "name": "gid1", "type": "int" }, { "name": "code1", "type": "string" }, { "name": "total1", "type": "double" }, { "name": "bcktotal1", "type": "double" }, { "name": "outmemo1", "type": "string" }, { "name": "dt", "type": "string", "value": "${dt}" } ], "defaultFS": "hdfs://nametest", "hadoopConfig": { "dfs.nameservices": "nametest", "dfs.client.failover.proxy.provider.nametest": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", "dfs.ha.automatic-failover.enabled.nametest": "true", "ha.zookeeper.quorum": "192.168.xxx.xxx:2181,192.168.xxx.xxx:2181,192.168.xxx.xxx:2181", "dfs.ha.namenodes.nametest": "namenode1,namenode2", "dfs.namenode.rpc-address.nametest.namenode1": "192.168.xxx.xxx:8020", "dfs.namenode.servicerpc-address.nametest.namenode1": "192.168.xxx.xxx:8022", "dfs.namenode.http-address.nametest.namenode1": "192.168.xxx.xxx:9870", "dfs.namenode.https-address.nametest.namenode1": "192.168.xxx.xxx:9871", "dfs.namenode.rpc-address.nametest.namenode2": "192.168.xxx.xxx:8020", "dfs.namenode.servicerpc-address.nametest.namenode2": "192.168.xxx.xxx:8022", "dfs.namenode.http-address.nametest.namenode2": "192.168.xxx.xxx:9870", "dfs.namenode.https-address.nametest.namenode2": "192.168.xxx.xxx:9871" }, "encoding": "UTF-8", "fieldDelimiter": ",", "fileType": "orc", "path": "/apps/hive/warehouse/ods.db/student/dt=${dt}/*" } }, "writer": { "name": "mysqlwriter", "parameter": { "column": [ "CompanyCode1", "adate1", "gid1", "code1", "total1", "bcktotal1", "memo1", "dt" ], "connection": [ { "jdbcUrl": "jdbc:mysql://192.168.xxx.xxx:3306/库名?autoReconnect=true&useSSL=false", "table": [ "表名" ] } ], "password": "密码", "preSql": [], "session": [], "username": "用户", "writeMode": "insert" } } } ], "setting": { "speed": { "channel": 3 }, "errorLimit": { "record": 0, "percentage": 0.02 } } } }
(3)执行同步
cd /opt/module/datax/bin/
python datax.py /opt/module/datax/data/hdfsTomysql/hdfsTomysql.json
10、MySQL-->Doris
(1)查看需要编写的json数据格式
cd /opt/module/datax/bin/
python datax.py -r mysqlreader -w doriswriter
(2)编写json文件
mkdir -p /opt/module/datax/data/mysqlTodoris
vim /opt/module/datax/data/mysqlTodoris/mysqlTodoris.json
添加如下内容:
{ "job": { "content": [ { "reader": { "name": "mysqlreader", "parameter": { "column": [ "CompanyCode1", "adate1", "gid1", "code1", "total1", "bcktotal1", "outmemo1", "memo1" ], "connection": [ { "jdbcUrl": [ "jdbc:mysql://192.168.xxx.xxx:3306/库名?autoReconnect=true&useSSL=false" ], "table": [ "表名" ] } ], "password": "密码", "username": "用户", "where": "" } }, "writer": { "name": "doriswriter", "parameter": { "beLoadUrl": [ "192.168.xxx.xxx:8040", "192.168.xxx.xxx:8040", "192.168.xxx.xxx:8040", "192.168.xxx.xxx:8040" ], "column": [ "CompanyCode", "adate", "gid", "code", "total", "bcktotal", "outmemo", "memo" ], "connection": [ { "jdbcUrl": "jdbc:mysql://192.168.xxx.xxx:9030/", "selectedDatabase": "库名", "table": [ "表名" ] } ], "loadProps": {}, "loadUrl": [ "192.168.xxx.xxx:8030", "192.168.xxx.xxx:8030" ], "password": "密码", "postSql": [], "preSql": [], "maxBatchRows" : 10000, "maxBatchByteSize" : 104857600, "labelPrefix": "datax_doris_writer_demo_", "lineDelimiter": "\n", "username": "用户" } } } ], "setting": { "speed": { "channel": 3 }, "errorLimit": { "record": 0, "percentage": 0.02 } } } }
(3)执行同步
cd /opt/module/datax/bin/
python datax.py /opt/module/datax/data/mysqlTodoris/mysqlTodoris.json