DataX使用
解决问题?
DataX简介?
DataX使用?
DataX配置文件?
1. DataX简介?
DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。
使用DataX能实现什么功能呢?例如,把MySQL数据库里的某张表导入到Oracle数据库说是HBase数据库中。还可以通过编写Transfer插件,实现数据转换的功能,比如字符串遮蔽的功能。
详情参见:https://github.com/alibaba/DataX
2. DataX使用?
运行环境:
- Linux
- JDK(1.8以上,推荐1.8)
- Python(推荐Python2.6.X)
- Apache Maven 3.x (Compile DataX)
2.1 直接安装(安装方法一)
直接下载DataX工具包:DataX下载地址 ,下载以后解压至本地某个目录,进入bin目录,即可使用:
$ cd {YOUR_DATAX_HOME}/bin
$ python datax.py {YOUR_JOB.json}
YOUR_DATAX_HOME:DataX解压目录。
YOUR_JOB.json:DataX运行需要的配置文件。
2.2 源码编译(安装方法二)
2.2.1 下载DataX源码:
$ git clone git@github.com:alibaba/DataX.git
2.2.2 通过maven打包
$ cd {DataX_source_code_home} $ mvn -U clean package assembly:assembly -Dmaven.test.skip=true
2.2.3 打包成功,日志显示如下:
[INFO] BUILD SUCCESS [INFO] ----------------------------------------------------------------- [INFO] Total time: 08:12 min [INFO] Finished at: 2015-12-13T16:26:48+08:00 [INFO] Final Memory: 133M/960M [INFO] -----------------------------------------------------------------
2.2.4 打包成功,目录结构如下:
$ cd {DataX_source_code_home} $ ls ./target/datax/datax/ bin conf job lib log log_perf plugin script tmp
3. DataX使用案例
3.1 创建任务的配置文件(json格式)
配置模板:python datax.py -r {YOUR_READER} -w {YOUR_WRITER}
$ cd {YOUR_DATAX_HOME}/bin $ python datax.py -r streamreader -w streamwriter DataX (UNKNOWN_DATAX_VERSION), From Alibaba ! Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved. Please refer to the streamreader document: https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md Please refer to the streamwriter document: https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md Please save the following configuration as a json file and use python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json to run the job. { "job": { "content": [ { "reader": { "name": "streamreader", "parameter": { "column": [], "sliceRecordCount": "" } }, "writer": { "name": "streamwriter", "parameter": { "encoding": "", "print": true } } } ], "setting": { "speed": { "channel": "" } } } }
案例:
#stream2stream.json { "job": { "content": [ { "reader": { "name": "streamreader", "parameter": { "sliceRecordCount": 10, "column": [ { "type": "long", "value": "10" }, { "type": "string", "value": "hello,你好,世界-DataX" } ] } }, "writer": { "name": "streamwriter", "parameter": { "encoding": "UTF-8", "print": true } } } ], "setting": { "speed": { "channel": 5 } } } }
3.1 启动DataX
$ cd {YOUR_DATAX_DIR_BIN}
$ python datax.py ./stream2stream.json
以上详情见:https://github.com/alibaba/DataX/blob/master/userGuid.md
4. 插件详解(配置参数)
见官网:https://github.com/alibaba/DataX
转载请标明出处。