DataX使用

解决问题?

DataX简介?

DataX使用?

DataX配置文件?

 

1. DataX简介?

DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。

使用DataX能实现什么功能呢?例如,把MySQL数据库里的某张表导入到Oracle数据库说是HBase数据库中。还可以通过编写Transfer插件,实现数据转换的功能,比如字符串遮蔽的功能。

详情参见:https://github.com/alibaba/DataX

 

2. DataX使用?

运行环境:

2.1 直接安装(安装方法一)

直接下载DataX工具包:DataX下载地址 ,下载以后解压至本地某个目录,进入bin目录,即可使用:

$ cd {YOUR_DATAX_HOME}/bin
$ python datax.py {YOUR_JOB.json}

YOUR_DATAX_HOME:DataX解压目录。
YOUR_JOB.json:DataX运行需要的配置文件。


2.2 源码编译(安装方法二)

2.2.1 下载DataX源码:

$ git clone git@github.com:alibaba/DataX.git

2.2.2 通过maven打包

$ cd  {DataX_source_code_home}
$ mvn -U clean package assembly:assembly -Dmaven.test.skip=true

2.2.3 打包成功,日志显示如下:

[INFO] BUILD SUCCESS
[INFO] -----------------------------------------------------------------
[INFO] Total time: 08:12 min
[INFO] Finished at: 2015-12-13T16:26:48+08:00
[INFO] Final Memory: 133M/960M
[INFO] -----------------------------------------------------------------

2.2.4 打包成功,目录结构如下:

$ cd  {DataX_source_code_home}
$ ls ./target/datax/datax/
bin   conf   job   lib   log   log_perf   plugin   script   tmp

 

3. DataX使用案例

3.1 创建任务的配置文件(json格式)

配置模板:python datax.py -r {YOUR_READER} -w {YOUR_WRITER}

$ cd  {YOUR_DATAX_HOME}/bin
$ python datax.py -r streamreader -w streamwriter
DataX (UNKNOWN_DATAX_VERSION), From Alibaba !
Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved.
Please refer to the streamreader document:
    https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md 

Please refer to the streamwriter document:
     https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md 
 
Please save the following configuration as a json file and  use
     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json 
to run the job.

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "streamreader", 
                    "parameter": {
                        "column": [], 
                        "sliceRecordCount": ""
                    }
                }, 
                "writer": {
                    "name": "streamwriter", 
                    "parameter": {
                        "encoding": "", 
                        "print": true
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": ""
            }
        }
    }
}

案例:

#stream2stream.json
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
            "sliceRecordCount": 10,
            "column": [
              {
                "type": "long",
                "value": "10"
              },
              {
                "type": "string",
                "value": "hello,你好,世界-DataX"
              }
            ]
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
            "encoding": "UTF-8",
            "print": true
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 5
       }
    }
  }
}

3.1 启动DataX

$ cd {YOUR_DATAX_DIR_BIN}
$ python datax.py ./stream2stream.json

 

以上详情见:https://github.com/alibaba/DataX/blob/master/userGuid.md

 

4. 插件详解(配置参数)

见官网:https://github.com/alibaba/DataX

 

 转载请标明出处。

 

posted @ 2019-03-04 20:30  mungerz  阅读(1203)  评论(0编辑  收藏  举报