datax的安装与使用

1、官网下载地址:https://github.com/alibaba/DataX

  DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。

  DataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。

2、在页面中【Quick Start】===》【Download DataX下载地址】进行下载。下载后的包名:datax.tar.gz。这个压缩包还是有点大的,我下载的有790.95MB大小。

3、DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图:

类型数据源Reader(读)Writer(写)文档
RDBMS 关系型数据库 MySQL
            Oracle         √         √    
  SQLServer
  PostgreSQL
  DRDS
  通用RDBMS(支持所有关系型数据库)
阿里云数仓数据存储 ODPS
  ADS  
  OSS
  OCS
NoSQL数据存储 OTS
  Hbase0.94
  Hbase1.1
  Phoenix4.x
  Phoenix5.x
  MongoDB
  Hive
  Cassandra
无结构化数据存储 TxtFile
  FTP
  HDFS
  Elasticsearch  
时间序列数据库 OpenTSDB  
  TSDB

4、将下载好的datax.tar.gz包上传到服务器,并进行解压缩操作,如下所示:

官网安装地址:https://github.com/alibaba/DataX/blob/master/userGuid.md

 1 [root@slaver1 package]# ll datax.tar.gz 
 2 -rw-r--r--. 1 root root 829372407 7月  17 14:50 datax.tar.gz
 3 [root@slaver1 package]# tar -zxvf datax.tar.gz -C  /home/hadoop/soft/
 4 
 5 [root@slaver1 package]# cd ../soft/
 6 [root@slaver1 soft]# ls
 7 cerebro-0.7.2  datax  elasticsearch-6.7.0  filebeat-6.7.0-linux-x86_64  kibana-6.7.0-linux-x86_64  logstash-6.7.0
 8 [root@slaver1 soft]# cd datax/
 9 [root@slaver1 datax]# ls
10 bin  conf  job  lib  plugin  script  tmp
11 [root@slaver1 datax]# ll
12 总用量 4
13 drwxr-xr-x. 2 62265 users   59 10月 12 2019 bin
14 drwxr-xr-x. 2 62265 users   68 10月 12 2019 conf
15 drwxr-xr-x. 2 62265 users   22 10月 12 2019 job
16 drwxr-xr-x. 2 62265 users 4096 10月 12 2019 lib
17 drwxr-xr-x. 4 62265 users   34 10月 12 2019 plugin
18 drwxr-xr-x. 2 62265 users   23 10月 12 2019 script
19 drwxr-xr-x. 2 62265 users   24 10月 12 2019 tmp
20 [root@slaver1 datax]# 

5、将下载后的datax.tar.gz压缩包直接解压后就可以使用了,但是前提是要安装好java、python的环境。

1)、由于此服务器之前已经安装过Jdk1.8了,所以这里就忽略了,jdk通常需要自己安装的哦,如下所示:

1 [root@slaver1 datax]# java -version
2 openjdk version "1.8.0_232"
3 OpenJDK Runtime Environment (build 1.8.0_232-b09)
4 OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
5 [root@slaver1 datax]# 

2)、Python(推荐Python2.7.X)一定要为Python2,因为后面执行datax.py的时候,里面的Python的print会执行不了,导致运行不成功,会提示你print语法要加括号,Python2中加不加都行,Python3中必须要加,否则报语法错。python版本查看(通常系统自带2.x版本)。

1 [root@slaver1 datax]# python -V
2 Python 2.7.5
3 [root@slaver1 datax]# 

3)、Apache Maven 3.x (Compile DataX),通常需要自己安装的哦。

下载Maven安装包,然后进行解压缩操作,可以自己改下名称方便操作。

 1 [root@slaver1 package]# wget http://mirrors.tuna.tsinghua.edu.cn/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
 2 --2020-07-17 15:10:29--  http://mirrors.tuna.tsinghua.edu.cn/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
 3 正在解析主机 mirrors.tuna.tsinghua.edu.cn (mirrors.tuna.tsinghua.edu.cn)... 101.6.8.193, 2402:f000:1:408:8100::1
 4 正在连接 mirrors.tuna.tsinghua.edu.cn (mirrors.tuna.tsinghua.edu.cn)|101.6.8.193|:80... 已连接。
 5 已发出 HTTP 请求,正在等待回应... 200 OK
 6 长度:8491533 (8.1M) [application/x-gzip]
 7 正在保存至: “apache-maven-3.3.9-bin.tar.gz”
 8 
 9 100%[======================================================================================================================================================================================>] 8,491,533   6.32MB/s 用时 1.3s   
10 
11 2020-07-17 15:10:31 (6.32 MB/s) - 已保存 “apache-maven-3.3.9-bin.tar.gz” [8491533/8491533])
12 
13 [root@slaver1 package]# tar -zxvf apache-maven-3.3.9-bin.tar.gz -C /home/hadoop/soft/

配置Maven的环境变量[root@slaver1 soft]# vim /etc/profile,如下所示:

1 M2_HOME=/home/hadoop/soft/apache-maven-3.3.9
2 export PATH=${M2_HOME}/bin:${PATH}

重新加载此配置文件,如下所示:

1 [root@slaver1 soft]# source /etc/profile
2 [root@slaver1 soft]# 

最后检查Maven安装是否成功,如下所示:

 1 [root@slaver1 soft]# mvn -v
 2 which: no javac in (/home/hadoop/soft/apache-maven-3.3.9/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
 3 Warning: JAVA_HOME environment variable is not set.
 4 Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-11T00:41:47+08:00)
 5 Maven home: /home/hadoop/soft/apache-maven-3.3.9
 6 Java version: 1.8.0_232, vendor: Oracle Corporation
 7 Java home: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.232.b09-0.el7_7.x86_64/jre
 8 Default locale: zh_CN, platform encoding: UTF-8
 9 OS name: "linux", version: "3.10.0-957.el7.x86_64", arch: "amd64", family: "unix"
10 [root@slaver1 soft]# 

6、Jdk、Python、Maven都安装成功了,datax解压缩成功了,开始自检,进入bin目录,开始自检,如下所示:

  1 [root@slaver1 bin]# python datax.py ../job/job.json 
  2 
  3 DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
  4 Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
  5 
  6 
  7 2020-07-17 15:33:06.296 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
  8 2020-07-17 15:33:06.322 [main] INFO  Engine - the machine info  => 
  9 
 10     osInfo:    Oracle Corporation 1.8 25.232-b09
 11     jvmInfo:    Linux amd64 3.10.0-957.el7.x86_64
 12     cpu num:    2
 13 
 14     totalPhysicalMemory:    -0.00G
 15     freePhysicalMemory:    -0.00G
 16     maxFileDescriptorCount:    -1
 17     currentOpenFileDescriptorCount:    -1
 18 
 19     GC Names    [PS MarkSweep, PS Scavenge]
 20 
 21     MEMORY_NAME                    | allocation_size                | init_size                      
 22     PS Eden Space                  | 256.00MB                       | 256.00MB                       
 23     Code Cache                     | 240.00MB                       | 2.44MB                         
 24     Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
 25     PS Survivor Space              | 42.50MB                        | 42.50MB                        
 26     PS Old Gen                     | 683.00MB                       | 683.00MB                       
 27     Metaspace                      | -0.00MB                        | 0.00MB                         
 28 
 29 
 30 2020-07-17 15:33:06.379 [main] INFO  Engine - 
 31 {
 32     "content":[
 33         {
 34             "reader":{
 35                 "name":"streamreader",
 36                 "parameter":{
 37                     "column":[
 38                         {
 39                             "type":"string",
 40                             "value":"DataX"
 41                         },
 42                         {
 43                             "type":"long",
 44                             "value":19890604
 45                         },
 46                         {
 47                             "type":"date",
 48                             "value":"1989-06-04 00:00:00"
 49                         },
 50                         {
 51                             "type":"bool",
 52                             "value":true
 53                         },
 54                         {
 55                             "type":"bytes",
 56                             "value":"test"
 57                         }
 58                     ],
 59                     "sliceRecordCount":100000
 60                 }
 61             },
 62             "writer":{
 63                 "name":"streamwriter",
 64                 "parameter":{
 65                     "encoding":"UTF-8",
 66                     "print":false
 67                 }
 68             }
 69         }
 70     ],
 71     "setting":{
 72         "errorLimit":{
 73             "percentage":0.02,
 74             "record":0
 75         },
 76         "speed":{
 77             "byte":10485760
 78         }
 79     }
 80 }
 81 
 82 2020-07-17 15:33:06.475 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
 83 2020-07-17 15:33:06.480 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
 84 2020-07-17 15:33:06.481 [main] INFO  JobContainer - DataX jobContainer starts job.
 85 2020-07-17 15:33:06.495 [main] INFO  JobContainer - Set jobId = 0
 86 2020-07-17 15:33:06.591 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
 87 2020-07-17 15:33:06.593 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do prepare work .
 88 2020-07-17 15:33:06.594 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do prepare work .
 89 2020-07-17 15:33:06.594 [job-0] INFO  JobContainer - jobContainer starts to do split ...
 90 2020-07-17 15:33:06.598 [job-0] INFO  JobContainer - Job set Max-Byte-Speed to 10485760 bytes.
 91 2020-07-17 15:33:06.604 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] splits to [1] tasks.
 92 2020-07-17 15:33:06.606 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] splits to [1] tasks.
 93 2020-07-17 15:33:06.678 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
 94 2020-07-17 15:33:06.709 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
 95 2020-07-17 15:33:06.720 [job-0] INFO  JobContainer - Running by standalone Mode.
 96 2020-07-17 15:33:06.806 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
 97 2020-07-17 15:33:06.830 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
 98 2020-07-17 15:33:06.831 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
 99 2020-07-17 15:33:06.882 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
100 2020-07-17 15:33:07.087 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[215]ms
101 2020-07-17 15:33:07.088 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
102 2020-07-17 15:33:16.841 [job-0] INFO  StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.148s |  All Task WaitReaderTime 0.161s | Percentage 100.00%
103 2020-07-17 15:33:16.842 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
104 2020-07-17 15:33:16.843 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do post work.
105 2020-07-17 15:33:16.844 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do post work.
106 2020-07-17 15:33:16.845 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
107 2020-07-17 15:33:16.846 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /home/hadoop/soft/datax/hook
108 2020-07-17 15:33:16.851 [job-0] INFO  JobContainer - 
109      [total cpu info] => 
110         averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
111         -1.00%                         | -1.00%                         | -1.00%
112                         
113 
114      [total gc info] => 
115          NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
116          PS MarkSweep         | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
117          PS Scavenge          | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
118 
119 2020-07-17 15:33:16.852 [job-0] INFO  JobContainer - PerfTrace not enable!
120 2020-07-17 15:33:16.854 [job-0] INFO  StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.148s |  All Task WaitReaderTime 0.161s | Percentage 100.00%
121 2020-07-17 15:33:16.861 [job-0] INFO  JobContainer - 
122 任务启动时刻                    : 2020-07-17 15:33:06
123 任务结束时刻                    : 2020-07-17 15:33:16
124 任务总计耗时                    :                 10s
125 任务平均流量                    :          253.91KB/s
126 记录写入速度                    :          10000rec/s
127 读出记录总数                    :              100000
128 读写失败总数                    :                   0
129 
130 [root@slaver1 bin]# 

7、如何实现mysql数据库的数据表向mysql的数据库的数据表传输数据呢,如下所示:

 1 [root@slaver1 job]# cat mysql2mysql.json 
 2 {
 3     "job": {
 4         "setting": {
 5              "speed": {
 6                 "byte":1048576,
 7                 "channel":"4"
 8             }
 9         },
10         "content": [
11             {
12                 "reader": {
13                     "name": "mysqlreader",
14                     "parameter": {
15                         "username": "root",
16                         "password": "123456",
17                         "connection": [
18                             {
19                                 "querySql": [
20                                     "SELECT id,table_name,part,source,start_time,next_time,target,start_batch,next_batch FROM data_exchange_table_time "
21                                 ],
22                                 "jdbcUrl": ["jdbc:mysql://192.168.110.133:3306/book"]
23                             }
24                         ]
25                     }
26                 },
27                 "writer": {
28                     "name": "mysqlwriter",
29                     "parameter": {
30                         "writeMode": "insert",
31                         "column": [
32                             "id",
33                             "table_name",
34                             "part",
35                             "source",
36                             "start_time",
37                             "next_time",
38                             "target",
39                             "start_batch",
40                             "next_batch"
41                         ],
42                         "connection": [
43                             {
44                                 
45                                 "jdbcUrl": "jdbc:mysql://192.168.110.133:3306/biehl",
46                                 "table": ["data_exchange_table_time"]
47                             }
48                         ],
49                         "username": "root",
50                         "password": "123456",
51                         "postSql": [],
52                         "preSql": [],
53                     }
54                 }
55             }
56         ],
57     }
58 }
59 [root@slaver1 job]# 

执行命令,切记首先数据源数据表必须存在,然后目标数据库的数据表结构一定要提前创建好,如下所示:

  1 [root@slaver1 job]# python /home/hadoop/soft/datax/bin/datax.py /home/hadoop/soft/datax/job/mysql2mysql.json 
  2 
  3 DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
  4 Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
  5 
  6 
  7 2020-07-17 17:16:50.886 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
  8 2020-07-17 17:16:50.903 [main] INFO  Engine - the machine info  => 
  9 
 10     osInfo:    Oracle Corporation 1.8 25.232-b09
 11     jvmInfo:    Linux amd64 3.10.0-957.el7.x86_64
 12     cpu num:    2
 13 
 14     totalPhysicalMemory:    -0.00G
 15     freePhysicalMemory:    -0.00G
 16     maxFileDescriptorCount:    -1
 17     currentOpenFileDescriptorCount:    -1
 18 
 19     GC Names    [PS MarkSweep, PS Scavenge]
 20 
 21     MEMORY_NAME                    | allocation_size                | init_size                      
 22     PS Eden Space                  | 256.00MB                       | 256.00MB                       
 23     Code Cache                     | 240.00MB                       | 2.44MB                         
 24     Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
 25     PS Survivor Space              | 42.50MB                        | 42.50MB                        
 26     PS Old Gen                     | 683.00MB                       | 683.00MB                       
 27     Metaspace                      | -0.00MB                        | 0.00MB                         
 28 
 29 
 30 2020-07-17 17:16:50.954 [main] INFO  Engine - 
 31 {
 32     "content":[
 33         {
 34             "reader":{
 35                 "name":"mysqlreader",
 36                 "parameter":{
 37                     "connection":[
 38                         {
 39                             "jdbcUrl":[
 40                                 "jdbc:mysql://192.168.110.133:3306/book"
 41                             ],
 42                             "querySql":[
 43                                 "SELECT id,table_name,part,source,start_time,next_time,target,start_batch,next_batch FROM data_exchange_table_time "
 44                             ]
 45                         }
 46                     ],
 47                     "password":"******",
 48                     "username":"root"
 49                 }
 50             },
 51             "writer":{
 52                 "name":"mysqlwriter",
 53                 "parameter":{
 54                     "column":[
 55                         "id",
 56                         "table_name",
 57                         "part",
 58                         "source",
 59                         "start_time",
 60                         "next_time",
 61                         "target",
 62                         "start_batch",
 63                         "next_batch"
 64                     ],
 65                     "connection":[
 66                         {
 67                             "jdbcUrl":"jdbc:mysql://192.168.110.133:3306/biehl",
 68                             "table":[
 69                                 "data_exchange_table_time"
 70                             ]
 71                         }
 72                     ],
 73                     "password":"******",
 74                     "postSql":[],
 75                     "preSql":[],
 76                     "username":"root",
 77                     "writeMode":"insert"
 78                 }
 79             }
 80         }
 81     ],
 82     "setting":{
 83         "speed":{
 84             "byte":1048576,
 85             "channel":"4"
 86         }
 87     }
 88 }
 89 
 90 2020-07-17 17:16:51.011 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
 91 2020-07-17 17:16:51.014 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
 92 2020-07-17 17:16:51.015 [main] INFO  JobContainer - DataX jobContainer starts job.
 93 2020-07-17 17:16:51.018 [main] INFO  JobContainer - Set jobId = 0
 94 2020-07-17 17:16:51.678 [job-0] INFO  OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://192.168.110.133:3306/book?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true.
 95 2020-07-17 17:16:52.118 [job-0] INFO  OriginalConfPretreatmentUtil - table:[data_exchange_table_time] all columns:[
 96 id,table_name,part,source,start_time,next_time,target,start_batch,next_batch
 97 ].
 98 2020-07-17 17:16:52.142 [job-0] INFO  OriginalConfPretreatmentUtil - Write data [
 99 insert INTO %s (id,table_name,part,source,start_time,next_time,target,start_batch,next_batch) VALUES(?,?,?,?,?,?,?,?,?)
100 ], which jdbcUrl like:[jdbc:mysql://192.168.110.133:3306/biehl?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true]
101 2020-07-17 17:16:52.144 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
102 2020-07-17 17:16:52.146 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do prepare work .
103 2020-07-17 17:16:52.150 [job-0] INFO  JobContainer - DataX Writer.Job [mysqlwriter] do prepare work .
104 2020-07-17 17:16:52.152 [job-0] INFO  JobContainer - jobContainer starts to do split ...
105 2020-07-17 17:16:52.157 [job-0] INFO  JobContainer - Job set Max-Byte-Speed to 1048576 bytes.
106 2020-07-17 17:16:52.169 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] splits to [1] tasks.
107 2020-07-17 17:16:52.171 [job-0] INFO  JobContainer - DataX Writer.Job [mysqlwriter] splits to [1] tasks.
108 2020-07-17 17:16:52.221 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
109 2020-07-17 17:16:52.232 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
110 2020-07-17 17:16:52.238 [job-0] INFO  JobContainer - Running by standalone Mode.
111 2020-07-17 17:16:52.270 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
112 2020-07-17 17:16:52.294 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
113 2020-07-17 17:16:52.297 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
114 2020-07-17 17:16:52.332 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
115 2020-07-17 17:16:52.340 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Begin to read record by Sql: [SELECT id,table_name,part,source,start_time,next_time,target,start_batch,next_batch FROM data_exchange_table_time 
116 ] jdbcUrl:[jdbc:mysql://192.168.110.133:3306/book?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
117 2020-07-17 17:16:56.814 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Finished read record by Sql: [SELECT id,table_name,part,source,start_time,next_time,target,start_batch,next_batch FROM data_exchange_table_time 
118 ] jdbcUrl:[jdbc:mysql://192.168.110.133:3306/book?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
119 2020-07-17 17:16:56.971 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[4651]ms
120 2020-07-17 17:16:56.972 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
121 2020-07-17 17:17:02.330 [job-0] INFO  StandAloneJobContainerCommunicator - Total 61086 records, 4112902 bytes | Speed 401.65KB/s, 6108 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 3.495s |  All Task WaitReaderTime 0.792s | Percentage 100.00%
122 2020-07-17 17:17:02.332 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
123 2020-07-17 17:17:02.333 [job-0] INFO  JobContainer - DataX Writer.Job [mysqlwriter] do post work.
124 2020-07-17 17:17:02.343 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do post work.
125 2020-07-17 17:17:02.345 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
126 2020-07-17 17:17:02.348 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /home/hadoop/soft/datax/hook
127 2020-07-17 17:17:02.379 [job-0] INFO  JobContainer - 
128      [total cpu info] => 
129         averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
130         -1.00%                         | -1.00%                         | -1.00%
131                         
132 
133      [total gc info] => 
134          NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
135          PS MarkSweep         | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
136          PS Scavenge          | 1                  | 1                  | 1                  | 0.044s             | 0.044s             | 0.044s             
137 
138 2020-07-17 17:17:02.379 [job-0] INFO  JobContainer - PerfTrace not enable!
139 2020-07-17 17:17:02.408 [job-0] INFO  StandAloneJobContainerCommunicator - Total 61086 records, 4112902 bytes | Speed 401.65KB/s, 6108 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 3.495s |  All Task WaitReaderTime 0.792s | Percentage 100.00%
140 2020-07-17 17:17:02.454 [job-0] INFO  JobContainer - 
141 任务启动时刻                    : 2020-07-17 17:16:51
142 任务结束时刻                    : 2020-07-17 17:17:02
143 任务总计耗时                    :                 11s
144 任务平均流量                    :          401.65KB/s
145 记录写入速度                    :           6108rec/s
146 读出记录总数                    :               61086
147 读写失败总数                    :                   0
148 
149 [root@slaver1 job]# 

2020-08-1913:08:53 

使用datax将mysql的数据导入到mysql显然是很简单的,但是当我使用datax将postgresql的数据导入到postgresql的时候,发生了一系列的问题,这里进行了简单的记录。

错误一:如果报ip、账号、密码、端口错误,很大可能是你写错了或者写错位置了(我的就是将密码写到账号位置了,将账号写到密码位置了)。

错误二:如果报某个字段不存在,很大概率是这样的,官方案例是小写的,但是postgresql大小写敏感的,如果字段是大写的,需要进行转义,不论是表输入还是表输出都需要进行转义的。

错误三:如果是报某个表不存在,此时需要将数据库的schema写上,写到数据表的前面,因为postgresql多了一个schema的概念。

 1 {
 2     "job": {
 3         "setting": {
 4              "speed": {
 5                 "byte":1048576,
 6                 "channel":"4"
 7             }
 8         },
 9         "content": [
10             {
11                 "reader": {
12                     "name": "postgresqlreader",
13                     "parameter": {
14                         "username": "账号",
15                         "password": "密码",
16                         "where": "",
17                         "connection": [
18                             {
19                                 "querySql": [
20                                     "select \"ID\",\"Name\",\"Age\" from public.person where \"ID\" > 1;"
21                                 ],
22                                 "jdbcUrl": [
23                                     "jdbc:postgresql://192.168.109.106:5432/postgres"
24                                 ]
25                             }
26                         ]
27                     }
28                 },
29                 "writer": {
30                     "name": "postgresqlwriter",
31                     "parameter": {
32                         "print": true,
33                         "encoding": "UTF-8",
34                         "username": "账号",
35                         "password": "密码",
36                         "column": [
37                             "\"ID\"",
38                             "\"Name\"",
39                             "\"Age\""
40                     ],
41                     "connection": [
42                         {
43                             "jdbcUrl": "jdbc:postgresql://192.168.109.106:5432/postgres",
44                             "table": ["public.person_copy"]
45                         }
46                     ]
47                     }
48                 }
49             }
50         ]
51     }
52 }

注意事项:1)、reader > name的值是postgresqlreader,writer > name的值是postgresqlwriter。

2)、reader/writer > parameter > username和password的值不要写反了。

3)、reader > parameter > connection的querySql的数据表字段如果是大写的需要进行转义的。writer > parameter > column的数据表字段如果是大写的需要进行转义的。

4)、reader > parameter > connection的jdbcUrl的jdbc:postgresql://127.0.0.1:5432/postgres后面这个是数据库的名称哦,如果数据库的名称和schema的名称不一致记得区分,别搞混了。

5)、reader > parameter > connection的querySql里面写的sql语句,数据表前面跟的是schema的名称,别搞混了。writer > connection > table里面的数据表前面跟的是schema的名称,别搞混了。

posted on 2020-07-17 20:47  别先生  阅读(11625)  评论(1编辑  收藏  举报