datax 调研
DataX
DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。
Features
DataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。
Support Data Channels
DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图,详情请点击:DataX数据源参考指南
类型 |
数据源 |
Reader(读) |
Writer(写) |
文档 |
RDBMS 关系型数据库 |
MySQL |
√ |
√ |
|
|
Oracle |
√ |
√ |
|
SQLServer |
√ |
√ |
||
PostgreSQL |
√ |
√ |
||
DRDS |
√ |
√ |
||
通用RDBMS(支持所有关系型数据库) |
√ |
√ |
||
阿里云数仓数据存储 |
ODPS |
√ |
√ |
|
ADS |
√ |
|||
OSS |
√ |
√ |
||
OCS |
√ |
√ |
||
NoSQL数据存储 |
OTS |
√ |
√ |
|
Hbase0.94 |
√ |
√ |
||
Hbase1.1 |
√ |
√ |
||
Phoenix4.x |
√ |
√ |
||
Phoenix5.x |
√ |
√ |
||
MongoDB |
√ |
√ |
||
Hive |
√ |
√ |
||
Cassandra |
√ |
√ |
||
无结构化数据存储 |
TxtFile |
√ |
√ |
|
FTP |
√ |
√ |
||
HDFS |
√ |
√ |
||
Elasticsearch |
√ |
|||
时间序列数据库 |
OpenTSDB |
√ |
||
TSDB |
√ |
√ |
- 配置一个自定义SQL的数据库同步任务到本地内容的作业:
{
"job": {
"setting": {
"speed": {
"channel":32
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "Aa123456",
"connection": [
{
"querySql": [
"select * from test"
],
"jdbcUrl": [
"jdbc:mysql://127.0.0.1:3306/test?characterEncoding=utf8&useSSL=false&serverTimezone=UTC&rewriteBatchedStatements=true"
]
}
]
}
},
"writer": {
"name": "elasticsearchwriter",
"parameter": {
"endpoint": "http://127.0.0.1:9200",
"index": "test-1",
"type": "_doc",
"cleanup": true,
"accessId":"root", # 注意这里如果没有设置accessId 和 accessKey 为任意值
"accessKey":"root",
"settings": {"index" :{"number_of_shards": 1, "number_of_replicas": 0}},
"discovery": false,
"batchSize": 1000,
"splitter": ",",
"column": [
{"name": "id", "type": "Integer"},
{ "name": "name","type": "keyword" }
]
}
}
}
]
}
}
这里还有个坑就是datax编译版本是没有elasticsearchwriter的,需要从源码打包之后编译,拿到target 中的es插件上传至编译好的plugin 目录的writer文件夹下
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.alibaba.datax</groupId> <artifactId>datax-all</artifactId> <version>0.0.1-SNAPSHOT</version> <dependencies> <dependency> <groupId>org.hamcrest</groupId> <artifactId>hamcrest-core</artifactId> <version>1.3</version> </dependency> </dependencies> <name>datax-all</name> <packaging>pom</packaging> <properties> <jdk-version>1.8</jdk-version> <datax-project-version>0.0.1-SNAPSHOT</datax-project-version> <commons-lang3-version>3.3.2</commons-lang3-version> <commons-configuration-version>1.10</commons-configuration-version> <commons-cli-version>1.2</commons-cli-version> <fastjson-version>1.1.46.sec10</fastjson-version> <guava-version>16.0.1</guava-version> <diamond.version>3.7.2.1-SNAPSHOT</diamond.version> <!--slf4j 1.7.10 和 logback-classic 1.0.13 是好基友 --> <slf4j-api-version>1.7.10</slf4j-api-version> <logback-classic-version>1.0.13</logback-classic-version> <commons-io-version>2.4</commons-io-version> <junit-version>4.11</junit-version> <tddl.version>5.1.22-1</tddl.version> <swift-version>1.0.0</swift-version> <project-sourceEncoding>UTF-8</project-sourceEncoding> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding> <maven.compiler.encoding>UTF-8</maven.compiler.encoding> </properties> <modules> <module>common</module> <module>core</module> <module>transformer</module> <!-- reader --> <!-- <module>mysqlreader</module>--> <!-- <module>drdsreader</module>--> <!-- <module>sqlserverreader</module>--> <!-- <module>postgresqlreader</module>--> <!-- <module>oraclereader</module>--> <!-- <module>odpsreader</module>--> <!-- <module>otsreader</module>--> <!-- <module>otsstreamreader</module>--> <!-- <module>txtfilereader</module>--> <!-- <module>hdfsreader</module>--> <!-- <module>streamreader</module>--> <!-- <module>ossreader</module>--> <!-- <module>ftpreader</module>--> <!-- <module>mongodbreader</module>--> <!-- <module>rdbmsreader</module>--> <!-- <module>hbase11xreader</module>--> <!-- <module>hbase094xreader</module>--> <!-- <module>tsdbreader</module>--> <!-- <module>opentsdbreader</module>--> <!-- <module>cassandrareader</module>--> <!-- <module>gdbreader</module>--> <!-- writer --> <!-- <module>mysqlwriter</module>--> <!-- <module>drdswriter</module>--> <!-- <module>odpswriter</module>--> <!-- <module>txtfilewriter</module>--> <!-- <module>ftpwriter</module>--> <!-- <module>hdfswriter</module>--> <!-- <module>streamwriter</module>--> <!-- <module>otswriter</module>--> <!-- <module>oraclewriter</module>--> <!-- <module>sqlserverwriter</module>--> <!-- <module>postgresqlwriter</module>--> <!-- <module>osswriter</module>--> <!-- <module>mongodbwriter</module>--> <!-- <module>adswriter</module>--> <!-- <module>ocswriter</module>--> <!-- <module>rdbmswriter</module>--> <!-- <module>hbase11xwriter</module>--> <!-- <module>hbase094xwriter</module>--> <!-- <module>hbase11xsqlwriter</module>--> <!-- <module>hbase11xsqlreader</module>--> <module>elasticsearchwriter</module> <!-- <module>tsdbwriter</module>--> <!-- <module>adbpgwriter</module>--> <!-- <module>gdbwriter</module>--> <!-- <module>cassandrawriter</module>--> <!-- <module>clickhousewriter</module>--> <!-- common support module --> <!-- <module>plugin-rdbms-util</module>--> <!-- <module>plugin-unstructured-storage-util</module>--> <!-- <module>hbase20xsqlreader</module>--> <!-- <module>hbase20xsqlwriter</module>--> </modules> <dependencyManagement> <dependencies> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>${commons-lang3-version}</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>${fastjson-version}</version> </dependency> <!--<dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>${guava-version}</version> </dependency>--> <dependency> <groupId>commons-io</groupId> <artifactId>commons-io</artifactId> <version>${commons-io-version}</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <version>${slf4j-api-version}</version> </dependency> <dependency> <groupId>ch.qos.logback</groupId> <artifactId>logback-classic</artifactId> <version>${logback-classic-version}</version> </dependency> <dependency> <groupId>com.taobao.tddl</groupId> <artifactId>tddl-client</artifactId> <version>${tddl.version}</version> <exclusions> <exclusion> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> </exclusion> <exclusion> <groupId>com.taobao.diamond</groupId> <artifactId>diamond-client</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>com.taobao.diamond</groupId> <artifactId>diamond-client</artifactId> <version>${diamond.version}</version> </dependency> <dependency> <groupId>com.alibaba.search.swift</groupId> <artifactId>swift_client</artifactId> <version>${swift-version}</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>${junit-version}</version> </dependency> <dependency> <groupId>org.mockito</groupId> <artifactId>mockito-all</artifactId> <version>1.9.5</version> <scope>test</scope> </dependency> </dependencies> </dependencyManagement> <repositories> <repository> <id>central</id> <name>Nexus aliyun</name> <url>https://maven.aliyun.com/repository/central</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>true</enabled> </snapshots> </repository> </repositories> <pluginRepositories> <pluginRepository> <id>central</id> <name>Nexus aliyun</name> <url>https://maven.aliyun.com/repository/central</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>true</enabled> </snapshots> </pluginRepository> </pluginRepositories> <build> <plugins> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <finalName>datax</finalName> <descriptors> <descriptor>package.xml</descriptor> </descriptors> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>2.3.2</version> <configuration> <source>${jdk-version}</source> <target>${jdk-version}</target> <encoding>${project-sourceEncoding}</encoding> </configuration> </plugin> </plugins> </build> </project>
增量更新方案:
DataX web 提供了增量更新方案(时间 或者 id)
这部分也可以通过xshall 脚本进行同步。
DataX web
时间增量:
Id 增量:
其他方面测试:
经测试100万数据导入大概为41s