StreamSets Data Collector 实时同步 PostgreSQL/MySQL 数据到 Clickhouse

Streamsets 是一款大数据实时采集和 ETL 工具，可以实现不写一行代码完成数据的采集和流转。通过拖拽式的可视化界面，实现数据管道(Pipelines)的设计和定时任务调度。最大的特点有：
- 可视化界面操作，不写代码完成数据的采集和流转,在几分钟内设计用于流式传输、批处理和更改数据捕获 (CDC) 的管道
- 内置监控，可是实时查看数据流传输的基本信息和数据的质量
- 强大的整合力，对现有常用组件全力支持，包括50种数据源、44 种数据操作、46 种目的地。

对于Streamsets来说，最重要的概念就是数据源(Origins)、操作(Processors)、目的地(Destinations)。创建一个Pipelines管道配置也基本是这三个方面。常见的 Origins 有 Kafka、HTTP、UDP、JDBC、HDFS 等；Processors 可以实现对每个字段的过滤、更改、编码、聚合等操作；Destinations 跟 Origins 差不多，可以写入 Kafka、Flume、JDBC、HDFS、Redis 等。

Origins（读取的目标数据源）（文档: https://docs.streamsets.com/portal/#datacollector/3.22.x/help/datacollector/UserGuide/Origins/Origins_title.html）

Origins An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline.
Amazon S3
Amazon SQS Consumer
Azure Data Lake Storage Gen1
Azure Data Lake Storage Gen2
Azure IoT/Event Hub Consumer
CoAP Server
Cron Scheduler
Directory
Elasticsearch
File Tail
Google BigQuery
Google Cloud Storage
Google Pub/Sub Subscriber
Groovy Scripting
gRPC Client
Hadoop FS
Hadoop FS Standalone
HTTP Client
HTTP Server
HTTP to Kafka (Deprecated)
JavaScript Scripting
JDBC Multitable Consumer
JDBC Query Consumer
JMS Consumer
Jython Scripting
Kafka Consumer
Kafka Multitopic Consumer
Kinesis Consumer
MapR DB CDC
MapR DB JSON
MapR FS
MapR FS Standalone
MapR Multitopic Streams Consumer
MapR Streams Consumer
MongoDB
MongoDB Oplog
MQTT Subscriber
MySQL Binary Log
NiFi HTTP Server
Omniture
OPC UA Client
Oracle Bulkload
Oracle CDC Client
PostgreSQL CDC Client
Pulsar Consumer
RabbitMQ Consumer
Redis Consumer
REST Service
Salesforce
SAP HANA Query Consumer
SDC RPC
SDC RPC to Kafka (Deprecated)
SFTP/FTP/FTPS Client
SQL Server 2019 BDC Multitable Consumer
SQL Server CDC Client
SQL Server Change Tracking
Start Jobs
Start Pipelines
System Metrics
TCP Server
Teradata Consumer
UDP Multithreaded Source
UDP Source
UDP to Kafka (Deprecated)
WebSocket Client
WebSocket Server
Windows Event Log

Destinations（写入的数据源）

Streamsets 安装

# 关闭防火墙：
systemctl stop firewalld.service
# 关闭SELINUX：
setenforce 0 （临时生效）  
修改 /etc/selinux/config 下的 SELINUX=disabled （重启后永久生效）
# 配置文件上限
ulimit -n
vi /etc/security/limits.conf
sdc soft nofile    32768
sdc hard nofile    32768
vi /etc/profile
ulimit -HSn 32768
source /etc/profile
# mac OSX 下修改 ulimit 参数
sudo sysctl -w kern.maxfilesperproc=65535
ulimit -n 65535
# 安装 axel
brew install axel
wget http://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/a/axel-2.4-9.el7.x86_64.rpm
rpm -ivh axel-2.4-9.el7.x86_64.rpm
# 下载 streamsets-datacollector
axel -n 20 -a -v https://archives.streamsets.com/datacollector/3.22.3/tarball/activation/streamsets-datacollector-common-3.22.3.tgz
axel -n 20 -a -v https://archives.streamsets.com/datacollector/3.22.3/tarball/activation/streamsets-datacollector-all-3.22.3.tgz（不推荐，全组件）
# 安装 jdk11
sudo yum install java-11-openjdk-devel
scp /Users/irving/Downloads/streamsets-datacollector-common-3.22.3.tgz root@10.34.12.255:/opt

# 启动服务
tar xvzf streamsets-datacollector-common-3.22.3.tgz
nohup bin/streamsets dc &

# pm2 管理
pm2 start "bin/streamsets dc" --name streamsets-datacollector
pm2 startup
pm2 save

# 修改配置文件(启用 form 认证，非默认的 OAUTH2 认证)
vi sdc.properties
# The authentication for the HTTP endpoint of Data Collector
# Valid values are: 'none', 'basic', 'digest', 'form' or 'aster'
http.authentication=form
pm2 logs streamsets-datacollector
0|streamse | Java 11 detected; adding $SDC_JAVA11_OPTS of "-Djdk.nio.maxCachedBufferSize=262144" to $SDC_JAVA_OPTS
0|streamse | Activation enabled, activation is valid and it does not expire
0|streamse | Logging initialized @3119ms to org.eclipse.jetty.util.log.Slf4jLog
0|streamsets-datacollector  | Running on URI : 'http://:18630'

# 上传 JDBC 驱动 (复制到 /opt/streamsets-datacollector-3.22.3/streamsets-libs-extras/streamsets-datacollector-jdbc-lib/lib 文件夹)
mvn dependency:copy-dependencies -DoutputDirectory=lib -DincludeScope=compile
scp -r /Users/irving/Desktop/git/microsrv/clickhouse/lib root@10.34.12.255:/opt/streamsets-datacollector-3.22.3/streamsets-libs-extras/streamsets-datacollector-jdbc-lib/lib
pm2 restart streamsets-datacollector

Maven 打包上传 JDBC 驱动包（MSSQL 与 PostgreSQL 默认支持）

 <dependencies>
        <dependency>
            <groupId>ru.yandex.clickhouse</groupId>
            <artifactId>clickhouse-jdbc</artifactId>
            <version>0.3.1</version>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.25</version>
        </dependency>
        <!--        <dependency>-->
        <!--            <groupId>org.postgresql</groupId>-->
        <!--            <artifactId>postgresql</artifactId>-->
        <!--            <version>42.2.23</version>-->
        <!--        </dependency>-->
    </dependencies>

# tree -h
.
├── [ 352K]  clickhouse-jdbc-0.3.1.jar
├── [ 327K]  commons-codec-1.11.jar
├── [  60K]  commons-logging-1.2.jar
├── [ 762K]  httpclient-4.5.13.jar
├── [ 321K]  httpcore-4.4.13.jar
├── [  41K]  httpmime-4.5.13.jar
├── [  65K]  jackson-annotations-2.9.10.jar
├── [ 318K]  jackson-core-2.9.10.jar
├── [ 1.3M]  jackson-databind-2.9.10.8.jar
├── [ 635K]  lz4-java-1.7.1.jar
├── [ 2.3M]  mysql-connector-java-8.0.25.jar
├── [ 1.6M]  protobuf-java-3.11.4.jar
└── [  40K]  slf4j-api-1.7.30.jar

Clickhouse 安装

# 检查环境
grep -q sse4_2 /proc/cpuinfo && echo “SSE 4.2 supported” || echo “SSE 4.2 not supported.
“SSE 4.2 supported” 代表可以安装，ClickHouse 需要使用 SSE 硬件指令集加速，大大加快了 CPU 寄存器计算效率。
# 安装 clickhouse
sudo yum install yum-utils
sudo rpm --import https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG
sudo yum-config-manager --add-repo https://repo.clickhouse.tech/rpm/clickhouse.repo
sudo yum install clickhouse-server clickhouse-client

[root@ ~]# sudo yum install clickhouse-server clickhouse-client
Repository epel is listed more than once in the configuration
上次元数据过期检查：0:01:02 前，执行于 2021年07月15日 星期四 15时49分34秒。
依赖关系解决。
============================================================================================================================================================================================================
软件包                                                    架构                                    版本                                            仓库                                                大小
============================================================================================================================================================================================================
安装:
clickhouse-client                                         noarch                                  21.7.3.14-2                                     clickhouse-stable                                   76 k
clickhouse-server                                         noarch                                  21.7.3.14-2                                     clickhouse-stable                                  100 k
安装依赖关系:
clickhouse-common-static                                  x86_64                                  21.7.3.14-2                                     clickhouse-stable                                  166 M

事务概要
============================================================================================================================================================================================================
安装  3 软件包

总下载：166 M
安装大小：539 M
确定吗？[y/N]： y
下载软件包：
[MIRROR] clickhouse-client-21.7.3.14-2.noarch.rpm: Curl error (56): Failure when receiving data from the peer for http://repo.clickhouse.tech/rpm/stable/x86_64/clickhouse-client-21.7.3.14-2.noarch.rpm [Recv failure: 连接被对方重设]
(1/3): clickhouse-server-21.7.3.14-2.noarch.rpm                                                                                                                             142 kB/s | 100 kB     00:00
(2/3): clickhouse-client-21.7.3.14-2.noarch.rpm                                                                                                                              99 kB/s |  76 kB     00:00
(3/3): clickhouse-common-static-21.7.3.14-2.x86_64.rpm                                                                                                                      5.3 MB/s | 166 MB     00:31
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
总计                                                                                                                                                                        5.3 MB/s | 166 MB     00:31
运行事务检查
事务检查成功。
运行事务测试
事务测试成功。
运行事务
准备中  :                                                                                                                                                                                             1/1
安装    : clickhouse-common-static-21.7.3.14-2.x86_64                                                                                                                                                 1/3
安装    : clickhouse-client-21.7.3.14-2.noarch                                                                                                                                                        2/3
安装    : clickhouse-server-21.7.3.14-2.noarch                                                                                                                                                        3/3
运行脚本: clickhouse-server-21.7.3.14-2.noarch                                                                                                                                                        3/3
ClickHouse binary is already located at /usr/bin/clickhouse
Symlink /usr/bin/clickhouse-server already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-server to /usr/bin/clickhouse.
Symlink /usr/bin/clickhouse-client already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-client to /usr/bin/clickhouse.
Symlink /usr/bin/clickhouse-local already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-local to /usr/bin/clickhouse.
Symlink /usr/bin/clickhouse-benchmark already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-benchmark to /usr/bin/clickhouse.
Symlink /usr/bin/clickhouse-copier already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-copier to /usr/bin/clickhouse.
Symlink /usr/bin/clickhouse-obfuscator already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-obfuscator to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-git-import to /usr/bin/clickhouse.
Symlink /usr/bin/clickhouse-compressor already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-compressor to /usr/bin/clickhouse.
Symlink /usr/bin/clickhouse-format already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-format to /usr/bin/clickhouse.
Symlink /usr/bin/clickhouse-extract-from-config already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-extract-from-config to /usr/bin/clickhouse.
Creating clickhouse group if it does not exist.
groupadd -r clickhouse
Creating clickhouse user if it does not exist.
useradd -r --shell /bin/false --home-dir /nonexistent -g clickhouse clickhouse
Will set ulimits for clickhouse user in /etc/security/limits.d/clickhouse.conf.
Creating config directory /etc/clickhouse-server/config.d that is used for tweaks of main server configuration.
Creating config directory /etc/clickhouse-server/users.d that is used for tweaks of users configuration.
Config file /etc/clickhouse-server/config.xml already exists, will keep it and extract path info from it.
/etc/clickhouse-server/config.xml has /var/lib/clickhouse/ as data path.
/etc/clickhouse-server/config.xml has /var/log/clickhouse-server/ as log path.
Users config file /etc/clickhouse-server/users.xml already exists, will keep it and extract users info from it.
chown --recursive clickhouse:clickhouse '/etc/clickhouse-server'
Creating log directory /var/log/clickhouse-server/.
Creating data directory /var/lib/clickhouse/.
Creating pid directory /var/run/clickhouse-server.
chown --recursive clickhouse:clickhouse '/var/log/clickhouse-server/'
chown --recursive clickhouse:clickhouse '/var/run/clickhouse-server'
chown clickhouse:clickhouse '/var/lib/clickhouse/'
groupadd -r clickhouse-bridge
useradd -r --shell /bin/false --home-dir /nonexistent -g clickhouse-bridge clickhouse-bridge
chown --recursive clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-odbc-bridge'
chown --recursive clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-library-bridge'
Password for default user is empty string. See /etc/clickhouse-server/users.xml and /etc/clickhouse-server/users.d to change it.
Setting capabilities for clickhouse binary. This is optional.

ClickHouse has been successfully installed.

Start clickhouse-server with:
sudo clickhouse start

Start clickhouse-client with:
clickhouse-client

Synchronizing state of clickhouse-server.service with SysV service script with /usr/lib/systemd/systemd-sysv-install.
Executing: /usr/lib/systemd/systemd-sysv-install enable clickhouse-server
Created symlink /etc/systemd/system/multi-user.target.wants/clickhouse-server.service → /etc/systemd/system/clickhouse-server.service.

验证    : clickhouse-client-21.7.3.14-2.noarch                                                                                                                                                        1/3
验证    : clickhouse-common-static-21.7.3.14-2.x86_64                                                                                                                                                 2/3
验证    : clickhouse-server-21.7.3.14-2.noarch                                                                                                                                                        3/3

已安装:
clickhouse-client-21.7.3.14-2.noarch                             clickhouse-common-static-21.7.3.14-2.x86_64                             clickhouse-server-21.7.3.14-2.noarch

完毕！

# 查看版本
clickhouse-server --version
ClickHouse server version 21.7.3.14 (official build).
# 启动服务
systemctl start clickhouse-server.service
# 重启服务
systemctl restart clickhouse-server.service
# 停止服务
systemctl stop clickhouse-server.service
# 查看状态
systemctl status clickhouse-server.service

```
[root@ clickhouse-server]# systemctl status clickhouse-server.service
● clickhouse-server.service - ClickHouse Server (analytic DBMS for big data)
Loaded: loaded (/etc/systemd/system/clickhouse-server.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2021-07-15 17:28:31 CST; 1min 11s ago
Main PID: 508117 (clckhouse-watch)
Tasks: 156 (limit: 49524)
Memory: 127.5M
CGroup: /system.slice/clickhouse-server.service
├─508117 clickhouse-watchdog --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
└─508118 /usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid

7月 15 17:28:31  systemd[1]: Started ClickHouse Server (analytic DBMS for big data).
7月 15 17:28:31  clickhouse-server[508117]: Processing configuration file '/etc/clickhouse-server/config.xml'.
7月 15 17:28:31  clickhouse-server[508117]: Logging trace to /var/log/clickhouse-server/clickhouse-server.log
7月 15 17:28:31  clickhouse-server[508117]: Logging errors to /var/log/clickhouse-server/clickhouse-server.err.log
7月 15 17:28:32  clickhouse-server[508117]: Processing configuration file '/etc/clickhouse-server/config.xml'.
7月 15 17:28:32  clickhouse-server[508117]: Saved preprocessed configuration to '/var/lib/clickhouse/preprocessed_configs/config.xml'.
7月 15 17:28:32  clickhouse-server[508117]: Processing configuration file '/etc/clickhouse-server/users.xml'.
7月 15 17:28:32  clickhouse-server[508117]: Saved preprocessed configuration to '/var/lib/clickhouse/preprocessed_configs/users.xml'.
```
# 目录
```
（1）/etc/clickhouse-server：服务端的配置文件目录，包括全局配置 config.xml 和用户配置 users.xml 等。
（2）/var/lib/clickhouse：默认的数据存储目录（通常会修改默认路径配置，将数据保存到大容量磁盘挂载的路径）。
（3）/var/log/clickhouse-server：默认保存日志的目录（通常会修改路径配置，将日志保存到大容量磁盘挂载的路径）。

Before going further, please notice the <path> element in config.xml. Path determines the location for data storage, so it should be located on volume with large disk capacity; the default value is /var/lib/clickhouse/.
```
# 配置远程访问权限
chmod 777 config.xml users.xml
<!-- 如果禁用了ipv6，使用下面配置-->
<listen_host>0.0.0.0</listen_host>
<!-- 如果没有禁用ipv6，使用下面配置
<listen_host>::</listen_host>
-->
```
1.纯文本：
<password>password</password>
2.sha256:
<password_sha256_hex>password</password_sha256_hex>
从shell生成密码的示例：
PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-'
第一行明文，第二行sha256
3.sha1:
<password_double_sha1_hex>password</password_double_sha1_hex>
从shell生成密码的示例：
PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha1sum | tr -d '-' | xxd -r -p | sha1sum | tr -d '-'
第一行明文，第二行sha1
```
# 生成密码
[root@ clickhouse-server]# PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha1sum | tr -d '-' | xxd -r -p | sha1sum | tr -d '-'
JUxG8EwNS
b1854b52ed2577d50a23591d0d8fda3420b4d2af
# 配置
<password_double_sha1_hex>b1854b52ed2577d50a23591d0d8fda3420b4d2af</password_double_sha1_hex>
# 客户端登陆（http端口是8123，tcp端口是9000）
clickhouse-client -h 127.0.0.1 --port 9000 -m -u default --password JUxG8EwNS
# 导入数据(航班数据为样例测试)
curl -O https://datasets.clickhouse.tech/ontime/partitions/ontime.tar
# 启动20个线程下载并显示进度条
axel -n 10 -a https://datasets.clickhouse.tech/ontime/partitions/ontime.tar
tar xvf ontime.tar -C /var/lib/clickhouse
# 查看目录文件大小
du -h --max-depth=1
# 重启服务
sudo systemctl restart clickhouse-server.service
# 查看总条数（183953732，1.8亿+）
clickhouse-client -h 127.0.0.1 --port 9000 -m -u default --password JUxG8EwNS --query "select count(*) from datasets.ontime"
# 建测试表
CREATE TABLE emp_replacingmergetree (
  emp_id UInt16 COMMENT '员工id',
  name String COMMENT '员工姓名',
  work_place String COMMENT '工作地点',
  age UInt8 COMMENT '员工年龄',
  depart String COMMENT '部门',
  salary Decimal32(2) COMMENT '薪资'
  )ENGINE=ReplacingMergeTree()
  ORDER BY emp_id
  PRIMARY KEY emp_id
  PARTITION BY work_place
  ;
 -- 插入数据 
INSERT INTO emp_replacingmergetree
VALUES (1,'tom','上海',25,'技术部',20000),(2,'jack','上海',26,'人事部',10000);
INSERT INTO emp_replacingmergetree
VALUES (3,'bob','北京',33,'财务部',50000),(4,'tony','杭州',28,'销售事部',50000);
# 分区表
按时间分区：
toYYYYMM(EventDate)：按月分区
toMonday(EventDate)：按周分区
toDate(EventDate)：按天分区
按指定列分区：
PARTITION BY cloumn_name

示意图

MaterializeMySQL 与 MaterializedPostgreSQL 引擎

https://clickhouse.tech/docs/en/engines/database-engines/materialize-mysql/(这个特性是实验性的)
https://clickhouse.tech/docs/en/engines/database-engines/materialized-postgresql/(官方文档未说是实验性的，未测试)

手工建表（航班测试数据）

CREATE TABLE `ontime`
(
    `Year`                            UInt16,
    `Quarter`                         UInt8,
    `Month`                           UInt8,
    `DayofMonth`                      UInt8,
    `DayOfWeek`                       UInt8,
    `FlightDate`                      Date,
    `Reporting_Airline`               String,
    `DOT_ID_Reporting_Airline`        Int32,
    `IATA_CODE_Reporting_Airline`     String,
    `Tail_Number`                     Int32,
    `Flight_Number_Reporting_Airline` String,
    `OriginAirportID`                 Int32,
    `OriginAirportSeqID`              Int32,
    `OriginCityMarketID`              Int32,
    `Origin`                          FixedString(5),
    `OriginCityName`                  String,
    `OriginState`                     FixedString(2),
    `OriginStateFips`                 String,
    `OriginStateName`                 String,
    `OriginWac`                       Int32,
    `DestAirportID`                   Int32,
    `DestAirportSeqID`                Int32,
    `DestCityMarketID`                Int32,
    `Dest`                            FixedString(5),
    `DestCityName`                    String,
    `DestState`                       FixedString(2),
    `DestStateFips`                   String,
    `DestStateName`                   String,
    `DestWac`                         Int32,
    `CRSDepTime`                      Int32,
    `DepTime`                         Int32,
    `DepDelay`                        Int32,
    `DepDelayMinutes`                 Int32,
    `DepDel15`                        Int32,
    `DepartureDelayGroups`            String,
    `DepTimeBlk`                      String,
    `TaxiOut`                         Int32,
    `WheelsOff`                       Int32,
    `WheelsOn`                        Int32,
    `TaxiIn`                          Int32,
    `CRSArrTime`                      Int32,
    `ArrTime`                         Int32,
    `ArrDelay`                        Int32,
    `ArrDelayMinutes`                 Int32,
    `ArrDel15`                        Int32,
    `ArrivalDelayGroups`              Int32,
    `ArrTimeBlk`                      String,
    `Cancelled`                       UInt8,
    `CancellationCode`                FixedString(1),
    `Diverted`                        UInt8,
    `CRSElapsedTime`                  Int32,
    `ActualElapsedTime`               Int32,
    `AirTime`                         Nullable(Int32),
    `Flights`                         Int32,
    `Distance`                        Int32,
    `DistanceGroup`                   UInt8,
    `CarrierDelay`                    Int32,
    `WeatherDelay`                    Int32,
    `NASDelay`                        Int32,
    `SecurityDelay`                   Int32,
    `LateAircraftDelay`               Int32,
    `FirstDepTime`                    String,
    `TotalAddGTime`                   String,
    `LongestAddGTime`                 String,
    `DivAirportLandings`              String,
    `DivReachedDest`                  String,
    `DivActualElapsedTime`            String,
    `DivArrDelay`                     String,
    `DivDistance`                     String,
    `Div1Airport`                     String,
    `Div1AirportID`                   Int32,
    `Div1AirportSeqID`                Int32,
    `Div1WheelsOn`                    String,
    `Div1TotalGTime`                  String,
    `Div1LongestGTime`                String,
    `Div1WheelsOff`                   String,
    `Div1TailNum`                     String,
    `Div2Airport`                     String,
    `Div2AirportID`                   Int32,
    `Div2AirportSeqID`                Int32,
    `Div2WheelsOn`                    String,
    `Div2TotalGTime`                  String,
    `Div2LongestGTime`                String,
    `Div2WheelsOff`                   String,
    `Div2TailNum`                     String,
    `Div3Airport`                     String,
    `Div3AirportID`                   Int32,
    `Div3AirportSeqID`                Int32,
    `Div3WheelsOn`                    String,
    `Div3TotalGTime`                  String,
    `Div3LongestGTime`                String,
    `Div3WheelsOff`                   String,
    `Div3TailNum`                     String,
    `Div4Airport`                     String,
    `Div4AirportID`                   Int32,
    `Div4AirportSeqID`                Int32,
    `Div4WheelsOn`                    String,
    `Div4TotalGTime`                  String,
    `Div4LongestGTime`                String,
    `Div4WheelsOff`                   String,
    `Div4TailNum`                     String,
    `Div5Airport`                     String,
    `Div5AirportID`                   Int32,
    `Div5AirportSeqID`                Int32,
    `Div5WheelsOn`                    String,
    `Div5TotalGTime`                  String,
    `Div5LongestGTime`                String,
    `Div5WheelsOff`                   String,
    `Div5TailNum`                     String
) ENGINE MergeTree () 
PARTITION BY FlightDate 
ORDER BY (Year,FlightDate) 
SETTINGS index_granularity = 8192;

备注：如果对实时性要求比较高使用 CDC（Change Data Capture) 或许也是一个比较好的方案, MySQL 需要开启 binlog 支持,PostgreSQL 需要开启 Write-Ahead Logging (WAL) 支持；debezium 或 flink cdc connector (暂未支持 Clickhouse)也是不错的备选方案。

REFER:
https://docs.streamsets.com/portal/#datacollector/latest/help/index.html
http://streamsets.vip/
https://clickhouse.tech/docs/en/
https://streamsets.com/products/dataops-platform/data-collector/
https://accounts.streamsets.com/archives(全组件 6GB+ )
https://github.com/ververica/flink-cdc-connectors
https://github.com/Altinity/clickhouse-mysql-data-reader
https://github.com/debezium/debezium
https://juejin.cn/post/6860480798294966286
https://www.cnblogs.com/uestc2007/p/13704912.html

https://github.com/apache/nifi（Apache NiFi 与 StreamSets 都是 Dataflow 数据处理的工具。虽然 StreamSets 图形化很美，但是企业级的 StreamSets Control Hub 不是开源的，论开源方面的企业级功能完整度，Apache NiFi 更全面。）

https://github.com/ClickHouse/ClickHouse

https://clickhouse.com/docs/zh/interfaces/jdbc/（TCP JDBC）

Build a CDC Pipeline

https://www.youtube.com/watch?v=rf0fquEze9s&ab_channel=StreamSetsInc.

How to Create Apache Kafka Pipelines in Minutes

https://www.youtube.com/watch?v=SiZrkyEzpJc

posted @ 2021-07-17 14:10 花儿笑弯了腰阅读(3487) 评论(0) 编辑收藏举报

刷新页面返回顶部

StreamSets Data Collector 实时同步 PostgreSQL/MySQL 数据到 Clickhouse

公告