StreamSets Data Collector 实时同步 PostgreSQL/MySQL 数据到 Clickhouse
Streamsets 是一款大数据实时采集和 ETL 工具,可以实现不写一行代码完成数据的采集和流转。通过拖拽式的可视化界面,实现数据管道(Pipelines)的设计和定时任务调度。最大的特点有:
- 可视化界面操作,不写代码完成数据的采集和流转,在几分钟内设计用于流式传输、批处理和更改数据捕获 (CDC) 的管道
- 内置监控,可是实时查看数据流传输的基本信息和数据的质量
- 强大的整合力,对现有常用组件全力支持,包括50种数据源、44 种数据操作、46 种目的地。
对于Streamsets来说,最重要的概念就是数据源(Origins)、操作(Processors)、目的地(Destinations)。创建一个Pipelines管道配置也基本是这三个方面。常见的 Origins 有 Kafka、HTTP、UDP、JDBC、HDFS 等;Processors 可以实现对每个字段的过滤、更改、编码、聚合等操作;Destinations 跟 Origins 差不多,可以写入 Kafka、Flume、JDBC、HDFS、Redis 等。
Origins(读取的目标数据源)(文档: https://docs.streamsets.com/portal/#datacollector/3.22.x/help/datacollector/UserGuide/Origins/Origins_title.html)
- Origins An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline.
- Amazon S3
- Amazon SQS Consumer
- Azure Data Lake Storage Gen1
- Azure Data Lake Storage Gen2
- Azure IoT/Event Hub Consumer
- CoAP Server
- Cron Scheduler
- Directory
- Elasticsearch
- File Tail
- Google BigQuery
- Google Cloud Storage
- Google Pub/Sub Subscriber
- Groovy Scripting
- gRPC Client
- Hadoop FS
- Hadoop FS Standalone
- HTTP Client
- HTTP Server
- HTTP to Kafka (Deprecated)
- JavaScript Scripting
- JDBC Multitable Consumer
- JDBC Query Consumer
- JMS Consumer
- Jython Scripting
- Kafka Consumer
- Kafka Multitopic Consumer
- Kinesis Consumer
- MapR DB CDC
- MapR DB JSON
- MapR FS
- MapR FS Standalone
- MapR Multitopic Streams Consumer
- MapR Streams Consumer
- MongoDB
- MongoDB Oplog
- MQTT Subscriber
- MySQL Binary Log
- NiFi HTTP Server
- Omniture
- OPC UA Client
- Oracle Bulkload
- Oracle CDC Client
- PostgreSQL CDC Client
- Pulsar Consumer
- RabbitMQ Consumer
- Redis Consumer
- REST Service
- Salesforce
- SAP HANA Query Consumer
- SDC RPC
- SDC RPC to Kafka (Deprecated)
- SFTP/FTP/FTPS Client
- SQL Server 2019 BDC Multitable Consumer
- SQL Server CDC Client
- SQL Server Change Tracking
- Start Jobs
- Start Pipelines
- System Metrics
- TCP Server
- Teradata Consumer
- UDP Multithreaded Source
- UDP Source
- UDP to Kafka (Deprecated)
- WebSocket Client
- WebSocket Server
- Windows Event Log
Destinations(写入的数据源)
- Destinations
- Aerospike
- Amazon S3
- Azure Data Lake Storage (Legacy) (Deprecated)
- Azure Data Lake Storage Gen1
- Azure Data Lake Storage Gen2
- Azure Event Hub Producer
- Azure IoT Hub Producer
- Azure Synapse SQL
- Cassandra
- CoAP Client
- Couchbase
- Databricks Delta Lake
- Einstein Analytics
- Elasticsearch
- Flume
- Google BigQuery
- Google Bigtable
- Google Cloud Storage
- Google Pub/Sub Publisher
- GPSS Producer
- Hadoop FS
- HBase
- Hive Metastore
- Hive Streaming
- HTTP Client
- InfluxDB
- JDBC Producer
- JMS Producer
- Kafka Producer
- Kinesis Firehose
- Kinesis Producer
- KineticaDB
- Kudu
- Local FS
- MapR DB
- MapR DB JSON
- MapR FS
- MapR Streams Producer
- MemSQL Fast Loader
- MongoDB
- MQTT Publisher
- Named Pipe
- Pulsar Producer
- RabbitMQ Producer
- Redis
- Salesforce
- SDC RPC
- Send Response to Origin
- SFTP/FTP/FTPS Client
- Snowflake
- Solr
- Splunk
- SQL Server 2019 BDC Bulk Loader
- Syslog
- To Error
- Trash
- WebSocket Client
Streamsets 安装
# 关闭防火墙: systemctl stop firewalld.service # 关闭SELINUX: setenforce 0 (临时生效) 修改 /etc/selinux/config 下的 SELINUX=disabled (重启后永久生效) # 配置文件上限 ulimit -n vi /etc/security/limits.conf sdc soft nofile 32768 sdc hard nofile 32768 vi /etc/profile ulimit -HSn 32768 source /etc/profile # mac OSX 下修改 ulimit 参数 sudo sysctl -w kern.maxfilesperproc=65535 ulimit -n 65535 # 安装 axel brew install axel wget http://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/a/axel-2.4-9.el7.x86_64.rpm rpm -ivh axel-2.4-9.el7.x86_64.rpm # 下载 streamsets-datacollector axel -n 20 -a -v https://archives.streamsets.com/datacollector/3.22.3/tarball/activation/streamsets-datacollector-common-3.22.3.tgz axel -n 20 -a -v https://archives.streamsets.com/datacollector/3.22.3/tarball/activation/streamsets-datacollector-all-3.22.3.tgz(不推荐,全组件) # 安装 jdk11 sudo yum install java-11-openjdk-devel scp /Users/irving/Downloads/streamsets-datacollector-common-3.22.3.tgz root@10.34.12.255:/opt # 启动服务 tar xvzf streamsets-datacollector-common-3.22.3.tgz nohup bin/streamsets dc & # pm2 管理 pm2 start "bin/streamsets dc" --name streamsets-datacollector pm2 startup pm2 save # 修改配置文件(启用 form 认证,非默认的 OAUTH2 认证) vi sdc.properties # The authentication for the HTTP endpoint of Data Collector # Valid values are: 'none', 'basic', 'digest', 'form' or 'aster' http.authentication=form pm2 logs streamsets-datacollector 0|streamse | Java 11 detected; adding $SDC_JAVA11_OPTS of "-Djdk.nio.maxCachedBufferSize=262144" to $SDC_JAVA_OPTS 0|streamse | Activation enabled, activation is valid and it does not expire 0|streamse | Logging initialized @3119ms to org.eclipse.jetty.util.log.Slf4jLog 0|streamsets-datacollector | Running on URI : 'http://:18630' # 上传 JDBC 驱动 (复制到 /opt/streamsets-datacollector-3.22.3/streamsets-libs-extras/streamsets-datacollector-jdbc-lib/lib 文件夹) mvn dependency:copy-dependencies -DoutputDirectory=lib -DincludeScope=compile scp -r /Users/irving/Desktop/git/microsrv/clickhouse/lib root@10.34.12.255:/opt/streamsets-datacollector-3.22.3/streamsets-libs-extras/streamsets-datacollector-jdbc-lib/lib pm2 restart streamsets-datacollector
Maven 打包上传 JDBC 驱动包(MSSQL 与 PostgreSQL 默认支持)
<dependencies> <dependency> <groupId>ru.yandex.clickhouse</groupId> <artifactId>clickhouse-jdbc</artifactId> <version>0.3.1</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>8.0.25</version> </dependency> <!-- <dependency>--> <!-- <groupId>org.postgresql</groupId>--> <!-- <artifactId>postgresql</artifactId>--> <!-- <version>42.2.23</version>--> <!-- </dependency>--> </dependencies>
# tree -h . ├── [ 352K] clickhouse-jdbc-0.3.1.jar ├── [ 327K] commons-codec-1.11.jar ├── [ 60K] commons-logging-1.2.jar ├── [ 762K] httpclient-4.5.13.jar ├── [ 321K] httpcore-4.4.13.jar ├── [ 41K] httpmime-4.5.13.jar ├── [ 65K] jackson-annotations-2.9.10.jar ├── [ 318K] jackson-core-2.9.10.jar ├── [ 1.3M] jackson-databind-2.9.10.8.jar ├── [ 635K] lz4-java-1.7.1.jar ├── [ 2.3M] mysql-connector-java-8.0.25.jar ├── [ 1.6M] protobuf-java-3.11.4.jar └── [ 40K] slf4j-api-1.7.30.jar
Clickhouse 安装
# 检查环境 grep -q sse4_2 /proc/cpuinfo && echo “SSE 4.2 supported” || echo “SSE 4.2 not supported. “SSE 4.2 supported” 代表可以安装,ClickHouse 需要使用 SSE 硬件指令集加速,大大加快了 CPU 寄存器计算效率。 # 安装 clickhouse sudo yum install yum-utils sudo rpm --import https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG sudo yum-config-manager --add-repo https://repo.clickhouse.tech/rpm/clickhouse.repo sudo yum install clickhouse-server clickhouse-client [root@ ~]# sudo yum install clickhouse-server clickhouse-client Repository epel is listed more than once in the configuration 上次元数据过期检查:0:01:02 前,执行于 2021年07月15日 星期四 15时49分34秒。 依赖关系解决。 ============================================================================================================================================================================================================ 软件包 架构 版本 仓库 大小 ============================================================================================================================================================================================================ 安装: clickhouse-client noarch 21.7.3.14-2 clickhouse-stable 76 k clickhouse-server noarch 21.7.3.14-2 clickhouse-stable 100 k 安装依赖关系: clickhouse-common-static x86_64 21.7.3.14-2 clickhouse-stable 166 M 事务概要 ============================================================================================================================================================================================================ 安装 3 软件包 总下载:166 M 安装大小:539 M 确定吗?[y/N]: y 下载软件包: [MIRROR] clickhouse-client-21.7.3.14-2.noarch.rpm: Curl error (56): Failure when receiving data from the peer for http://repo.clickhouse.tech/rpm/stable/x86_64/clickhouse-client-21.7.3.14-2.noarch.rpm [Recv failure: 连接被对方重设] (1/3): clickhouse-server-21.7.3.14-2.noarch.rpm 142 kB/s | 100 kB 00:00 (2/3): clickhouse-client-21.7.3.14-2.noarch.rpm 99 kB/s | 76 kB 00:00 (3/3): clickhouse-common-static-21.7.3.14-2.x86_64.rpm 5.3 MB/s | 166 MB 00:31 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 总计 5.3 MB/s | 166 MB 00:31 运行事务检查 事务检查成功。 运行事务测试 事务测试成功。 运行事务 准备中 : 1/1 安装 : clickhouse-common-static-21.7.3.14-2.x86_64 1/3 安装 : clickhouse-client-21.7.3.14-2.noarch 2/3 安装 : clickhouse-server-21.7.3.14-2.noarch 3/3 运行脚本: clickhouse-server-21.7.3.14-2.noarch 3/3 ClickHouse binary is already located at /usr/bin/clickhouse Symlink /usr/bin/clickhouse-server already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse. Creating symlink /usr/bin/clickhouse-server to /usr/bin/clickhouse. Symlink /usr/bin/clickhouse-client already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse. Creating symlink /usr/bin/clickhouse-client to /usr/bin/clickhouse. Symlink /usr/bin/clickhouse-local already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse. Creating symlink /usr/bin/clickhouse-local to /usr/bin/clickhouse. Symlink /usr/bin/clickhouse-benchmark already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse. Creating symlink /usr/bin/clickhouse-benchmark to /usr/bin/clickhouse. Symlink /usr/bin/clickhouse-copier already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse. Creating symlink /usr/bin/clickhouse-copier to /usr/bin/clickhouse. Symlink /usr/bin/clickhouse-obfuscator already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse. Creating symlink /usr/bin/clickhouse-obfuscator to /usr/bin/clickhouse. Creating symlink /usr/bin/clickhouse-git-import to /usr/bin/clickhouse. Symlink /usr/bin/clickhouse-compressor already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse. Creating symlink /usr/bin/clickhouse-compressor to /usr/bin/clickhouse. Symlink /usr/bin/clickhouse-format already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse. Creating symlink /usr/bin/clickhouse-format to /usr/bin/clickhouse. Symlink /usr/bin/clickhouse-extract-from-config already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse. Creating symlink /usr/bin/clickhouse-extract-from-config to /usr/bin/clickhouse. Creating clickhouse group if it does not exist. groupadd -r clickhouse Creating clickhouse user if it does not exist. useradd -r --shell /bin/false --home-dir /nonexistent -g clickhouse clickhouse Will set ulimits for clickhouse user in /etc/security/limits.d/clickhouse.conf. Creating config directory /etc/clickhouse-server/config.d that is used for tweaks of main server configuration. Creating config directory /etc/clickhouse-server/users.d that is used for tweaks of users configuration. Config file /etc/clickhouse-server/config.xml already exists, will keep it and extract path info from it. /etc/clickhouse-server/config.xml has /var/lib/clickhouse/ as data path. /etc/clickhouse-server/config.xml has /var/log/clickhouse-server/ as log path. Users config file /etc/clickhouse-server/users.xml already exists, will keep it and extract users info from it. chown --recursive clickhouse:clickhouse '/etc/clickhouse-server' Creating log directory /var/log/clickhouse-server/. Creating data directory /var/lib/clickhouse/. Creating pid directory /var/run/clickhouse-server. chown --recursive clickhouse:clickhouse '/var/log/clickhouse-server/' chown --recursive clickhouse:clickhouse '/var/run/clickhouse-server' chown clickhouse:clickhouse '/var/lib/clickhouse/' groupadd -r clickhouse-bridge useradd -r --shell /bin/false --home-dir /nonexistent -g clickhouse-bridge clickhouse-bridge chown --recursive clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-odbc-bridge' chown --recursive clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-library-bridge' Password for default user is empty string. See /etc/clickhouse-server/users.xml and /etc/clickhouse-server/users.d to change it. Setting capabilities for clickhouse binary. This is optional. ClickHouse has been successfully installed. Start clickhouse-server with: sudo clickhouse start Start clickhouse-client with: clickhouse-client Synchronizing state of clickhouse-server.service with SysV service script with /usr/lib/systemd/systemd-sysv-install. Executing: /usr/lib/systemd/systemd-sysv-install enable clickhouse-server Created symlink /etc/systemd/system/multi-user.target.wants/clickhouse-server.service → /etc/systemd/system/clickhouse-server.service. 验证 : clickhouse-client-21.7.3.14-2.noarch 1/3 验证 : clickhouse-common-static-21.7.3.14-2.x86_64 2/3 验证 : clickhouse-server-21.7.3.14-2.noarch 3/3 已安装: clickhouse-client-21.7.3.14-2.noarch clickhouse-common-static-21.7.3.14-2.x86_64 clickhouse-server-21.7.3.14-2.noarch 完毕! # 查看版本 clickhouse-server --version ClickHouse server version 21.7.3.14 (official build). # 启动服务 systemctl start clickhouse-server.service # 重启服务 systemctl restart clickhouse-server.service # 停止服务 systemctl stop clickhouse-server.service # 查看状态 systemctl status clickhouse-server.service ``` [root@ clickhouse-server]# systemctl status clickhouse-server.service ● clickhouse-server.service - ClickHouse Server (analytic DBMS for big data) Loaded: loaded (/etc/systemd/system/clickhouse-server.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2021-07-15 17:28:31 CST; 1min 11s ago Main PID: 508117 (clckhouse-watch) Tasks: 156 (limit: 49524) Memory: 127.5M CGroup: /system.slice/clickhouse-server.service ├─508117 clickhouse-watchdog --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid └─508118 /usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid 7月 15 17:28:31 systemd[1]: Started ClickHouse Server (analytic DBMS for big data). 7月 15 17:28:31 clickhouse-server[508117]: Processing configuration file '/etc/clickhouse-server/config.xml'. 7月 15 17:28:31 clickhouse-server[508117]: Logging trace to /var/log/clickhouse-server/clickhouse-server.log 7月 15 17:28:31 clickhouse-server[508117]: Logging errors to /var/log/clickhouse-server/clickhouse-server.err.log 7月 15 17:28:32 clickhouse-server[508117]: Processing configuration file '/etc/clickhouse-server/config.xml'. 7月 15 17:28:32 clickhouse-server[508117]: Saved preprocessed configuration to '/var/lib/clickhouse/preprocessed_configs/config.xml'. 7月 15 17:28:32 clickhouse-server[508117]: Processing configuration file '/etc/clickhouse-server/users.xml'. 7月 15 17:28:32 clickhouse-server[508117]: Saved preprocessed configuration to '/var/lib/clickhouse/preprocessed_configs/users.xml'. ``` # 目录 ``` (1)/etc/clickhouse-server:服务端的配置文件目录,包括全局配置 config.xml 和用户配置 users.xml 等。 (2)/var/lib/clickhouse:默认的数据存储目录(通常会修改默认路径配置,将数据保存到大容量磁盘挂载的路径)。 (3)/var/log/clickhouse-server:默认保存日志的目录(通常会修改路径配置,将日志保存到大容量磁盘挂载的路径)。 Before going further, please notice the <path> element in config.xml. Path determines the location for data storage, so it should be located on volume with large disk capacity; the default value is /var/lib/clickhouse/. ``` # 配置远程访问权限 chmod 777 config.xml users.xml <!-- 如果禁用了ipv6,使用下面配置--> <listen_host>0.0.0.0</listen_host> <!-- 如果没有禁用ipv6,使用下面配置 <listen_host>::</listen_host> --> ``` 1.纯文本: <password>password</password> 2.sha256: <password_sha256_hex>password</password_sha256_hex> 从shell生成密码的示例: PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-' 第一行明文,第二行sha256 3.sha1: <password_double_sha1_hex>password</password_double_sha1_hex> 从shell生成密码的示例: PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha1sum | tr -d '-' | xxd -r -p | sha1sum | tr -d '-' 第一行明文,第二行sha1 ``` # 生成密码 [root@ clickhouse-server]# PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha1sum | tr -d '-' | xxd -r -p | sha1sum | tr -d '-' JUxG8EwNS b1854b52ed2577d50a23591d0d8fda3420b4d2af # 配置 <password_double_sha1_hex>b1854b52ed2577d50a23591d0d8fda3420b4d2af</password_double_sha1_hex> # 客户端登陆(http端口是8123,tcp端口是9000) clickhouse-client -h 127.0.0.1 --port 9000 -m -u default --password JUxG8EwNS # 导入数据(航班数据为样例测试) curl -O https://datasets.clickhouse.tech/ontime/partitions/ontime.tar # 启动20个线程下载并显示进度条 axel -n 10 -a https://datasets.clickhouse.tech/ontime/partitions/ontime.tar tar xvf ontime.tar -C /var/lib/clickhouse # 查看目录文件大小 du -h --max-depth=1 # 重启服务 sudo systemctl restart clickhouse-server.service # 查看总条数(183953732,1.8亿+) clickhouse-client -h 127.0.0.1 --port 9000 -m -u default --password JUxG8EwNS --query "select count(*) from datasets.ontime" # 建测试表 CREATE TABLE emp_replacingmergetree ( emp_id UInt16 COMMENT '员工id', name String COMMENT '员工姓名', work_place String COMMENT '工作地点', age UInt8 COMMENT '员工年龄', depart String COMMENT '部门', salary Decimal32(2) COMMENT '薪资' )ENGINE=ReplacingMergeTree() ORDER BY emp_id PRIMARY KEY emp_id PARTITION BY work_place ; -- 插入数据 INSERT INTO emp_replacingmergetree VALUES (1,'tom','上海',25,'技术部',20000),(2,'jack','上海',26,'人事部',10000); INSERT INTO emp_replacingmergetree VALUES (3,'bob','北京',33,'财务部',50000),(4,'tony','杭州',28,'销售事部',50000); # 分区表 按时间分区: toYYYYMM(EventDate):按月分区 toMonday(EventDate):按周分区 toDate(EventDate):按天分区 按指定列分区: PARTITION BY cloumn_name
示意图
MaterializeMySQL 与 MaterializedPostgreSQL 引擎
https://clickhouse.tech/docs/en/engines/database-engines/materialize-mysql/(这个特性是实验性的)
https://clickhouse.tech/docs/en/engines/database-engines/materialized-postgresql/(官方文档未说是实验性的,未测试)
手工建表(航班测试数据)
CREATE TABLE `ontime` ( `Year` UInt16, `Quarter` UInt8, `Month` UInt8, `DayofMonth` UInt8, `DayOfWeek` UInt8, `FlightDate` Date, `Reporting_Airline` String, `DOT_ID_Reporting_Airline` Int32, `IATA_CODE_Reporting_Airline` String, `Tail_Number` Int32, `Flight_Number_Reporting_Airline` String, `OriginAirportID` Int32, `OriginAirportSeqID` Int32, `OriginCityMarketID` Int32, `Origin` FixedString(5), `OriginCityName` String, `OriginState` FixedString(2), `OriginStateFips` String, `OriginStateName` String, `OriginWac` Int32, `DestAirportID` Int32, `DestAirportSeqID` Int32, `DestCityMarketID` Int32, `Dest` FixedString(5), `DestCityName` String, `DestState` FixedString(2), `DestStateFips` String, `DestStateName` String, `DestWac` Int32, `CRSDepTime` Int32, `DepTime` Int32, `DepDelay` Int32, `DepDelayMinutes` Int32, `DepDel15` Int32, `DepartureDelayGroups` String, `DepTimeBlk` String, `TaxiOut` Int32, `WheelsOff` Int32, `WheelsOn` Int32, `TaxiIn` Int32, `CRSArrTime` Int32, `ArrTime` Int32, `ArrDelay` Int32, `ArrDelayMinutes` Int32, `ArrDel15` Int32, `ArrivalDelayGroups` Int32, `ArrTimeBlk` String, `Cancelled` UInt8, `CancellationCode` FixedString(1), `Diverted` UInt8, `CRSElapsedTime` Int32, `ActualElapsedTime` Int32, `AirTime` Nullable(Int32), `Flights` Int32, `Distance` Int32, `DistanceGroup` UInt8, `CarrierDelay` Int32, `WeatherDelay` Int32, `NASDelay` Int32, `SecurityDelay` Int32, `LateAircraftDelay` Int32, `FirstDepTime` String, `TotalAddGTime` String, `LongestAddGTime` String, `DivAirportLandings` String, `DivReachedDest` String, `DivActualElapsedTime` String, `DivArrDelay` String, `DivDistance` String, `Div1Airport` String, `Div1AirportID` Int32, `Div1AirportSeqID` Int32, `Div1WheelsOn` String, `Div1TotalGTime` String, `Div1LongestGTime` String, `Div1WheelsOff` String, `Div1TailNum` String, `Div2Airport` String, `Div2AirportID` Int32, `Div2AirportSeqID` Int32, `Div2WheelsOn` String, `Div2TotalGTime` String, `Div2LongestGTime` String, `Div2WheelsOff` String, `Div2TailNum` String, `Div3Airport` String, `Div3AirportID` Int32, `Div3AirportSeqID` Int32, `Div3WheelsOn` String, `Div3TotalGTime` String, `Div3LongestGTime` String, `Div3WheelsOff` String, `Div3TailNum` String, `Div4Airport` String, `Div4AirportID` Int32, `Div4AirportSeqID` Int32, `Div4WheelsOn` String, `Div4TotalGTime` String, `Div4LongestGTime` String, `Div4WheelsOff` String, `Div4TailNum` String, `Div5Airport` String, `Div5AirportID` Int32, `Div5AirportSeqID` Int32, `Div5WheelsOn` String, `Div5TotalGTime` String, `Div5LongestGTime` String, `Div5WheelsOff` String, `Div5TailNum` String ) ENGINE MergeTree () PARTITION BY FlightDate ORDER BY (Year,FlightDate) SETTINGS index_granularity = 8192;
备注:如果对实时性要求比较高使用 CDC(Change Data Capture) 或许也是一个比较好的方案, MySQL 需要开启 binlog 支持,PostgreSQL 需要开启 Write-Ahead Logging (WAL) 支持;debezium 或 flink cdc connector (暂未支持 Clickhouse)也是不错的备选方案。
REFER:
https://docs.streamsets.com/portal/#datacollector/latest/help/index.html
http://streamsets.vip/
https://clickhouse.tech/docs/en/
https://streamsets.com/products/dataops-platform/data-collector/
https://accounts.streamsets.com/archives(全组件 6GB+ )
https://github.com/ververica/flink-cdc-connectors
https://github.com/Altinity/clickhouse-mysql-data-reader
https://github.com/debezium/debezium
https://juejin.cn/post/6860480798294966286
https://www.cnblogs.com/uestc2007/p/13704912.htmlhttps://github.com/apache/nifi(Apache NiFi 与 StreamSets 都是 Dataflow 数据处理的工具。虽然 StreamSets 图形化很美,但是企业级的 StreamSets Control Hub 不是开源的,论开源方面的企业级功能完整度,Apache NiFi 更全面。)
https://github.com/ClickHouse/ClickHouse
https://clickhouse.com/docs/zh/interfaces/jdbc/(TCP JDBC)
Build a CDC Pipeline
https://www.youtube.com/watch?v=rf0fquEze9s&ab_channel=StreamSetsInc.
How to Create Apache Kafka Pipelines in Minutes
https://www.youtube.com/watch?v=SiZrkyEzpJc