Flink学习笔记——Flink MySQL CDC
1.Flink CDC介绍
Flink CDC提供了一系列connector,用于从其他数据源获取变更数据(change data capture),其中的Flink MySQL CDC基于Debezium
官方文档
https://ververica.github.io/flink-cdc-connectors/release-2.3/content/about.html
官方github
https://github.com/ververica/flink-cdc-connectors
Flink和Flink CDC的版本对应关系参考:
https://nightlies.apache.org/flink/flink-cdc-docs-release-3.1/docs/connectors/flink-sources/overview/
各种数据源使用案例,参考:
基于 AWS S3、EMR Flink、Presto 和 Hudi 的实时数据湖仓 – 使用 EMR 迁移 CDH
2.Flink MySQL CDC
原理
Flink MySQL CDC官方文档:https://github.com/apache/flink-cdc/blob/master/docs/content/docs/connectors/flink-sources/mysql-cdc.md
MySQL binlog可以参考: MySQL学习笔记——binlog
Flink MySQL CDC在2.0版本之后有比较大的性能提升:
1.无锁
2.并发同步(在大表同步的时候可以加速)
3.支持snapshot阶段checkpoint(在snapshot同步阶段失败后可以继续同步)
使用datastream api
package com.bigdata.flink; import org.apache.flink.api.common.eventtime.WatermarkStrategy; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.cdc.connectors.mysql.source.MySqlSource; import org.apache.flink.cdc.debezium.JsonDebeziumDeserializationSchema; public class MysqlCdcExample { public static void main(String[] args) throws Exception { MySqlSource<String> mySqlSource = MySqlSource.<String>builder() .hostname("localhost") .port(3306) .databaseList("test") // set captured database, If you need to synchronize the whole database, Please set tableList to ".*". .tableList("test.user") // set captured table .username("root") .password("123456") .serverTimeZone("America/Danmarkshavn") .deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String .build(); final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // enable checkpoint env.enableCheckpointing(3000); env .fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source") // set 1 parallel source tasks .setParallelism(1) .print().setParallelism(1); // use parallelism 1 for sink env.execute("Print MySQL Snapshot + Binlog"); } }
insert一条MySQL数据,输出如下
{ "before":null, "after":{ "id":"AQ==", "username":"test", "email":"test@test.com" }, "source":{ "version":"1.9.8.Final", "connector":"mysql", "name":"mysql_binlog_source", "ts_ms":0, "snapshot":"false", "db":"test", "sequence":null, "table":"user", "server_id":0, "gtid":null, "file":"", "pos":0, "row":0, "thread":null, "query":null }, "op":"r", "ts_ms":1723951109745, "transaction":null }
update这条数据,输出如下
{ "before":{ "id":"AQ==", "username":"test", "email":"test@test.com" }, "after":{ "id":"AQ==", "username":"test123", "email":"test@test.com" }, "source":{ "version":"1.9.8.Final", "connector":"mysql", "name":"mysql_binlog_source", "ts_ms":1723951151000, "snapshot":"false", "db":"test", "sequence":null, "table":"user", "server_id":1, "gtid":null, "file":"mysql-bin.000005", "pos":1307, "row":0, "thread":75, "query":null }, "op":"u", "ts_ms":1723951151768, "transaction":null }
删除这条数据,输出如下
{ "before":{ "id":"AQ==", "username":"test123", "email":"test@test.com" }, "after":null, "source":{ "version":"1.9.8.Final", "connector":"mysql", "name":"mysql_binlog_source", "ts_ms":1723953255000, "snapshot":"false", "db":"test", "sequence":null, "table":"user", "server_id":1, "gtid":null, "file":"mysql-bin.000005", "pos":1627, "row":0, "thread":99, "query":null }, "op":"d", "ts_ms":1723953255153, "transaction":null }
使用flink SQL
报错
1.Caused by: org.apache.flink.table.api.ValidationException: The MySQL server has a timezone offset (0 seconds ahead of UTC) which does not match the configured timezone Asia/Shanghai. Specify the right server-time-zone to avoid inconsistencies for time-related fields.
这个因为没有给mysql cdc任务指定时区,可以使用如下命令查看mysql的时区
mysql> show variables like '%time_zone%'; +------------------+--------+ | Variable_name | Value | +------------------+--------+ | system_time_zone | UTC | | time_zone | SYSTEM | +------------------+--------+ 2 rows in set (0.02 sec) mysql> select now(); +---------------------+ | now() | +---------------------+ | 2024-08-17 15:29:48 | +---------------------+ 1 row in set (0.00 sec)
假设mysql的时区是UTC,则需要指定和mysql时区一样的时区配置,如下
MySqlSource<String> mySqlSource = MySqlSource.<String>builder() .hostname("localhost") .port(55000) .databaseList("default") // set captured database, If you need to synchronize the whole database, Please set tableList to ".*". .tableList("default.test") // set captured table .username("root") .password("123456") .serverTimeZone("America/Danmarkshavn") .deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String .build();
2.Caused by: org.apache.flink.util.FlinkRuntimeException: Cannot read the binlog filename and position via 'SHOW MASTER STATUS'. Make sure your server is correctly configured
需要假设mysql的binlog是否是开启ON状态,如果是OFF状态的话则会报错
mysql> show variables like 'log_bin';
MySQL CDC业界使用案例
1.基于 Flink CDC + Hudi 湖仓一体方案实践 (37互娱)
2.Flink CDC + Hudi 海量数据入湖在顺丰的实践 (顺丰)
本文只发表于博客园和tonglin0325的博客,作者:tonglin0325,转载请注明原文链接:https://www.cnblogs.com/tonglin0325/p/5321328.html