[Flink] Flink CDC FAQ

Flink CDC Connectors FAQ

近期遇到 Flink CDC 的问题较多,故基于第1篇参考文献的FAQ文档基础之上,对这些问题做个系统的总结。

MYSQL CDC

Q:作业报错 ConnectException: A slave with the same server_uuid/server_id as this slave has connected to the master,怎么办呢?

  • 出现这种错误是 作业里使用的 server id 和其他作业或其他同步工具使用的server id 冲突了,server id 需要全局唯一,server id 是一个int类型整数。
  • 在 CDC 2.x 版本中,source 的每个并发都需要一个server id,建议合理规划好server id,比如作业的 source 设置成了四个并发,可以配置 'serverid' = '5001-5004', 这样每个 source task 就不会冲突了。

推荐文献

  • flink version:flink-1.13.5 ; cdc version:2.1.1
  • 关键错误日志
org.apache.flink.runtime.JobException: Recovery is suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=3, backoffTimeMS=10000)

Caused by: com.ververica.cdc.connectors.shaded.org.apache.kafka.connect.errors.ConnectException: An exception occurred in the change event producer. This connector will be stopped.

Caused by: io.debezium.DebeziumException: A slave with the same server_uuid/server_id as this slave has connected to the master; the first event '' at 4, the last event read from '/data/mysql/storage/logs/bin_log/bin.001086' at 426321679, the last byte read from '/data/mysql/storage/logs/bin_log/bin.001086' at 426321679. Error code: 1236; SQLSTATE: HY000.

Caused by: com.github.shyiko.mysql.binlog.network.ServerException: A slave with the same server_uuid/server_id as this slave has connected to the master; the first event '' at 4, the last event read from '/data/mysql/storage/logs/bin_log/bin.001086' at 426321679, the last byte read from '/data/mysql/storage/logs/bin_log/bin.001086' at 426321679
  • 原因分析
flink cdc是基于debezium实现的mysql实时同步,debezium是以slave server的方式去读取mysql的binlog日志。
默认情况下,系统会自动生成一个介于 5400 和 6400 之间的随机数,作为debezium这个客户端的server-id,
而这个id在mysql cluster中必须是唯一的,报这个错说明是有重复的server-id了,
建议你显示的配上这个参数“server-id”,可以配置成一个数字或者一个范围。

另外当 scan.incremental.snapshot.enabled 设置为true时(默认为true),则建议设置为范围,因为增量读取快照时,source是可以并行执行的,
这些并行的客户端也必须有着唯一的server-id,增量读取快照的并行度由参数“parallelism.default”控制,而且server-id设置的范围必须要大于并行度。

详情参考:
https://ververica.github.io/flink-cdc-connectors/master/content/connectors/mysql-cdc.html#connector-options
https://nightlies.apache.org/flink/flink-cdc-docs-release-3.1/docs/faq/faq/
配置页里关于 server-id 和 scan.incremental.snapshot.enabled 的解释
  • 关键日志
com.github.shyiko.mysql.binlog.network.ServerException: A slave with the same server_uuid/server_id as this slave has connected to the master。
  • 解决办法:

目前已经优化增加随机生成 serverid,之前的任务中如果在 mysql 高级参数中显示指定了 server-id 建议删除,因为可能多个任务使用了相同的数据源,并且 server-id 设置的相同导致冲突。

Q:作业报错 The connector is trying to read binlog starting at GTIDs xxx and binlog file 'binlog.000064', pos=89887992, skipping 4 events plus 1 rows, but this is no longer available on the server. Reconfigure the connector to use a snapshot when needed,怎么办呢?

出现这种错误是:

  • 情况1:作业正在读取的binlog文件在 MySQL 服务器已经被清理掉,这种情况一般是 MySQL 服务器上保留的 binlog 文件过期时间太短,可
    以将该值设置大一点,比如7天。
mysql> show variables like 'expire_logs_days';
mysql> set global expire_logs_days=7;
  • 情况2: flink cdc 作业消费binlog 太慢,这种一般分配足够的资源即可。

推荐文献

  • 关键日志
Caused by: org.apache.kafka.connect.errors.ConnectException: The connector is trying to read binlog starting at GTIDs xxx and binlog file 'binlog.xxx', pos=xxx, skipping 4 events plus 1 rows, but this is no longer available on the server. 
Reconfigure the connector to use a snapshot when needed。
  • 错误原因:

作业正在读取的 binlog 文件在 MySQL 服务器已经被清理时,会产生报错。导致 Binlog 清理的原因较多,可能是 Binlog 保留时间设置的过短;或者作业处理的速度追不上 Binlog 产生的速度,超过了 MySQL Binlog 文件的最大保留时间,MySQL 服务器上的 Binlog 文件被清理,导致正在读的 Binlog 位点变得无效。

  • 解决办法:

如果作业处理速度无法追上 Binlog 产生速度,可以考虑增加 Binlog 的保留时间也可以优化作业减轻反压来加速 source 消费。如果作业状态没有异常,可能是数据库发生了其他操作导致 Binlog 被清理,从而无法访问,需要结合 MySQL 数据库侧的信息来确定Binlog被清理的原因。

Q:作业报错 The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires. 怎么办呢 ?

  • Flink CDC 官方FAQ:
    出现这个问题的原因是的作业全量阶段读取太慢,在全量阶段读完后,之前记录的全量阶段开始时的 gtid 位点已经被 mysql 清理掉了。这种可以增大 mysql 服务器上 binlog 文件的保存时间,也可以调大 source 的并发,让全量阶段读取更快。

Q: mysql cdc支持监听从库吗?从库需要如何配置?

  • 支持的,从库需要配置 log-slave-updates = 1 使从实例也能将从主实例同步的数据写入从库的 binlog 文件中,如果主库开启了gtid mode,从库也需要开启。
log-slave-updates = 1
gtid_mode = on
enforce_gtid_consistency = on

Q:作业报错 ConnectException: Received DML ‘…’ for processing, binlog probably contains events generated with statement or mixed based replication format,怎么办呢?

出现这种错误是 MySQL 服务器配置不对,需要检查下 binlog_format 是不是 ROW? 可以通过下面的命令查看

mysql> show variables like '%binlog_format%';
  • 问题描述

Flink CDC MYSQL Job (2.4.0 | mysql-cdc)启动失败,JobManager日志报: FlinkRuntimeException: Failed to discovery tables to capture / IllegalArgumentException: Can't find any matched tables, please check your configured database-name: [xxx_devicecenter] and table-name: [xxx_device]

org.apache.flink.util.FlinkException: Global failure triggered by OperatorCoordinator for 'Source: CDC From DimDevice' (operator bc764cd8ddf7a0cff126f51c16239658).
	at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder$LazyInitializedCoordinatorContext.failJob(OperatorCoordinatorHolder.java:556) ~[flink-dist-1.15.0-h0.cbu.dli.330.r34.jar:1.15.0-h0.cbu.dli.330.r34]
	at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator$QuiesceableContext.failJob(RecreateOnResetOperatorCoordinator.java:236) ~[flink-dist-1.15.0-h0.cbu.dli.330.r34.jar:1.15.0-h0.cbu.dli.330.r34]
	at org.apache.flink.runtime.source.coordinator.SourceCoordinatorContext.failJob(SourceCoordinatorContext.java:321) ~[flink-dist-1.15.0-h0.cbu.dli.330.r34.jar:1.15.0-h0.cbu.dli.330.r34]
	at org.apache.flink.runtime.source.coordinator.SourceCoordinator.lambda$runInEventLoop$9(SourceCoordinator.java:429) ~[flink-dist-1.15.0-h0.cbu.dli.330.r34.jar:1.15.0-h0.cbu.dli.330.r34]
	at org.apache.flink.util.ThrowableCatchingRunnable.run(ThrowableCatchingRunnable.java:40) ~[flink-core-1.15.0-h0.cbu.dli.330.r34.jar:1.15.0-h0.cbu.dli.330.r34]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_412]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_412]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_412]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_412]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_412]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_412]
	at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_412]
Caused by: org.apache.flink.util.FlinkRuntimeException: Failed to discovery tables to capture //关键日志1
	at com.ververica.cdc.connectors.mysql.source.assigners.MySqlSnapshotSplitAssigner.discoveryCaptureTables(MySqlSnapshotSplitAssigner.java:186) ~[flink-sql-connector-mysql-cdc-2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT.jar:2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT]
	at com.ververica.cdc.connectors.mysql.source.assigners.MySqlSnapshotSplitAssigner.open(MySqlSnapshotSplitAssigner.java:171) ~[flink-sql-connector-mysql-cdc-2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT.jar:2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT]
	at com.ververica.cdc.connectors.mysql.source.assigners.MySqlHybridSplitAssigner.open(MySqlHybridSplitAssigner.java:93) ~[flink-sql-connector-mysql-cdc-2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT.jar:2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT]
	at com.ververica.cdc.connectors.mysql.source.enumerator.MySqlSourceEnumerator.start(MySqlSourceEnumerator.java:92) ~[flink-sql-connector-mysql-cdc-2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT.jar:2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT]
	at org.apache.flink.runtime.source.coordinator.SourceCoordinator.lambda$start$1(SourceCoordinator.java:217) ~[flink-dist-1.15.0-h0.cbu.dli.330.r34.jar:1.15.0-h0.cbu.dli.330.r34]
	at org.apache.flink.runtime.source.coordinator.SourceCoordinator.lambda$runInEventLoop$9(SourceCoordinator.java:415) ~[flink-dist-1.15.0-h0.cbu.dli.330.r34.jar:1.15.0-h0.cbu.dli.330.r34]
	... 8 more
Caused by: java.lang.IllegalArgumentException: Can't find any matched tables, please check your configured database-name: [xxx_devicecenter] and table-name: [xxx_device] // 关键日志2
	at com.ververica.cdc.connectors.mysql.debezium.DebeziumUtils.discoverCapturedTables(DebeziumUtils.java:196) ~[flink-sql-connector-mysql-cdc-2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT.jar:2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT]
	at com.ververica.cdc.connectors.mysql.source.assigners.MySqlSnapshotSplitAssigner.discoveryCaptureTables(MySqlSnapshotSplitAssigner.java:182) ~[flink-sql-connector-mysql-cdc-2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT.jar:2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT]
	at com.ververica.cdc.connectors.mysql.source.assigners.MySqlSnapshotSplitAssigner.open(MySqlSnapshotSplitAssigner.java:171) ~[flink-sql-connector-mysql-cdc-2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT.jar:2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT]
	at com.ververica.cdc.connectors.mysql.source.assigners.MySqlHybridSplitAssigner.open(MySqlHybridSplitAssigner.java:93) ~[flink-sql-connector-mysql-cdc-2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT.jar:2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT]
	at com.ververica.cdc.connectors.mysql.source.enumerator.MySqlSourceEnumerator.start(MySqlSourceEnumerator.java:92) ~[flink-sql-connector-mysql-cdc-2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT.jar:2.4.0-h0.cbu.mrs.330.r1-SNAPSHOT]
	at org.apache.flink.runtime.source.coordinator.SourceCoordinator.lambda$start$1(SourceCoordinator.java:217) ~[flink-dist-1.15.0-h0.cbu.dli.330.r34.jar:1.15.0-h0.cbu.dli.330.r34]
	at org.apache.flink.runtime.source.coordinator.SourceCoordinator.lambda$runInEventLoop$9(SourceCoordinator.java:415) ~[flink-dist-1.15.0-h0.cbu.dli.330.r34.jar:1.15.0-h0.cbu.dli.330.r34]
	... 8 more
  • 问题分析
  • 源码分析

https://github.com/apache/flink-cdc/blob/release-2.4.0/flink-connector-mysql-cdc/src/main/java/com/ververica/cdc/connectors/mysql/source/assigners/MySqlSnapshotSplitAssigner.java

package com.ververica.cdc.connectors.mysql.source.assigners;

...
import static com.ververica.cdc.connectors.mysql.debezium.DebeziumUtils.discoverCapturedTables;
...

class MySqlSnapshotSplitAssigner implements MySqlSplitAssigner {

    @Override
    public void open() {
        chunkSplitter.open();
        discoveryCaptureTables(); // 问题入口、问题行1
        captureNewlyAddedTables();
        startAsynchronouslySplit();
    }

    private void discoveryCaptureTables() {//问题行2 <-- 问题行1
        // discovery the tables lazily
        if (needToDiscoveryTables()) {
            long start = System.currentTimeMillis();
            LOG.debug("The remainingTables is empty, start to discovery tables");
            try (JdbcConnection jdbc = openJdbcConnection(sourceConfig)) {
                final List<TableId> discoverTables = discoverCapturedTables(jdbc, sourceConfig);// 问题行3  | 此处调用了 : com.ververica.cdc.connectors.mysql.debezium.DebeziumUtils.discoverCapturedTables;
                this.remainingTables.addAll(discoverTables);
                this.isTableIdCaseSensitive = DebeziumUtils.isTableIdCaseSensitive(jdbc);
            } catch (Exception e) {
                throw new FlinkRuntimeException("Failed to discovery tables to capture", e);//关键日志1
            }
            LOG.debug(
                    "Discovery tables success, time cost: {} ms.",
                    System.currentTimeMillis() - start);
        }
        // when restore the job from legacy savepoint, the legacy state may haven't snapshot
        // remaining tables, discovery remaining table here
        else if (!isRemainingTablesCheckpointed && !isSnapshotAssigningFinished(assignerStatus)) {
            try (JdbcConnection jdbc = openJdbcConnection(sourceConfig)) {
                final List<TableId> discoverTables = discoverCapturedTables(jdbc, sourceConfig);
                discoverTables.removeAll(alreadyProcessedTables);
                this.remainingTables.addAll(discoverTables);
                this.isTableIdCaseSensitive = DebeziumUtils.isTableIdCaseSensitive(jdbc);
            } catch (Exception e) {
                throw new FlinkRuntimeException(
                        "Failed to discover remaining tables to capture", e);
            }
        }
    }


}

https://github.com/apache/flink-cdc/blob/release-2.4.0/flink-connector-mysql-cdc/src/main/java/com/ververica/cdc/connectors/mysql/debezium/DebeziumUtils.java

package com.ververica.cdc.connectors.mysql.debezium;

...
import static com.ververica.cdc.connectors.mysql.source.utils.TableDiscoveryUtils.listTables;
...

public class DebeziumUtils {

    public static List<TableId> discoverCapturedTables(//问题行4 <-- 问题行3
            JdbcConnection jdbc, MySqlSourceConfig sourceConfig) {

        final List<TableId> capturedTableIds;
        try {
            capturedTableIds = listTables(jdbc, sourceConfig.getTableFilters());//问题行5
        } catch (SQLException e) {
            throw new FlinkRuntimeException("Failed to discover captured tables", e);//关键日志1
        }
        if (capturedTableIds.isEmpty()) {//问题行6
            throw new IllegalArgumentException(
                    String.format(
                            "Can't find any matched tables, please check your configured database-name: %s and table-name: %s",
                            sourceConfig.getDatabaseList(), sourceConfig.getTableList()));//关键日志2
        }
        return capturedTableIds;
    }

}

https://github.com/apache/flink-cdc/blob/release-2.4.0/flink-connector-mysql-cdc/src/main/java/com/ververica/cdc/connectors/mysql/source/utils/TableDiscoveryUtils.java

package com.ververica.cdc.connectors.mysql.source.utils;

public class TableDiscoveryUtils {
    ...
	
    public static List<TableId> listTables(JdbcConnection jdbc, RelationalTableFilters tableFilters)//问题行7 <-- 问题行5
            throws SQLException {
        final List<TableId> capturedTableIds = new ArrayList<>();
        // -------------------
        // READ DATABASE NAMES
        // -------------------
        // Get the list of databases ...
        LOG.info("Read list of available databases");
        final List<String> databaseNames = new ArrayList<>();

        jdbc.query( //问题行8 | 查询指定数据库是否有对应表的获取权限 | SQL : SHOW FULL TABLES IN xxx_devicecenter  where Table_Type = 'BASE TABLE'; -- xxx_device 
                "SHOW DATABASES",
                rs -> {
                    while (rs.next()) {
                        databaseNames.add(rs.getString(1));
                    }
                });
        LOG.info("\t list of available databases is: {}", databaseNames);

        // ----------------
        // READ TABLE NAMES
        // ----------------
        // Get the list of table IDs for each database. We can't use a prepared statement with
        // MySQL, so we have to build the SQL statement each time. Although in other cases this
        // might lead to SQL injection, in our case we are reading the database names from the
        // database and not taking them from the user ...
        LOG.info("Read list of available tables in each database");
        for (String dbName : databaseNames) {
            try {
                jdbc.query(
                        "SHOW FULL TABLES IN " + quote(dbName) + " where Table_Type = 'BASE TABLE'",
                        rs -> {
                            while (rs.next()) {
                                TableId tableId = new TableId(dbName, null, rs.getString(1));
                                if (tableFilters.dataCollectionFilter().isIncluded(tableId)) {
                                    capturedTableIds.add(tableId);
                                    LOG.info("\t including '{}' for further processing", tableId);
                                } else {
                                    LOG.info("\t '{}' is filtered out of capturing", tableId);
                                }
                            }
                        });
            } catch (SQLException e) {
                // We were unable to execute the query or process the results, so skip this ...
                LOG.warn(
                        "\t skipping database '{}' due to error reading tables: {}",
                        dbName,
                        e.getMessage());
            }
        }
        return capturedTableIds;
    }
	
	...
}
  • 原因分析(汇总)
  1. 数据库连接配置错误:请检查Flink CDC的配置文件中的数据库连接信息是否正确,包括主机名、端口号、用户名和密码等。
  2. Flink CDC版本不兼容:请确保你使用的Flink CDC版本与你的MySQL数据库版本兼容。如果不兼容,可以尝试升级或降级Flink CDC版本。
  3. 表名或数据库名拼写错误:请检查Flink CDC配置文件中指定的表名和数据库名是否正确,没有拼写错误。
  • 表名、库名的大小写问题 | SQL show variables like '%case%';
  1. 权限问题:请确保Flink CDC进程具有足够的权限访问指定的数据库和表。如果没有足够的权限,可以尝试使用具有足够权限的用户运行Flink CDC进程。
  • 查询用户的BINLOG权限 | SQL : SHOW GRANTS FOR 'xxx_cdc'@'%';
  • 查询指定数据库是否有对应表的获取权限 | SQL : SHOW FULL TABLES IN xxx_devicecenter where Table_Type = 'BASE TABLE'; -- xxx_device
  1. binlog文件损坏:如果binlog文件损坏,Flink CDC可能无法读取到正确的数据。可以尝试重新生成binlog文件或者从备份中恢复。
  2. 网络问题:请检查Flink CDC进程与MySQL数据库之间的网络连接是否正常。如果网络不稳定,可能会导致Flink CDC无法正常读取数据。
  3. Flink CDC配置问题:请检查Flink CDC的配置文件中的其他设置,例如过滤条件、转换逻辑等,确保它们正确无误。
  • 本次问题的最终结论
    经排查,环境搞混了,环境A的 Flink Job 对应配置文件的 mysql host 写成了 环境B的 mysql host
  • 参考文献

show variables like '%case%';
默认在变量lower_case_table_names=0的情况下,表名是严格区分大小写的,若查询时大小写弄混淆就会直接报错表不存在

Flink CDC里Can't find any matched tables, please check your configured database-name: [demo] and table-name: [test] flink 配置root用户 监控binlog 老是找不到该表 明明数据库中有这个表啊为什么监控不到?
这个问题可能是由于以下原因导致的:

  1. 数据库连接配置错误
  2. Flink CDC版本不兼容
  3. 表名或数据库名拼写错误
  4. 权限问题
  5. binlog文件损坏
  6. 网络问题
  7. Flink CDC配置问题
    关于本问题的更多回答可点击原文查看:https://developer.aliyun.com/ask/590881

X 参考文献

  • Flink CDC 官网
posted @ 2024-08-23 15:11  千千寰宇  阅读(170)  评论(0编辑  收藏  举报