FlinkCDC使用

环境

版本

flink-1.16.0-bin-scala_2.12.gz

复制jar

flink-sql-connector-mysql-cdc-2.3.0.jar:监听MySQL数据变更。

flink-sql-connector-tidb-cdc-2.3.0.jar:监听tidb数据变更。

flink-connector-jdbc-1.16.0.jar:连接MySQL,并将数据写入MySQL。

flink-sql-connector-kafka-1.16.2.jar:连接Kafka、消费、生产。

复制到${flink_home}/lib。

MySQL环境配置

用户赋权

此用户只用于监听,无写权限。

mysql> CREATE USER 'debezium'@'%' IDENTIFIED BY '1qazXSW@';
mysql> GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'debezium' IDENTIFIED BY '1qazXSW@';
mysql> FLUSH PRIVILEGES;

日志参数配置

检查时候开启归档

show variables like 'log_bin';

如果为OFF,需在MySQL配置文件my.ini/my.cnf中添加:

server-id=223344
log_bin=mysql-bin
binlog_format=ROW
binlog_row_image=FULL
expire_logs_days=10

gtid-mode=ON

enforce-gtid-consistency

binlog_rows_query_log_events=ON
配置 描述
server-id 对于 MySQL 集群中的每个服务器和复制客户端,的值server-id必须是唯一的。在 MySQL 连接器设置期间,Debezium 为连接器分配一个唯一的服务器 ID。
log_bin 的值log_bin是二进制日志文件序列的基本名称。
binlog_format binlog-format必须设置为ROWrow
binlog_row_image binlog_row_image必须设置为FULLfull
expire_logs_days 这是自动删除 binlog 文件的天数。默认值为0,表示不自动删除。

本地环境测试

启动、配置

mysql中创建表,并向表中添加测试数据

create table source_user
(
    id          int auto_increment
        primary key,
    name        varchar(10)                        null,
    dept_id     int                                null,
    salary      decimal(10, 4)                     null,
    create_time datetime default CURRENT_TIMESTAMP not null
);

切换到${flink-home}/bin

[root@localhost bin]# ./start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host localhost.localdomain.
Starting taskexecutor daemon on host localhost.localdomain.
[root@localhost bin]# ./sql-client.sh
...
Command history file path: /root/.flink-sql-history

Flink SQL> show tables;
Empty set

Flink SQL> show databases;
+------------------+
|    database name |
+------------------+
| default_database |
+------------------+
1 row in set

Flink SQL> use default_database;
[INFO] Execute statement succeed.

Flink SQL> show tables;
Empty set

Flink SQL> 
CREATE TABLE source_user(
    id INT,
    name STRING,
    dept_id INT,
    PRIMARY KEY (id) NOT ENFORCED
  ) WITH (
    'connector' = 'mysql-cdc' ,
    'hostname' = 'localhost',
    'port' = '3306',
    'username' = 'debezium',
    'password' = '1qazXSW@',
    'database-name' = 'test',
    'table-name' = 'source_user'
 );
[INFO] Execute statement succeed.

Flink SQL> 
CREATE TABLE target_user_aftermap (
   id INT,
   name STRING,
   dept_id INT,
   PRIMARY KEY (id) NOT ENFORCED
 ) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://localhost:3306/test',
    'driver' = 'com.mysql.cj.jdbc.Driver',
    'username' = 'root',
    'password' = '111111',
    'table-name' = 'target_user_aftermap'
 );
[INFO] Execute statement succeed.

Flink SQL> 
CREATE TABLE dept_dic (
   local_id INT,
   center_id INT,
   PRIMARY KEY (local_id) NOT ENFORCED
 ) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:mysql://localhost:3306/test',
    'driver' = 'com.mysql.cj.jdbc.Driver',
    'username' = 'root',
    'password' = '111111',
    'table-name' = 'dept_dic'
 );
[INFO] Execute statement succeed.

Flink SQL> 
insert into target_user_aftermap
(id, name, dept_id)
select a.id,a.name,b.center_id from source_user as a ,dept_dic as b where a.dept_id=b.local_id;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: a9ff88da532fd7fa7a6a32d4e35aee12
CDC配置参数 描述
connector 连接器名称
hostname 监听数据库地址
port 监听数据库端口
username 监听数据库用户(注意用户权限,使用新创建、赋权的用户)
password 监听数据库密码
database-name 监听数据库名(支持正则表达式)
table-name 监听数据表名(支持正则表达式)

更多参数参考:debezium官网。

JDBC配置参数 描述
connector 连接器名称
url 连接数据库url
driver 使用驱动
username 数据库用户(如果需要输出到表,注意用户权限)
password 数据库密码
table-name 连接数据表名

WEB界面

作业详细内容:

无法访问服务器的8081端口,需要修改配置文件${flink_home}/conf/flink-conf.yaml,并重启。

修改前:
rest.bind-address: localhost
修改后:
rest.bind-address: 0.0.0.0

数据类型

Data Type Remarks for Data Type
CHAR
VARCHAR
STRING
BOOLEAN
BYTES BINARY and VARBINARY are not supported yet.
DECIMAL Supports fixed precision and scale.
TINYINT
SMALLINT
INTEGER
BIGINT
FLOAT
DOUBLE
DATE
TIME Supports only a precision of 0.
TIMESTAMP
TIMESTAMP_LTZ
INTERVAL Supports only interval of MONTH and SECOND(3).
ARRAY
MULTISET
MAP
ROW
RAW
structured types Only exposed in user-defined functions yet.

时间类型

使用TIMESTAMP类型。

Flink SQL> CREATE TABLE target_user(
>     id INT,
>     name STRING,
>     dept_id INT,
>     salary DECIMAL(10,4),
>     create_time TIMESTAMP,
>     PRIMARY KEY (id) NOT ENFORCED
> ) WITH (
>     'connector' = 'mysql-cdc' ,
>     'hostname' = 'localhost',
>     'port' = '3306',
>     'username' = 'debezium',
>     'password' = '1qazXSW@',
>     'database-name' = 'test',
>     'table-name' = 'target_user'
>  );
[INFO] Execute statement succeed.

Flink SQL> SELECT * FROM target_user;
          id                           name     dept_id       salary                create_time
           2                             ch           1    1234.5600 2023-08-21 15:40:51.000000
           3                             hh           3    1234.5600 2023-08-21 15:40:51.000000

时间字段使用STRING ,不会报错,但查询出为时间戳类型。DECIMAL类型也需要将精度补充完整。

时间类型设置为BIGINT。使用TO_TIMESTAMP_LTZ转换。(此时未设置时区)

select TO_TIMESTAMP_LTZ(create_time,3) as var1 from target_user3;

                   var1
 -----------------------
 2023-08-21 23:40:51.000
 2023-08-21 23:40:51.000

设置时区,获取当前时间

Flink SQL> SET 'table.local-time-zone' = 'Asia/Shanghai';
[INFO] Session property has been set.

Flink SQL> select current_date;
2023-08-22

Flink SQL> select current_time;
11:32:07

Flink SQL> select current_timestamp;
2023-08-22 11:32:39.888

CDC参数

可以将部分从数据库日志中解析出的内容作为表的字段。

例:

CREATE TABLE products (
    db_name STRING METADATA FROM 'database_name' VIRTUAL,
    table_name STRING METADATA  FROM 'table_name' VIRTUAL,
    operation_ts TIMESTAMP_LTZ(3) METADATA FROM 'op_ts' VIRTUAL,
    order_id INT,
    order_date TIMESTAMP(0),
    customer_name STRING,
    price DECIMAL(10, 5),
    product_id INT,
    order_status BOOLEAN,
    PRIMARY KEY(order_id) NOT ENFORCED
) WITH (
    'connector' = 'tidb-cdc',
    'tikv.grpc.timeout_in_ms' = '20000',
    'pd-addresses' = 'localhost:2379',
    'database-name' = 'mydb',
    'table-name' = 'orders'
);
CDC参数 必填 默认值 类型 描述
connector Y (none) String Specify what connector to use, here should be 'mysql-cdc'.
hostname Y (none) String IP address or hostname of the MySQL database server.
username Y (none) String Name of the MySQL database to use when connecting to the MySQL database server.
password Y (none) String Password to use when connecting to the MySQL database server.
database-name Y (none) String Database name of the MySQL server to monitor. The database-name also supports regular expressions to monitor multiple tables matches the regular expression.
table-name Y (none) String Table name of the MySQL database to monitor. The table-name also supports regular expressions to monitor multiple tables that satisfy the regular expressions. Note: When the MySQL CDC connector regularly matches the table name, it will concat the database-name and table-name filled in by the user through the string \\. to form a full-path regular expression, and then use the regular expression to match the fully qualified name of the table in the MySQL database.
port N 3306 Integer Integer port number of the MySQL database server.
server-id N (none) String A numeric ID or a numeric ID range of this database client, The numeric ID syntax is like '5400', the numeric ID range syntax is like '5400-5408', The numeric ID range syntax is recommended when 'scan.incremental.snapshot.enabled' enabled. Every ID must be unique across all currently-running database processes in the MySQL cluster. This connector joins the MySQL cluster as another server (with this unique ID) so it can read the binlog. By default, a random number is generated between 5400 and 6400, though we recommend setting an explicit value.
scan.incremental.snapshot.enabled N true Boolean Incremental snapshot is a new mechanism to read snapshot of a table. Compared to the old snapshot mechanism, the incremental snapshot has many advantages, including: (1) source can be parallel during snapshot reading, (2) source can perform checkpoints in the chunk granularity during snapshot reading, (3) source doesn't need to acquire global read lock (FLUSH TABLES WITH READ LOCK) before snapshot reading. If you would like the source run in parallel, each parallel reader should have an unique server id, so the 'server-id' must be a range like '5400-6400', and the range must be larger than the parallelism. Please see Incremental Snapshot Readingsection for more detailed information.
scan.incremental.snapshot.chunk.size 读取表快照时,捕获的表的块大小(行数)被分割成多个块。 N 8096 Integer The chunk size (number of rows) of table snapshot, captured tables are split into multiple chunks when read the snapshot of table.
scan.snapshot.fetch.size 获取快照时,每次轮询获取数据条数 N 1024 Integer The maximum fetch size for per poll when read table snapshot.
scan.startup.mode 启动模式,如果设置获取表的全部信息作为快照,表数据量很大,需要先调整会话连接时长限制。 N initial String Optional startup mode for MySQL CDC consumer, valid enumerations are "initial", "earliest-offset", "latest-offset", "specific-offset" and "timestamp". Please see Startup Reading Position section for more detailed information.
scan.startup.specific-offset.file N (none) String Optional binlog file name used in case of "specific-offset" startup mode
scan.startup.specific-offset.pos N (none) Long Optional binlog file position used in case of "specific-offset" startup mode
scan.startup.specific-offset.gtid-set N (none) String Optional GTID set used in case of "specific-offset" startup mode
scan.startup.specific-offset.skip-events N (none) Long Optional number of events to skip after the specific starting offset
scan.startup.specific-offset.skip-rows N (none) Long Optional number of rows to skip after the specific starting offset
server-time-zone N (none) String The session time zone in database server, e.g. "Asia/Shanghai". It controls how the TIMESTAMP type in MYSQL converted to STRING. See more here. If not set, then ZoneId.systemDefault() is used to determine the server time zone.
debezium.min.row. count.to.stream.result N 1000 Integer During a snapshot operation, the connector will query each included table to produce a read event for all rows in that table. This parameter determines whether the MySQL connection will pull all results for a table into memory (which is fast but requires large amounts of memory), or whether the results will instead be streamed (can be slower, but will work for very large tables). The value specifies the minimum number of rows a table must contain before the connector will stream results, and defaults to 1,000. Set this parameter to '0' to skip all table size checks and always stream all results during a snapshot.
connect.timeout 连接器在尝试连接到 MySQL 数据库服务器之后等待超时的最长时间。 N 30s Duration The maximum time that the connector should wait after trying to connect to the MySQL database server before timing out.
connect.max-retries 连接器重试构建 MySQL 数据库服务器连接的最大重试次数。 N 3 Integer The max retry times that the connector should retry to build MySQL database server connection.
connection.pool.size 连接池大小 N 20 Integer The connection pool size.
jdbc.properties.* N 20 String Option to pass custom JDBC URL properties. User can pass custom properties like 'jdbc.properties.useSSL' = 'false'.
heartbeat.interval 心跳监测间隔 N 30s Duration The interval of sending heartbeat event for tracing the latest available binlog offsets.
debezium.* 详见Debezium官网文档 N (none) String Pass-through Debezium's properties to Debezium Embedded Engine which is used to capture data changes from MySQL server. For example: 'debezium.snapshot.mode' = 'never'. See more about the Debezium's MySQL Connector properties
scan.incremental.close-idle-reader.enabled N false Boolean Whether to close idle readers at the end of the snapshot phase. The flink version is required to be greater than or equal to 1.14 when 'execution.checkpointing.checkpoints-after-tasks-finish.enabled' is set to true.

数据处理模式

执行模式(流/批) | Apache Flink

StreamingMode:适用于连续增量处理,而且预计无限期保持在线的无边界作业。

BatchMode:适用于有一个已知的固定输入,而且不会连续运行的有边界作业。

public EnvironmentSettings.Builder inBatchMode() {
    this.configuration.set(ExecutionOptions.RUNTIME_MODE, RuntimeExecutionMode.BATCH);
    return this;
}
public EnvironmentSettings.Builder inStreamingMode() {
    this.configuration.set(ExecutionOptions.RUNTIME_MODE, RuntimeExecutionMode.STREAMING);
    return this;

提交作业到集群时设置处理模式(推荐)

bin/flink run -Dexecution.runtime-mode=BATCH <jarFile>

报错

服务器资源不足

[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not acquire the minimum required resources.

语法错误

[ERROR] Could not execute SQL statement. Reason:
org.apache.calcite.runtime.CalciteException: Non-query expression encountered in illegal context

表达式类型错误

[ERROR] Could not execute SQL statement. Reason:
org.apache.calcite.sql.validate.SqlValidatorException: Cannot apply 'TO_TIMESTAMP_LTZ' to arguments of type 'TO_TIMESTAMP_LTZ(<TIME(0)>, <INTEGER>)'. Supported form(s): 'TO_TIMESTAMP_LTZ(<NUMERIC>, <INTEGER>)'

输入输出字段类型不一致

[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.api.ValidationException: Column types of query result and sink for 'default_catalog.default_database.target_user11' do not match.
Cause: Incompatible types for sink column 'create_time' at position 4.

Query schema: [id: INT NOT NULL, name: STRING, dept_id: INT, salary: DECIMAL(10, 4), create_time: TIMESTAMP_LTZ(3)]
Sink schema:  [id: INT, name: STRING, dept_id: INT, salary: DECIMAL(10, 4), create_time: STRING]

类型不支持

[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.api.ValidationException: The MySQL dialect doesn't support type: TIMESTAMP_LTZ(6).

字符大小写问题

TiDB

TiDB术语

术语表 | PingCAP 文档中心

15分钟了解TiDB - 知乎 (zhihu.com)

CREATE TABLE products (
    db_name STRING METADATA FROM 'database_name' VIRTUAL,
    table_name STRING METADATA  FROM 'table_name' VIRTUAL,
    operation_ts TIMESTAMP_LTZ(3) METADATA FROM 'op_ts' VIRTUAL,
    order_id INT,
    order_date TIMESTAMP(0),
    customer_name STRING,
    price DECIMAL(10, 5),
    product_id INT,
    order_status BOOLEAN,
    PRIMARY KEY(order_id) NOT ENFORCED
) WITH (
    'connector' = 'tidb-cdc',
    'tikv.grpc.timeout_in_ms' = '20000',
    'pd-addresses' = 'localhost:2379',
    'database-name' = 'mydb',
    'table-name' = 'orders'
);
参数 必填 默认值 类型 描述
connector Y (none) String Specify what connector to use, here should be 'tidb-cdc'.
database-name Y (none) String Database name of the TiDB server to monitor.
table-name Y (none) String Table name of the TiDB database to monitor.
scan.startup.mode N initial String Optional startup mode for TiDB CDC consumer, valid enumerations are "initial" and "latest-offset".是否需要加载表中历史数据。initial是,latest-offset否
pd-addresses Y (none) String TiKV cluster's PD address.PD地址
tikv.grpc.timeout_in_ms N (none) Long TiKV GRPC timeout in ms. grpc的连接超时时间。
tikv.grpc.scan_timeout_in_ms N (none) Long TiKV GRPC scan timeout in ms.grpc的扫描超时时间。
tikv.batch_get_concurrency N 20 Integer TiKV GRPC batch get concurrency.grpc批处理并发数量。
tikv.* N (none) String Pass-through TiDB client's properties.

TIDB测试

[root@slave3 flink-1.16.0]# pwd
/home/flink/flink-1.16.0
[root@slave3 flink-1.16.0]# netstat -antl | grep 8081
[root@slave3 flink-1.16.0]#
[root@slave3 flink-1.16.0]# cd bin/
[root@slave3 bin]# ./start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host slave3.
Starting taskexecutor daemon on host slave3.
[root@slave3 bin]# jps
933166 jar
2186802 jar
2477040 Jps
2476712 StandaloneSessionClusterEntrypoint
1840610 jar
1822628 jar
[root@slave3 bin]#

注册监听

create table ois_reg_info
(
    org_code                 STRING,                           
    branch_code              STRING,              
    opc_id                   STRING,                         
    card_type                STRING,                           
    card_type_code_org       STRING,                           
    card_type_name_org       STRING,                          
    card_data                STRING,                         
    reg_type                 STRING,                           
    reg_type_code_org        STRING,                           
    reg_type_name_org        STRING,                      
    reg_time                 TIMESTAMP,                             
    reg_source_type          STRING,                           
    reg_source_type_code_org STRING,                           
    reg_source_type_name_org STRING,                          
    reg_client_ip            STRING,                          
    reg_client_no            STRING,                          
    order_source             STRING,                          
    order_source_code_org    STRING,                           
    order_source_name_org    STRING,                          
    resv_sn                  STRING,                          
    msg_start                STRING,                          
    msg_end                  STRING,                          
    reg_input_empid          STRING,                          
    reg_input_empid_code_org STRING,                          
    reg_input_empid_name_org STRING,                          
    reg_dept                 STRING,                          
    reg_dept_code_org        STRING,                          
    reg_dept_name_org        STRING,                          
    regdoc_empid             STRING,                          
    regdoc_empid_code_org    STRING,                          
    regdoc_empid_name_org    STRING,                          
    clinic_class             STRING,                          
    clinic_class_code_org    STRING,                          
    clinic_class_name_org    STRING,                          
    invalid_flag             STRING,                          
    invalid_empid            STRING,                          
    invalid_empid_code_org   STRING,                          
    invalid_empid_name_org   STRING,                          
    invalid_time             TIMESTAMP,                             
    is_eme                   STRING,                         
    charge_no                STRING,                         
    invoice                  STRING,                         
    paper_invoice            STRING,                         
    sumfee                   DECIMAL(22, 2),                       
    reg_fee                  DECIMAL(22, 2),                      
    checkup_fee              DECIMAL(22, 2),                       
    experts_fee              DECIMAL(22, 2),                       
    casecard_fee             DECIMAL(22, 2),                       
    card_fee                 DECIMAL(22, 2),                       
    other_fee                DECIMAL(22, 2),                       
    ins_pson_type            STRING,               
    ins_pson_type_code_org   STRING,               
    ins_pson_type_name_org   STRING,               
    serve_way                STRING,               
    serve_way_code_org       STRING,               
    serve_way_name_org       STRING,               
    del_flag                 STRING,               
    modify_time_sys          TIMESTAMP,
    modify_empid             STRING,                          
    modify_empid_code_org    STRING,                          
    modify_empid_name_org    STRING,                          
    create_time_sys          TIMESTAMP,
    create_empid             STRING,                          
    create_empid_code_org    STRING,                          
    create_empid_name_org    STRING,                          
    modify_time_mfs          TIMESTAMP,                             
    create_time_mfs          TIMESTAMP,                             
    batch_version            STRING,                          
    batch_type               STRING,                          
    primary key (org_code, branch_code, opc_id, reg_time) NOT ENFORCED
) WITH (
    'connector' = 'tidb-cdc',
    'tikv.grpc.timeout_in_ms' = '30000',
    'pd-addresses' = 'x.x.x.x:2379',
    'database-name' = 'xxxx',
    'table-name' = 'ois_reg_info',
	'scan.startup.mode' = 'latest-offset'
);

Flink SQL> show tables ;
+--------------+
|   table name |
+--------------+
| ois_reg_info |
+--------------+
1 row in set

Flink SQL> select * from ois_reg_info;
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not acquire the minimum required resources.

修改内存参数

####修改前######
jobmanager.memory.process.size: 1600m

taskmanager.memory.process.size: 1728m
####修改后######
jobmanager.memory.process.size: 4096m

taskmanager.memory.process.size: 4096m
RESOURCE_PARAMS extraction logs:
jvm_params: -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456
dynamic_configs: -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=429496736b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=3462817376b -D jobmanager.memory.jvm-overhead.max=429496736b
logs: INFO  [] - Loading configuration property: taskmanager.memory.process.size, 4096m
INFO  [] - Loading configuration property: jobmanager.bind-host, localhost
INFO  [] - Loading configuration property: taskmanager.bind-host, localhost
INFO  [] - Loading configuration property: taskmanager.host, localhost
INFO  [] - Loading configuration property: parallelism.default, 1
INFO  [] - Loading configuration property: jobmanager.execution.failover-strategy, region
INFO  [] - Loading configuration property: jobmanager.rpc.address, localhost
INFO  [] - Loading configuration property: taskmanager.numberOfTaskSlots, 1
INFO  [] - Loading configuration property: rest.address, localhost
INFO  [] - Loading configuration property: jobmanager.memory.process.size, 4096m
INFO  [] - Loading configuration property: jobmanager.rpc.port, 6123
INFO  [] - Loading configuration property: rest.bind-address, 0.0.0.0
INFO  [] - Final Master Memory configuration:
INFO  [] -   Total Process Memory: 4.000gb (4294967296 bytes)
INFO  [] -     Total Flink Memory: 3.350gb (3597035104 bytes)
INFO  [] -       JVM Heap:         3.225gb (3462817376 bytes)
INFO  [] -       Off-heap:         128.000mb (134217728 bytes)
INFO  [] -     JVM Metaspace:      256.000mb (268435456 bytes)
INFO  [] -     JVM Overhead:       409.600mb (429496736 bytes)

检查日志

2023-08-23 17:20:16,633 WARN  org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge [] - Could not acquire the minimum required resources, failing slot requests. Acquired: []. Current slot pool status: Registered TMs: 0, registered slots: 0 free slots: 0

发现web页面中,available task slots 为0,total task solts为0,task managers 0。

修改任务池数量

taskmanager.numberOfTaskSlots: 50  ##默认为1

taskmanager.numberOfTaskSlots: 50	

日志检查

检查flink-root-taskexecutor-0-slave3.log的报错

Error: VM option ‘UseG1GC’ is experimental and must be enabled via -XX:+UnlockExperimentalVMOptions.
Error: Could not create the Java Virtual Machine.
。。。。。

删除taskmanager.sh下参数,后重启

目标表

create table flink_ois_reg_info
(
    org_code                 STRING,                           
    branch_code              STRING,              
    opc_id                   STRING,                         
    card_type                STRING,                           
    card_type_code_org       STRING,                           
    card_type_name_org       STRING,                          
    card_data                STRING,                         
    reg_type                 STRING,                           
    reg_type_code_org        STRING,                           
    reg_type_name_org        STRING,                      
    reg_time                 TIMESTAMP,                             
    reg_source_type          STRING,                           
    reg_source_type_code_org STRING,                           
    reg_source_type_name_org STRING,                          
    reg_client_ip            STRING,                          
    reg_client_no            STRING,                          
    order_source             STRING,                          
    order_source_code_org    STRING,                           
    order_source_name_org    STRING,                          
    resv_sn                  STRING,                          
    msg_start                STRING,                          
    msg_end                  STRING,                          
    reg_input_empid          STRING,                          
    reg_input_empid_code_org STRING,                          
    reg_input_empid_name_org STRING,                          
    reg_dept                 STRING,                          
    reg_dept_code_org        STRING,                          
    reg_dept_name_org        STRING,                          
    regdoc_empid             STRING,                          
    regdoc_empid_code_org    STRING,                          
    regdoc_empid_name_org    STRING,                          
    clinic_class             STRING,                          
    clinic_class_code_org    STRING,                          
    clinic_class_name_org    STRING,                          
    invalid_flag             STRING,                          
    invalid_empid            STRING,                          
    invalid_empid_code_org   STRING,                          
    invalid_empid_name_org   STRING,                          
    invalid_time             TIMESTAMP,                             
    is_eme                   STRING,                         
    charge_no                STRING,                         
    invoice                  STRING,                         
    paper_invoice            STRING,                         
    sumfee                   DECIMAL(22, 2),                       
    reg_fee                  DECIMAL(22, 2),                      
    checkup_fee              DECIMAL(22, 2),                       
    experts_fee              DECIMAL(22, 2),                       
    casecard_fee             DECIMAL(22, 2),                       
    card_fee                 DECIMAL(22, 2),                       
    other_fee                DECIMAL(22, 2),                       
    ins_pson_type            STRING,               
    ins_pson_type_code_org   STRING,               
    ins_pson_type_name_org   STRING,               
    serve_way                STRING,               
    serve_way_code_org       STRING,               
    serve_way_name_org       STRING,               
    del_flag                 STRING,               
    modify_time_sys          TIMESTAMP,
    modify_empid             STRING,                          
    modify_empid_code_org    STRING,                          
    modify_empid_name_org    STRING,                          
    create_time_sys          TIMESTAMP,
    create_empid             STRING,                          
    create_empid_code_org    STRING,                          
    create_empid_name_org    STRING,                          
    modify_time_mfs          TIMESTAMP,                             
    create_time_mfs          TIMESTAMP,                             
    batch_version            STRING,                          
    batch_type               STRING,                          
    primary key (org_code, branch_code, opc_id, reg_time) NOT ENFORCED
) WITH (
     'connector' = 'jdbc',
     'url' = 'jdbc:mysql://x.x.x.x:4000/test',
     'driver' = 'com.mysql.cj.jdbc.Driver',
     'username' = 'xxx',
     'password' = 'xxx',
     'table-name' = 'flink_ois_reg_info'
);

同步

insert into  flink_ois_reg_info select * from ois_reg_info;
Flink SQL> show tables;
+--------------------+
|         table name |
+--------------------+
| flink_ois_reg_info |
|       ois_reg_info |
+--------------------+
2 rows in set

Flink SQL> insert into  flink_ois_reg_info select * from ois_reg_info;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: 6b2dd0310e50d89ae5a6150ab18d6ff4

直到08.25下午查看才发现接收到一条数据,后续再无接收到任何数据

对接Kafka

监听数据库,字典转换后写入Kafka

Flink SQL> CREATE TABLE source_user(
>     id INT,
>     name STRING,
>     dept_id INT,
>     PRIMARY KEY (id) NOT ENFORCED
>   ) WITH (
>     'connector' = 'mysql-cdc' ,
>     'hostname' = 'localhost',
>     'port' = '3306',
>     'username' = 'debezium',
>     'password' = '1qazXSW@',
>     'database-name' = 'test',
>     'table-name' = 'source_user'
>  );
[INFO] Execute statement succeed.

Flink SQL> select * from source_user;
[INFO] Result retrieval cancelled.

Flink SQL> CREATE TABLE target_user(
>     id INT,
>     name STRING,
>     dept_id INT
> ) WITH (
>   'connector' = 'kafka',
>   'topic' = 'test',
>   'properties.bootstrap.servers' = 'x.x.x.x:9092',
>   'properties.group.id' = 'testGroup',
>   'scan.startup.mode' = 'earliest-offset',
>   'format' = 'json'
> );
[INFO] Execute statement succeed.

Flink SQL> insert into target_user select * from source_user;
[INFO] Submitting SQL update statement to the cluster...
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.api.TableException: Table sink 'default_catalog.default_database.target_user' doesn't support consuming update and delete changes which is produced by node TableSourceScan(table=[[default_catalog, default_database, source_user]], fields=[id, name, dept_id])

Flink SQL> CREATE TABLE dept_dic (
>     local_id INT,
>     center_id INT,
>     PRIMARY KEY (local_id) NOT ENFORCED
>   ) WITH (
>      'connector' = 'jdbc',
>      'url' = 'jdbc:mysql://localhost:3306/test',
>      'driver' = 'com.mysql.cj.jdbc.Driver',
>      'username' = 'root',
>      'password' = '111111',
>      'table-name' = 'dept_dic'
>   );
[INFO] Execute statement succeed.

Flink SQL> insert into target_user
>  (id, name, dept_id)
> select a.id,a.name,b.center_id from source_user as a ,dept_dic as b where a.dept_id=b.local_id;
[INFO] Submitting SQL update statement to the cluster...
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.api.TableException: Table sink 'default_catalog.default_database.target_user' doesn't support consuming update and delete changes which is produced by node Join(joinType=[InnerJoin], where=[(dept_id = local_id)], select=[id, name, dept_id, local_id, center_id], leftInputSpec=[HasUniqueKey], rightInputSpec=[JoinKeyContainsUniqueKey])

修改kafka连接器配置

'format' = 'json' 改为'value.format' = 'debezium-json'

Kafka | Apache Flink

Flink SQL> CREATE TABLE target_user1(
>     id INT,
>     name STRING,
>     dept_id INT
> ) WITH (
>   'connector' = 'kafka',
>   'topic' = 'test',
>   'properties.bootstrap.servers' = 'x.x.x.x:9092',
>   'properties.group.id' = 'testGroup',
>   'scan.startup.mode' = 'earliest-offset',
>   'value.format' = 'debezium-json'
> );
[INFO] Execute statement succeed.

Flink SQL> insert into target_user1
>  (id, name, dept_id)
> select a.id,a.name,b.center_id from source_user as a ,dept_dic as b where a.dept_id=b.local_id;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: 076a9553a00fa4544b6d3b9db3953b39

模拟源端消费

[root@slave3 kafka]# ./bin/kafka-console-consumer.sh --bootstrap-server x.x.x.x:9092 --topic test --from-beginning
{"before":null,"after":{"id":2,"name":"kk","dept_id":111},"op":"c"}
{"before":null,"after":{"id":3,"name":"hh","dept_id":333},"op":"c"}
{"before":null,"after":{"id":5,"name":"cc","dept_id":222},"op":"c"}
{"before":null,"after":{"id":7,"name":"xx","dept_id":111},"op":"c"}
{"before":null,"after":{"id":8,"name":"gg","dept_id":111},"op":"c"}
{"before":null,"after":{"id":9,"name":"yy","dept_id":222},"op":"c"}

模拟flink对接OGG

模拟生成OGG日志

[root@slave3 kafka]# ./bin/kafka-topics.sh --bootstrap-server x.x.x.x:9092 --create --topic test1
Created topic test1.
[root@slave3 kafka]# ./bin/kafka-console-producer.sh --bootstrap-server 10.76.4.107:9092  --topic test1
>{"table":"target_user2","op_type":"U","op_ts":"2023-08-30 14:43:26.214713","current_ts":"2023-08-30T14:43:32.126020","pos":"00000000050008402642","before":{"id":"23509681B1","name":"B"},"after":{"id":"23509681B1","name":"B"}}
>{"table":"target_user2","op_type":"U","op_ts":"2023-08-30 14:43:26.214713","current_ts":"2023-08-30T14:43:32.126020","pos":"00000000050008402642","before":{"id":"23509681B1","name":"B"},"after":{"id":"23509681B1","name":"B"}}
>

Flink消费OGG生产到kafka日志

OGG与Flink CDC日志解析出的日志结构不一样, 不能直接使用debezium-json格式化。

Flink SQL> CREATE TABLE target_user2(
>     id STRING,
>     name STRING
> ) WITH (
>   'connector' = 'kafka',
>   'topic' = 'test1',
>   'properties.bootstrap.servers' = 'x.x.x.x:9092',
>   'properties.group.id' = 'testGroup',
>   'scan.startup.mode' = 'earliest-offset',
>   'value.format' = 'debezium-json'
> );
[INFO] Execute statement succeed.

Flink SQL> select * from target_user2;
[ERROR] Could not execute SQL statement. Reason:
java.io.IOException: Corrupt Debezium JSON message '{"table":"target_user2","op_type":"U","op_ts":"2023-08-30 14:43:26.214713","current_ts":"2023-08-30T14:43:32.126020","pos":"00000000050008402642","before":{"id":"23509681B1","name":"B"},"after":{"id":"23509681B1","name":"B"}}'.

修改source的format配置参数

Flink SQL> CREATE TABLE target_user3(
>     id STRING,
>     name STRING
> ) WITH (
>   'connector' = 'kafka',
>   'topic' = 'test1',
>   'properties.bootstrap.servers' = 'x.x.x.x:9092',
>   'properties.group.id' = 'testGroup',
>   'scan.startup.mode' = 'earliest-offset',
>   'format' = 'json'
> );
[INFO] Execute statement succeed.

JSON和Flink SQL数据类型映射

Flink SQL 类型 JSON 类型
CHAR / VARCHAR / STRING string
BOOLEAN boolean
BINARY / VARBINARY string with encoding: base64
DECIMAL number
TINYINT number
SMALLINT number
INT number
BIGINT number
FLOAT number
DOUBLE number
DATE string with format: date
TIME string with format: time
TIMESTAMP string with format: date-time
TIMESTAMP_WITH_LOCAL_TIME_ZONE string with format: date-time (with UTC time zone)
INTERVAL number
ARRAY array
MAP / MULTISET object
ROW object

关键字问题

需要使用``包裹关键字

使用数据:

{"table":"target_user","op_type":"U","op_ts":"2023-08-30 14:43:26.214713","current_ts":"2023-08-30T14:43:32.126020","pos":"00000000050008402642","before":{"id":"23509681B1","name":"B"},"after":{"id":"23509681B1","name":"B"}}
{"table":"target_user","op_type":"U","op_ts":"2023-08-30 14:43:26.214713","current_ts":"2023-08-30T14:43:32.126020","pos":"00000000050008402642","before":{"id":"23509681B1","name":"B"},"after":{"id":"23509681B1","name":"B"}}
Flink SQL> CREATE TABLE target_user(
>      	table STRING,
>      	op_type STRING,
>    	before  ROW(ZYH STRING,MRCYSHBS STRING),
>    	after ROW(ZYH STRING,MRCYSHBS STRING)
> ) WITH (
>    'connector' = 'kafka',
>    'topic' = 'test1',
>    'properties.bootstrap.servers' = 'x.x.x.x:9092',
>    'properties.group.id' = 'Group0831',
>    'scan.startup.mode' = 'earliest-offset',
>    'format' = 'json'
> );
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.sql.parser.impl.ParseException: Encountered "table" at line 2, column 6.
Was expecting one of:
    "CONSTRAINT" ...
    "PRIMARY" ...
    "UNIQUE" ...
    "WATERMARK" ...
    <BRACKET_QUOTED_IDENTIFIER> ...
    <QUOTED_IDENTIFIER> ...
    <BACK_QUOTED_IDENTIFIER> ...
    <HYPHENATED_IDENTIFIER> ...
    <IDENTIFIER> ...
    <UNICODE_QUOTED_IDENTIFIER> ...


Flink SQL> CREATE TABLE target_user1(
>   `table` STRING,
>   op_type STRING,
>   before  ROW(),
>   after ROW()
> ) WITH (
>    'connector' = 'kafka',
>    'topic' = 'test1',
>    'properties.bootstrap.servers' = 'x.x.x.x:9092',
>    'properties.group.id' = 'Group0831',
>    'scan.startup.mode' = 'earliest-offset',
>    'format' = 'json'
> );
[INFO] Execute statement succeed.

Flink SQL> CREATE TABLE target_user1(
>   `table` STRING,
>   op_type STRING,
>   before  ROW(id STRING,name STRING),
>   after ROW(id STRING,name STRING)
> ) WITH (
>    'connector' = 'kafka',
>    'topic' = 'test1',
>    'properties.bootstrap.servers' = 'x.x.x.x:9092',
>    'properties.group.id' = 'Group0831',
>    'scan.startup.mode' = 'earliest-offset',
>    'format' = 'json'
> );
[INFO] Execute statement succeed.

Flink SQL> select * from target_user1;
[INFO] Result retrieval cancelled.

Flink SQL> select before.id,before.name from target_user1;
[INFO] Result retrieval cancelled.

添加CUD类型数据

[root@slave3 kafka]# ./bin/kafka-console-producer.sh --bootstrap-server x.x.x.x:9092 --topic test0831
>{"table":"target_user","op_type":"U","op_ts":"2023-08-30 14:43:26.214713","current_ts":"2023-08-30T14:43:32.126020","pos":"00000000050008402642","before":{"id":"23509681B1","name":"B"},"after":{"id":"23509681B1","name":"B"}}
>{"table":"target_user2","op_type":"I","op_ts":"2023-08-30 14:43:26.214713","current_ts":"2023-08-30T14:43:32.126020","pos":"00000000050008402642","before":null,"after":{"id":"23509681C2","name":"C"}}
>{"table":"target_user2","op_type":"D","op_ts":"2023-08-30 14:43:26.214713","current_ts":"2023-08-30T14:43:32.126020","pos":"00000000050008402642","before":{"id":"23509681A1","name":"A"},"after":null}
>
Flink SQL> CREATE TABLE target_user(
>   `table` STRING,
>   op_type STRING,
>   before  ROW(id STRING,name STRING),
>   after ROW(id STRING,name STRING)
> ) WITH (
>    'connector' = 'kafka',
>    'topic' = 'test0831',
>    'properties.bootstrap.servers' = 'x.x.x.x:9092',
>    'properties.group.id' = 'Group0831',
>    'scan.startup.mode' = 'earliest-offset',
>    'format' = 'json'
> );
[INFO] Execute statement succeed.

Flink SQL> select * from target_user;
[INFO] Result retrieval cancelled.

Flink SQL>  select
>   a.`table`,
>   a.op_type,
>   case
>       when op_type = 'I'
>           THEN  a.after.id
>       when op_type = 'D'
>           THEN before.id
>       when op_type = 'U'
>           THEN a.after.id
>       ELSE '0' END  AS PK1
> from target_user as a ;

参数调优

重启策略

暂时没配置

restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 3
restart-strategy.fixed-delay.delay: 10 s

并发问题

parallelism.default: 4

注意:

taskmanager.numberOfTaskSlots: 12
# 5*4=20》12,导致无法获取最小执行资源

心跳超时时间

HEARTBEAT_INTERVAL就是超时检测间隔(默认为10秒),HEARTBEAT_TIMEOUT就是超时时长(默认为50秒)

[源码解析] 从TimeoutException看Flink的心跳机制

心跳超时时间设置-Flink1.9.0源码调试介绍&增加调试超时时间_

其他参数

Flink 参数配置和常见参数调优

Flink jdbc-connector

附带查询条件不会下推到数据库,默认走全表加载,在内存中使用算子过滤;

streamEnv.executeSql("insert into target_user\n" +
        "select a.id,a.name,b.center_id,a.salary,a.create_time from source_user as a  left join dept_dic as b\n" +
        "on a.dept_id = b.local_id;");

LOOKUPJOIN-适用大维表关联、业务表关联

使用FOR SYSTEM AS OF A.PROCTIME

optimize result:
 FlinkLogicalSink(table=[default_catalog.default_database.target_user], fields=[id, name, center_id, salary, create_time])
+- FlinkLogicalCalc(select=[id, name, center_id, salary, create_time])
   +- FlinkLogicalJoin(condition=[=($2, $5)], joinType=[left])
      :- FlinkLogicalTableSourceScan(table=[[default_catalog, default_database, source_user]], fields=[id, name, dept_id, salary, create_time])
      +- FlinkLogicalSnapshot(period=[$cor0.proctime])
         +- FlinkLogicalTableSourceScan(table=[[default_catalog, default_database, dept_dic]], fields=[local_id, center_id])
CREATE TABLE source_user( 
   id INT, 
   name STRING, 
   dept_id INT, 
   salary DECIMAL(10,4), 
   create_time TIMESTAMP, 
   proctime AS PROCTIME(), 
   PRIMARY KEY (id) NOT ENFORCED 
 ) WITH ( 
   'connector' = 'mysql-cdc' , 
   'hostname' = 'localhost', 
   'port' = '3306', 
   'username' = 'debezium', 
   'password' = '1qazXSW@', 
   'database-name' = 'test', 
   'table-name' = 'source_user' 
);

CREATE TABLE target_user( 
    id INT, 
    name STRING, 
    dept_id INT, 
    salary DECIMAL(10,4), 
    create_time TIMESTAMP, 
    PRIMARY KEY (id) NOT ENFORCED 
) WITH ( 
    'connector' = 'jdbc', 
    'url' = 'jdbc:mysql://localhost:3306/test', 
    'driver' = 'com.mysql.cj.jdbc.Driver', 
    'username' = 'root', 
    'password' = 'root', 
    'table-name' = 'target_user' 
);

CREATE TABLE dept_dic ( 
   local_id INT, 
   center_id INT, 
   PRIMARY KEY (local_id) NOT ENFORCED 
 ) WITH ( 
    'connector' = 'jdbc', 
    'url' = 'jdbc:mysql://localhost:3306/test', 
    'driver' = 'com.mysql.cj.jdbc.Driver', 
    'username' = 'root', 
    'password' = 'root', 
    'table-name' = 'dept_dic' 
 );
 
 ## 提交作业
insert into target_user 
select a.id,a.name,b.center_id,a.salary,a.create_time from source_user as a   
left join dept_dic FOR SYSTEM_TIME AS OF a.proctime as b 
on a.dept_id = b.local_id;

sql-client中指定任务名称

Flink SQL> set pipeline.name = 'mysql2mysql';
[INFO] Session property has been set.

checkpoint

启用检查点持久化

# 修改flink-conf.yaml配置文件中的相关参数。

state.backend: filesystem

# 配置state.checkpoints.dir参数,指定文件系统中用于保存检查点数据的目录
# file:// 持久化为本地文件; hdfs:// 持久化到hdfs;其他还有到数据库的;

state.checkpoints.dir: file:///path/to/local/directory

设置检查点间隔时间

Flink SQL> SET execution.checkpointing.interval = 3s;
[INFO] Session property has been set.

文件目录结构

[root@localhost flink-1.16.0]# cd ../checkpoints/
[root@localhost checkpoints]# ls
c1bc183d5c8386ded6fb302c5721676c
[root@localhost checkpoints]# cd c1bc183d5c8386ded6fb302c5721676c/
[root@localhost c1bc183d5c8386ded6fb302c5721676c]# ls
chk-195  shared  taskowned

作业状态查询

# 根据作业id查询作业信息 ./bin/flink list -r <jobId>
# 启动时间 :jobid :jobname (运行状态)

[root@localhost flink-1.16.0]# ./bin/flink list  -r c1bc183d5c8386ded6fb302c5721676c
Waiting for response...
------------------ Running/Restarting Jobs -------------------
05.09.2023 10:58:39 : c1bc183d5c8386ded6fb302c5721676c : mysql2mysql (RUNNING)
--------------------------------------------------------------
[root@localhost flink-1.16.0]# ./bin/flink list  -r c1bc183d5c8386ded6fb302c5721676c
Waiting for response...
No running jobs.

根据checkpointID进行重启

在web界面查看检查点id或在文件目录查看。

# 在重启任务之前,需要先取消当前正在运行的任务。可以使用以下命令来取消任务:

./bin/flink cancel -s <checkpointId> <jobId>

# 使用flink run命令重新提交任务。在取消任务后,可以使用以下命令重新提交任务:
# 注意:此处重启只能指定jar文件
./bin/flink run -s <checkpoint_path> -d <jarFile>
[root@localhost flink-1.16.0]# ./bin/flink run --help

Action "run" compiles and runs a program.

  Syntax: run [OPTIONS] <jar-file> <arguments>
  "run" action options:
     -c,--class <classname>                     Class with the program entry
                                                point ("main()" method). Only
                                                needed if the JAR file does not
                                                specify the class in its
                                                manifest.
     -C,--classpath <url>                       Adds a URL to each user code
                                                classloader  on all nodes in the
                                                cluster. The paths must specify
                                                a protocol (e.g. file://) and be
                                                accessible on all nodes (e.g. by
                                                means of a NFS share). You can
                                                use this option multiple times
                                                for specifying more than one
                                                URL. The protocol must be
                                                supported by the {@link
                                                java.net.URLClassLoader}.
     -d,--detached                              If present, runs the job in
                                                detached mode
     -n,--allowNonRestoredState                 Allow to skip savepoint state
                                                that cannot be restored. You
                                                need to allow this if you
                                                removed an operator from your
                                                program that was part of the
                                                program when the savepoint was
                                                triggered.
     -p,--parallelism <parallelism>             The parallelism with which to
                                                run the program. Optional flag
                                                to override the default value
                                                specified in the configuration.
     -py,--python <pythonFile>                  Python script with the program
                                                entry point. The dependent
                                                resources can be configured with
                                                the `--pyFiles` option.
     -pyarch,--pyArchives <arg>                 Add python archive files for
                                                job. The archive files will be
                                                extracted to the working
                                                directory of python UDF worker.
                                                For each archive file, a target
                                                directory be specified. If the
                                                target directory name is
                                                specified, the archive file will
                                                be extracted to a directory with
                                                the specified name. Otherwise,
                                                the archive file will be
                                                extracted to a directory with
                                                the same name of the archive
                                                file. The files uploaded via
                                                this option are accessible via
                                                relative path. '#' could be used
                                                as the separator of the archive
                                                file path and the target
                                                directory name. Comma (',')
                                                could be used as the separator
                                                to specify multiple archive
                                                files. This option can be used
                                                to upload the virtual
                                                environment, the data files used
                                                in Python UDF (e.g.,
                                                --pyArchives
                                                file:///tmp/py37.zip,file:///tmp
                                                /data.zip#data --pyExecutable
                                                py37.zip/py37/bin/python). The
                                                data files could be accessed in
                                                Python UDF, e.g.: f =
                                                open('data/data.txt', 'r').
     -pyclientexec,--pyClientExecutable <arg>   The path of the Python
                                                interpreter used to launch the
                                                Python process when submitting
                                                the Python jobs via "flink run"
                                                or compiling the Java/Scala jobs
                                                containing Python UDFs.
     -pyexec,--pyExecutable <arg>               Specify the path of the python
                                                interpreter used to execute the
                                                python UDF worker (e.g.:
                                                --pyExecutable
                                                /usr/local/bin/python3). The
                                                python UDF worker depends on
                                                Python 3.6+, Apache Beam
                                                (version == 2.38.0), Pip
                                                (version >= 20.3) and SetupTools
                                                (version >= 37.0.0). Please
                                                ensure that the specified
                                                environment meets the above
                                                requirements.
     -pyfs,--pyFiles <pythonFiles>              Attach custom files for job. The
                                                standard resource file suffixes
                                                such as .py/.egg/.zip/.whl or
                                                directory are all supported.
                                                These files will be added to the
                                                PYTHONPATH of both the local
                                                client and the remote python UDF
                                                worker. Files suffixed with .zip
                                                will be extracted and added to
                                                PYTHONPATH. Comma (',') could be
                                                used as the separator to specify
                                                multiple files (e.g., --pyFiles
                                                file:///tmp/myresource.zip,hdfs:
                                                ///$namenode_address/myresource2
                                                .zip).
     -pym,--pyModule <pythonModule>             Python module with the program
                                                entry point. This option must be
                                                used in conjunction with
                                                `--pyFiles`.
     -pyreq,--pyRequirements <arg>              Specify a requirements.txt file
                                                which defines the third-party
                                                dependencies. These dependencies
                                                will be installed and added to
                                                the PYTHONPATH of the python UDF
                                                worker. A directory which
                                                contains the installation
                                                packages of these dependencies
                                                could be specified optionally.
                                                Use '#' as the separator if the
                                                optional parameter exists (e.g.,
                                                --pyRequirements
                                                file:///tmp/requirements.txt#fil
                                                e:///tmp/cached_dir).
     -rm,--restoreMode <arg>                    Defines how should we restore
                                                from the given savepoint.
                                                Supported options: [claim -
                                                claim ownership of the savepoint
                                                and delete once it is subsumed,
                                                no_claim (default) - do not
                                                claim ownership, the first
                                                checkpoint will not reuse any
                                                files from the restored one,
                                                legacy - the old behaviour, do
                                                not assume ownership of the
                                                savepoint files, but can reuse
                                                some shared files.
     -s,--fromSavepoint <savepointPath>         Path to a savepoint to restore
                                                the job from (for example
                                                hdfs:///flink/savepoint-1537).
     -sae,--shutdownOnAttachedExit              If the job is submitted in
                                                attached mode, perform a
                                                best-effort cluster shutdown
                                                when the CLI is terminated
                                                abruptly, e.g., in response to a
                                                user interrupt, such as typing
                                                Ctrl + C.
  Options for Generic CLI mode:
     -D <property=value>   Allows specifying multiple generic configuration
                           options. The available options can be found at
                           https://nightlies.apache.org/flink/flink-docs-stable/
                           ops/config.html
     -e,--executor <arg>   DEPRECATED: Please use the -t option instead which is
                           also available with the "Application Mode".
                           The name of the executor to be used for executing the
                           given job, which is equivalent to the
                           "execution.target" config option. The currently
                           available executors are: "remote", "local",
                           "kubernetes-session", "yarn-per-job" (deprecated),
                           "yarn-session".
     -t,--target <arg>     The deployment target for the given application,
                           which is equivalent to the "execution.target" config
                           option. For the "run" action the currently available
                           targets are: "remote", "local", "kubernetes-session",
                           "yarn-per-job" (deprecated), "yarn-session". For the
                           "run-application" action the currently available
                           targets are: "kubernetes-application".

  Options for yarn-cluster mode:
     -m,--jobmanager <arg>            Set to yarn-cluster to use YARN execution
                                      mode.
     -yid,--yarnapplicationId <arg>   Attach to running YARN session
     -z,--zookeeperNamespace <arg>    Namespace to create the Zookeeper
                                      sub-paths for high availability mode

  Options for default mode:
     -D <property=value>             Allows specifying multiple generic
                                     configuration options. The available
                                     options can be found at
                                     https://nightlies.apache.org/flink/flink-do
                                     cs-stable/ops/config.html
     -m,--jobmanager <arg>           Address of the JobManager to which to
                                     connect. Use this flag to connect to a
                                     different JobManager than the one specified
                                     in the configuration. Attention: This
                                     option is respected only if the
                                     high-availability configuration is NONE.
     -z,--zookeeperNamespace <arg>   Namespace to create the Zookeeper sub-paths
                                     for high availability mode

web提交任务-jar

提交jar

点击jar包进行配置

检查点保存位置如果不配置,将按照配置文件的路径进行保存。

报错

执行环境问题

the localstreamenvironment cannot be used when submitting a program through a client, or running in a testenvironment context.

修改获取执行环境

 env = StreamExecutionEnvironment.getExecutionEnvironment();
// 修改前:创建带有本地webui的执行环境
//env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(configuration);

提交失败会弹出报错日志,根据日志排查,直接搜自己的主类名;

提交成功自动跳转

验证

posted @ 2023-09-06 16:40  CHEN_zu_he  阅读(2401)  评论(0编辑  收藏  举报