Hudi表数据重复原因
测试中,发现虽然显示设置为 upsert,且也按规范设置了 primaryKey、preCombineField,type 等,但查出的结果仍然存在重复。反复测试,重复的数据稳定为 2,且同一数据的一笔提交时间也保持不变。结果显示同一数据分区相同,但来自不同的 HDFS 文件。
相关Issue
这个问题的原因是 drop 掉一张 hudi 表时,对应的表文件没有被删除,导致以同样路径再次创建后,select 仍然能查出来,这时显示重复的原因。但反复 create 和 drop,总是只有两个文件,并没有增加新的。
Flink SQL> CREATE TABLE `default_catalog`.`default_database`.`t06` (
> `ds` BIGINT,
> `ut` VARCHAR(2147483647),
> `pk` BIGINT NOT NULL,
> `f0` BIGINT,
> `f1` BIGINT,
> `f2` BIGINT,
> `f3` BIGINT,
> CONSTRAINT `PK_3610` PRIMARY KEY (`pk`) NOT ENFORCED
> ) PARTITIONED BY (`ds`)
> WITH (
> 'connector' = 'hudi',
> 'index.type' = 'BUCKET',
> 'payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'hoodie.compaction.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'hoodie.datasource.write.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'path' = 'hdfs:///user/root/hudi/t06',
> 'connector' = 'hudi',
> 'precombine.field' = 'ut',
> 'table.type' = 'MERGE_ON_READ',
> 'write.operation' = 'upsert',
> 'hoodie.bucket.index.num.buckets' = '2'
> )
> ;
[INFO] Execute statement succeed.
Flink SQL> insert into t06 (ds,ut,pk,f0) select 20230101 ds,CAST(current_timestamp AS STRING) ut,1001 pk,1 f0;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: e4e4c8e8ab8c904ad9238336cbf8fb8f
Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op | ds | ut | pk | f0 | f1 | f2 | f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I | 20230101 | 2023-05-16 12:44:54.004 | 1001 | 1 | <NULL> | <NULL> | <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row
Flink SQL> drop table t06;
[INFO] Execute statement succeed.
Flink SQL> CREATE TABLE `default_catalog`.`default_database`.`t06` (
> `ds` BIGINT,
> `ut` VARCHAR(2147483647),
> `pk` BIGINT NOT NULL,
> `f0` BIGINT,
> `f1` BIGINT,
> `f2` BIGINT,
> `f3` BIGINT,
> CONSTRAINT `PK_3610` PRIMARY KEY (`pk`) NOT ENFORCED
> ) PARTITIONED BY (`ds`)
> WITH (
> 'connector' = 'hudi',
> 'index.type' = 'BUCKET',
> 'payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'hoodie.compaction.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'hoodie.datasource.write.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'path' = 'hdfs:///user/root/hudi/t06',
> 'connector' = 'hudi',
> 'precombine.field' = 'ut',
> 'table.type' = 'MERGE_ON_READ',
> 'write.operation' = 'upsert',
> 'hoodie.bucket.index.num.buckets' = '2'
> );
[INFO] Execute statement succeed.
Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op | ds | ut | pk | f0 | f1 | f2 | f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I | 20230101 | 2023-05-16 12:44:54.004 | 1001 | 1 | <NULL> | <NULL> | <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row
Flink SQL> insert into t06 (ds,ut,pk,f0) select 20230101 ds,CAST(current_timestamp AS STRING) ut,1001 pk,1 f0;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: 28e29ce22999e479dd14fc1d17bbda1e
Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op | ds | ut | pk | f0 | f1 | f2 | f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I | 20230101 | 2023-05-16 12:45:41.939 | 1001 | 1 | <NULL> | <NULL> | <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row
Flink SQL> drop table t06;
[INFO] Execute statement succeed.
Flink SQL> CREATE TABLE `default_catalog`.`default_database`.`t06` (
> `ds` BIGINT,
> `ut` VARCHAR(2147483647),
> `pk` BIGINT NOT NULL,
> `f0` BIGINT,
> `f1` BIGINT,
> `f2` BIGINT,
> `f3` BIGINT,
> CONSTRAINT `PK_3610` PRIMARY KEY (`pk`) NOT ENFORCED
> ) PARTITIONED BY (`ds`)
> WITH (
> 'connector' = 'hudi',
> 'index.type' = 'BUCKET',
> 'payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'hoodie.compaction.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'hoodie.datasource.write.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'path' = 'hdfs:///user/root/hudi/t06',
> 'connector' = 'hudi',
> 'precombine.field' = 'ut',
> 'table.type' = 'MERGE_ON_READ',
> 'write.operation' = 'upsert',
> 'hoodie.bucket.index.num.buckets' = '2'
> );
[INFO] Execute statement succeed.
Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op | ds | ut | pk | f0 | f1 | f2 | f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I | 20230101 | 2023-05-16 12:45:41.939 | 1001 | 1 | <NULL> | <NULL> | <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row
Flink SQL> insert into t06 (ds,ut,pk,f0) select 20230101 ds,CAST(current_timestamp AS STRING) ut,1001 pk,10 f0;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: b9c7ef4dcdbcf743e609edb92216272e
Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op | ds | ut | pk | f0 | f1 | f2 | f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I | 20230101 | 2023-05-16 12:46:33.156 | 1001 | 10 | <NULL> | <NULL> | <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row
Flink SQL>