Hudi表数据重复原因
测试中,发现虽然显示设置为 upsert,且也按规范设置了 primaryKey、preCombineField,type 等,但查出的结果仍然存在重复。反复测试,重复的数据稳定为 2,且同一数据的一笔提交时间也保持不变。结果显示同一数据分区相同,但来自不同的 HDFS 文件。
相关Issue
这个问题的原因是 drop 掉一张 hudi 表时,对应的表文件没有被删除,导致以同样路径再次创建后,select 仍然能查出来,这时显示重复的原因。但反复 create 和 drop,总是只有两个文件,并没有增加新的。
Flink SQL> CREATE TABLE `default_catalog`.`default_database`.`t06` (
> `ds` BIGINT,
> `ut` VARCHAR(2147483647),
> `pk` BIGINT NOT NULL,
> `f0` BIGINT,
> `f1` BIGINT,
> `f2` BIGINT,
> `f3` BIGINT,
> CONSTRAINT `PK_3610` PRIMARY KEY (`pk`) NOT ENFORCED
> ) PARTITIONED BY (`ds`)
> WITH (
> 'connector' = 'hudi',
> 'index.type' = 'BUCKET',
> 'payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'hoodie.compaction.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'hoodie.datasource.write.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'path' = 'hdfs:///user/root/hudi/t06',
> 'connector' = 'hudi',
> 'precombine.field' = 'ut',
> 'table.type' = 'MERGE_ON_READ',
> 'write.operation' = 'upsert',
> 'hoodie.bucket.index.num.buckets' = '2'
> )
> ;
[INFO] Execute statement succeed.
Flink SQL> insert into t06 (ds,ut,pk,f0) select 20230101 ds,CAST(current_timestamp AS STRING) ut,1001 pk,1 f0;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: e4e4c8e8ab8c904ad9238336cbf8fb8f
Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op | ds | ut | pk | f0 | f1 | f2 | f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I | 20230101 | 2023-05-16 12:44:54.004 | 1001 | 1 | <NULL> | <NULL> | <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row
Flink SQL> drop table t06;
[INFO] Execute statement succeed.
Flink SQL> CREATE TABLE `default_catalog`.`default_database`.`t06` (
> `ds` BIGINT,
> `ut` VARCHAR(2147483647),
> `pk` BIGINT NOT NULL,
> `f0` BIGINT,
> `f1` BIGINT,
> `f2` BIGINT,
> `f3` BIGINT,
> CONSTRAINT `PK_3610` PRIMARY KEY (`pk`) NOT ENFORCED
> ) PARTITIONED BY (`ds`)
> WITH (
> 'connector' = 'hudi',
> 'index.type' = 'BUCKET',
> 'payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'hoodie.compaction.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'hoodie.datasource.write.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'path' = 'hdfs:///user/root/hudi/t06',
> 'connector' = 'hudi',
> 'precombine.field' = 'ut',
> 'table.type' = 'MERGE_ON_READ',
> 'write.operation' = 'upsert',
> 'hoodie.bucket.index.num.buckets' = '2'
> );
[INFO] Execute statement succeed.
Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op | ds | ut | pk | f0 | f1 | f2 | f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I | 20230101 | 2023-05-16 12:44:54.004 | 1001 | 1 | <NULL> | <NULL> | <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row
Flink SQL> insert into t06 (ds,ut,pk,f0) select 20230101 ds,CAST(current_timestamp AS STRING) ut,1001 pk,1 f0;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: 28e29ce22999e479dd14fc1d17bbda1e
Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op | ds | ut | pk | f0 | f1 | f2 | f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I | 20230101 | 2023-05-16 12:45:41.939 | 1001 | 1 | <NULL> | <NULL> | <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row
Flink SQL> drop table t06;
[INFO] Execute statement succeed.
Flink SQL> CREATE TABLE `default_catalog`.`default_database`.`t06` (
> `ds` BIGINT,
> `ut` VARCHAR(2147483647),
> `pk` BIGINT NOT NULL,
> `f0` BIGINT,
> `f1` BIGINT,
> `f2` BIGINT,
> `f3` BIGINT,
> CONSTRAINT `PK_3610` PRIMARY KEY (`pk`) NOT ENFORCED
> ) PARTITIONED BY (`ds`)
> WITH (
> 'connector' = 'hudi',
> 'index.type' = 'BUCKET',
> 'payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'hoodie.compaction.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'hoodie.datasource.write.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
> 'path' = 'hdfs:///user/root/hudi/t06',
> 'connector' = 'hudi',
> 'precombine.field' = 'ut',
> 'table.type' = 'MERGE_ON_READ',
> 'write.operation' = 'upsert',
> 'hoodie.bucket.index.num.buckets' = '2'
> );
[INFO] Execute statement succeed.
Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op | ds | ut | pk | f0 | f1 | f2 | f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I | 20230101 | 2023-05-16 12:45:41.939 | 1001 | 1 | <NULL> | <NULL> | <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row
Flink SQL> insert into t06 (ds,ut,pk,f0) select 20230101 ds,CAST(current_timestamp AS STRING) ut,1001 pk,10 f0;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: b9c7ef4dcdbcf743e609edb92216272e
Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op | ds | ut | pk | f0 | f1 | f2 | f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I | 20230101 | 2023-05-16 12:46:33.156 | 1001 | 10 | <NULL> | <NULL> | <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row
Flink SQL>
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 【译】Visual Studio 中新的强大生产力特性
· 【设计模式】告别冗长if-else语句:使用策略模式优化代码结构
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义