蓝天

Hudi表数据重复原因

测试中,发现虽然显示设置为 upsert,且也按规范设置了 primaryKey、preCombineField,type 等,但查出的结果仍然存在重复。反复测试,重复的数据稳定为 2,且同一数据的一笔提交时间也保持不变。结果显示同一数据分区相同,但来自不同的 HDFS 文件。

相关Issue

这个问题的原因是 drop 掉一张 hudi 表时,对应的表文件没有被删除,导致以同样路径再次创建后,select 仍然能查出来,这时显示重复的原因。但反复 create 和 drop,总是只有两个文件,并没有增加新的。

Flink SQL> CREATE TABLE `default_catalog`.`default_database`.`t06` (
>   `ds` BIGINT,
>   `ut` VARCHAR(2147483647),
>   `pk` BIGINT NOT NULL,
>   `f0` BIGINT,
>   `f1` BIGINT,
>   `f2` BIGINT,
>   `f3` BIGINT,
>   CONSTRAINT `PK_3610` PRIMARY KEY (`pk`) NOT ENFORCED
> ) PARTITIONED BY (`ds`)
> WITH (
>   'connector' = 'hudi',
>   'index.type' = 'BUCKET',
>   'payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
>   'hoodie.compaction.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
>   'hoodie.datasource.write.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
>   'path' = 'hdfs:///user/root/hudi/t06',
>   'connector' = 'hudi',
>   'precombine.field' = 'ut',
>   'table.type' = 'MERGE_ON_READ',
>   'write.operation' = 'upsert',
>   'hoodie.bucket.index.num.buckets' = '2'
> )
> ;
[INFO] Execute statement succeed.

Flink SQL> insert into t06 (ds,ut,pk,f0) select 20230101 ds,CAST(current_timestamp AS STRING) ut,1001 pk,1 f0;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: e4e4c8e8ab8c904ad9238336cbf8fb8f


Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op |                   ds |                             ut |                   pk |                   f0 |                   f1 |                   f2 |                   f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I |             20230101 |        2023-05-16 12:44:54.004 |                 1001 |                    1 |               <NULL> |               <NULL> |               <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row

Flink SQL> drop table t06;
[INFO] Execute statement succeed.

Flink SQL> CREATE TABLE `default_catalog`.`default_database`.`t06` (
>   `ds` BIGINT,
>   `ut` VARCHAR(2147483647),
>   `pk` BIGINT NOT NULL,
>   `f0` BIGINT,
>   `f1` BIGINT,
>   `f2` BIGINT,
>   `f3` BIGINT,
>   CONSTRAINT `PK_3610` PRIMARY KEY (`pk`) NOT ENFORCED
> ) PARTITIONED BY (`ds`)
> WITH (
>   'connector' = 'hudi',
>   'index.type' = 'BUCKET',
>   'payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
>   'hoodie.compaction.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
>   'hoodie.datasource.write.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
>   'path' = 'hdfs:///user/root/hudi/t06',
>   'connector' = 'hudi',
>   'precombine.field' = 'ut',
>   'table.type' = 'MERGE_ON_READ',
>   'write.operation' = 'upsert',
>   'hoodie.bucket.index.num.buckets' = '2'
> );
[INFO] Execute statement succeed.

Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op |                   ds |                             ut |                   pk |                   f0 |                   f1 |                   f2 |                   f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I |             20230101 |        2023-05-16 12:44:54.004 |                 1001 |                    1 |               <NULL> |               <NULL> |               <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row

Flink SQL> insert into t06 (ds,ut,pk,f0) select 20230101 ds,CAST(current_timestamp AS STRING) ut,1001 pk,1 f0;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: 28e29ce22999e479dd14fc1d17bbda1e


Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op |                   ds |                             ut |                   pk |                   f0 |                   f1 |                   f2 |                   f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I |             20230101 |        2023-05-16 12:45:41.939 |                 1001 |                    1 |               <NULL> |               <NULL> |               <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row

Flink SQL> drop table t06;
[INFO] Execute statement succeed.

Flink SQL> CREATE TABLE `default_catalog`.`default_database`.`t06` (
>   `ds` BIGINT,
>   `ut` VARCHAR(2147483647),
>   `pk` BIGINT NOT NULL,
>   `f0` BIGINT,
>   `f1` BIGINT,
>   `f2` BIGINT,
>   `f3` BIGINT,
>   CONSTRAINT `PK_3610` PRIMARY KEY (`pk`) NOT ENFORCED
> ) PARTITIONED BY (`ds`)
> WITH (
>   'connector' = 'hudi',
>   'index.type' = 'BUCKET',
>   'payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
>   'hoodie.compaction.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
>   'hoodie.datasource.write.payload.class' = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
>   'path' = 'hdfs:///user/root/hudi/t06',
>   'connector' = 'hudi',
>   'precombine.field' = 'ut',
>   'table.type' = 'MERGE_ON_READ',
>   'write.operation' = 'upsert',
>   'hoodie.bucket.index.num.buckets' = '2'
> );
[INFO] Execute statement succeed.

Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op |                   ds |                             ut |                   pk |                   f0 |                   f1 |                   f2 |                   f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I |             20230101 |        2023-05-16 12:45:41.939 |                 1001 |                    1 |               <NULL> |               <NULL> |               <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row

Flink SQL>  insert into t06 (ds,ut,pk,f0) select 20230101 ds,CAST(current_timestamp AS STRING) ut,1001 pk,10 f0;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: b9c7ef4dcdbcf743e609edb92216272e


Flink SQL> select * from t06;
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| op |                   ds |                             ut |                   pk |                   f0 |                   f1 |                   f2 |                   f3 |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
| +I |             20230101 |        2023-05-16 12:46:33.156 |                 1001 |                   10 |               <NULL> |               <NULL> |               <NULL> |
+----+----------------------+--------------------------------+----------------------+----------------------+----------------------+----------------------+----------------------+
Received a total of 1 row

Flink SQL> 

posted on 2023-05-16 12:48  #蓝天  阅读(302)  评论(0编辑  收藏  举报

导航