Hudi表创建时HDFS上的变化
SparkSQL 建 Hudi 表语句:
CREATE TABLE t71 (
ds BIGINT,
ut STRING,
pk BIGINT,
f0 BIGINT,
f1 BIGINT,
f2 BIGINT,
f3 BIGINT,
f4 BIGINT
) USING hudi
PARTITIONED BY (ds)
TBLPROPERTIES ( -- 这里也可使用 options (https://hudi.apache.org/docs/table_management)
type = 'mor',
primaryKey = 'pk',
preCombineField = 'ut',
hoodie.index.type = 'BUCKET',
hoodie.bucket.index.num.buckets = '2',
hoodie.compaction.payload.class = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
hoodie.datasource.write.payload.class = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
hoodie.archive.merge.enable = 'true',
hoodie.datasource.write.operation = 'upsert'
);
执行 create table 后,会创建一子目录和文件:
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71
Found 1 items
drwxr-xr-x - zhangsan dfsusers 0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71
Found 1 items
drwxr-xr-x - zhangsan dfsusers 0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie
Found 5 items
drwxr-xr-x - zhangsan dfsusers 0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux
drwxr-xr-x - zhangsan dfsusers 0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.schema
drwxr-xr-x - zhangsan dfsusers 0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.temp
drwxr-xr-x - zhangsan dfsusers 0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/archived
-rw-r--r-- 3 zhangsan dfsusers 1501 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/hoodie.properties
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/archived
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.schema
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux
Found 1 items
drwxr-xr-x - zhangsan dfsusers 0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.temp
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap
Found 2 items
drwxr-xr-x - zhangsan dfsusers 0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.fileids
drwxr-xr-x - zhangsan dfsusers 0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.partitions
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.partitions
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.fileids
[/home/zhangsan]$ sh hadoop.sh fs -cat hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/hoodie.properties
#Properties saved on 2023-05-31T03:09:25.601Z
#Wed May 31 11:09:25 CST 2023
hoodie.table.precombine.field=ut
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.partition.fields=ds
hoodie.bucket.index.num.buckets=2
hoodie.table.type=MERGE_ON_READ
hoodie.archivelog.folder=archived
hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload
hoodie.table.version=5
hoodie.timeline.layout.version=1
hoodie.table.recordkey.fields=pk
hoodie.database.name=test
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.table.name=t71
hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
hoodie.datasource.write.hive_style_partitioning=true
hoodie.table.create.schema={"type"\:"record","name"\:"t71_record","namespace"\:"hoodie.t71","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"ut","type"\:["string","null"]},{"name"\:"pk","type"\:["long","null"]},{"name"\:"f0","type"\:["long","null"]},{"name"\:"f1","type"\:["long","null"]},{"name"\:"f2","type"\:["long","null"]},{"name"\:"f3","type"\:["long","null"]},{"name"\:"f4","type"\:["long","null"]},{"name"\:"ds","type"\:["long","null"]}]}
hoodie.index.type=BUCKET
hoodie.table.checksum=3938074607
执行 drop table 后,会将表目录如 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71 删除掉。
如果是分区表,则在有数据插入是,会在表目录下以分区值建立子目录,比如:
insert into t71 (ds,ut,pk,f0) values (20230101,CURRENT_TIMESTAMP,1102,1);
上述语句会在 HDFS 上建立以“ds=20230101”为名的子目录:
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101
Found 2 items
-rw-r--r-- 3 zhangsan dfsusers 96 2023-05-31 11:29 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.hoodie_partition_metadata
-rw-r--r-- 3 zhangsan dfsusers 435756 2023-05-31 11:29 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/00000001-6776-4b80-915b-ad6bdff96948-0_1-21-19_20230531112913107.parquet
执行连续三条 insert:
insert into t71 (ds,ut,pk,f0) values (20230101,CURRENT_TIMESTAMP,1102,1);
select * from t71 where pk=1102;
insert into t71 (ds,ut,pk,f1) values (20230101,CURRENT_TIMESTAMP,1102,2);
select * from t71 where pk=1102;
insert into t71 (ds,ut,pk,f2) values (20230101,CURRENT_TIMESTAMP,1102,3);
select * from t71 where pk=1102;
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101
Found 3 items
-rw-r--r-- 3 zhangsan dfsusers 1048 2023-05-31 14:26 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_20230531141236926.log.1_1-8-6
-rw-r--r-- 3 zhangsan dfsusers 2096 2023-05-31 14:31 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_20230531141236926.log.1_1-8-6
-rw-r--r-- 3 zhangsan dfsusers 96 2023-05-31 14:13 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.hoodie_partition_metadata
-rw-r--r-- 3 zhangsan dfsusers 435757 2023-05-31 14:13 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_1-21-17_20230531141236926.parquet
上面列了两个“.log.”文件,分别为第一次 insert 后和第二次 insert 后的结果,为方便对比观察放在了一起。
分区的第 1 笔插入总是会生成“.parquet”文件,而不是“.log.”文件。上述“.parquet”为列存储格式的基础文件,COW 和 MOR 表都有的文件,但 COW 每笔 insert 都会整个重写,而 MOR 表则不会。“.log.”为行存储格式的增量日志文件,为 MOR 表独有文件。文件 .hoodie_partition_metadata 为分区元数据文件:
[/home/zhangsan]$ sh hadoop.sh fs -cat hdfs://hadoop-cluster-01/user/zhangsan/warehouse/testtest.db/t71/ds=20230101/.hoodie_partition_metadata
#partition metadata
#Wed May 31 11:29:49 CST 2023
commitTime=20230531112913107
partitionDepth=1
“.parquet”文件
使用在线的工具 https://parquet-viewer-online.com/result 打开“.parquet”文件,可发现内容同 select 完全一样。
_hoodie_commit_time _hoodie_commit_seqno _hoodie_record_key _hoodie_partition_path _hoodie_file_name ut pk f0 f1 f2 f3 f4 ds
20230531141236926 20230531141236926_1_0 1102 ds=20230101 00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_1-21-17_20230531141236926.parquet 2023-05-31 14:12:37.126 1102 1 null null null null 20230
“.log.”文件
第二次 insert 后生成了“.log.”文件:
#HUDI#
4{"type":"record","name":"t71_record","namespace":"hoodie.t71","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"ut","type":"string"},{"name":"pk","type":"long"},{"name":"f0","type":["null","long"],"default":null},{"name":"f1","type":["null","long"],"default":null},{"name":"f2","type":["null","long"],"default":null},{"name":"f3","type":["null","long"],"default":null},{"name":"f4","type":["null","long"],"default":null},{"name":"ds","type":"long"}]} 20230531142614512 • ‰"20230531142614512*20230531142614512_1_11102ds=20230101L00000001-b6d7-4eaa-8004-ac7d0626bf8d-0.2023-05-31 14:26:14.761œ ª¿¥
第三次 insert 后更新了“.log.”文件:
#HUDI#
4{"type":"record","name":"t71_record","namespace":"hoodie.t71","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"ut","type":"string"},{"name":"pk","type":"long"},{"name":"f0","type":["null","long"],"default":null},{"name":"f1","type":["null","long"],"default":null},{"name":"f2","type":["null","long"],"default":null},{"name":"f3","type":["null","long"],"default":null},{"name":"f4","type":["null","long"],"default":null},{"name":"ds","type":"long"}]} 20230531142614512 • ‰"20230531142614512*20230531142614512_1_11102ds=20230101L00000001-b6d7-4eaa-8004-ac7d0626bf8d-0.2023-05-31 14:26:14.761œ ª¿¥ #HUDI#
4{"type":"record","name":"t71_record","namespace":"hoodie.t71","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"ut","type":"string"},{"name":"pk","type":"long"},{"name":"f0","type":["null","long"],"default":null},{"name":"f1","type":["null","long"],"default":null},{"name":"f2","type":["null","long"],"default":null},{"name":"f3","type":["null","long"],"default":null},{"name":"f4","type":["null","long"],"default":null},{"name":"ds","type":"long"}]} 20230531143136695 • ‰"20230531143136695*20230531143136695_1_11102ds=20230101L00000001-b6d7-4eaa-8004-ac7d0626bf8d-0.2023-05-31 14:31:36.801œ ª¿¥
这里可以看到 pk=1102 有两笔数据,ut 值分别为 2023-05-31 14:26:14.761 和 2023-05-31 14:31:36.801 。使用 OverwriteNonDefaultsWithLatestAvroPayload 读取时,只能读取到 2023-05-31 14:31:36.801 这笔,这是依据 preCombineField 更大原则的结果,在 HoodieRecordPayload::preCombine 时完成的逻辑。
相关源码
// OverwriteNonDefaultsWithLatestAvroPayload 没有重写 OverwriteWithLatestAvroPayload 的 preCombine 方法
public class OverwriteNonDefaultsWithLatestAvroPayload extends OverwriteWithLatestAvroPayload {
}
public class OverwriteWithLatestAvroPayload extends BaseAvroPayload
implements HoodieRecordPayload<OverwriteWithLatestAvroPayload> {
@Override
public OverwriteWithLatestAvroPayload preCombine(OverwriteWithLatestAvroPayload oldValue) {
if (oldValue.recordBytes.length == 0) {
// use natural order for delete record
return this;
}
if (oldValue.orderingVal.compareTo(orderingVal) > 0) {
// pick the payload with greatest ordering value
return oldValue;
} else {
return this;
}
}
}
如果换用 PartialUpdateAvroPayload,则
_hoodie_commit_time _hoodie_commit_seqno _hoodie_record_key _hoodie_partition_path _hoodie_file_name ut pk f0 f1 f2 f3 f4
20230531164701237 20230531164701237_0_1 1006 ds=20230101 00000000-ad06-474e-a7ac-0580f60307e1-0 2023-05-31 16:47:02.337 1006 1 2 3 NULL NULL