表治理-Iceberg元数据合并-metadata.json文件
一、背景描述
元数据文件随时间增多,导致查询变慢。通过如下方式可以指定metadata个数,超过指定数量自动清理。
metadata文件对应Iceberg概念是Snapshots
二、解决方案
1、在建表时增加参数
‘write.metadata.delete-after-commit.enabled’=‘true’,
‘write.metadata.previous-versions-max’=‘5’
2、建表语句
自动清理metadata文件
CREATE TABLE iceberg_test.autoclean_usql_test(
`id` BIGINT,
`name_cn` STRING,
`description` STRING,
`status` INT
)USING iceberg
TBLPROPERTIES(
'format-version'='2'
,'property-version'='2'
,'write.upsert.enabled'='true'
,'write.metadata.delete-after-commit.enabled'='true'
,'write.metadata.previous-versions-max'='5'
)
不带metadata自动清理
CREATE TABLE iceberg_test.notclean_usql_test(
`id` BIGINT,
`name_cn` STRING,
`description` STRING,
`status` INT
)USING iceberg
TBLPROPERTIES(
'format-version'='2'
,'property-version'='2'
,'write.upsert.enabled'='true')
|
3、存储路径
/user/hive/warehouse/iceberg_test.db/autoclean_usql_test
三、插入数据测试
1、带metadata清理测试,因保留5个,第六次插入才清理
插入数据,查看生成的文件数量。查看命令如下
hdfs dfs -ls /user/hive/warehouse/iceberg_test.db/autoclean_usql_test/metadata |grep 'Found'
insert into iceberg_test.autoclean_usql_test values(1,'name1','desc',1); Found 4 items 2个metadata.json 1个*m0.avro 1个***.avro
insert into iceberg_test.autoclean_usql_test values(2,'name2','desc',1); Found 7 items 3个metadata.json 2个*m0.avro 2个***.avro
insert into iceberg_test.autoclean_usql_test values(3,'name3','desc',1); Found 10 items 4个metadata.json 3个*m0.avro 3个***.avro
insert into iceberg_test.autoclean_usql_test values(4,'name4','desc',1); Found 13 items 5个metadata.json 4个*m0.avro 4个***.avro
insert into iceberg_test.autoclean_usql_test values(5,'name5','desc',1); Found 16 items 6个metadata.json 5个*m0.avro 5个***.avro
---开始清理生效
insert into iceberg_test.autoclean_usql_test values(6,'name6','desc',1); Found 18 items 6个metadata.json 6个*m0.avro 6个***.avro
insert into iceberg_test.autoclean_usql_test values(7,'name7','desc',1); Found 20 items 6个metadata.json 7个*m0.avro 7个***.avro
2、不带metadata清理,每次增加三个,一直增加。
hdfs dfs -ls /user/hive/warehouse/iceberg_test.db/notclean_usql_test/metadata
insert into iceberg_test.notclean_usql_test values(1,'name1','desc',1); Found 4 items 2个metadata.json 1个*m0.avro 1个***.avro
insert into iceberg_test.notclean_usql_test values(2,'name2','desc',1); Found 7 items 3个metadata.json 2个*m0.avro 2个***.avro
insert into iceberg_test.notclean_usql_test values(3,'name3','desc',1); Found 10 items 4个metadata.json 3个*m0.avro 3个***.avro
insert into iceberg_test.notclean_usql_test values(4,'name4','desc',1); Found 13 items 5个metadata.json 4个*m0.avro 4个***.avro
insert into iceberg_test.notclean_usql_test values(5,'name5','desc',1); Found 16 items 6个metadata.json 5个*m0.avro 5个***.avro
insert into iceberg_test.notclean_usql_test values(6,'name6','desc',1); Found 19 items 7个metadata.json 6个*m0.avro 6个***.avro
insert into iceberg_test.notclean_usql_test values(7,'name7','desc',1); Found 22 items 8个metadata.json 7个*m0.avro 7个***.avro
四、参考文章
1、实践数据湖iceberg 元数据合并
https://blog.csdn.net/spark_dev/article/details/122876819
分类:
Iceberg
· 百万级群聊的设计实践
· 永远不要相信用户的输入:从 SQL 注入攻防看输入验证的重要性
· 全网最简单!3分钟用满血DeepSeek R1开发一款AI智能客服,零代码轻松接入微信、公众号、小程
· .NET 10 首个预览版发布,跨平台开发与性能全面提升
· 《HelloGitHub》第 107 期