hudi数据湖数据管理框架

字节跳动基于 Hudi 的实时数据湖平台

https://developer.volcengine.com/articles/7220345269954003004

数仓实时化改造：Hudi on Flink 在顺丰的实践应用

https://www.logclub.com/articleInfo/NDE1NTk=

一、概念

Hudi（Hadoop Upserts anD Incrementals）是一个开源的 Apache Hudi 项目，它是一个用于支持大规模数据湖中增量和更新操作的数据管理框架。Hudi 主要设计用于 Apache Hadoop 生态系统，支持 Apache Spark 和 Apache Flink 等大数据处理引擎。

关键特性和概念：

增量写入： Hudi 允许在现有数据湖中进行增量写入，支持插入、更新和删除操作。
时间旅行： Hudi 支持数据湖中的时间旅行查询，可以查询历史版本的数据。
事务一致性： Hudi 提供了基于事务的一致性保证，确保写入和查询的一致性。
支持多种数据格式： Hudi 支持多种数据格式，包括 Avro、Parquet 等。
支持多存储后端： Hudi 可以与不同的存储后端集成，包括 Apache HDFS、Amazon S3、Azure Data Lake Storage 等。
灵活的查询： Hudi 支持 SQL 样式的查询，允许用户使用 SQL 查询语言查询数据湖中的数据。
支持 Apache Spark 和 Apache Flink： Hudi 可以与 Apache Spark 和 Apache Flink 集成，利用它们强大的数据处理能力。

Hudi 的主要应用场景是在大规模数据湖环境中处理和管理大量的数据。通过提供增量写入和时间旅行查询等功能，Hudi 支持构建可靠、灵活和高效的数据湖解决方案。

二、Hudi 的核心优势主要分为两部分：

首先，Hudi 提供了一个在 Hadoop 中更新删除的解决方案，所以它的核心在于能够增量更新，同时增量删除。增量更新的好处是国内与国际现在对隐私数据的保护要求比较高，比如在 Hive 中清理删除某一个用户的数据是比较困难的，相当于重新清洗一遍数据。使用 Hudi 可以根据主键快速抓取，并将其删除掉。
另外，时间漫游。之前我们有很多应用需要做准实时计算。如果要找出半个小时内的增量到底是什么，变化点在哪，必须要把一天的数据全捞出来，过滤一遍才能找出来。Hudi 提供时间漫游能力，只需要类似 SQL 的语法就能快速地把全部增量捞出来，然后后台应用使用时，就能够直接根据里面的数据做业务的更新，这是 Hudi 时间漫游里最重要的能力。

三、原理

https://blog.csdn.net/younger_china/article/details/125911292

1、Understanding dataset storage types: Copy on write vs. merge on read

When you create a Hudi dataset, you specify that the dataset is either copy on write or merge on read.

Copy on Write (CoW) – Data is stored in a columnar format (Parquet), and each update creates a new version of files during a write. CoW is the default storage type.
Merge on Read (MoR) – Data is stored using a combination of columnar (Parquet) and row-based (Avro) formats. Updates are logged to row-based delta files and are compacted as needed to create new versions of the columnar files.

With CoW datasets, each time there is an update to a record, the file that contains the record is rewritten with the updated values. With a MoR dataset, each time there is an update, Hudi writes only the row for the changed record. MoR is better suited for write- or change-heavy workloads with fewer reads. CoW is better suited for read-heavy workloads on data that changes less frequently.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-how-it-works.html

posted @ 2024-01-05 23:36 guoyu1 阅读(52) 评论(0) 编辑收藏举报

刷新页面返回顶部

打怪up

hudi数据湖数据管理框架

一、概念

二、Hudi 的核心优势主要分为两部分：

三、原理

公告