DataWarehouse - 随笔分类 - fxjwind

论文解析 -- Big Metadata: When Metadata is Big Data

摘要：要解决的问题就是对于云原生数据库，越来越大的meta应该怎么管理传统的数据库，都是将catalog存在系统表里面大数据系统，比如Colossus将meta存在Big table里面；Hadoop生态有Hive metastore Delta lake用事务log的方式来记录meta 并且对于AP 阅读全文

posted @ 2022-05-18 16:17 fxjwind 阅读(330) 评论(0) 推荐(0)

Hudi Concepts

摘要：和Hadoop比，增加两个功能，更新和删除 delta，变更流 Apache Hudi (pronounced “Hudi”) provides the following streaming primitives over hadoop compatible storages Update/De 阅读全文

posted @ 2022-05-06 16:03 fxjwind 阅读(158) 评论(0) 推荐(0)

Apache Hudi简介

摘要：Hudi: Uber Engineering’s Incremental Processing Framework on Apache Hadoop With the evolution of storage formats like Apache Parquet and Apache ORC an 阅读全文

posted @ 2022-04-25 15:53 fxjwind 阅读(755) 评论(0) 推荐(0)

论文解析 -- Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics

摘要：数仓新架构的特点，直接访问的开放格式，天然支持机器学习框架，好的性能 This paper argues that the data warehouse architecture as we know it today will wane in the coming years and be rep 阅读全文

posted @ 2022-04-12 18:07 fxjwind 阅读(389) 评论(0) 推荐(0)

论文解析 -- Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

摘要：INTRODUCTION 提出对象存储作为数据系统的存储层，低成本，实现存算分离 Cloud object stores such as Amazon S3 [4] and Azure Blob Storage [17] have become some of the largest and mos 阅读全文

posted @ 2022-04-12 14:48 fxjwind 阅读(455) 评论(0) 推荐(0)

PrestoSQL（trinodb）源码分析 - 执行（下）

摘要：TaskExecutor 那么都准备好了，就要开始真正的执行了初始化的时候增加TaskRunner线程 TaskRunner 核心就是不断的从waitingSplits中获取split，然后process 到这会创建driver， CreateDriver 先使用之前的operatorFacto 阅读全文

posted @ 2022-01-07 15:35 fxjwind 阅读(878) 评论(2) 推荐(0)

PrestoSQL（trinodb）源码分析 - 执行（上）

摘要：SqlTaskManager Worker的SqlTaskManager负责接收发来的TaskRequest， doUpdateTask Get或创建SqlTask，仅仅新的Task需要创建， tasks是LoadingCache<TaskId, SqlTask> 最终调用updateTask，生阅读全文

posted @ 2021-12-21 17:47 fxjwind 阅读(913) 评论(0) 推荐(0)

PrestoSQL（trinodb）源码分析 -优化和调度

摘要：通过TpchQueryRunner可以跑起来一个测试服务仍然使用‘SELECT SUPPKEY, sum(QUANTITY) from lineitem where QUANTITY > 5 group by SUPPKEY limit 10’ Mac M1, Java CLI有bug，可以用py 阅读全文

posted @ 2021-11-25 14:56 fxjwind 阅读(1989) 评论(1) 推荐(0)

论文解析 -- Building An Elastic Query Engine on Disaggregated Storage (NSDI 2020)

摘要：Introduction 引入Shared-nothing架构 Shared-nothing architectures have been the foundation of traditional query execution engines and data warehousing syst 阅读全文

posted @ 2021-03-25 13:50 fxjwind 阅读(565) 评论(0) 推荐(1)

列存格式

摘要：https://zhuanlan.zhihu.com/p/35622907 https://blog.csdn.net/yu616568/article/details/51868447 为什么要用列存这里就不聊了，直接看格式的演变 NSM (N-ary Storage Model) ，按行存储 D 阅读全文

posted @ 2020-05-14 14:46 fxjwind 阅读(953) 评论(0) 推荐(0)

Evaluating EndtoEnd Optimization for Data Analytics Applications in Weld

摘要：参考，从 Weld 论文看执行器的优化技术需要解决的问题，当前数据分析应用，会用到很多libraries，比如Numpy，Pandas，TensorFlow，Spark等这些libraries的接口和数据结构都是不一样的，所以如果要提升应用的性能，你只能one by one的去提升每个liba 阅读全文

posted @ 2020-03-17 15:43 fxjwind 阅读(350) 评论(0) 推荐(0)

Pinot: Realtime OLAP for 530 Million Users

摘要：传统的TP库难以应对AP的需求如果要应对AP的需求，现在有几种做法，列存，列存可以降低数据传输量，而且让压缩更有效 NewSql，Hybrid TP/AP，一般都是内存数据库离线数据库，Hive，Presto，Spark，无论快慢，它自己本身不存储数据的，只是一个执行引擎和查询引擎预聚合c 阅读全文

posted @ 2020-01-19 16:55 fxjwind 阅读(545) 评论(0) 推荐(0)

HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots

摘要：HyPer也是内存数据库传统数据库基本都是TP，后续出现BI的需求，即AP的需求，传统数据库满足不了所以出现了数仓，但是需要ETL把TP的数据同步到数仓中，进行AP 哪怕基于列存的实时数仓，也要针对Tp和AP用不同的存储引擎总之如果要用一套数据结构和系统同时支持TP和AP，之前是没有能做到的阅读全文

posted @ 2020-01-17 20:10 fxjwind 阅读(776) 评论(0) 推荐(0)

SAP HANA Database - Data Management for Modern Business Applications

摘要：简单的看下架构，分成几个部分， Connection And Session The Connection and Session Management component creates and manages sessions and connections for the database 阅读全文

posted @ 2020-01-06 16:45 fxjwind 阅读(461) 评论(0) 推荐(0)

Impala: A Modern, Open-Source SQL Engine for Hadoop

摘要：Impala is an open-source, fully-integrated, state-of-the-art MPP SQL query engine designed specically to leverage the flexibility and scalability of 阅读全文

posted @ 2019-12-26 18:08 fxjwind 阅读(350) 评论(0) 推荐(0)

The Vertica Analytic Database: CStore 7 Years Later

摘要：Vertica作为C-Store项目的商业化实现，所有前置论文C-Store要先看下 The Vertica Analytic Database (Vertica) is a distributed massively parallel RDBMS system that commercialize 阅读全文

posted @ 2019-12-25 11:46 fxjwind 阅读(475) 评论(0) 推荐(0)

Amazon Redshift and the Case for Simpler Data Warehouses

摘要：Redshift是Amazon一个商业产品上的进化但并不是技术的进化，他使用的无非都是传统数仓领域的技术如果说创新，就是大量使用Amazon本身的云服务的云原生架构，大大提升的产品的迭代速度，可维护性，管控能力，当然前提是Amazon有那么好的基础设施可以用架构 DataPlane 典型的Sh 阅读全文

posted @ 2019-12-23 15:16 fxjwind 阅读(443) 评论(0) 推荐(0)

C-Store: A Column-oriented DBMS Mike

摘要：这篇paper比较老，是列存比较基础的论文几乎所有列存，或olap的论文都会引用这篇行存面向写，支持OLTP 列存面向读，支持OLAP 基于磁盘的DBMS，瓶颈基本在磁盘IO，所有做的工作都是用多余的cpu来换取磁盘IO 总体的思路，压缩让需要存的数据更小，densepack，更多的数据一起存，阅读全文

posted @ 2019-12-20 12:30 fxjwind 阅读(702) 评论(0) 推荐(0)

Apache Kylin 简介

摘要：http://kylin.apache.org/docs/index.html https://www.infoq.cn/article/vOrjsJCgVAVPim5hsj6p Kylin 的核心思想是预计算，将数据按照指定的维度和指标，预先计算出所有可能的查询结果，利用空间换时间来加速查询模式固阅读全文

posted @ 2019-12-18 14:06 fxjwind 阅读(948) 评论(0) 推荐(0)

Druid: A Real-time Analytical Data Store

摘要：Druid一种实时数仓，针对的场景和目的，如下比较明确 Druid was originally designed to solve problems around ingesting and exploring large quantities of transactional events (l 阅读全文

posted @ 2019-12-16 15:47 fxjwind 阅读(527) 评论(0) 推荐(0)

fxjwind

随笔分类 - DataWarehouse