What Is a Data Lake Engine?
A data lake engine is an open source software solution or cloud service that provides critical capabilities for a wide range of data sources for analytical workloads through a unified set of APIs and data model. Data lake engines address key needs in terms of simplifying access, accelerating analytical processing, securing and masking data, curating datasets, and providing a unified catalog of data across all sources.
The tools used by millions of data consumers, such as BI tools, data science platforms, and dashboarding tools, assume all data exists in a single, high-performance relational database. When data is in multiple systems, or in non-relational stores such as ADLS, Amazon S3, Hadoop, and NoSQL databases, then these tools are compromised in their abilities. As a result, IT is tasked with moving this data into relational environments, creating cubes, and generating proprietary extracts for different analytical tools. Data lake engines simplify these challenges by allowing companies to leave data where it is already managed, and to provide fast access for data consumers, regardless of the tool they use.
Organizations deploy data lake engines in their own data centers or on cloud platforms such as Amazon AWS, Microsoft Azure, and Google Cloud Platform.
Architecture of a Data Lake Engine
Data lake engines run between the systems that manage your data and the tools you use to analyze, visualize, and process data for different data consumer applications. Rather than moving data from sources into a single repository, data lake engines are deployed between existing data sources and the tools of data consumers such as BI tools and data science platforms.
From the perspective of these tools, data lake engines are accessed using standard SQL over ODBC, JDBC, or REST, and the data lake engine takes care of accessing and securing the data as efficiently as possible, wherever it is managed.
Benefits of a Data Lake Engine
BI and analytics tools such as Tableau, Power BI, R, Python, and machine learning models were designed for a world in which data lives in a single, high-performance relational database. However, most organizations manage their data in multiple solutions using different data formats and different technologies. Most organizations now utilize one or more non-relational datastores, such as cloud storage (e.g., S3, ADLS), Hadoop, and NoSQL databases (e.g., Elasticsearch, Cassandra). In addition, following new design patterns in microservices, data for an application is often fragmented and distributed across multiple datastores.
BI tools, data science systems, and machine learning models work best when data lives in a single, high-performance relational database. Unfortunately, that’s not where data lives today. As a result, IT typically applies a combination of custom ETL development and proprietary products to integrate data from different systems in order to improve access for analytics. In many organizations, the analytics stack includes the following layers:
- Data lake. The data is moved from various operational databases into a single staging area such as a cloud storage service (e.g., Amazon S3, ADLS).
- Data warehouse. While it is possible to execute SQL queries directly on Hadoop and cloud storage, these systems are simply not designed to deliver interactive performance. Therefore, a subset of the data is usually loaded into a relational data warehouse or MPP database.
- Cubes, aggregation tables, and BI extracts. In order to provide interactive performance on large datasets, the data must be pre-aggregated and/or indexed by building cubes in an OLAP system or materialized aggregation tables in the data warehouse.
This multi-layer architecture introduces many challenges. For example:
- Flexibility. As data sources change, or data requirements evolve, every layer of the stack must be revisited to ensure data pipelines and tools continue to work with the system. Changes can take months and quarters to implement, all with high risk.
- Complexity. Business analysts must understand which component to use for a given query, an unnecessary complexity that is easy to get wrong.
- IT-centric. This architecture inhibits business self-service. Any change at any stage can only be performed by IT. Business users are unable to do things for themselves.
- Engineering cost. This architecture requires extensive custom ETL development, DBA expertise, and data engineering to address the evolving data needs of the business.
- Infrastructure cost. This architecture is extremely expensive because it requires numerous proprietary technologies and typically results in many copies of the data stored in different systems. In addition, the use of proprietary technologies leads to vendor lock-in.
- Data governance. This architecture leads to data sprawl, making it very difficult to track lineage or maintain tight security.
- Data freshness. It takes time to move data between systems. It also takes time to create various organizations of the data. In addition, the data pipeline becomes increasingly inefficient over time, resulting in longer cycles and even more stale data.
Data lake engine platforms take a different approach to powering data analytics. Rather than moving data into a single repository, data lake engines access the data where it is managed, and perform any necessary transformations and integrations of data dynamically. In addition, data lake engines provide a self-service model that enables data consumers to explore, organize, describe, and analyze data regardless of its location, size, or structure, using their favorite tools such as Power BI, Tableau, Python, and R.
Some data sources may not be optimized for analytical processing and are unable to provide efficient access to the data. Data lake engines provide the ability to optimize the physical access to data that is independent of the schema that is used to organize and facilitate access to the data. With this ability, individual datasets can be optimized without changing the way data consumers access the data, and without changing the tools they use. These changes can be made over time to address the evolving needs of data consumers.
Data Lake Engines Compared to Traditional Solutions Organizations employ multiple technologies to make data accessible by data consumers. Data lake engines are different in that they integrate many of these capabilities into a single solution that is self-service.
Data Lake Engines Compared to Data Lakes Data lakes are typically deployed on Amazon S3, ADLS, or Hadoop. They provide a flexible, file-oriented storage layer. Hadoop includes interfaces for querying the data, while S3 and ADLS rely on other services for performing analysis beyond basic file-level access.
Data lake engines are typically deployed alongside data lakes. The data lake is used in two distinct ways: 1) as a data source, and 2) as a persistence layer for metadata or any data acceleration-related data structures. Data lake engines provide many features that are complementary to the data lake, including:
- Fast SQL-based access
- A searchable data catalog
- A logical data model
- Data provenance & lineage
- Data curation
- Row and column-level access controls
- Self-service access for data consumers
Typically, raw data would be loaded into the data lake, and the data lake engine would make the data ready for analysis, easy to find, and fast to analyze. These capabilities are provided in a self-service model so that data consumers can perform these tasks on their own.
Data Lake Engines Compared to Data Warehouses
Data warehouses are specialized relational databases that are optimized for analytical workloads. Data lake engines are typically deployed alongside data warehouses to simplify access to this data and to allow for uniform access to the data warehouse along with other sources, such as data lakes and NoSQL systems.
Data lake engines provide many features that are complementary to the data warehouse, including:
- Joins between the data warehouse and other systems
- SQL-based access to non-relational data sources such as NoSQL
- A searchable data catalog
- A logical data model
- Data provenance & lineage
- Data curation
- Self-service access for data consumers
Data Lake Engines Compared to Cubes, Extracts, and Aggregation Tables
Cubes, extracts, and aggregation tables provide optimized access to data for different analytical tools. Organizations use these technologies by making a copy of the data and transforming it into one of these data structures. In order to create these resources, IT must have a strong understanding of the types of queries that will be issued by data consumers. As a result, designing and building these resources is typically a multi-week or multi-month exercise.
These resources tend to require a relational database as the input to their build process, so data must first be loaded into a relational database if it is non-relational (e.g., JSON). This can add significant time and effort to the work of building and maintaining cubes, extracts, and aggregation tables.
Some data lake engines provide an alternative to these techniques that is similar to the concept of a materialized view in a relational database. Data is organized in such a way that it is physically optimized for different query patterns, such as sorting, aggregating, or partitioning the data. These data structures are then maintained in a columnar, compressed representation that is invisible to data consumers.
The query planner of data lake engine determines which materializations of the data can be used to generate a more efficient query plan. Just as with materialized views, data consumers benefit from a faster query without changing their behavior. Unlike materialized views, data lake engines can perform this optimization on non-relational data structures, and there is no dependency on a database for storing or querying the data.