[ad_1]
Enterprises aiming to maximize their data assets are adopting scalable, flexible and unified approaches to data storage and analysis. This trend is driven by enterprise architects, who are tasked with building infrastructure that meets changing business needs. Modern data lake architecture meets this need by integrating the scalability and flexibility of the data lake with the structure and performance optimization of the data warehouse. This article provides a reference architecture for understanding and implementing a modern data lake.
What is a modern data lake?
A modern data lake is half data warehouse and half data lake, and everything is stored using objects. It may sound like a marketing trick—put two products in one package and call it a new product—but the data warehousing described in this article is better than traditional data warehousing. It uses object storage and therefore provides all the advantages of object storage in terms of scalability and performance. Organizations taking this approach only pay for what they need (enabled by the scalability of object storage) and achieve performance by equipping their underlying object storage with NVMe drives connected by high-end networks .
Using object storage in this way was enabled by the rise of Open Table Format (OTF) such as Apache Iceberg, Apache Hudi and Delta Lake. Once these specifications are implemented, it will be possible to seamlessly use object storage as the underlying storage solution for data warehousing. They also provide functionality that may not exist in traditional data warehousing, including snapshots (also known as time travel), schema evolution, partitioning, partition evolution, and zero-copy branching.
But a modern data lake is more than just a fancy data warehouse, as it also includes a data lake for unstructured data. OTF also provides integration with external data in the data lake, which allows the external data to be used as SQL tables when needed. Or you can use a high-speed processing engine and familiar SQL commands to transform and route external data to the data warehouse.
Therefore, a modern data lake is more than just a data warehouse and a data lake with a different name. Overall, they provide more value than traditional data warehousing or standalone data lakes.
conceptual architecture
Layering is a convenient way to present the components and services needed for a modern data lake. Layering provides a clear way to group services that provide similar functionality. It also allows for the establishment of a hierarchy, with consumers at the top and sources (and their raw materials) at the bottom. The levels of a modern data lake from top to bottom are:
- Consumer level: Contains tools for advanced users to analyze data. Also included are applications and AI/ML workloads that will programmatically access modern data lakes.
- Semantic layer: Optional metadata layer for data discovery and management.
- Processing layer: This layer contains the computing clusters required to query modern data lakes. It also contains compute clusters for decentralized model training. Using storage layer integration between data lakes and data warehouses, complex transformations can be performed in the processing layer.
- Storage layer: Object storage is the primary storage service in modern data lakes; however, machine learning operations (MLOps) tools may require other storage services, such as relational databases. If you are pursuing generative artificial intelligence, you will need a vector library.
- Ingestion layer: Contains services required to receive materials. Advanced ingest can retrieve data based on a schedule. Modern data lakes should support multiple protocols. It should also support data arriving in streams and batches. Simple and complex data transformations can occur at the ingestion layer.
- source: Technically speaking, the data source layer is not part of a modern data lake solution, but it is included in this article because a well-built modern data lake must support a variety of data sources with different capabilities to send data.
The following diagram visually depicts these layers and the functionality that may be required to implement them. This is an end-to-end architecture, and the core of the platform is a modern data lake. The diagram also shows the components required to ingest, transform, discover, manage, and use data. It also describes the tools needed to support important use cases that rely on modern data lakes, such as MLOps storage, vector databases, and machine learning clusters.
The storage layer and processing layer are the core of the modern data lake. These two layers also include the fastest growing technologies for building data infrastructure: data warehousing, high-speed object storage, and vector databases built using open tabular formats.
storage layer
The data storage layer is the cornerstone on which all other layers depend. Its purpose is to store data reliably and provide it efficiently. It includes separate object storage services for the data lake and the data warehousing side of the modern data lake.
If desired, the two object storage services can be combined into a single physical instance of the object store by separating the data warehouse from the data lake storage using buckets. However, if your consumption layer and data pipeline will place different workloads on these two storage services, consider separating them and installing them on different hardware.
For example, a common data flow is to put all new data into a data lake. It can then be transformed and ingested into a data warehouse for use by other applications and for data science and data analysis. In this data flow, a modern data lake places more load on your data warehouse, so you’ll want to run it on higher-end hardware (storage devices, storage clusters, and networks).
The external tables feature allows data warehousing and processing engines to read objects in the data lake as if they were SQL tables. If a data lake is used as a landing zone for raw data, this feature, along with the data warehouse’s SQL capabilities, can be used to transform the raw data before inserting it into the data warehouse. Alternatively, external tables can be used “as-is” and connected to other tables and resources within the data warehouse without leaving the data lake. This model keeps data in one location while being available to external services, helping to save migration costs and overcome some data security issues.
You can also use this reference architecture to pursue AI/ML strategies, but that is beyond the scope of this article. Our AI/ML Modern Data Lake Reference Architecture provides information on building an AI data infrastructure.
processing layer
The processing layer contains the computation required for all workloads supported by a modern data lake. At a high level, there are two types of computing: processing engines for data warehousing and clusters for distributed machine learning.
The data warehouse processing engine supports distributed execution of SQL commands on data stored in the data warehouse. Transformations that are part of the ingestion process may also require computing power from the processing layer. For example, in some data warehouses you may want to use a medallion schema; in other cases you may choose a star schema with dimension tables. These designs often require extensive extraction, transformation, and loading (ETL) of raw data during the ingestion process.
Data warehousing in a modern data lake separates computation from storage. Therefore, multiple processing engines can exist within a single data repository if required. (This is different from traditional relational databases, in which computing and storage are tightly coupled, and each storage device has a computing resource.)
One possible processing layer design is to have a processing engine for each entity in the consumer layer. For example, use one processing cluster for business intelligence (BI), a separate cluster for data analytics, and another cluster for data science. Each processing engine queries the same data warehouse service, but because each team has its own dedicated cluster, they don’t compete with each other for computation. If the BI team is performing a computationally intensive month-end report, they will not interfere with other teams running daily reports.
Machine learning models, especially large language models, can be trained faster if they are trained in a decentralized manner. Machine learning clusters support distributed training. Distributed training should be integrated with MLOps tools for experiment tracking and checkpointing.
generalize
This article introduces a high-level reference architecture for a modern data lake and explores its core components. The goal is to provide organizations with a strategic blueprint for establishing a platform to effectively manage and extract value from their large and diverse data sets.
Modern data lakes combine the advantages of OTF-based data warehousing and flexible data lakes to provide a unified and scalable solution for storing, processing and analyzing data. If you would like to learn more about these concepts, please contact the Min.io team at hello@min.io.
YOUTUBE.COM/THENEWSTACK
Technology is advancing at a rapid pace, don’t miss an episode. Subscribe to our YouTube channel to stream all our podcasts, interviews, demos, and more.
subscription
[ad_2]
Source link