Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more
This article was contributed by Gunasekaran S., director of data engineering at Sigmoid.
Over the years, cloud data lake and warehousing architectures have helped enterprises scale their data management efforts while lowering costs. Conventionally, the steps in the data management architecture typically include enterprise data extraction from operational data repositories and storing them in a raw data lake. The next step is to execute another round of ETL processes to shift critical subsets of this data into a data warehouse to generate business insights for decision-making. However, the current set-up has several challenges, such as:
- Lack of consistency: Companies may often find it difficult to keep their data lake and data warehouse architecture consistent. It is not just a costly affair, but teams also need to employ continuous data engineering tactics to ETL/ELT data between the two systems. Each step can introduce failures and unwanted bugs affecting the overall data quality.
- Constantly changing datasets: The data stored in a data warehouse may not be as current as the data in a data lake which depends upon the data pipeline schedule and frequency.
- Vendor lock-in: Shifting large volumes of data into a centralized EDW becomes quite challenging for companies not only because of the time and resource required to execute such a task but also because this architecture creates a closed-loop causing vendor lock-in. Additionally, data stored in the warehouses is also harder to share with all data end-users within an organization.
- Poor maintainability: With data lakes and data warehouses, companies need to maintain multiple systems and facilitate synchronization which makes the system complex and difficult to maintain in the long run.
- Data governance: While the data in the data lake tend to be mostly in different file-based formats, a data warehouse is mostly in database format, and it adds to the complexity in terms of data governance and lineage.
- Advanced analytics limitations: Advanced machine learning applications such as PyTorch and TensorFlow aren’t fully compatible with data warehouses. These applications fetch data from data lakes where the data quality is often not governed.
- Data copies and associated costs: Data available in data lakes and data warehouses leads to an extent of data copies and has associated costs. Moreover, commercial warehouse data in proprietary formats increases the cost of migrating data.
A data lakehouse addresses these typical limitations of a data lake and data warehouse architecture by combining the best elements of both data warehouses and data lakes to deliver significant value for organizations.
The data lakehouse: A brief overview
A data lakehouse is essentially the next breed of cloud data lake and warehousing architecture that combines the best of both worlds. It is an architectural approach for managing all data formats (structured, semi-structured, or unstructured) as well as supporting multiple data workloads (data warehouse, BI, AI/ML, and streaming). Data lakehouses are underpinned by a new open system architecture that allows data teams to implement data structures through smart data management features similar to data warehouses over a low-cost storage platform that is similar to the ones used in data lakes.
A data lakehouse architecture allows data teams to glean insights faster as they have the opportunity to harness data without accessing multiple systems. A data lakehouse architecture can also help companies ensure that data teams have the most accurate and updated data at their disposal for mission-critical machine learning, enterprise analytics initiatives, and reporting purposes.
The benefits of data lakehouse
There are several reasons to look at modern data lakehouse architecture in order to drive sustainable data management practices. The following are some of the key factors that make data lakehouse an ideal option for enterprise data storage initiatives:
- Data quality delivered through simplified schema: A data lakehouse comes with a dual-layered architecture where a warehouse layer is embedded over a data lake enforcing schema which provides data quality and control and orchestrates faster BI and reporting.
- Reduction of data drift: A data lakehouse architecture mitigates the need for multiple data copies and significantly reduces challenges related to data drift.
- Faster query: Faster interactive query coupled with true data democratization facilitates more informed decision-making. The architecture allows data scientists, engineers, and analysts to quickly access the required data. This results in a faster time-to-insight cycle.
- Effective administration: By implementing a data lakehouse architecture, companies can help their data teams save significant time and effort because it requires less time and resources in storing and processing data and delivering business insights. In fact, a single platform for data management instituted through a data lakehouse can reduce significant administrative burdens as well.
- Seamless data governance: A data lakehouse serves as a single source, thereby allowing data teams to embed advanced features such as audit logging and access control.
- Effective data access and data security: Data lakehouses provide data teams with the option to maintain the right access controls and encryption across pipelines for data integrity. Additionally, in a data lakehouse model, data teams are not required to manage security for all data copies which makes security administration a lot easier and cost-effective.
- Low chances of data redundancy: A data lakehouse architecture mitigates the need for multiple data copies required in processes of implementing data lakes and data warehouses, thereby reducing data drift.
- High scalability: A data lakehouse offers high scalability of both data and metadata. This allows companies to run critical analytics projects with a fast time-to-insight cycle.
Emerging data lakehouse patterns
The Azure Databricks Lakehouse and Snowflake are the two leading lakehouse platforms that companies can leverage for their data management initiatives. However, the decision to opt for one should be based on a company’s requirements. There are several companies that leverage these platforms together, including Databricks for data processing and Snowflake for data warehousing capabilities. Over time, both these platforms have gradually started building on the capabilities that the other has to offer in the quest to emerge as a platform of choice for multiple workloads.
Now, let’s have a look at these distinct lakehouse patterns and how they have evolved over time.
Databricks: A data processing engine on data lakes adding data lakehouse capabilities
Databricks is essentially an Apache Spark-driven data processing tool that provides data teams with an agile programming environment with auto-scalable computing capability. Companies need to just pay for the computational resources in use. The Databricks platform is best suited for data processing at early stages in the pipeline where there is a need to prepare and ingest data. Companies can also leverage it to prepare data for transformation and enrichment but it falls short when it comes to processing data for reporting.
In the last few years, Databricks has focused on building capabilities around traditional data warehouses. The platform comes with a built-in DQL-query interface and intuitive visualization features. Apart from this, Databricks also comes with a table structure that is similar to a database which is specifically developed in Delta file format. This format is leveraged to add database capabilities into data lakes. The format allows for data versioning through ACID transactions and schema.
Key differentiators of the Azure Databricks lakehouse
- Comes with a ready-to-use spark environment with no need for configuration
- Embedded open-source Delta Lake technology that serves as an additional storage layer
- Delivers better performance by consolidating smaller files in Delta tables
- ACID functionality in Delta table helps ensure complete data security
- Has several language options such as Scala, Python, R, Java, and SQL
- Platform supports interactive data analysis with notebook-style coding
- Provides seamless integration options with other cloud platform services such as Blob Storage, Azure Data Factory, and Azure DevOps
- Provides open source library support
Snowflake: Cloud data warehouse extending to address data lake capabilities
Unlike Databricks, Snowflake transformed the data warehousing space a few years back by offering computation capability which is highly scalable and distributed. The platform achieved this by separating storage and processing capability in a data warehouse ecosystem. This is one of the approaches that Snowflake embraced in expanding the solution in the data lake space.
Over the years, Snowflake has been gradually expanding its ELT capabilities, allowing companies to run their ELT processes in conjunction with the platform. For instance, while some companies leverage Snowflake Streams and Tasks to complete SQL tasks in Snowflake, others “dbt” with Snowflake.
Key differentiators of the Snowflake data lakehouse
- Comes with built-in export and query tools
- The platform can seamlessly connect with BI tools such as Metabase, Tableau, PowerBI, and more
- The platform supports JSON format for querying and output of data
- Provides secured and compressed storage options for semi-structured data
- Can be connected easily with Object Storage like Amazon S3
- Comes with granular security to deliver maximum data integrity
- There’s no noticeable limit to the size of a query
- Presence of standard SQL dialect and robust function library
- Comes with virtual warehouses that allow data teams to separate and categorize workloads according to requirements
- Promotes secure data sharing and simple integration with other cloud technologies
Dremio and Firebolt – SQL lakehouse engine on data lake
Besides Snowflake and Databricks, data lakehouse tools such as Dremio and Firebolt are also coming up with advanced querying capabilities. Dremio’s SQL Lakehouse platform, for instance, has the capability to deliver high-performance dashboards and intuitive analytics directly on any data lake storage, thereby eliminating the need for a data warehouse. Similarly, Firebolt comes with advanced indexing capabilities which helps data teams shrink data access down to data ranges that are even smaller than partitions.
An evolution over cloud data lakes and warehouses
A data lakehouse is an evolution over cloud data lake and warehousing architectures that provides data teams with an opportunity to capitalize on the best of both worlds while mitigating all historical data management weaknesses. When done right, a data lakehouse initiative can free up the data and enable a company to use it the way it wants and at the desired speed.
Going forward, as cloud data warehouse and data lake architectures converge, companies may soon find vendors that combine all the capabilities of all the data lakehouse tools. This may open up endless opportunities when it comes to building and managing data pipelines.
Gunasekaran S is the director of data engineering at Sigmoid.