Lakehouse Architecture

The lakehouse combines the cheap, flexible storage of a data lake with the transactional guarantees, SQL support, and performance of a data warehouse. It is not a new product you buy. It is an architecture pattern: open table formats on object storage, queried by engines that provide warehouse-grade performance.

The Evolution

Data architectures have gone through three phases:

Phase 1: Data Warehouse (1990s-2010s)
  Sources -> ETL -> Warehouse (Teradata, Oracle, Netezza)
  Pros: SQL, ACID, fast queries, well-understood
  Cons: Expensive, rigid schema, vendor lock-in, no unstructured data

Phase 2: Data Lake (2010s)
  Sources -> Dump into HDFS/S3 -> Spark/Hive
  Pros: Cheap, any data format, scalable
  Cons: No ACID, no schema enforcement, "data swamp" risk, slow queries

Phase 3: Lakehouse (2020s)
  Sources -> Load into Object Storage (Parquet + Iceberg/Delta) -> SQL engines
  Pros: Cheap storage, ACID, SQL, schema enforcement, open formats
  Cons: More moving parts than a managed warehouse

The lakehouse does not replace the warehouse in all cases. It replaces the two-system architecture where you maintain both a data lake and a warehouse, copying data between them.

The Two-System Problem

Many organizations run both a data lake and a warehouse:

Traditional two-system architecture:
  Sources -> Data Lake (S3 + Spark)
                |
                v
          ETL/ELT copies data
                |
                v
          Data Warehouse (Snowflake/BigQuery/Redshift)
                |
                v
          Dashboards, reports, ML models

This creates real problems:

Data duplication. The same data exists in the lake and the warehouse. Storage costs double. Keeping them in sync is an ongoing engineering burden.
Stale data. The warehouse copy lags behind the lake. Analysts query stale data without knowing it.
Governance complexity. Access controls must be maintained in two systems. Lineage tracking spans two platforms.
Cost. You pay for warehouse storage (expensive) in addition to object storage (cheap).

The lakehouse eliminates the copy by making the lake queryable with warehouse-grade performance.

Core Components

A lakehouse architecture has four main components: object storage, a table format, a catalog, and query engines.

Object Storage

The foundation. All data lives in S3, GCS, or Azure Blob Storage as Parquet files managed by a table format.

s3://company-lakehouse/
  raw/
    stripe/charges/...
    salesforce/opportunities/...
  staging/
    stg_charges/...
    stg_opportunities/...
  marts/
    revenue/...
    customers/...

Table Format

Delta Lake, Apache Iceberg, or Apache Hudi. The table format provides ACID transactions, time travel, schema evolution, and efficient metadata for query planning.

Catalog Service

The catalog is the registry that maps table names to their physical locations and metadata. Without a catalog, every query would need to know the exact S3 path and metadata location for each table.

Catalog services:
  Hive Metastore    - The original. Open source, widely supported, shows its age.
                      Stores table metadata in a relational database (MySQL/Postgres).
                      Works with all three table formats.

  AWS Glue Catalog  - AWS-managed, Hive-compatible. Serverless, no infrastructure
                      to manage. Integrates with Athena, Redshift Spectrum, EMR.

  Unity Catalog     - Databricks' catalog. Fine-grained access control, lineage,
                      data sharing. Tightly integrated with Delta Lake.
                      Recently open-sourced.

  Nessie            - Git-like catalog for Iceberg. Branching and tagging for
                      tables. Interesting for CI/CD on data.

  Polaris           - Snowflake's open-source Iceberg catalog.
                      REST-based, designed for multi-engine interop.

  AWS Lake Formation - Access control and governance layer on top of Glue Catalog.

The catalog choice affects which engines can query your data and how you manage access control. For Iceberg-based lakehouses, a REST catalog (Polaris, Nessie, or the Iceberg REST spec) provides the broadest engine compatibility.

Query Engines

The engine executes SQL against data in the lakehouse. Different engines suit different workloads.

Apache Spark

The workhorse for batch processing and heavy transformations. Best for:

Large-scale ETL/ELT pipelines
Machine learning feature engineering
Processing that requires complex logic beyond SQL

Spark trade-offs:
  + Handles massive scale (petabytes)
  + Rich API (SQL, Python, Scala, Java)
  + Native support for all three table formats
  - High latency for interactive queries (cluster startup time)
  - Expensive for small queries
  - Operational overhead (cluster management)

Trino (formerly PrestoSQL)

A distributed SQL engine designed for interactive analytics. Best for:

Ad-hoc queries by analysts
Dashboard backends
Federated queries across multiple data sources

Trino trade-offs:
  + Fast interactive queries (seconds, not minutes)
  + Federated: query S3, Postgres, MySQL, and Kafka in one query
  + No data movement: queries data where it lives
  - Not designed for heavy ETL workloads
  - Requires a running cluster
  - Memory-intensive for large joins

DuckDB

An embedded analytical database. Best for:

Local development and testing
Single-machine analytics on moderate-sized data
Notebook-based exploration

-- DuckDB: query Iceberg tables directly from your laptop
SELECT region, SUM(revenue) AS total_revenue
FROM iceberg_scan('s3://lakehouse/marts/revenue/')
WHERE order_date >= '2025-01-01'
GROUP BY region;

DuckDB trade-offs:
  + Zero infrastructure (embedded, runs in-process)
  + Extremely fast on single-machine workloads
  + Native Parquet and Iceberg support
  + Free
  - Single machine: limited by local RAM and CPU
  - Not suitable for production multi-user workloads
  - No built-in access control

Other Engines

Databricks SQL: Managed engine optimized for Delta Lake. Warehouse-like experience on lakehouse data.
Snowflake: Can query Iceberg tables in S3 via external tables. Brings Snowflake's engine to lakehouse data.
BigQuery: Supports Iceberg and BigLake for querying data in GCS with BigQuery's engine.
StarRocks / ClickHouse: High-performance OLAP engines with lakehouse connector support.

A Practical Lakehouse Architecture

Data Sources
    |
    v
Ingestion (Fivetran, Airbyte, Kafka, custom)
    |
    v
Object Storage (S3/GCS)
  + Iceberg table format
  + Catalog (Glue / Polaris / Nessie)
    |
    +-- Spark (batch transforms, dbt on Spark)
    |
    +-- Trino (interactive queries, dashboards)
    |
    +-- DuckDB (local dev, notebooks)
    |
    v
Consumers (BI tools, ML pipelines, applications)

Orchestration

The catalog and table format handle storage. But you still need orchestration to coordinate:

When ingestion jobs run
When dbt models rebuild
When compaction and garbage collection happen
Alerting when jobs fail

Airflow, Dagster, and Prefect are the common choices. The orchestrator calls Spark or dbt to run transformations, then triggers downstream consumers.

Lakehouse vs Warehouse: When to Use Which

Use a Lakehouse When

You have diverse workloads. SQL analytics, machine learning, streaming, and batch processing all need the same data. A lakehouse lets each workload use the right engine.
You want to avoid vendor lock-in. Data in open formats (Parquet + Iceberg) on your object storage. You can switch engines without migrating data.
You have large data volumes. Object storage at $0.02/GB/month is significantly cheaper than warehouse storage at$ 0.04-0.10/GB/month. At petabyte scale, this difference is material.
You need to support non-SQL workloads. ML training on raw data, image processing, log analysis. Warehouses are SQL-only.

Use a Warehouse When

Your team is small. A lakehouse has more moving parts: object storage, table format, catalog, query engine, orchestrator. A warehouse is one product.
Your workload is purely SQL analytics. If every consumer needs SQL and nothing else, a warehouse is simpler and faster to set up.
You value managed operations. Snowflake and BigQuery handle storage, compute, optimization, and maintenance. A lakehouse requires you to manage more of this yourself.
Latency matters. Warehouses have optimized caching and query planning that is hard to match with a lakehouse setup.

The Hybrid Approach

Many teams land on a pragmatic hybrid:

Lakehouse (S3 + Iceberg):
  - Raw and staging data
  - ML feature stores
  - Large historical archives
  - Data shared across teams and engines

Warehouse (Snowflake/BigQuery):
  - Curated mart tables
  - Dashboard backends
  - Finance and executive reporting
  - Workloads requiring sub-second latency

The warehouse reads from the lakehouse (Snowflake external tables on Iceberg, BigQuery BigLake) rather than maintaining a separate copy. This gives you lakehouse economics for storage and warehouse performance for the queries that need it.

Common Pitfalls

Building a lakehouse when a warehouse would suffice. If your data fits in Snowflake, your team knows SQL, and you do not need ML training on raw data, a lakehouse adds complexity without clear benefit. Start with a warehouse and add lakehouse components when you hit its limits.

Neglecting the catalog. Without a catalog, your lakehouse is just files in S3. Nobody can discover tables, access controls are per-bucket, and engine integration requires manual path configuration. Invest in a catalog early.

Underestimating operational overhead. A warehouse manages compaction, statistics, caching, and optimization automatically. In a lakehouse, you manage compaction, garbage collection, table optimization, and catalog maintenance yourself (or pay for a managed platform like Databricks).

Choosing too many engines. Having Spark, Trino, DuckDB, and Flink all querying the same data sounds flexible. In practice, each engine has its own configuration, tuning, and failure modes. Start with one or two engines and add more only when you have a clear use case.

Ignoring data governance. Open formats on object storage do not come with built-in access control. You need to implement column-level security, row-level filtering, and audit logging through the catalog or a governance layer. This is easier in a managed warehouse.

Treating the lakehouse as a dump. A lakehouse without data quality checks, schema enforcement, and clear ownership is just a data swamp with ACID transactions. The table format provides the mechanism; you provide the discipline.

Key Takeaways

A lakehouse combines data lake economics (cheap object storage, open formats) with warehouse capabilities (ACID, SQL, performance)
The four components are object storage, a table format (Iceberg/Delta), a catalog, and query engines
Choose the right engine for each workload: Spark for batch, Trino for interactive SQL, DuckDB for local development
A lakehouse makes sense when you have diverse workloads, large data volumes, or need vendor independence
A managed warehouse is simpler when your team is small, workloads are purely SQL, and you value operational simplicity
The pragmatic path for many teams is a hybrid: lakehouse for storage and raw data, warehouse for curated marts and dashboards
Do not underestimate the operational overhead; a lakehouse trades managed simplicity for flexibility and control