Delta, Iceberg & Hudi

Data lakes solved the storage problem. But they introduced a new one: you cannot update or delete individual rows in a bunch of Parquet files sitting in S3. You cannot run a query and guarantee you are reading a consistent snapshot. You cannot roll back a bad write. The lakehouse table formats (Delta Lake, Apache Iceberg, Apache Hudi) fix all of this by adding database-like capabilities on top of object storage.

The Problem with Raw Parquet Files

Parquet on object storage gives you cheap, scalable, columnar storage. But it lacks the features that databases have provided for decades:

Missing from raw Parquet on S3:
  - ACID transactions (concurrent writes can corrupt data)
  - Row-level updates and deletes (Parquet files are immutable)
  - Schema evolution (adding a column means rewriting everything)
  - Time travel (no way to query data as it existed yesterday)
  - Consistent reads (a query might read partially-written data)
  - Efficient upserts (no merge operation)

These are not theoretical concerns. Every team that runs a data lake at scale hits them within months. A pipeline fails halfway through writing 100 files; downstream queries read the 50 that landed and produce incorrect results. GDPR requires deleting a user's data; you cannot delete one row from a Parquet file without rewriting the entire file.

How Table Formats Work

All three formats follow the same core principle: they add a metadata layer on top of immutable Parquet (or ORC) files in object storage.

Traditional data lake:
  Query Engine -> List files in S3 prefix -> Read Parquet files

Lakehouse table format:
  Query Engine -> Read metadata (manifest/log) -> Read only relevant Parquet files

The metadata layer tracks:

Which files belong to the current version of the table
Statistics about each file (row counts, column min/max)
The history of changes (what was added or removed in each transaction)

Immutable Files, Mutable Tables

The key insight: individual data files are never modified in place. Instead, operations create new files and update the metadata to point to the new set of files.

Update operation (conceptual):
  1. Read the file containing the row to update
  2. Write a NEW file with the updated row
  3. Update metadata to point to the new file instead of the old one
  4. The old file is kept (for time travel) or garbage collected later

Delete operation (conceptual):
  1. Read the file containing the row to delete
  2. Write a NEW file without that row
  3. Update metadata to stop referencing the old file

This copy-on-write approach gives you the semantics of mutable tables on top of immutable storage.

Delta Lake

Delta Lake was created by Databricks. It uses a transaction log stored alongside the data files.

Structure

s3://warehouse/orders/
  _delta_log/
    00000000000000000000.json   (initial table creation)
    00000000000000000001.json   (first batch of inserts)
    00000000000000000002.json   (second batch, plus an update)
    00000000000000000010.checkpoint.parquet  (checkpoint every 10 commits)
  part-00000-abc123.parquet
  part-00001-def456.parquet
  part-00002-ghi789.parquet

The _delta_log directory contains a JSON file for each transaction. Each file describes which data files were added or removed. Checkpoints (in Parquet format) are written periodically to speed up metadata reads.

Key Features

-- Time travel: query previous versions
SELECT * FROM orders VERSION AS OF 5;
SELECT * FROM orders TIMESTAMP AS OF '2025-01-15 10:00:00';

-- Upserts via MERGE
MERGE INTO orders AS target
USING new_orders AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

-- Schema evolution
ALTER TABLE orders ADD COLUMN shipping_method STRING;

Delta Lake Ecosystem

Delta Lake is tightly integrated with Databricks but also works with Spark, Trino, and other engines through open-source connectors. The Delta Sharing protocol enables sharing tables across organizations without copying data.

Apache Iceberg

Apache Iceberg was created at Netflix and donated to the Apache Software Foundation. It was designed from the ground up as an open table format, not tied to any particular engine or vendor.

Structure

s3://warehouse/orders/
  metadata/
    v1.metadata.json      (schema, partition spec, snapshot pointers)
    v2.metadata.json      (updated after next transaction)
    snap-001.avro         (manifest list for snapshot 1)
    snap-002.avro         (manifest list for snapshot 2)
    manifest-abc.avro     (manifest file: lists data files and stats)
    manifest-def.avro
  data/
    part-00000-abc123.parquet
    part-00001-def456.parquet

Iceberg uses a three-level metadata hierarchy:

Metadata file -> Manifest list -> Manifest files -> Data files

Metadata file: current schema, partition spec, pointer to current snapshot
Manifest list: which manifest files belong to this snapshot
Manifest files: list of data files with per-file statistics
Data files: actual Parquet files containing the data

This hierarchy enables efficient planning. The query engine reads metadata to determine exactly which data files to read, without listing the object storage directory.

Key Features

-- Time travel
SELECT * FROM orders FOR SYSTEM_TIME AS OF TIMESTAMP '2025-01-15 10:00:00';
SELECT * FROM orders FOR VERSION AS OF 42;

-- Schema evolution (add, drop, rename, reorder columns)
ALTER TABLE orders ADD COLUMN shipping_method STRING;
ALTER TABLE orders DROP COLUMN legacy_field;
ALTER TABLE orders RENAME COLUMN old_name TO new_name;

-- Partition evolution (change partitioning without rewriting data)
ALTER TABLE orders ADD PARTITION FIELD month(order_date);

-- Hidden partitioning (users do not need to know the partition scheme)
-- Query just uses: WHERE order_date = '2025-01-15'
-- Iceberg automatically prunes based on the partition spec

Partition Evolution

Iceberg's partition evolution is a standout feature. In Delta Lake and Hudi, changing the partition scheme typically requires rewriting the entire table. In Iceberg, you can add or change partition fields, and old data keeps its original partitioning while new data uses the new scheme. The query planner handles both transparently.

Original partition: daily (order_date)
New partition: monthly (order_date)

After evolution:
  Old files: still partitioned by day
  New files: partitioned by month
  Queries: Iceberg prunes correctly across both schemes

Apache Hudi

Apache Hudi (Hadoop Upserts Deletes and Incrementals) was created at Uber to handle the specific challenge of upserting ride data at massive scale.

Structure

Hudi supports two table types:

Copy-on-Write (CoW):
  - Data stored in Parquet files
  - Updates rewrite entire files
  - Read-optimized: queries are fast
  - Write-amplified: updates are expensive

Merge-on-Read (MoR):
  - Base files in Parquet + change logs in Avro
  - Updates append to log files
  - Write-optimized: updates are fast
  - Reads may need to merge base files with logs (slower)
  - Compaction merges logs into base files periodically

Key Features

Hudi strengths:
  - Fast upserts (designed for high-throughput ingestion)
  - Incremental queries (read only what changed since last query)
  - Built-in compaction and cleaning
  - Record-level indexing for fast lookups
  - Strong CDC (change data capture) support

Hudi's incremental query capability is particularly useful for pipelines that need to process only new or changed records rather than scanning entire tables.

Comparing the Three Formats

                    Delta Lake        Apache Iceberg      Apache Hudi
Origin              Databricks        Netflix/Apache      Uber/Apache
License             Apache 2.0        Apache 2.0          Apache 2.0
Metadata format     JSON log          Avro manifests      Timeline + metadata
Default file format Parquet           Parquet (or ORC)    Parquet (or ORC)
Schema evolution    Add columns       Full (add/drop/     Add columns
                                      rename/reorder)
Partition evolution No (rewrite)      Yes (no rewrite)    Limited
Time travel         Yes               Yes                 Yes
Hidden partitioning No                Yes                 No
Merge-on-Read       Delta 3.0+        Yes (v2 deletes)    Yes (native)
Ecosystem breadth   Databricks-centric Broadest            Narrower
Governance          Databricks        Open standard        Open standard

How to Choose

The practical advice, stripped of vendor marketing:

Go with Iceberg if you are not locked into Databricks. Iceberg has the broadest engine support (Spark, Trino, Flink, Dremio, Snowflake, BigQuery, AWS Athena, StarRocks), the most flexible schema and partition evolution, and the strongest open governance. It is becoming the industry standard.

Go with Delta Lake if you are already on Databricks or plan to be. Delta Lake is deeply integrated with the Databricks runtime, Unity Catalog, and the broader Databricks ecosystem. You will get the best performance and features within that platform.

Consider Hudi if your primary use case is high-throughput CDC ingestion with frequent upserts. Hudi was designed for this workload and has optimizations (merge-on-read, record-level indexing) that the others are still catching up on.

The convergence trend: all three formats are adding each other's features. Delta Lake added liquid clustering and deletion vectors. Iceberg added row-level deletes. The differences are narrowing. What matters more than the format is the ecosystem you are building around.

Migration Between Formats

If you chose wrong (or if the landscape shifts), migration is not catastrophic. Tools like Apache XTable (formerly OneTable) can convert metadata between formats without rewriting the underlying Parquet files.

Apache XTable:
  - Converts Delta -> Iceberg, Iceberg -> Delta, Hudi -> Iceberg, etc.
  - Only converts metadata; data files stay in place
  - Enables querying the same data with different engines

Snowflake and BigQuery both support reading Iceberg tables directly from object storage, which provides an exit path from warehouse lock-in.

Common Pitfalls

Choosing based on benchmarks instead of ecosystem fit. Micro-benchmarks comparing Delta vs Iceberg vs Hudi are misleading. Real-world performance depends on your data, query patterns, and engine. Choose based on ecosystem compatibility and operational simplicity.

Not running compaction. Merge-on-read tables accumulate small delta files. Without regular compaction, read performance degrades steadily. Set up automated compaction jobs from day one.

Ignoring garbage collection. Time travel keeps old file versions. Without a retention policy and vacuum/expire process, storage costs grow indefinitely.

-- Delta Lake: remove files older than 7 days
VACUUM orders RETAIN 168 HOURS;

-- Iceberg: expire snapshots older than 7 days
CALL catalog.system.expire_snapshots('orders', TIMESTAMP '2025-01-08 00:00:00');

Over-indexing on "open standard" vs "vendor." Delta Lake is open-source (Apache 2.0). Iceberg is an Apache project. Both are open. The real question is which engines and tools you use, not which format has a purer open-source pedigree.

Assuming table formats solve all data lake problems. Table formats give you ACID and time travel. They do not give you data quality, governance, access control, or discovery. You still need tools for those.

Key Takeaways

Lakehouse table formats (Delta, Iceberg, Hudi) add ACID transactions, time travel, schema evolution, and upserts to data lakes
All three work by adding a metadata layer on top of immutable Parquet files in object storage
Apache Iceberg is the strongest choice for most new projects due to broad engine support and flexible partition evolution
Delta Lake is the best choice within the Databricks ecosystem
Apache Hudi excels at high-throughput upsert and CDC workloads
Run compaction and garbage collection from day one; without them, performance and storage costs degrade over time
The formats are converging in features; ecosystem fit matters more than feature checklists