Schema Evolution

Schemas change. Columns are added, renamed, removed, and retyped. Data types are widened. Enums grow new values. Nested fields appear and disappear. If your data platform cannot handle schema changes gracefully, every upstream deployment becomes a potential outage for every downstream consumer. Schema evolution is the discipline of managing these changes without breaking pipelines.

Why Schema Evolution Is Hard

In a traditional application, you control both the code and the database. When you add a column, you update the code that reads it. The deployment is atomic: new code and new schema go live together.

In data engineering, you control neither the source nor (often) the consumer. The source team adds a column to their API without telling you. Your pipeline breaks at 3 AM. Or you add a column to a warehouse table and a downstream dashboard breaks because it uses SELECT *.

The fundamental problem: producers and consumers evolve at different speeds, and there is no compile-time check to catch incompatibilities.

Types of Schema Changes

Backward Compatible (Safe)

Changes that do not break existing consumers:

Adding a new column. Existing queries that do not reference the column are unaffected.
Adding a new optional field. Consumers that do not expect it simply ignore it.
Widening a data type. Changing INT to BIGINT or VARCHAR(50) to VARCHAR(100).

Backward Incompatible (Breaking)

Changes that break existing consumers:

Removing a column. Any query referencing it fails.
Renaming a column. Same effect as removing and adding — existing references break.
Narrowing a data type. Changing BIGINT to INT can truncate values.
Changing a column's semantics. The column revenue changes from cents to dollars. No schema error, but every downstream calculation is wrong.

The Semantic Change Problem

The most dangerous schema changes are invisible to schema validation. If amount silently changes from pre-tax to post-tax, every pipeline processes it without error but every number is wrong. This is why schema evolution requires communication, not just tooling.

Adding Columns Is Easy

Adding a column is the safest schema change. As long as consumers do not use SELECT * (and they should not), a new column is invisible to existing queries.

-- Source adds a new column
ALTER TABLE orders ADD COLUMN discount_code VARCHAR(50);

-- Existing downstream query: unaffected
SELECT order_id, customer_id, amount
FROM orders
WHERE order_date = '2025-03-15';

-- New downstream query: can use the column
SELECT order_id, amount, discount_code
FROM orders
WHERE discount_code IS NOT NULL;

Best practice: new columns should be nullable or have a default value. A non-nullable column without a default breaks INSERT statements that do not include it.

Removing & Renaming Is Hard

The Problem with Dropping Columns

You drop column legacy_status from the orders table. Three things happen:

A dbt model that references legacy_status fails.
A Looker explore that uses the column shows an error.
A data scientist's notebook that was working yesterday now throws an exception.

You did not know about any of these consumers because there is no registry of who uses what.

The Expand-Contract Pattern

Instead of making breaking changes atomically, use a two-phase approach:

Phase 1: Expand. Add the new column alongside the old one. Both exist simultaneously. Migrate consumers to use the new column.

-- Phase 1: Add the new column
ALTER TABLE orders ADD COLUMN order_status VARCHAR(20);

-- Backfill the new column from the old one
UPDATE orders SET order_status = legacy_status;

-- Communicate to consumers: use order_status, not legacy_status

Phase 2: Contract. Once all consumers have migrated, remove the old column.

-- Phase 2: After verifying no queries reference legacy_status
-- Check query logs for references
SELECT query_text
FROM information_schema.query_history
WHERE query_text ILIKE '%legacy_status%'
  AND start_time > DATEADD(day, -30, CURRENT_TIMESTAMP);

-- If no recent references, drop the column
ALTER TABLE orders DROP COLUMN legacy_status;

The expand phase can last weeks or months. That is fine. The point is that the breaking change (dropping the column) only happens after consumers have migrated.

Renaming via Expand-Contract

Renaming a column is just expand-contract with a copy:

-- Step 1: Add new column
ALTER TABLE orders ADD COLUMN customer_identifier BIGINT;

-- Step 2: Keep both in sync (via trigger, pipeline logic, or view)
UPDATE orders SET customer_identifier = cust_id;

-- Step 3: Migrate consumers from cust_id to customer_identifier

-- Step 4: Drop old column (after migration is complete)
ALTER TABLE orders DROP COLUMN cust_id;

Forward & Backward Compatibility

Backward Compatibility

A new schema can read data written with an old schema. This is the consumer's perspective: "I got an upgrade, but I can still read the old data."

Example: You add a discount_code column. The new schema can read old records (where discount_code is null) and new records (where it has a value).

Forward Compatibility

An old schema can read data written with a new schema. This is the producer's perspective: "I started writing new fields, but old consumers can still read my data."

Example: A producer adds a discount_code field to events. Old consumers that do not know about discount_code simply ignore it.

Why Both Matter

In a data platform, producers and consumers deploy independently. The producer might deploy a new schema before all consumers have updated. Forward compatibility handles this. A consumer might deploy an update that reads new fields before the producer starts writing them. Backward compatibility handles this.

Avro & Protobuf for Schema Evolution

Avro

Avro is a binary serialization format with built-in schema evolution. Each Avro message includes a reference to its schema. The reader can use a different (compatible) schema and Avro handles the translation.

# Avro schema (version 1)
schema_v1 = {
    "type": "record",
    "name": "Order",
    "fields": [
        {"name": "order_id", "type": "long"},
        {"name": "amount", "type": "double"},
        {"name": "status", "type": "string"},
    ]
}

# Avro schema (version 2) - added discount_code with default
schema_v2 = {
    "type": "record",
    "name": "Order",
    "fields": [
        {"name": "order_id", "type": "long"},
        {"name": "amount", "type": "double"},
        {"name": "status", "type": "string"},
        {"name": "discount_code", "type": ["null", "string"], "default": None},
    ]
}
# A reader using schema_v2 can read data written with schema_v1
# (discount_code defaults to null for old records)

Avro's compatibility rules:

Adding a field with a default: Backward and forward compatible.
Removing a field with a default: Backward and forward compatible.
Adding a field without a default: Only backward compatible (old data does not have it).
Removing a field without a default: Only forward compatible.

Protobuf

Protocol Buffers use field numbers instead of names, making schema evolution more flexible. Fields can be added or removed as long as field numbers are not reused.

// Version 1
message Order {
    int64 order_id = 1;
    double amount = 2;
    string status = 3;
}

// Version 2 - added field, never reuse field number 3 for something else
message Order {
    int64 order_id = 1;
    double amount = 2;
    string status = 3;
    string discount_code = 4;  // New field
}

Avro vs Protobuf

Feature              Avro                         Protobuf
-------              ----                         --------
Schema location      Stored with data             Compiled into code
Evolution model      Reader/writer schemas        Field numbers
Compression          Good                         Excellent
Ecosystem            Hadoop, Kafka, Spark         gRPC, microservices
Dynamic schemas      Yes (schema in header)       No (compiled)

Both work well. Avro is more common in the Kafka and data engineering ecosystem. Protobuf is more common in microservices.

The Schema Registry Pattern

A schema registry is a centralized service that stores and validates schemas. Before a producer can publish data with a new schema, the registry checks that the change is compatible.

How It Works

1. Producer wants to publish with schema v2
2. Producer registers schema v2 with the registry
3. Registry checks compatibility with v1:
   - Added field with default? OK (backward compatible)
   - Removed required field? REJECTED
4. If compatible, registry assigns a schema ID
5. Producer publishes messages with the schema ID in the header
6. Consumer reads the schema ID, fetches the schema from the registry
7. Consumer deserializes using its own schema + the writer's schema

Confluent Schema Registry

The most widely used schema registry, built for Kafka:

from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer

registry_client = SchemaRegistryClient({'url': 'http://schema-registry:8081'})

# The registry enforces compatibility rules
# If you try to register an incompatible schema, it rejects it
serializer = AvroSerializer(
    schema_registry_client=registry_client,
    schema_str=schema_v2_json,
)

Compatibility Modes

BACKWARD:          New schema can read old data. Default.
FORWARD:           Old schema can read new data.
FULL:              Both backward and forward compatible.
BACKWARD_TRANSITIVE: Backward compatible with ALL previous versions.
FORWARD_TRANSITIVE:  Forward compatible with ALL previous versions.
FULL_TRANSITIVE:     Both, with all previous versions.
NONE:              No compatibility checking. Use with caution.

-- Version 1 (original)
CREATE TABLE analytics.orders_v1 (
    order_id BIGINT,
    amount DECIMAL(10,2),
    legacy_status VARCHAR(10)
);

-- Version 2 (breaking change: renamed column, new enum values)
CREATE TABLE analytics.orders_v2 (
    order_id BIGINT,
    amount DECIMAL(10,2),
    order_status VARCHAR(20)  -- Renamed, new values
);

-- View for backward compatibility during migration
CREATE VIEW analytics.orders AS SELECT * FROM analytics.orders_v2;

Pros: Clear separation. Consumers choose when to migrate. Cons: Maintaining multiple versions is expensive. Views add a layer of indirection.

Common Pitfalls

Using SELECT * in production pipelines. Any added column breaks the query or introduces unexpected data. Always list columns explicitly.
Making breaking changes without communication. A Slack message saying "we renamed customer_id to cust_id, deploying tomorrow" saves hours of debugging.
Not tracking who consumes your data. You cannot do expand-contract if you do not know who to migrate. Maintain a consumer registry or use lineage tools.
Reusing deleted column names. If you drop column status and later add a new status with different semantics, historical queries produce silently wrong results.
Ignoring semantic changes. A column that changes from UTC to local time, from cents to dollars, or from inclusive to exclusive ranges is a breaking change that no schema validator will catch.
Skipping the expand phase. Going directly from old schema to new schema without a transition period breaks consumers who have not updated yet.

Key Takeaways

Schema changes are inevitable. Adding columns is safe. Removing and renaming columns is risky and requires a migration strategy.
The expand-contract pattern is the safest approach: add the new thing alongside the old, migrate consumers, then remove the old thing.
Forward compatibility (old readers, new data) and backward compatibility (new readers, old data) determine which changes are safe.
Avro and Protobuf provide built-in schema evolution support. Use them for serialized data, especially in Kafka.
A schema registry enforces compatibility rules before changes are deployed. It prevents breaking changes from reaching production.
Communication is as important as tooling. Semantic changes (different units, different meanings) are invisible to automated checks and require human coordination.