Data Contracts

A data contract is an agreement between a data producer and its consumers about what the data looks like, how reliable it is, and who is responsible for it. It is the API contract concept applied to data. Without data contracts, producers change their data without knowing who it breaks. With data contracts, changes are negotiated, communicated, and managed — just like changes to a REST API.

The Problem Data Contracts Solve

In most organizations, the data flow looks like this:

Backend Team (produces data)
  -> Database/Events
    -> Data Engineering (ingests, transforms)
      -> Analytics Table
        -> Data Analysts (query for reports)
        -> ML Engineers (train models)
        -> Finance Team (revenue reporting)
        -> Marketing (attribution)

The backend team does not know that six downstream teams depend on the user_events table. They rename a column, change an enum value, or alter the event schema. The data engineering pipeline breaks. The analytics table is stale or incorrect. The finance report has wrong numbers. The ML model's predictions degrade.

The root cause: the producer has no visibility into who uses their data, and no obligation to maintain it.

A data contract makes this explicit. The backend team agrees: "We will produce events with this schema, at this frequency, with this quality level, and we will not make breaking changes without a 2-week notice."

What a Data Contract Contains

Schema Definition

The exact structure of the data: field names, types, nullability, and descriptions.

contract:
  name: order_events
  version: 2.1.0
  owner: backend-team
  
  schema:
    fields:
      - name: event_id
        type: string
        format: uuid
        nullable: false
        description: Unique identifier for this event
        
      - name: order_id
        type: integer
        nullable: false
        description: The order this event relates to
        
      - name: event_type
        type: string
        nullable: false
        allowed_values: [created, paid, shipped, delivered, cancelled]
        description: The type of order lifecycle event
        
      - name: amount_cents
        type: integer
        nullable: false
        description: Order amount in cents (USD)
        
      - name: occurred_at
        type: timestamp
        format: ISO 8601 UTC
        nullable: false
        description: When the event actually occurred

SLAs (Service Level Agreements)

How quickly and reliably the data will be available.

sla:
  freshness:
    max_delay: 15 minutes
    description: Events appear in Kafka within 15 minutes of occurrence
    
  availability:
    uptime: 99.9%
    description: The event stream is available 99.9% of the time
    
  volume:
    expected_daily_events: 50000-200000
    alert_if_below: 10000
    alert_if_above: 500000

Quality Expectations

What guarantees the producer makes about data quality.

quality:
  - field: event_id
    check: unique
    description: No duplicate event IDs within a 24-hour window
    
  - field: amount_cents
    check: range
    min: 0
    max: 10000000  # $100,000 max
    description: Amount is non-negative and within reasonable bounds
    
  - field: occurred_at
    check: freshness
    max_age: 48 hours
    description: Events are not backdated more than 48 hours

Ownership & Contact

Who is responsible for the data and how to reach them.

ownership:
  team: backend-payments
  slack_channel: "#payments-data"
  on_call: payments-oncall@company.pagerduty.com
  
  contacts:
    - name: Jane Smith
      role: Tech Lead
      email: jane@company.com
    - name: Bob Jones
      role: Data Producer Owner
      email: bob@company.com

Change Management

How changes to the contract are handled.

changes:
  breaking_change_notice: 14 days
  deprecation_notice: 30 days
  communication_channel: "#data-contracts-changes"
  approval_required_from: [data-engineering, analytics]

Why Data Contracts Matter

The Producer Does Not Know Who Uses Their Data

A backend engineer adds a feature that changes the status field from a string to an integer enum. In their world, this is a clean improvement. They have no idea that:

A data pipeline parses status as a string and maps it to categories
An ML model uses one-hot encoding on the string values
A Looker dashboard has a filter that matches on the string "completed"

All three break silently. Data contracts prevent this by making the dependency explicit and the change process structured.

Data Contracts as API Contracts for Data

REST APIs have contracts: endpoint paths, request/response schemas, authentication, rate limits, deprecation policies. We would never accept a REST API that changed its response format without notice. But we routinely accept this for data.

Data contracts apply the same discipline:

REST API Contract              Data Contract
----------------               -------------
Endpoint path                  Topic/table name
Request/response schema        Data schema
Authentication                 Access control
Rate limits                    Volume expectations
SLA (latency, uptime)          SLA (freshness, availability)
Deprecation policy             Change management policy
API versioning                 Schema versioning

Implementing Data Contracts

Contract-as-Code

Store contracts in version control alongside the code that produces the data.

repo: backend-payments/
  src/
    order_events.py
  contracts/
    order_events.yaml    <- The data contract
  tests/
    test_order_events.py
    test_contract.py     <- Tests that validate output matches contract

Validation at the Source

The producer validates that their output matches the contract before publishing.

def publish_order_event(order_id, event_type, amount_cents, occurred_at):
    event = {
        'event_id': str(uuid.uuid4()),
        'order_id': order_id,
        'event_type': event_type,
        'amount_cents': amount_cents,
        'occurred_at': occurred_at.isoformat(),
    }

    # Validate against contract before publishing
    validate_against_contract(event, 'order_events')

    kafka_producer.produce('order_events', value=json.dumps(event))

Validation at Ingestion

The data engineering team also validates incoming data against the contract.

def ingest_order_events():
    events = consume_from_kafka('order_events')

    # Validate against the published contract
    contract = load_contract('order_events')

    for event in events:
        violations = contract.validate(event)
        if violations:
            # Log the violation and route to dead letter queue
            dead_letter.publish(event, violations=violations)
            metrics.increment('contract_violations', tags={'contract': 'order_events'})
        else:
            write_to_warehouse(event)

CI/CD Integration

Block deployments that violate data contracts.

# In the CI pipeline for the backend-payments repo:
1. Run unit tests
2. Run integration tests
3. Validate data contract compatibility:
   - Load the current contract (order_events v2.1.0)
   - Generate the schema from the code
   - Check compatibility (backward compatible? breaking change?)
   - If breaking: FAIL the build. Require contract version bump and notice period.
4. Deploy

Breaking Changes Require Communication

When a breaking change is necessary (it sometimes is), the process should be:

1. Producer proposes the change
   - Opens a PR updating the contract
   - Describes what changes and why
   
2. Consumers are notified
   - Automated notification to all registered consumers
   - Discussion in the data contracts channel
   
3. Migration timeline is agreed
   - "New schema available March 1"
   - "Old schema deprecated March 15"
   - "Old schema removed April 1"
   
4. Both schemas are supported during transition
   - Producer writes both formats, or
   - A compatibility layer translates old to new
   
5. Old schema is removed
   - Only after all consumers have confirmed migration

This is exactly how API versioning works in well-run organizations. Data should get the same treatment.

Real-World Example: E-Commerce Order Events

A mid-size e-commerce company has three teams consuming order events:

Data Engineering: Ingests events into the warehouse for analytics tables. ML Team: Uses event streams to train a fraud detection model. Finance: Uses aggregated data for revenue reconciliation.

Without a contract, the backend team renames total_amount to order_total. The data engineering pipeline produces NULLs in the amount column. The ML model's fraud scores drop because a key feature is missing. Finance reports zero revenue for the day. Three teams spend a combined 20 hours debugging.

With a contract: the backend team's CI pipeline rejects the rename because it is a breaking change. The team opens a data contract change request. They agree to use the expand-contract pattern: add order_total, keep total_amount for 30 days, then remove it. All consumers migrate within the window. Zero downtime. Zero debugging.

Making contracts too strict too early. A contract with 50 quality checks on a table that changes weekly creates friction without value. Start with critical fields and expand.
Not enforcing contracts. A contract that exists in a YAML file but is never validated is just documentation. Contracts must be enforced at the source, at ingestion, or both.
Only validating schema, not semantics. A contract that checks types but not meanings misses the most dangerous changes (units, timezones, business logic).
Making contracts the data engineering team's responsibility alone. Producers must own their contracts. If data engineering writes and maintains producer contracts, the system breaks down.
Ignoring internal data. Contracts between teams within the same company matter just as much as contracts with external vendors. Internal teams break your pipelines too.
No process for breaking changes. If there is no defined process, teams either never make breaking changes (accumulating tech debt) or make them without notice (breaking consumers).

Key Takeaways

A data contract is an agreement between producer and consumer covering schema, SLAs, quality, ownership, and change management.
Without contracts, producers have no visibility into who uses their data, and changes break downstream systems silently.
Data contracts apply the same discipline as API contracts: versioning, backward compatibility, deprecation policies, and structured change management.
Store contracts as code, validate in CI/CD, and monitor compliance in production. A contract that is not enforced is just documentation.
Breaking changes require communication: propose, notify consumers, agree on a timeline, support both schemas during transition, then remove the old one.
Start simple. A contract covering schema and ownership for your five most critical data sources is better than a comprehensive contract framework that nobody adopts.