Data Contracts
A data contract is an agreement between a data producer and its consumers about what the data looks like, how reliable it is, and who is responsible for it. It is the API contract concept applied to data. Without data contracts, producers change their data without knowing who it breaks. With data contracts, changes are negotiated, communicated, and managed — just like changes to a REST API.
The Problem Data Contracts Solve
In most organizations, the data flow looks like this:
Backend Team (produces data)
-> Database/Events
-> Data Engineering (ingests, transforms)
-> Analytics Table
-> Data Analysts (query for reports)
-> ML Engineers (train models)
-> Finance Team (revenue reporting)
-> Marketing (attribution)
The backend team does not know that six downstream teams depend on the user_events table. They rename a column, change an enum value, or alter the event schema. The data engineering pipeline breaks. The analytics table is stale or incorrect. The finance report has wrong numbers. The ML model's predictions degrade.
The root cause: the producer has no visibility into who uses their data, and no obligation to maintain it.
A data contract makes this explicit. The backend team agrees: "We will produce events with this schema, at this frequency, with this quality level, and we will not make breaking changes without a 2-week notice."
What a Data Contract Contains
Schema Definition
The exact structure of the data: field names, types, nullability, and descriptions.
contract:
name: order_events
version: 2.1.0
owner: backend-team
schema:
fields:
- name: event_id
type: string
format: uuid
nullable: false
description: Unique identifier for this event
- name: order_id
type: integer
nullable: false
description: The order this event relates to
- name: event_type
type: string
nullable: false
allowed_values: [created, paid, shipped, delivered, cancelled]
description: The type of order lifecycle event
- name: amount_cents
type: integer
nullable: false
description: Order amount in cents (USD)
- name: occurred_at
type: timestamp
format: ISO 8601 UTC
nullable: false
description: When the event actually occurred
SLAs (Service Level Agreements)
How quickly and reliably the data will be available.
sla:
freshness:
max_delay: 15 minutes
description: Events appear in Kafka within 15 minutes of occurrence
availability:
uptime: 99.9%
description: The event stream is available 99.9% of the time
volume:
expected_daily_events: 50000-200000
alert_if_below: 10000
alert_if_above: 500000
Quality Expectations
What guarantees the producer makes about data quality.
quality:
- field: event_id
check: unique
description: No duplicate event IDs within a 24-hour window
- field: amount_cents
check: range
min: 0
max: 10000000 # $100,000 max
description: Amount is non-negative and within reasonable bounds
- field: occurred_at
check: freshness
max_age: 48 hours
description: Events are not backdated more than 48 hours
Ownership & Contact
Who is responsible for the data and how to reach them.
ownership:
team: backend-payments
slack_channel: "#payments-data"
on_call: payments-oncall@company.pagerduty.com
contacts:
- name: Jane Smith
role: Tech Lead
email: jane@company.com
- name: Bob Jones
role: Data Producer Owner
email: bob@company.com
Change Management
How changes to the contract are handled.
changes:
breaking_change_notice: 14 days
deprecation_notice: 30 days
communication_channel: "#data-contracts-changes"
approval_required_from: [data-engineering, analytics]
Why Data Contracts Matter
The Producer Does Not Know Who Uses Their Data
A backend engineer adds a feature that changes the status field from a string to an integer enum. In their world, this is a clean improvement. They have no idea that:
- A data pipeline parses
statusas a string and maps it to categories - An ML model uses one-hot encoding on the string values
- A Looker dashboard has a filter that matches on the string "completed"
All three break silently. Data contracts prevent this by making the dependency explicit and the change process structured.
Data Contracts as API Contracts for Data
REST APIs have contracts: endpoint paths, request/response schemas, authentication, rate limits, deprecation policies. We would never accept a REST API that changed its response format without notice. But we routinely accept this for data.
Data contracts apply the same discipline:
REST API Contract Data Contract
---------------- -------------
Endpoint path Topic/table name
Request/response schema Data schema
Authentication Access control
Rate limits Volume expectations
SLA (latency, uptime) SLA (freshness, availability)
Deprecation policy Change management policy
API versioning Schema versioning
Implementing Data Contracts
Contract-as-Code
Store contracts in version control alongside the code that produces the data.
repo: backend-payments/
src/
order_events.py
contracts/
order_events.yaml <- The data contract
tests/
test_order_events.py
test_contract.py <- Tests that validate output matches contract
Validation at the Source
The producer validates that their output matches the contract before publishing.
def publish_order_event(order_id, event_type, amount_cents, occurred_at):
event = {
'event_id': str(uuid.uuid4()),
'order_id': order_id,
'event_type': event_type,
'amount_cents': amount_cents,
'occurred_at': occurred_at.isoformat(),
}
# Validate against contract before publishing
validate_against_contract(event, 'order_events')
kafka_producer.produce('order_events', value=json.dumps(event))
Validation at Ingestion
The data engineering team also validates incoming data against the contract.
def ingest_order_events():
events = consume_from_kafka('order_events')
# Validate against the published contract
contract = load_contract('order_events')
for event in events:
violations = contract.validate(event)
if violations:
# Log the violation and route to dead letter queue
dead_letter.publish(event, violations=violations)
metrics.increment('contract_violations', tags={'contract': 'order_events'})
else:
write_to_warehouse(event)
CI/CD Integration
Block deployments that violate data contracts.
# In the CI pipeline for the backend-payments repo:
1. Run unit tests
2. Run integration tests
3. Validate data contract compatibility:
- Load the current contract (order_events v2.1.0)
- Generate the schema from the code
- Check compatibility (backward compatible? breaking change?)
- If breaking: FAIL the build. Require contract version bump and notice period.
4. Deploy
Breaking Changes Require Communication
When a breaking change is necessary (it sometimes is), the process should be:
1. Producer proposes the change
- Opens a PR updating the contract
- Describes what changes and why
2. Consumers are notified
- Automated notification to all registered consumers
- Discussion in the data contracts channel
3. Migration timeline is agreed
- "New schema available March 1"
- "Old schema deprecated March 15"
- "Old schema removed April 1"
4. Both schemas are supported during transition
- Producer writes both formats, or
- A compatibility layer translates old to new
5. Old schema is removed
- Only after all consumers have confirmed migration
This is exactly how API versioning works in well-run organizations. Data should get the same treatment.
Real-World Example: E-Commerce Order Events
A mid-size e-commerce company has three teams consuming order events:
Data Engineering: Ingests events into the warehouse for analytics tables. ML Team: Uses event streams to train a fraud detection model. Finance: Uses aggregated data for revenue reconciliation.
Without a contract, the backend team renames total_amount to order_total. The data engineering pipeline produces NULLs in the amount column. The ML model's fraud scores drop because a key feature is missing. Finance reports zero revenue for the day. Three teams spend a combined 20 hours debugging.
With a contract: the backend team's CI pipeline rejects the rename because it is a breaking change. The team opens a data contract change request. They agree to use the expand-contract pattern: add order_total, keep total_amount for 30 days, then remove it. All consumers migrate within the window. Zero downtime. Zero debugging.
Maturity Levels
Level 0: No Contracts
Producers publish whatever they want. Consumers discover schema changes when pipelines break. This is where most organizations start.
Level 1: Documented Schemas
Schemas are documented in a wiki or README. Better than nothing, but documentation goes stale quickly and there is no enforcement.
Level 2: Schema Registry
A central registry stores schemas and checks compatibility. Incompatible changes are blocked. This prevents many breaking changes but does not cover SLAs or ownership.
Level 3: Full Data Contracts
Schemas, SLAs, quality expectations, ownership, and change management are defined as code, enforced in CI/CD, and monitored in production. This is the goal for mature data platforms.
Common Pitfalls
- Making contracts too strict too early. A contract with 50 quality checks on a table that changes weekly creates friction without value. Start with critical fields and expand.
- Not enforcing contracts. A contract that exists in a YAML file but is never validated is just documentation. Contracts must be enforced at the source, at ingestion, or both.
- Only validating schema, not semantics. A contract that checks types but not meanings misses the most dangerous changes (units, timezones, business logic).
- Making contracts the data engineering team's responsibility alone. Producers must own their contracts. If data engineering writes and maintains producer contracts, the system breaks down.
- Ignoring internal data. Contracts between teams within the same company matter just as much as contracts with external vendors. Internal teams break your pipelines too.
- No process for breaking changes. If there is no defined process, teams either never make breaking changes (accumulating tech debt) or make them without notice (breaking consumers).
Key Takeaways
- A data contract is an agreement between producer and consumer covering schema, SLAs, quality, ownership, and change management.
- Without contracts, producers have no visibility into who uses their data, and changes break downstream systems silently.
- Data contracts apply the same discipline as API contracts: versioning, backward compatibility, deprecation policies, and structured change management.
- Store contracts as code, validate in CI/CD, and monitor compliance in production. A contract that is not enforced is just documentation.
- Breaking changes require communication: propose, notify consumers, agree on a timeline, support both schemas during transition, then remove the old one.
- Start simple. A contract covering schema and ownership for your five most critical data sources is better than a comprehensive contract framework that nobody adopts.