Data Modeling

Overview

Data modeling is the process of defining how data is stored, organized, and accessed. Good data models make applications fast and maintainable. Bad data models create cascading problems that get worse as the system grows.

Normalization

Normalization eliminates data redundancy by organizing data into related tables. Each piece of information is stored in exactly one place.

Normal Forms

First Normal Form (1NF):
- Each column contains atomic values
- No repeating groups
- Bad:  orders(id, items: "book,pen,tape")
- Good: order_items(order_id, item_name)

Second Normal Form (2NF):
- Meets 1NF
- Every non-key column depends on the entire primary key
- Bad:  order_items(order_id, product_id, product_name)
- Good: products(product_id, product_name) + order_items(order_id, product_id)

Third Normal Form (3NF):
- Meets 2NF
- No transitive dependencies (non-key depending on non-key)
- Bad:  employees(id, department_id, department_name)
- Good: employees(id, department_id) + departments(department_id, department_name)

Benefits of Normalization

Single source of truth for each fact
Updates happen in one place, preventing inconsistency
Storage efficiency (no duplicate data)
Flexible querying with joins

Costs of Normalization

Reads require joins across multiple tables
Join performance degrades at scale
More complex queries for simple reads
Harder to shard across multiple machines

Denormalization

Denormalization intentionally introduces redundancy to optimize read performance. You duplicate data to avoid expensive joins.

When to Denormalize

Read-heavy workloads where join performance is a bottleneck
Data that is read together should be stored together
Reporting and analytics queries that scan large datasets
Distributed systems where cross-partition joins are impossible

Denormalization Strategies

Strategy 1: Duplicated columns
- Store the customer name directly on the order table
- Avoids joining to the customer table for order display
- Trade-off: Must update all orders if customer name changes

Strategy 2: Precomputed aggregates
- Store order_count directly on the customer record
- Avoids COUNT(*) queries on the orders table
- Trade-off: Must maintain the count on every insert/delete

Strategy 3: Materialized views
- Precompute and store the result of a complex query
- Database refreshes the view periodically or on change
- Trade-off: Stale data between refreshes

Strategy 4: Embedded documents (NoSQL)
- Store the full shipping address inside each order document
- Avoids a separate addresses collection lookup
- Trade-off: Address updates require updating all orders

Real-World Examples

Twitter denormalizes tweet data by storing the author's display name and avatar URL directly on the tweet. Updating a user profile is rare compared to reading tweets.

Amazon stores product information redundantly in the order history. The product details at time of purchase are preserved even if the product listing changes later.

Schema Design for Access Patterns

The most important principle: design your schema around how data will be read, not how it looks in the real world.

Access Pattern Analysis

Step 1: List all queries your application will make
  - "Get user profile by user_id"
  - "List all orders for a user, most recent first"
  - "Find all users in a given city"
  - "Get order details with line items and shipping status"

Step 2: Identify hot paths (high frequency queries)
  - User profile reads: 10,000/sec
  - Order listing: 5,000/sec
  - City search: 50/sec

Step 3: Optimize schema for hot paths
  - Index user_id on all relevant tables
  - Denormalize order display data to avoid joins
  - City search is low volume; a join is acceptable

Step 4: Validate with realistic data volumes
  - Test with production-scale data, not toy datasets
  - Measure query plans, not just results

Index Design

Indexes speed up reads but slow down writes.

Primary key: Unique identifier, clustered storage order
  - users(user_id) -> fast point lookups

Secondary index: Additional lookup paths
  - users(email) -> find user by email
  - orders(user_id, created_at DESC) -> recent orders for user

Composite index: Multiple columns, order matters
  - (country, city, created_at) supports:
    WHERE country = 'US'                          [yes]
    WHERE country = 'US' AND city = 'Seattle'     [yes]
    WHERE city = 'Seattle'                         [no - leftmost prefix required]

Covering index: Includes all columns needed by a query
  - Avoids table lookup entirely
  - orders(user_id, created_at) INCLUDE (total, status)

Entity Relationships

One-to-One

Example: User and UserProfile
- Store in same table if always accessed together
- Separate tables if profile is large and rarely accessed
- Use foreign key with unique constraint

One-to-Many

Example: User and Orders
- Foreign key on the "many" side (orders.user_id)
- In document stores, embed if bounded (user's addresses)
- Reference if unbounded (user's orders over time)

Rule of thumb for embedding (document stores):
- Embed if fewer than ~100 items and they don't change independently
- Reference if unbounded or independently updated

Many-to-Many

Example: Students and Courses
- Join table in relational: enrollments(student_id, course_id)
- In document stores: store array of IDs on one or both sides
- Consider which direction you query most often

Example: Products and Tags
- product_tags(product_id, tag_id)
- Enables: "all products with tag X" and "all tags for product Y"

Hierarchical Data

Example: Organizational chart, category trees, comment threads

Option 1: Adjacency list
  categories(id, parent_id, name)
  Simple but recursive queries are expensive

Option 2: Materialized path
  categories(id, path, name)
  path: "/electronics/computers/laptops"
  Easy subtree queries: WHERE path LIKE '/electronics/%'

Option 3: Nested sets
  categories(id, left, right, name)
  Fast subtree reads, expensive writes

Option 4: Closure table
  category_tree(ancestor_id, descendant_id, depth)
  Fast reads and writes, extra storage

Schema Evolution

Schemas change over time. How you handle changes determines whether deployments are smooth or terrifying.

Backward-Compatible Changes

These changes are safe to deploy without coordinating application and database changes:

Safe changes:
- Adding a new column with a default value
- Adding a new table
- Adding a new index (may lock table temporarily)
- Widening a column (INT -> BIGINT, VARCHAR(50) -> VARCHAR(100))

Unsafe changes:
- Removing a column (old code still references it)
- Renaming a column (old code uses old name)
- Changing a column type incompatibly
- Adding a NOT NULL constraint without a default

Migration Strategies

Expand-Contract Pattern (Zero-Downtime Migrations):

Phase 1 - Expand:
  Add new column, write to both old and new columns
  Deploy application code that writes to both

Phase 2 - Migrate:
  Backfill existing data from old column to new column
  Verify data consistency

Phase 3 - Contract:
  Deploy code that reads only from new column
  Remove old column after verification period

Example: Renaming "username" to "display_name"
  1. Add display_name column
  2. Write to both username and display_name
  3. Backfill display_name from username
  4. Switch reads to display_name
  5. Stop writing to username
  6. Drop username column (weeks later)

Schema Evolution in NoSQL

Document stores handle schema changes differently:

Strategy 1: Schema version field
  { "schema_version": 2, "name": "Alice", "email_addresses": [...] }
  Application code handles multiple versions

Strategy 2: Lazy migration
  Read old format, write new format
  Documents migrate naturally over time
  Old documents are upgraded on next access

Strategy 3: Background migration
  Run a job to update all documents to new format
  Safer but consumes resources

Protobuf/Avro schemas:
  - Support forward and backward compatibility
  - New fields must have defaults
  - Never reuse field numbers
  - Mark removed fields as reserved

Real-World Modeling Decisions

Slack stores messages in a sharded MySQL cluster. Messages are partitioned by channel ID and time range. This design optimizes for the primary access pattern: loading recent messages in a channel.

Shopify uses a heavily denormalized schema for the storefront. Product data, pricing, and inventory are combined into read-optimized structures while the admin panel uses normalized tables for writes.

LinkedIn models professional relationships as a graph but stores member profiles in a denormalized document format for fast profile page rendering.

Common Pitfalls

Premature denormalization: Start normalized. Denormalize when you have measured evidence of a performance problem, not before.
Ignoring write patterns: A schema optimized purely for reads can make writes painfully slow or complex.
Over-indexing: Every index slows down writes and consumes storage. Only index columns you actually query on.
Not planning for schema evolution: If your migration strategy is "take the database offline," you will eventually have a very bad day.
Modeling data as it looks in the UI: The UI changes frequently. Model data around business entities and access patterns, not screens.
Unbounded embedding in document stores: Embedding a growing list inside a document leads to ever-larger documents and eventually hits size limits.

Key Takeaways

Design your schema around access patterns, not entity-relationship diagrams.
Start normalized and denormalize strategically based on measured performance needs.
Every denormalization trades write complexity for read performance. Know the cost.
Schema evolution is inevitable. Plan for zero-downtime migrations from the start.
Index design is as important as table design. Understand composite index ordering and covering indexes.
Test your schema with production-scale data volumes. Queries that work on 1,000 rows can fail on 10 million.