Dimensional Modeling

Dimensional modeling is the foundation of analytical data warehousing. Developed by Ralph Kimball in the 1990s, it organizes data into a structure optimized for human understanding and query performance. If you work in data engineering, you will encounter dimensional models constantly — and you should know how to build them well.

The Core Concept: Facts & Dimensions

Dimensional modeling divides data into two types of tables:

Fact tables store measurable events — things that happened. An order was placed. A page was viewed. A payment was processed. Each row in a fact table represents one event, and the table contains numeric measures (amount, quantity, duration) along with foreign keys to dimension tables.

Dimension tables store the context around those events — the who, what, where, and when. Who placed the order? What product was it? Where was it shipped? When did it happen? Dimension tables are typically wide (many columns) and descriptive.

-- A fact table: each row is one order line item
CREATE TABLE fact_order_items (
    order_item_id    BIGINT PRIMARY KEY,
    order_id         BIGINT,
    customer_key     INT REFERENCES dim_customer,
    product_key      INT REFERENCES dim_product,
    date_key         INT REFERENCES dim_date,
    store_key        INT REFERENCES dim_store,
    quantity         INT,
    unit_price       DECIMAL(10,2),
    discount_amount  DECIMAL(10,2),
    total_amount     DECIMAL(10,2)
);

-- A dimension table: descriptive attributes about customers
CREATE TABLE dim_customer (
    customer_key     INT PRIMARY KEY,
    customer_id      VARCHAR(50),   -- natural key from source
    first_name       VARCHAR(100),
    last_name        VARCHAR(100),
    email            VARCHAR(255),
    city             VARCHAR(100),
    state            VARCHAR(50),
    country          VARCHAR(50),
    customer_segment VARCHAR(50),
    signup_date      DATE,
    is_active        BOOLEAN
);

The fact table is narrow and tall (few columns, many rows). The dimension table is wide and short (many columns, fewer rows). This separation is intentional — it makes queries intuitive and performant.

Star Schema

The star schema is the most common dimensional modeling pattern. One fact table sits at the center, surrounded by dimension tables. When you draw it on a whiteboard, it looks like a star.

                  dim_date
                     |
                     |
dim_customer --- fact_order_items --- dim_product
                     |
                     |
                  dim_store

Queries against a star schema are straightforward:

-- Total revenue by product category and month
SELECT
    d.calendar_month,
    p.category,
    SUM(f.total_amount) AS revenue
FROM fact_order_items f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
WHERE d.calendar_year = 2025
GROUP BY d.calendar_month, p.category
ORDER BY d.calendar_month, revenue DESC;

The beauty of this structure is that a business analyst can understand it immediately. They know that fact_order_items contains orders, and they can slice the data by joining to any dimension. No complex subqueries, no self-joins.

Snowflake Schema

The snowflake schema normalizes dimension tables further. Instead of a flat dim_product table with a category column, you have a separate dim_category table that dim_product references.

                      dim_date
                         |
dim_category             |
     |                   |
dim_product --- fact_order_items --- dim_customer --- dim_geography
                         |
                         |
                      dim_store

-- Snowflake: dim_product references dim_category
CREATE TABLE dim_category (
    category_key  INT PRIMARY KEY,
    category_name VARCHAR(100),
    department    VARCHAR(100)
);

CREATE TABLE dim_product (
    product_key   INT PRIMARY KEY,
    product_id    VARCHAR(50),
    product_name  VARCHAR(200),
    category_key  INT REFERENCES dim_category,
    brand         VARCHAR(100),
    unit_cost     DECIMAL(10,2)
);

Star vs snowflake in practice: Star schemas are simpler to query (fewer joins) and perform better on modern columnar warehouses like Snowflake, BigQuery, and Redshift. Snowflake schemas save a small amount of storage by eliminating redundant data in dimensions, but that savings is negligible with modern storage costs.

Use star schemas unless you have a specific reason to normalize dimensions further. Most practitioners and most modern tooling assume star schemas.

Designing Fact Tables

Grain

The grain of a fact table is the level of detail each row represents. This is the single most important decision in dimensional modeling. Get the grain wrong, and your model is either useless or misleading.

Grain examples for an e-commerce business:

Coarsest:  One row per customer per month    (aggregated)
           One row per order                 (order-level)
           One row per order line item       (line-item-level)
Finest:    One row per click event           (event-level)

Start with the finest grain that makes sense for your use case. You can always aggregate fine-grained data upward, but you cannot disaggregate a coarse-grained table. If your fact table is at the order level, you cannot answer "which products in this order were returned?" without going back to the source.

Types of Facts

Additive facts can be summed across all dimensions. Revenue is additive — you can sum it across time, products, and customers, and the result is meaningful.

Semi-additive facts can be summed across some dimensions but not all. Account balance is semi-additive — you can sum it across customers (total balance across all accounts) but not across time (summing today's balance and yesterday's balance is nonsensical).

Non-additive facts cannot be summed at all. Ratios and percentages are non-additive. If Product A has a 20% margin and Product B has a 30% margin, the combined margin is not 50%. Store the components (revenue and cost) and calculate the ratio at query time.

-- Store components, not ratios
-- Good: store revenue and cost in the fact table
SELECT
    p.category,
    SUM(f.revenue) AS total_revenue,
    SUM(f.cost) AS total_cost,
    SUM(f.revenue - f.cost) / NULLIF(SUM(f.revenue), 0) AS margin_pct
FROM fact_order_items f
JOIN dim_product p ON f.product_key = p.product_key
GROUP BY p.category;

Factless Fact Tables

Sometimes you need to record that something happened without any numeric measurement. A student registered for a class. A customer viewed a promotion but did not click. These are factless fact tables — they contain only foreign keys to dimensions, with no measures.

-- Factless fact: records which promotions each customer was exposed to
CREATE TABLE fact_promotion_exposure (
    date_key      INT REFERENCES dim_date,
    customer_key  INT REFERENCES dim_customer,
    promotion_key INT REFERENCES dim_promotion,
    channel_key   INT REFERENCES dim_channel
);

Designing Dimension Tables

The Date Dimension

Every dimensional model needs a date dimension. Do not just use a raw date column — build a proper dimension table with pre-calculated attributes.

CREATE TABLE dim_date (
    date_key         INT PRIMARY KEY,        -- 20250115
    full_date        DATE,
    day_of_week      VARCHAR(10),            -- 'Wednesday'
    day_of_month     INT,
    day_of_year      INT,
    week_of_year     INT,
    calendar_month   VARCHAR(20),            -- 'January'
    month_number     INT,
    calendar_quarter VARCHAR(5),             -- 'Q1'
    calendar_year    INT,
    fiscal_quarter   VARCHAR(5),             -- may differ from calendar
    fiscal_year      INT,
    is_weekend       BOOLEAN,
    is_holiday       BOOLEAN,
    holiday_name     VARCHAR(100)
);

This table lets analysts filter and group by any time attribute without writing date functions in every query. "Revenue by fiscal quarter" becomes a simple GROUP BY instead of a complex CASE expression.

Conformed Dimensions

A conformed dimension is shared across multiple fact tables. dim_customer is used by fact_orders, fact_support_tickets, and fact_web_sessions. This ensures that when someone filters by customer segment, the definition is consistent everywhere.

Conformed dimensions are one of the most valuable aspects of dimensional modeling. They enforce a shared vocabulary across the organization.

Junk Dimensions

Low-cardinality flags and indicators (is_gift_wrap, payment_type, shipping_priority) can clutter a fact table. A junk dimension combines these into a single table.

CREATE TABLE dim_order_flags (
    order_flag_key    INT PRIMARY KEY,
    is_gift_wrap      BOOLEAN,
    payment_type      VARCHAR(20),
    shipping_priority VARCHAR(20),
    is_first_order    BOOLEAN
);

Instead of four columns in the fact table, you have one foreign key. This is a minor optimization, but it keeps fact tables clean.

Why Dimensional Modeling Violates Normalization (& Why That Is Fine)

In a normalized (3NF) model, you store each piece of information exactly once. Customer city appears only in the customer table, never duplicated. This is correct for operational databases where insert and update performance matters.

Dimensional models intentionally denormalize. The customer's city, state, and country all sit in dim_customer rather than being broken into separate tables. This violates third normal form but makes analytical queries faster and simpler.

The tradeoff is acceptable because:

Analytical workloads are read-heavy. You query the warehouse thousands of times for every write. Optimizing for reads (fewer joins) is the right call.
Storage is cheap. Repeating "United States" across 10 million rows in dim_customer costs fractions of a cent on a columnar warehouse.
Columnar storage handles it well. Modern warehouses compress repeated values in columns extremely efficiently. The denormalized data takes less space than you would expect.
Human comprehension matters. A star schema with 5 tables is easier to understand than a normalized schema with 30 tables. Analysts should be exploring data, not decoding join paths.

Common Pitfalls

Not defining the grain upfront. If you cannot state the grain of your fact table in one sentence ("one row per order line item per day"), your model is not clear enough.
Storing derived metrics in fact tables. Store the raw components (quantity, unit price) and calculate derived values (total, margin) at query time. Pre-calculated values become stale when business rules change.
Using operational keys as primary keys. Surrogate keys (auto-incrementing integers) in dimension tables protect you from source system key changes and enable SCD Type 2 tracking.
Skipping the date dimension. Querying raw dates works until someone asks for "fiscal quarter" or "business days only." Build the date dimension on day one.
Over-normalizing dimensions. In analytical contexts, the extra joins from snowflaking dimensions rarely justify the marginal storage savings. Prefer star schema.
Making fact tables too wide. If your fact table has 50 columns of descriptive attributes, those belong in dimension tables. Facts should be lean: keys and measures.

Key Takeaways

Dimensional modeling organizes data into fact tables (measurable events) and dimension tables (descriptive context).
Star schemas are the standard pattern: one fact table surrounded by dimension tables. Prefer them over snowflake schemas for simplicity and query performance.
The grain is the most important design decision. Define it explicitly and choose the finest grain your use case requires.
Store additive components in facts, not derived ratios. Calculate ratios at query time.
Conformed dimensions enforce consistent definitions across the organization.
Denormalization is deliberate and justified for analytical workloads where reads vastly outnumber writes.