Data Vault & Other Approaches

Dimensional modeling is the workhorse of analytical data warehousing, but it is not the only approach. Depending on your organization's size, the number of source systems you integrate, and your query patterns, other modeling approaches might serve you better. This guide covers Data Vault, One Big Table, Activity Schema, and when each makes sense.

Data Vault

Data Vault is a modeling methodology designed by Dan Linstedt for large enterprises that integrate data from many source systems. Where dimensional modeling optimizes for query simplicity, Data Vault optimizes for auditability, flexibility, and parallel loading.

The Three Building Blocks

A Data Vault model consists of three table types: hubs, links, and satellites.

Hubs represent core business entities. A hub contains only the business key (the natural identifier) and metadata about when and where the record was first loaded. No descriptive attributes.

CREATE TABLE hub_customer (
    hub_customer_hash_key  CHAR(32) PRIMARY KEY,  -- MD5 of business key
    customer_id            VARCHAR(50),            -- business key
    load_date              TIMESTAMP,
    record_source          VARCHAR(100)
);

Links represent relationships between hubs. An order connects a customer to a product, so there is a link table capturing that relationship.

CREATE TABLE link_order (
    link_order_hash_key       CHAR(32) PRIMARY KEY,
    hub_customer_hash_key     CHAR(32) REFERENCES hub_customer,
    hub_product_hash_key      CHAR(32) REFERENCES hub_product,
    hub_store_hash_key        CHAR(32) REFERENCES hub_store,
    order_id                  VARCHAR(50),
    load_date                 TIMESTAMP,
    record_source             VARCHAR(100)
);

Satellites store descriptive attributes and their history. Each hub or link can have multiple satellites, typically one per source system. Satellites are where all the descriptive data lives, along with timestamps that track when each version was valid.

CREATE TABLE sat_customer_crm (
    hub_customer_hash_key  CHAR(32),
    load_date              TIMESTAMP,
    load_end_date          TIMESTAMP,
    record_source          VARCHAR(100),
    first_name             VARCHAR(100),
    last_name              VARCHAR(100),
    email                  VARCHAR(255),
    phone                  VARCHAR(50),
    PRIMARY KEY (hub_customer_hash_key, load_date)
);

CREATE TABLE sat_customer_billing (
    hub_customer_hash_key  CHAR(32),
    load_date              TIMESTAMP,
    load_end_date          TIMESTAMP,
    record_source          VARCHAR(100),
    billing_address        VARCHAR(500),
    payment_method         VARCHAR(50),
    credit_limit           DECIMAL(10,2),
    PRIMARY KEY (hub_customer_hash_key, load_date)
);

How It Fits Together

sat_customer_crm                              sat_customer_billing
       \                                            /
        \                                          /
         hub_customer ---- link_order ---- hub_product
                                |                \
                                |                 \
                           sat_order          sat_product_catalog

Why Data Vault Exists

Data Vault solves specific problems that dimensional models handle poorly:

Many source systems. When customer data comes from CRM, billing, support, and marketing — each with different schemas, update frequencies, and data quality — Data Vault keeps them cleanly separated in different satellites while unifying them through a single hub.

Parallel loading. Hubs, links, and satellites are independent. You can load the CRM satellite without waiting for the billing satellite. In large enterprises with dozens of data feeds arriving on different schedules, this matters.

Full auditability. Every record tracks when it was loaded and from which source. You can always trace a value back to its origin. This is important in regulated industries (finance, healthcare) where auditors need a clear lineage.

Schema evolution. Adding a new source system means adding a new satellite. Existing tables are untouched. In a dimensional model, adding a new source often requires altering existing tables.

The Downsides

Query complexity. A simple question like "show me customer names and their orders" requires joining a hub, a satellite, and a link. In a dimensional model, it is one join.

-- Data Vault: getting customer name and order details
SELECT
    c.customer_id,
    sc.first_name,
    sc.last_name,
    lo.order_id,
    so.order_amount,
    so.order_date
FROM hub_customer c
JOIN sat_customer_crm sc
    ON c.hub_customer_hash_key = sc.hub_customer_hash_key
    AND sc.load_end_date IS NULL  -- current record
JOIN link_order lo
    ON c.hub_customer_hash_key = lo.hub_customer_hash_key
JOIN sat_order so
    ON lo.link_order_hash_key = so.link_order_hash_key
    AND so.load_end_date IS NULL;

Compare that to the dimensional model equivalent:

-- Dimensional: same question
SELECT
    c.customer_id,
    c.first_name,
    c.last_name,
    f.order_id,
    f.order_amount,
    f.order_date
FROM fact_orders f
JOIN dim_customer c ON f.customer_key = c.customer_key
WHERE c.is_current = TRUE;

Not analyst-friendly. Business users and analysts should not query the Data Vault directly. You build "business vault" or "information mart" layers on top — essentially dimensional models derived from the vault. This means more layers and more work.

Overhead for small teams. If you have one or two data engineers and a handful of data sources, Data Vault adds complexity without proportional benefit. The methodology assumes a scale of integration that most companies do not have.

When to Use Data Vault

You have 10+ source systems feeding into the warehouse
Regulatory requirements demand full auditability and lineage
Multiple teams need to load data independently and in parallel
Source system schemas change frequently
You have a team large enough to maintain the additional layers (raw vault, business vault, information marts)

When to Skip Data Vault

You have fewer than 5 source systems
Your data team is smaller than 5 people
Analysts need direct access to the data
Time-to-value matters more than architectural purity

One Big Table (OBT)

One Big Table is the opposite of normalized modeling. You denormalize everything into a single wide table. Every attribute, every metric, every dimension — all in one place.

-- One Big Table: everything about orders in a single table
CREATE TABLE obt_orders (
    order_id           BIGINT,
    order_date         DATE,
    order_amount       DECIMAL(10,2),
    discount_amount    DECIMAL(10,2),
    customer_id        VARCHAR(50),
    customer_name      VARCHAR(200),
    customer_city      VARCHAR(100),
    customer_state     VARCHAR(50),
    customer_segment   VARCHAR(50),
    customer_signup    DATE,
    product_id         VARCHAR(50),
    product_name       VARCHAR(200),
    product_category   VARCHAR(100),
    product_brand      VARCHAR(100),
    store_id           VARCHAR(50),
    store_name         VARCHAR(200),
    store_region       VARCHAR(50)
);

Why OBT Works

No joins at query time. Every query is a simple scan, filter, and aggregate on one table. This is fast on columnar warehouses and trivial for analysts to write.

-- OBT: revenue by customer segment and product category
SELECT
    customer_segment,
    product_category,
    SUM(order_amount) AS revenue
FROM obt_orders
WHERE order_date >= '2025-01-01'
GROUP BY customer_segment, product_category;

Low barrier for analysts. No need to understand join paths, star schemas, or cardinality. One table, filter it, aggregate it, done.

Good for specific use cases. OBT works well for a focused analytical domain — "everything about orders" or "everything about web sessions." It is less about replacing dimensional modeling and more about creating purpose-built analytical tables.

The Downsides

Data redundancy. Customer name is repeated on every order row. Product category is repeated on every order row. For a table with hundreds of millions of rows, this increases storage and build time. Columnar compression mitigates this, but it is still more data than a normalized model.

Update complexity. If a customer changes their name, you need to update every row in the OBT that references that customer. In a dimensional model, you update one row in dim_customer.

Schema rigidity. Adding a new attribute means altering a massive table and backfilling. In a dimensional model, you alter a dimension table that might have thousands of rows instead of millions.

When to Use OBT

Building a purpose-specific analytical table for a BI tool
The downstream consumers are non-technical and need the simplest possible interface
Query performance on joins is a bottleneck and you have confirmed that denormalization solves it
The table represents a single, well-defined analytical domain

When to Skip OBT

You need to support ad hoc queries across many domains
Dimension attributes change frequently
You need to track historical changes (SCD Type 2 in an OBT is painful)

Activity Schema

Activity Schema models data as a stream of activities (events) performed by entities. Instead of building domain-specific fact tables, you create a single activity stream with a consistent structure.

CREATE TABLE activity_stream (
    activity_id      BIGINT PRIMARY KEY,
    entity_id        VARCHAR(50),        -- who (customer, user, account)
    activity         VARCHAR(100),       -- what happened
    activity_ts      TIMESTAMP,          -- when
    feature_1        VARCHAR(500),       -- context (varies by activity)
    feature_2        VARCHAR(500),
    feature_3        VARCHAR(500),
    revenue_impact   DECIMAL(10,2)       -- optional numeric measure
);

Example data:

entity_id | activity          | activity_ts         | feature_1      | feature_2    | revenue_impact
U-1001    | signed_up         | 2025-01-01 10:00:00 | organic_search | mobile       | NULL
U-1001    | placed_order      | 2025-01-05 14:30:00 | ORD-5001       | electronics  | 149.99
U-1001    | contacted_support | 2025-01-06 09:15:00 | billing_issue  | chat         | NULL
U-1001    | placed_order      | 2025-01-20 11:00:00 | ORD-5002       | clothing     | 79.99

Why Activity Schema Works

Flexibility. Adding a new activity type does not require schema changes. You insert rows with a new activity value. No new tables, no migrations.

Customer journey analysis. Because all activities are in one table with a consistent structure, answering "what did this customer do?" is a single query.

-- Full customer journey
SELECT activity, activity_ts, feature_1, feature_2
FROM activity_stream
WHERE entity_id = 'U-1001'
ORDER BY activity_ts;

Funnel and cohort analysis. Activity schema makes it natural to build funnels (signed_up -> activated -> purchased) and cohorts (users who signed up in January).

The Downsides

Generic columns are confusing. feature_1 means "signup channel" for signup events and "order ID" for order events. Without documentation or a lookup table, analysts do not know what each column means for each activity type.

Aggregation is harder. Summing revenue across order activities requires filtering by activity type, which is less intuitive than summing a column in a fact table.

Not suited for complex joins. Activity schema works for entity-centric analysis but struggles when you need to join across entities (orders with products with suppliers).

When to Use Activity Schema

Product analytics where user journey analysis is the primary use case
Early-stage companies that need a flexible model before they understand their analytical requirements
Event-driven applications where the natural data shape is a stream of activities

When to Skip Activity Schema

You need rich dimensional context (product attributes, store hierarchies)
Your primary queries are aggregations across entities, not journey analysis
Your analysts expect a traditional star schema and are not comfortable with generic columns

Choosing Your Approach

Scenario                                    Recommended Approach
---------------------------------------------------------------
Small team, few sources, general analytics  Dimensional (star schema)
Large enterprise, many sources, audit needs Data Vault + dimensional marts
Single-domain analytical table, BI focus    One Big Table
Product analytics, user journey focus       Activity Schema
First data model, uncertain requirements    Dimensional (simplest correct choice)

The approaches are not mutually exclusive. A common enterprise pattern is:

Source Systems -> Data Vault (raw) -> Dimensional Models (marts) -> OBT (BI layer)

Each layer serves a different purpose. The vault provides auditability and integration. Dimensional models provide analytical structure. OBTs provide BI-friendly access.

Common Pitfalls

Choosing Data Vault because it sounds enterprise-grade. Data Vault solves integration and auditability problems. If you do not have those problems, you do not need Data Vault. A star schema built in dbt will serve most companies well.
Building an OBT as your only model. OBT works as a downstream artifact, not as a primary model. Without an upstream structured model, your OBT becomes impossible to maintain as requirements change.
Using Activity Schema without documenting feature columns. Generic columns are powerful but opaque. Maintain a reference that maps each activity type to its feature column meanings, or analysts will misinterpret the data.
Overcomplating your first data model. Your first model should be a simple star schema with a handful of fact tables and dimensions. You can always refactor later. You cannot get back the time spent building infrastructure you did not need.
Ignoring your team's skills. The best data model is one your team can build, maintain, and evolve. A theoretically superior approach that nobody understands is worse than a simple approach that everyone can work with.
Mixing paradigms without clear boundaries. If you use Data Vault for the raw layer and dimensional for marts, make it explicit. Analysts should never need to query the vault directly. Engineers should never build pipelines against the marts.

Key Takeaways

Data Vault (hubs, links, satellites) excels at integrating many source systems with full auditability. It is built for large enterprises and regulated industries, not for small teams.
One Big Table denormalizes everything into a single wide table for maximum query simplicity. Use it as a downstream artifact for BI, not as your primary model.
Activity Schema models data as a stream of entity activities. It is strong for product analytics and user journey analysis but weak for complex dimensional queries.
Dimensional modeling (star schema) remains the best default for most analytics use cases. Start there unless you have specific requirements that demand a different approach.
These approaches can be layered: vault for raw integration, dimensional for analytical structure, OBT for BI access.
Do not overcomplicate your first data model. Ship a working star schema, learn from how people use it, and evolve from there.