Data Vault & Other Approaches
Dimensional modeling is the workhorse of analytical data warehousing, but it is not the only approach. Depending on your organization's size, the number of source systems you integrate, and your query patterns, other modeling approaches might serve you better. This guide covers Data Vault, One Big Table, Activity Schema, and when each makes sense.
Data Vault
Data Vault is a modeling methodology designed by Dan Linstedt for large enterprises that integrate data from many source systems. Where dimensional modeling optimizes for query simplicity, Data Vault optimizes for auditability, flexibility, and parallel loading.
The Three Building Blocks
A Data Vault model consists of three table types: hubs, links, and satellites.
Hubs represent core business entities. A hub contains only the business key (the natural identifier) and metadata about when and where the record was first loaded. No descriptive attributes.
CREATE TABLE hub_customer (
hub_customer_hash_key CHAR(32) PRIMARY KEY, -- MD5 of business key
customer_id VARCHAR(50), -- business key
load_date TIMESTAMP,
record_source VARCHAR(100)
);
Links represent relationships between hubs. An order connects a customer to a product, so there is a link table capturing that relationship.
CREATE TABLE link_order (
link_order_hash_key CHAR(32) PRIMARY KEY,
hub_customer_hash_key CHAR(32) REFERENCES hub_customer,
hub_product_hash_key CHAR(32) REFERENCES hub_product,
hub_store_hash_key CHAR(32) REFERENCES hub_store,
order_id VARCHAR(50),
load_date TIMESTAMP,
record_source VARCHAR(100)
);
Satellites store descriptive attributes and their history. Each hub or link can have multiple satellites, typically one per source system. Satellites are where all the descriptive data lives, along with timestamps that track when each version was valid.
CREATE TABLE sat_customer_crm (
hub_customer_hash_key CHAR(32),
load_date TIMESTAMP,
load_end_date TIMESTAMP,
record_source VARCHAR(100),
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(255),
phone VARCHAR(50),
PRIMARY KEY (hub_customer_hash_key, load_date)
);
CREATE TABLE sat_customer_billing (
hub_customer_hash_key CHAR(32),
load_date TIMESTAMP,
load_end_date TIMESTAMP,
record_source VARCHAR(100),
billing_address VARCHAR(500),
payment_method VARCHAR(50),
credit_limit DECIMAL(10,2),
PRIMARY KEY (hub_customer_hash_key, load_date)
);
How It Fits Together
sat_customer_crm sat_customer_billing
\ /
\ /
hub_customer ---- link_order ---- hub_product
| \
| \
sat_order sat_product_catalog
Why Data Vault Exists
Data Vault solves specific problems that dimensional models handle poorly:
Many source systems. When customer data comes from CRM, billing, support, and marketing — each with different schemas, update frequencies, and data quality — Data Vault keeps them cleanly separated in different satellites while unifying them through a single hub.
Parallel loading. Hubs, links, and satellites are independent. You can load the CRM satellite without waiting for the billing satellite. In large enterprises with dozens of data feeds arriving on different schedules, this matters.
Full auditability. Every record tracks when it was loaded and from which source. You can always trace a value back to its origin. This is important in regulated industries (finance, healthcare) where auditors need a clear lineage.
Schema evolution. Adding a new source system means adding a new satellite. Existing tables are untouched. In a dimensional model, adding a new source often requires altering existing tables.
The Downsides
Query complexity. A simple question like "show me customer names and their orders" requires joining a hub, a satellite, and a link. In a dimensional model, it is one join.
-- Data Vault: getting customer name and order details
SELECT
c.customer_id,
sc.first_name,
sc.last_name,
lo.order_id,
so.order_amount,
so.order_date
FROM hub_customer c
JOIN sat_customer_crm sc
ON c.hub_customer_hash_key = sc.hub_customer_hash_key
AND sc.load_end_date IS NULL -- current record
JOIN link_order lo
ON c.hub_customer_hash_key = lo.hub_customer_hash_key
JOIN sat_order so
ON lo.link_order_hash_key = so.link_order_hash_key
AND so.load_end_date IS NULL;
Compare that to the dimensional model equivalent:
-- Dimensional: same question
SELECT
c.customer_id,
c.first_name,
c.last_name,
f.order_id,
f.order_amount,
f.order_date
FROM fact_orders f
JOIN dim_customer c ON f.customer_key = c.customer_key
WHERE c.is_current = TRUE;
Not analyst-friendly. Business users and analysts should not query the Data Vault directly. You build "business vault" or "information mart" layers on top — essentially dimensional models derived from the vault. This means more layers and more work.
Overhead for small teams. If you have one or two data engineers and a handful of data sources, Data Vault adds complexity without proportional benefit. The methodology assumes a scale of integration that most companies do not have.
When to Use Data Vault
- You have 10+ source systems feeding into the warehouse
- Regulatory requirements demand full auditability and lineage
- Multiple teams need to load data independently and in parallel
- Source system schemas change frequently
- You have a team large enough to maintain the additional layers (raw vault, business vault, information marts)
When to Skip Data Vault
- You have fewer than 5 source systems
- Your data team is smaller than 5 people
- Analysts need direct access to the data
- Time-to-value matters more than architectural purity
One Big Table (OBT)
One Big Table is the opposite of normalized modeling. You denormalize everything into a single wide table. Every attribute, every metric, every dimension — all in one place.
-- One Big Table: everything about orders in a single table
CREATE TABLE obt_orders (
order_id BIGINT,
order_date DATE,
order_amount DECIMAL(10,2),
discount_amount DECIMAL(10,2),
customer_id VARCHAR(50),
customer_name VARCHAR(200),
customer_city VARCHAR(100),
customer_state VARCHAR(50),
customer_segment VARCHAR(50),
customer_signup DATE,
product_id VARCHAR(50),
product_name VARCHAR(200),
product_category VARCHAR(100),
product_brand VARCHAR(100),
store_id VARCHAR(50),
store_name VARCHAR(200),
store_region VARCHAR(50)
);
Why OBT Works
No joins at query time. Every query is a simple scan, filter, and aggregate on one table. This is fast on columnar warehouses and trivial for analysts to write.
-- OBT: revenue by customer segment and product category
SELECT
customer_segment,
product_category,
SUM(order_amount) AS revenue
FROM obt_orders
WHERE order_date >= '2025-01-01'
GROUP BY customer_segment, product_category;
Low barrier for analysts. No need to understand join paths, star schemas, or cardinality. One table, filter it, aggregate it, done.
Good for specific use cases. OBT works well for a focused analytical domain — "everything about orders" or "everything about web sessions." It is less about replacing dimensional modeling and more about creating purpose-built analytical tables.
The Downsides
Data redundancy. Customer name is repeated on every order row. Product category is repeated on every order row. For a table with hundreds of millions of rows, this increases storage and build time. Columnar compression mitigates this, but it is still more data than a normalized model.
Update complexity. If a customer changes their name, you need to update every row in the OBT that references that customer. In a dimensional model, you update one row in dim_customer.
Schema rigidity. Adding a new attribute means altering a massive table and backfilling. In a dimensional model, you alter a dimension table that might have thousands of rows instead of millions.
When to Use OBT
- Building a purpose-specific analytical table for a BI tool
- The downstream consumers are non-technical and need the simplest possible interface
- Query performance on joins is a bottleneck and you have confirmed that denormalization solves it
- The table represents a single, well-defined analytical domain
When to Skip OBT
- You need to support ad hoc queries across many domains
- Dimension attributes change frequently
- You need to track historical changes (SCD Type 2 in an OBT is painful)
Activity Schema
Activity Schema models data as a stream of activities (events) performed by entities. Instead of building domain-specific fact tables, you create a single activity stream with a consistent structure.
CREATE TABLE activity_stream (
activity_id BIGINT PRIMARY KEY,
entity_id VARCHAR(50), -- who (customer, user, account)
activity VARCHAR(100), -- what happened
activity_ts TIMESTAMP, -- when
feature_1 VARCHAR(500), -- context (varies by activity)
feature_2 VARCHAR(500),
feature_3 VARCHAR(500),
revenue_impact DECIMAL(10,2) -- optional numeric measure
);
Example data:
entity_id | activity | activity_ts | feature_1 | feature_2 | revenue_impact
U-1001 | signed_up | 2025-01-01 10:00:00 | organic_search | mobile | NULL
U-1001 | placed_order | 2025-01-05 14:30:00 | ORD-5001 | electronics | 149.99
U-1001 | contacted_support | 2025-01-06 09:15:00 | billing_issue | chat | NULL
U-1001 | placed_order | 2025-01-20 11:00:00 | ORD-5002 | clothing | 79.99
Why Activity Schema Works
Flexibility. Adding a new activity type does not require schema changes. You insert rows with a new activity value. No new tables, no migrations.
Customer journey analysis. Because all activities are in one table with a consistent structure, answering "what did this customer do?" is a single query.
-- Full customer journey
SELECT activity, activity_ts, feature_1, feature_2
FROM activity_stream
WHERE entity_id = 'U-1001'
ORDER BY activity_ts;
Funnel and cohort analysis. Activity schema makes it natural to build funnels (signed_up -> activated -> purchased) and cohorts (users who signed up in January).
The Downsides
Generic columns are confusing. feature_1 means "signup channel" for signup events and "order ID" for order events. Without documentation or a lookup table, analysts do not know what each column means for each activity type.
Aggregation is harder. Summing revenue across order activities requires filtering by activity type, which is less intuitive than summing a column in a fact table.
Not suited for complex joins. Activity schema works for entity-centric analysis but struggles when you need to join across entities (orders with products with suppliers).
When to Use Activity Schema
- Product analytics where user journey analysis is the primary use case
- Early-stage companies that need a flexible model before they understand their analytical requirements
- Event-driven applications where the natural data shape is a stream of activities
When to Skip Activity Schema
- You need rich dimensional context (product attributes, store hierarchies)
- Your primary queries are aggregations across entities, not journey analysis
- Your analysts expect a traditional star schema and are not comfortable with generic columns
Choosing Your Approach
Scenario Recommended Approach
---------------------------------------------------------------
Small team, few sources, general analytics Dimensional (star schema)
Large enterprise, many sources, audit needs Data Vault + dimensional marts
Single-domain analytical table, BI focus One Big Table
Product analytics, user journey focus Activity Schema
First data model, uncertain requirements Dimensional (simplest correct choice)
The approaches are not mutually exclusive. A common enterprise pattern is:
Source Systems -> Data Vault (raw) -> Dimensional Models (marts) -> OBT (BI layer)
Each layer serves a different purpose. The vault provides auditability and integration. Dimensional models provide analytical structure. OBTs provide BI-friendly access.
Common Pitfalls
- Choosing Data Vault because it sounds enterprise-grade. Data Vault solves integration and auditability problems. If you do not have those problems, you do not need Data Vault. A star schema built in dbt will serve most companies well.
- Building an OBT as your only model. OBT works as a downstream artifact, not as a primary model. Without an upstream structured model, your OBT becomes impossible to maintain as requirements change.
- Using Activity Schema without documenting feature columns. Generic columns are powerful but opaque. Maintain a reference that maps each activity type to its feature column meanings, or analysts will misinterpret the data.
- Overcomplating your first data model. Your first model should be a simple star schema with a handful of fact tables and dimensions. You can always refactor later. You cannot get back the time spent building infrastructure you did not need.
- Ignoring your team's skills. The best data model is one your team can build, maintain, and evolve. A theoretically superior approach that nobody understands is worse than a simple approach that everyone can work with.
- Mixing paradigms without clear boundaries. If you use Data Vault for the raw layer and dimensional for marts, make it explicit. Analysts should never need to query the vault directly. Engineers should never build pipelines against the marts.
Key Takeaways
- Data Vault (hubs, links, satellites) excels at integrating many source systems with full auditability. It is built for large enterprises and regulated industries, not for small teams.
- One Big Table denormalizes everything into a single wide table for maximum query simplicity. Use it as a downstream artifact for BI, not as your primary model.
- Activity Schema models data as a stream of entity activities. It is strong for product analytics and user journey analysis but weak for complex dimensional queries.
- Dimensional modeling (star schema) remains the best default for most analytics use cases. Start there unless you have specific requirements that demand a different approach.
- These approaches can be layered: vault for raw integration, dimensional for analytical structure, OBT for BI access.
- Do not overcomplicate your first data model. Ship a working star schema, learn from how people use it, and evolve from there.