Warehouse Architecture

A data warehouse is not just a big database. It is a fundamentally different system designed for a fundamentally different workload. Understanding how warehouses work under the hood explains why they are fast at analytics and why they make different trade-offs than transactional databases.

Columnar Storage

This is the single most important architectural difference between a warehouse and a transactional database.

Row-Oriented Storage (Transactional Databases)

Traditional databases like PostgreSQL and MySQL store data row by row. Each row is written contiguously on disk.

Row storage on disk:
[user_1, "Alice", "alice@example.com", "2024-01-15", "premium"]
[user_2, "Bob", "bob@example.com", "2024-02-20", "free"]
[user_3, "Carol", "carol@example.com", "2024-03-10", "premium"]

This is great for transactional workloads. When you run SELECT * FROM users WHERE user_id = 2, the database reads one contiguous block and returns the whole row. Inserts, updates, and deletes are fast because you modify one row in one place.

Column-Oriented Storage (Warehouses)

Warehouses store data column by column. All values for a single column are stored together.

Column storage on disk:
user_id column:    [user_1, user_2, user_3, ...]
name column:       ["Alice", "Bob", "Carol", ...]
email column:      ["alice@...", "bob@...", "carol@...", ...]
created_at column: ["2024-01-15", "2024-02-20", "2024-03-10", ...]
plan column:       ["premium", "free", "premium", ...]

Why Columnar is Fast for Analytics

Consider this query on a table with 100 columns and 1 billion rows:

SELECT plan, COUNT(*) AS user_count
FROM users
GROUP BY plan;

In a row store, the database reads all 100 columns for every row, even though it only needs one column. That is 99% wasted I/O.

In a column store, the database reads only the plan column. It skips the other 99 columns entirely. On a table with 1 billion rows, this can be the difference between reading 100 GB and reading 1 GB.

Compression Benefits

Columnar storage also compresses dramatically better. A column of plan types might contain only three distinct values: "free", "premium", "enterprise". Run-length encoding or dictionary encoding compresses this to a fraction of its raw size. In a row store, these values are interleaved with other columns, making compression far less effective.

Plan column (raw):      ["premium", "free", "premium", "premium", "free", "enterprise", ...]
Plan column (encoded):  {0: "free", 1: "premium", 2: "enterprise"} -> [1, 0, 1, 1, 0, 2, ...]

Separation of Storage & Compute

Traditional data warehouses (old-school Teradata, Oracle Exadata) coupled storage and compute. The same machines stored the data and ran the queries. If you needed more compute, you had to buy more machines, which also came with more storage you might not need, and vice versa.

Modern cloud warehouses decouple them.

How It Works

Traditional (coupled):
[Machine 1: CPU + Disk] [Machine 2: CPU + Disk] [Machine 3: CPU + Disk]

Modern (decoupled):
Storage Layer: S3 / GCS / Azure Blob (cheap, nearly infinite)
    |
Compute Layer: Ephemeral clusters spun up on demand

Snowflake stores data in its internal storage layer (backed by cloud object storage). You spin up "virtual warehouses" (compute clusters) of any size. Multiple warehouses can query the same data simultaneously. You pay for compute only when it runs.

BigQuery takes this further with serverless compute. There are no clusters to manage. You submit a query, Google provisions compute automatically, and you pay per byte scanned.

Redshift Serverless follows a similar model, allocating compute dynamically based on workload.

Why This Matters

Scale compute independently. Need more processing power for month-end reports? Spin up a larger cluster for two hours, then shut it down.
Scale storage independently. Storing 5 years of historical data is cheap. You do not need proportionally more compute to hold it.
Concurrency without contention. Multiple teams can query the same data using separate compute resources. The marketing team's heavy dashboard refresh does not slow down the data science team's model training.
Cost control. Suspend compute when no one is querying. Storage costs pennies per gigabyte per month.

Massively Parallel Processing (MPP)

Warehouses do not process queries on a single machine. They distribute the work across many nodes that operate in parallel.

How a Query Runs in an MPP Warehouse

SELECT
    region,
    SUM(revenue) AS total_revenue
FROM sales
WHERE sale_date >= '2025-01-01'
GROUP BY region;

The warehouse executes this in stages:

Step 1: Parse and optimize the query plan
Step 2: Distribute the work across N compute nodes
  - Node 1 scans partition A, filters, computes partial SUM
  - Node 2 scans partition B, filters, computes partial SUM
  - Node 3 scans partition C, filters, computes partial SUM
  - ...
Step 3: Shuffle intermediate results by region (redistribute data)
Step 4: Each node computes the final SUM for its assigned regions
Step 5: Coordinator node collects results and returns to the client

The key insight: adding more nodes makes the query faster because each node does less work. This is horizontal scaling, and it is why warehouses can scan billions of rows in seconds.

Data Distribution

For MPP to work, data must be distributed across nodes. Warehouses use different strategies:

Hash distribution:  Rows assigned to nodes based on hash of a key column
                    Good for JOIN-heavy workloads (co-locate matching keys)

Round-robin:        Rows distributed evenly regardless of content
                    Good for scan-heavy workloads (balanced I/O)

Broadcast:          Small tables copied to every node
                    Good for small dimension tables joined with large fact tables

Most modern warehouses handle distribution automatically. Snowflake uses micro-partitions and does not require you to choose a distribution strategy. BigQuery manages distribution internally. Redshift lets you specify distribution keys but has sensible defaults.

Warehouse vs Transactional Database

The trade-offs are fundamental, not incidental.

                    Transactional DB          Warehouse
                    (PostgreSQL, MySQL)       (Snowflake, BigQuery)
----------------------------------------------------------------
Storage layout      Row-oriented              Column-oriented
Optimized for       Single-row reads/writes   Full-table scans
Query pattern       Point lookups, OLTP       Aggregations, analytics
Concurrency model   Many small transactions   Few large queries
Index strategy      B-tree indexes            Partition pruning, zone maps
Update model        In-place updates          Append-only, batch updates
Latency target      Milliseconds              Seconds to minutes
Scaling model       Vertical (bigger machine) Horizontal (more nodes)

A Practical Example

Consider a users table with 50 million rows.

In PostgreSQL (transactional):

-- Fast: point lookup using primary key index
SELECT * FROM users WHERE user_id = 12345;
-- Result in ~1ms

-- Slow: full table scan for analytics
SELECT country, COUNT(*) FROM users GROUP BY country;
-- Result in ~30 seconds (reads all columns for every row)

In Snowflake (warehouse):

-- Slow-ish: point lookup (not what it is designed for)
SELECT * FROM users WHERE user_id = 12345;
-- Result in ~500ms (overhead of distributed query planning)

-- Fast: columnar scan for analytics
SELECT country, COUNT(*) FROM users GROUP BY country;
-- Result in ~2 seconds (reads only the country column, parallel scan)

Neither is "better." They solve different problems.

Query Execution Internals

When you submit a query to a warehouse, several things happen:

1. Query Parsing & Optimization

The query planner analyzes the SQL, rewrites it for efficiency, and generates an execution plan. This includes:

Predicate pushdown: push WHERE filters as close to the storage layer as possible
Projection pruning: read only the columns referenced in the query
JOIN reordering: determine the most efficient order to join tables

2. Metadata Pruning

Before reading any data, the warehouse checks metadata to skip irrelevant files.

Partition metadata for sales table:
  Partition 1: sale_date range [2024-01-01, 2024-03-31], file size 2 GB
  Partition 2: sale_date range [2024-04-01, 2024-06-30], file size 1.8 GB
  Partition 3: sale_date range [2024-07-01, 2024-09-30], file size 2.1 GB
  ...

Query: WHERE sale_date >= '2025-01-01'
Result: Skip partitions 1-4 entirely, only read partitions covering 2025+

Snowflake calls this "micro-partition pruning." BigQuery uses clustered column metadata. The effect is the same: less data read means faster queries and lower cost.

3. Distributed Execution

The query plan is distributed to worker nodes. Each node:

Reads its assigned data from storage (column by column)
Applies filters
Computes partial aggregations
Sends intermediate results to other nodes for shuffling if needed

4. Result Assembly

The coordinator node merges results from all worker nodes and returns the final result set to the client.

Common Pitfalls

Treating a warehouse like a transactional database. Running millions of single-row lookups against a warehouse is slow and expensive. Use a transactional database or cache for that workload.

Selecting all columns when you need two. SELECT * forces the warehouse to read every column. On wide tables, this can be 10-50x more data than necessary. Always specify only the columns you need.

Ignoring query costs in serverless warehouses. BigQuery charges per byte scanned. A poorly written query on a large table can cost more than a well-written one by orders of magnitude. Partition pruning and column selection directly affect your bill.

Running too many small queries instead of batching. Each query has fixed overhead: parsing, planning, cluster spin-up. Running 1,000 small queries is slower and more expensive than running 1 query that does the equivalent work.

Not understanding your warehouse's caching. Most warehouses cache query results. Running the same query twice is nearly free. But minor changes (different filter value, added column) bypass the cache entirely.

Key Takeaways

Columnar storage is what makes warehouses fast for analytics: they read only the columns you query, not entire rows
Separation of storage and compute lets you scale each independently, paying only for what you use
MPP distributes query execution across many nodes in parallel, enabling sub-second scans of billions of rows
Warehouses and transactional databases solve different problems; using one for the other's workload produces poor results
Query performance depends on how much data the warehouse reads: prune with partitions, select only needed columns, and leverage metadata to skip irrelevant files