Materialized Views & Aggregates

Some queries are expensive. They scan billions of rows, join five tables, and compute aggregations that take minutes. When the same expensive query runs dozens of times a day from dashboards and reports, you are burning compute for the same result. Materialized views and aggregate tables pre-compute those results so the expensive work happens once.

The Problem

Consider a revenue dashboard that runs this query every time a user opens it:

SELECT
    DATE_TRUNC('day', o.order_date) AS order_day,
    p.product_category,
    c.region,
    SUM(o.amount) AS total_revenue,
    COUNT(DISTINCT o.customer_id) AS unique_customers,
    COUNT(*) AS order_count
FROM orders o
JOIN products p ON o.product_id = p.product_id
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY 1, 2, 3;

This query joins three tables, scans 90 days of order data, and computes aggregations. If the orders table has 500 million rows, this takes 30 seconds and scans 50 GB. With 20 dashboard users refreshing throughout the day, that is 600 seconds of compute and 1 TB of data scanned daily for the same result.

Materialized Views

A materialized view is a query whose results are stored as a physical table. Instead of re-running the query every time, consumers read the pre-computed results.

Creating a Materialized View

-- Snowflake
CREATE MATERIALIZED VIEW analytics.mv_daily_revenue AS
SELECT
    DATE_TRUNC('day', o.order_date) AS order_day,
    p.product_category,
    c.region,
    SUM(o.amount) AS total_revenue,
    COUNT(DISTINCT o.customer_id) AS unique_customers,
    COUNT(*) AS order_count
FROM orders o
JOIN products p ON o.product_id = p.product_id
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY 1, 2, 3;

-- BigQuery
CREATE MATERIALIZED VIEW analytics.mv_daily_revenue AS
SELECT
    DATE_TRUNC(order_date, DAY) AS order_day,
    product_category,
    region,
    SUM(amount) AS total_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers,
    COUNT(*) AS order_count
FROM analytics.orders_enriched
GROUP BY 1, 2, 3;

How Refresh Works

Materialized views must be refreshed to reflect changes in the underlying data. Refresh strategies vary by platform:

Snowflake:
  - Automatic refresh when base tables change
  - Background process, near-real-time for most workloads
  - Cost: compute credits for the refresh process

BigQuery:
  - Automatic refresh (up to every 30 minutes)
  - Smart refresh: only re-processes changed partitions
  - Cost: bytes processed during refresh

Redshift:
  - Manual refresh: REFRESH MATERIALIZED VIEW mv_name
  - Auto refresh option available
  - Incremental refresh for certain query patterns

PostgreSQL:
  - Manual refresh only: REFRESH MATERIALIZED VIEW mv_name
  - CONCURRENTLY option allows reads during refresh
  - No automatic scheduling built in

Automatic Query Rewriting

Some warehouses are smart enough to use a materialized view even when the user queries the base tables directly.

-- User writes this query
SELECT product_category, SUM(amount)
FROM orders
JOIN products ON orders.product_id = products.product_id
WHERE order_date >= '2025-01-01'
GROUP BY product_category;

-- BigQuery/Snowflake may automatically rewrite it to read from
-- the materialized view instead, if it can satisfy the query

This is called automatic query rewriting or materialized view matching. It means downstream users get the performance benefit without changing their queries.

Limitations of Materialized Views

Not all queries can be materialized. Typical restrictions:

Common limitations across warehouses:
  - No subqueries in the view definition
  - Limited set of aggregate functions (varies by platform)
  - No window functions
  - No UNION or set operations
  - Must reference base tables, not other views (varies)
  - BigQuery: must include GROUP BY for aggregations
  - Snowflake: no joins in materialized views (as of recent versions)

When the warehouse's materialized view feature is too restrictive, build aggregate tables instead.

Aggregate Tables

An aggregate table is a regular table that stores pre-computed aggregations. You create and refresh it yourself, giving you full control over the query, refresh schedule, and storage.

Daily Rollups

The most common aggregate table: summarize granular data into daily totals.

-- Create the aggregate table
CREATE TABLE analytics.daily_revenue_summary AS
SELECT
    DATE_TRUNC('day', o.order_date) AS order_day,
    p.product_category,
    c.region,
    SUM(o.amount) AS total_revenue,
    COUNT(DISTINCT o.customer_id) AS unique_customers,
    COUNT(*) AS order_count,
    AVG(o.amount) AS avg_order_value,
    MIN(o.amount) AS min_order_value,
    MAX(o.amount) AS max_order_value
FROM orders o
JOIN products p ON o.product_id = p.product_id
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY 1, 2, 3;

Weekly & Monthly Rollups

Build higher-level aggregations on top of daily rollups, not on top of raw data. This creates a layered aggregation strategy.

-- Weekly rollup from daily rollup (fast, since daily is already aggregated)
CREATE TABLE analytics.weekly_revenue_summary AS
SELECT
    DATE_TRUNC('week', order_day) AS order_week,
    product_category,
    region,
    SUM(total_revenue) AS total_revenue,
    SUM(order_count) AS order_count,
    SUM(total_revenue) / NULLIF(SUM(order_count), 0) AS avg_order_value
FROM analytics.daily_revenue_summary
GROUP BY 1, 2, 3;

Note that COUNT(DISTINCT customer_id) cannot be rolled up from daily to weekly because the same customer might appear on multiple days. For distinct counts at higher granularities, you either re-query the raw data or use approximate distinct count techniques like HyperLogLog.

Refreshing Aggregate Tables

Use dbt or a scheduled query to rebuild aggregate tables. The simplest approach: full rebuild daily.

-- In dbt: models/marts/daily_revenue_summary.sql
-- config: materialized='table' (full rebuild each run)
SELECT
    DATE_TRUNC('day', o.order_date) AS order_day,
    p.product_category,
    c.region,
    SUM(o.amount) AS total_revenue,
    COUNT(DISTINCT o.customer_id) AS unique_customers,
    COUNT(*) AS order_count
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_products') }} p ON o.product_id = p.product_id
JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.customer_id
GROUP BY 1, 2, 3

For large tables where full rebuild is too slow, use incremental refresh:

-- Incremental: only reprocess the last 3 days
{{
    config(
        materialized='incremental',
        unique_key=['order_day', 'product_category', 'region'],
        incremental_strategy='merge'
    )
}}

SELECT
    DATE_TRUNC('day', o.order_date) AS order_day,
    p.product_category,
    c.region,
    SUM(o.amount) AS total_revenue,
    COUNT(DISTINCT o.customer_id) AS unique_customers,
    COUNT(*) AS order_count
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_products') }} p ON o.product_id = p.product_id
JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.customer_id
{% if is_incremental() %}
WHERE o.order_date >= CURRENT_DATE - INTERVAL '3 days'
{% endif %}
GROUP BY 1, 2, 3

When to Materialize

Not every query deserves a materialized view or aggregate table. Materialization adds complexity: more tables to maintain, more storage, more refresh jobs to monitor.

Materialize When

The query runs frequently. A dashboard query that executes 100+ times per day is a strong candidate.
The query is expensive. Joins across large tables, complex aggregations, or full table scans.
Results change slowly. Daily revenue totals do not change every second. Refreshing hourly or daily is fine.
Multiple consumers need the same data. If five dashboards all compute revenue by region, do it once in an aggregate table.

Do Not Materialize When

The query runs rarely. An ad-hoc analysis run once a quarter does not need pre-computation.
Results change constantly. Real-time data that is stale after minutes does not benefit from materialization with hourly refresh.
The base query is already fast. If the query scans 1 GB and returns in 2 seconds, the overhead of maintaining a materialized view is not worth it.
The aggregation dimensions change often. If analysts keep adding and removing GROUP BY columns, maintaining aggregate tables for every combination is impractical.

The Storage vs Compute Trade-Off

Materialization trades storage cost for compute savings.

Without materialization:
  - 100 dashboard queries/day x 50 GB scanned x $5/TB = $25/day in compute
  - $0 in extra storage

With a daily aggregate table:
  - 100 dashboard queries/day x 0.1 GB scanned x $5/TB = $0.05/day in compute
  - 1 refresh job/day x 50 GB scanned x $5/TB = $0.25/day in refresh cost
  - 0.1 GB storage x $0.02/GB/month = negligible storage cost
  - Total: $0.30/day vs $25/day

The math almost always favors materialization for frequently-run expensive queries.

Layered Aggregation Strategy

A well-designed warehouse has a clear hierarchy of aggregation levels:

Layer 1: Raw data (billions of rows)
    |
Layer 2: Cleaned and enriched facts (billions of rows, better schema)
    |
Layer 3: Daily aggregates (millions of rows)
    |
Layer 4: Weekly/monthly aggregates (thousands of rows)
    |
Layer 5: Dashboard-specific summary tables (hundreds of rows)

Each layer reads from the layer above it. Dashboards query layers 3-5, never layer 1. Ad-hoc analysis might query layers 1-2 directly.

Common Pitfalls

Materializing everything. More materialized views means more refresh jobs, more storage, and more things that can go wrong. Materialize the expensive, frequent queries. Leave the rest as regular views.

Stale materialized views without monitoring. A materialized view that has not refreshed in 3 days is showing wrong data. Monitor refresh jobs and alert on staleness.

Not accounting for distinct counts in rollups. You cannot SUM daily distinct user counts to get a weekly distinct user count. Users appear on multiple days. Either re-query raw data at the target granularity or use approximate algorithms.

Building aggregate tables without a refresh strategy. An aggregate table created once and never refreshed is a snapshot, not a summary. Define the refresh frequency and mechanism before creating it.

Ignoring automatic query rewriting. If your warehouse supports materialized view matching, your existing dashboards may benefit from materialized views without any query changes. Check if this feature is enabled.

Materializing queries with volatile filters. A materialized view for "last 7 days" must be refreshed as the window shifts. If the refresh is daily, the view is always 1 day stale at the trailing edge. Use fixed date ranges or accept the staleness.

Key Takeaways

Materialized views and aggregate tables pre-compute expensive queries so the heavy work happens once, not on every dashboard load
Materialized views are managed by the warehouse (automatic refresh, query rewriting); aggregate tables are managed by you (full control, more flexibility)
Materialize queries that are expensive, run frequently, and whose results change slowly
The storage cost of materialization is almost always less than the compute cost of re-running the query repeatedly
Build a layered aggregation strategy: raw data, daily aggregates, weekly/monthly rollups, dashboard summaries
Monitor refresh freshness as carefully as you monitor data quality; stale aggregates are silently wrong