Scaling Data Systems

Every data system starts small. A single PostgreSQL instance, a few dbt models, an Airflow scheduler running on one machine. Then the data grows. Then the queries multiply. Then the business needs real-time instead of daily batches. Scaling a data system is about recognizing which bottleneck you are hitting and choosing the right approach: bigger machines, more machines, smarter queries, or a different architecture entirely.

Vertical Scaling: Bigger Machines

Vertical scaling means giving your existing system more resources: more CPU, more memory, more disk I/O. In cloud data warehouses, this usually means upgrading to a larger warehouse size.

When Vertical Scaling Works

Snowflake warehouse sizes:
  X-Small:  1 server    (~$2/hr)   -> Good for dev, light queries
  Small:    2 servers   (~$4/hr)   -> Small team, moderate data
  Medium:   4 servers   (~$8/hr)   -> Growing team, GB-scale tables
  Large:    8 servers   (~$16/hr)  -> Heavy workloads, TB-scale tables
  X-Large:  16 servers  (~$32/hr)  -> Large datasets, complex joins

Vertical scaling is the simplest approach. No code changes, no architecture changes. Just slide the slider up. It works until you hit the ceiling — there is a maximum machine size, and the cost doubles with each step.

The Limits of Vertical Scaling

Problem size      X-Small    Medium    X-Large
10 GB scan        30 sec     8 sec     2 sec
100 GB scan       5 min      75 sec    20 sec
1 TB scan         50 min     12 min    3 min
10 TB scan        8 hrs      2 hrs     30 min

Cost per hour:    $2         $8        $32

Doubling the warehouse size roughly halves the query time but doubles the cost per hour. For a query that runs once a day, the cost increase is small. For a query that runs thousands of times, it adds up fast.

Vertical scaling hits a wall when:

Your data exceeds what the largest available machine can handle
The cost of the largest machine exceeds your budget
Your bottleneck is not compute power but I/O, network, or concurrency

Horizontal Scaling: More Nodes

Horizontal scaling adds more machines to distribute the workload. This is how modern cloud warehouses (Snowflake, BigQuery, Redshift) handle large datasets — they spread data across many nodes and process in parallel.

Multi-Cluster Warehouses

Snowflake's multi-cluster feature automatically scales the number of clusters based on query concurrency.

Single cluster:
  10 concurrent queries -> queue them, run sequentially
  
Multi-cluster (auto-scale, max 5):
  10 concurrent queries -> spin up clusters, run in parallel
  50 concurrent queries -> spin up more clusters
  Back to 2 queries     -> scale down to 1 cluster

# Terraform: auto-scaling Snowflake warehouse
resource "snowflake_warehouse" "analytics" {
  name                = "ANALYTICS_WH"
  warehouse_size      = "medium"
  min_cluster_count   = 1
  max_cluster_count   = 5
  scaling_policy      = "standard"  # Scale up aggressively
  auto_suspend        = 300
  auto_resume         = true
}

Horizontal Scaling for Processing

For data processing frameworks like Spark, horizontal scaling means adding more executor nodes.

# Spark: configure for horizontal scaling
spark = SparkSession.builder \
    .config("spark.executor.instances", "20") \
    .config("spark.executor.cores", "4") \
    .config("spark.executor.memory", "8g") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.dynamicAllocation.minExecutors", "5") \
    .config("spark.dynamicAllocation.maxExecutors", "50") \
    .getOrCreate()

Dynamic allocation lets the cluster grow and shrink based on the workload. A small transformation uses 5 executors. A large backfill spins up to 50.

Auto-Scaling for Bursty Workloads

Most data workloads are bursty. Overnight batch jobs spike at 2 AM. Dashboard refreshes spike at 9 AM when people start work. Month-end reporting spikes on the first business day of the month. Fixed-size infrastructure either wastes money during quiet periods or cannot handle peaks.

Patterns of Burstiness

Time           Queries/min    Ideal Size
00:00-02:00    5              X-Small
02:00-04:00    200            X-Large (batch jobs)
04:00-09:00    10             Small
09:00-11:00    150            Large (analysts start work)
11:00-17:00    80             Medium
17:00-00:00    20             Small

Auto-Scaling Strategies

Time-based scaling: Scale up before known peaks, scale down after.

# Scale up the warehouse before the batch window
0 1 * * * snowsql -q "ALTER WAREHOUSE TRANSFORM_WH SET WAREHOUSE_SIZE = 'X-LARGE'"
# Scale down after the batch window
0 5 * * * snowsql -q "ALTER WAREHOUSE TRANSFORM_WH SET WAREHOUSE_SIZE = 'SMALL'"

Load-based scaling: Let the cloud provider scale based on actual demand. This is what Snowflake multi-cluster and BigQuery slots provide.

Workload isolation: Use separate warehouses (or clusters) for different workloads so they do not compete.

BATCH_WH:      X-Large, runs 02:00-05:00, auto-suspend after
ANALYST_WH:    Medium, auto-scale 1-3 clusters, business hours
DASHBOARD_WH:  Small, always on (dashboards need low latency)
DEV_WH:        X-Small, auto-suspend after 1 minute

When to Optimize Queries vs Throw Hardware at It

This is the most important cost decision in data engineering. Engineer time is expensive. Compute time is (relatively) cheap. The right choice depends on the numbers.

The Decision Framework

Scenario 1: A query costs $5/run, runs 10x/day
  Monthly cost: $1,500
  
  Option A: Spend 40 hours optimizing -> save 80% -> $300/month
    Payback: 40 hours * $100/hr = $4,000 / $1,200 saved = 3.3 months
    -> Worth it if the query runs long-term
    
  Option B: Upgrade warehouse one size -> 2x cost but 2x faster
    New cost: $3,000/month (same number of runs, bigger warehouse)
    -> Makes it worse! (unless the problem is concurrency, not speed)
    
Scenario 2: A query costs $0.50/run, runs 2x/day
  Monthly cost: $30
  
  Option A: Spend 20 hours optimizing
    Cost of optimization: $2,000
    Even if you save 90%, payback is 74 months
    -> Not worth it
    
  Option B: Do nothing
    -> Correct choice

Quick Query Optimizations

Before throwing hardware at a problem, check these low-effort optimizations:

-- 1. Add partition filter
-- Before: full table scan
SELECT * FROM events WHERE user_id = 12345;

-- After: partition pruning
SELECT * FROM events WHERE event_date = '2025-03-15' AND user_id = 12345;

-- 2. Select only needed columns
-- Before: reads all 50 columns
SELECT * FROM orders WHERE status = 'pending';

-- After: reads 3 columns
SELECT order_id, customer_id, amount FROM orders WHERE status = 'pending';

-- 3. Filter before joining
-- Before: join, then filter (processes all rows)
SELECT o.*, c.name
FROM orders o JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date = '2025-03-15';

-- After: filter, then join (processes fewer rows)
WITH recent_orders AS (
    SELECT * FROM orders WHERE order_date = '2025-03-15'
)
SELECT o.*, c.name
FROM recent_orders o JOIN customers c ON o.customer_id = c.customer_id;

-- 4. Use approximate functions for large aggregations
-- Before: exact count (scans everything)
SELECT COUNT(DISTINCT user_id) FROM events;

-- After: approximate count (much faster, <2% error)
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;

When to Throw Hardware at It

The query is already well-optimized (partition pruning, selective columns, good joins)
The data is just big and there is no way to reduce the scan
The cost of more compute is less than the cost of engineering time to optimize
You need results faster and latency is the bottleneck, not cost

The Cost Curve

At small scale, more compute is cheaper than optimization. At large scale, optimization saves more than more compute.

Monthly data processing cost:

Scale          Brute Force    Optimized    Savings
----------     -----------    ---------    -------
1 TB/month     $500           $400         $100 (not worth optimizing)
10 TB/month    $5,000         $2,000       $3,000 (worth some effort)
100 TB/month   $50,000        $10,000      $40,000 (must optimize)
1 PB/month     $500,000       $50,000      $450,000 (survival depends on it)

The inflection point varies by organization, but the pattern is universal: at some point, throwing more hardware at the problem becomes prohibitively expensive and optimization becomes essential.

Optimization Priority Order

When you need to scale, optimize in this order (cheapest to most expensive):

1. Query optimization (free)
   - Add partition filters, reduce columns, fix joins
   
2. Data organization (low cost)
   - Partition tables, add clustering, compress files
   
3. Caching and materialization (medium cost)
   - Cache frequent queries, materialize expensive views
   
4. Architecture changes (high cost)
   - Incremental processing instead of full rebuilds
   - Move hot data to faster storage
   
5. Bigger/more hardware (ongoing cost)
   - Larger warehouses, more nodes
   - Last resort after exhausting optimization

Incremental Processing

The most impactful scaling technique: instead of reprocessing all data every run, process only what has changed.

-- Full rebuild: processes all historical data every day
CREATE OR REPLACE TABLE analytics.daily_revenue AS
SELECT order_date, SUM(amount) AS revenue
FROM staging.orders
GROUP BY order_date;

-- Incremental: processes only new/changed data
-- dbt incremental model
{{ config(materialized='incremental', unique_key='order_date') }}

SELECT
    order_date,
    SUM(amount) AS revenue
FROM staging.orders

{% if is_incremental() %}
WHERE order_date >= (SELECT MAX(order_date) FROM {{ this }})
{% endif %}

GROUP BY order_date

A full rebuild of a 5 TB table takes an hour and costs $50. An incremental run processing 50 GB of new data takes 3 minutes and costs$ 0.50. Over a month, that is $1,500 vs$ 15.

Common Pitfalls

Scaling hardware before optimizing queries. A poorly written query on a bigger machine is still a poorly written query. Optimize first, then scale.
Not isolating workloads. Batch jobs and interactive queries compete for the same resources. Use separate warehouses or clusters for different workload types.
Fixed-size infrastructure for variable workloads. If your peak is 10x your baseline, auto-scaling saves 80% compared to provisioning for peak.
Full rebuilds when incremental is possible. Reprocessing 5 years of data every day because "it is simpler" works until the table reaches 10 TB. Build incremental from the start for large tables.
Ignoring the cost curve. What works at 1 TB does not work at 100 TB. Revisit your scaling strategy as data grows. The right approach changes with scale.
Premature optimization. Spending two weeks building an elaborate caching layer for a table that costs $10/month to query is wasted effort. Measure costs before optimizing.
Not monitoring resource utilization. You cannot scale efficiently if you do not know what is bottlenecked. Monitor CPU, memory, I/O, and queue depth.

Key Takeaways

Vertical scaling (bigger machines) is simple but has a ceiling. Horizontal scaling (more machines) is how cloud warehouses handle real growth.
Auto-scaling matches infrastructure to bursty workloads. Use time-based scaling for predictable peaks and load-based scaling for variable demand.
The decision to optimize queries or add hardware depends on the math: compare engineering time cost against compute cost savings over the expected lifetime.
At small scale, more hardware is cheaper than optimization. At large scale, optimization is essential for survival.
Incremental processing is the most impactful scaling technique. Process only new data instead of rebuilding everything.
Optimize in order: query tuning, data organization, caching, architecture changes, then hardware. Each step has a different cost and impact profile.