Scaling Data Systems
Every data system starts small. A single PostgreSQL instance, a few dbt models, an Airflow scheduler running on one machine. Then the data grows. Then the queries multiply. Then the business needs real-time instead of daily batches. Scaling a data system is about recognizing which bottleneck you are hitting and choosing the right approach: bigger machines, more machines, smarter queries, or a different architecture entirely.
Vertical Scaling: Bigger Machines
Vertical scaling means giving your existing system more resources: more CPU, more memory, more disk I/O. In cloud data warehouses, this usually means upgrading to a larger warehouse size.
When Vertical Scaling Works
Snowflake warehouse sizes:
X-Small: 1 server (~$2/hr) -> Good for dev, light queries
Small: 2 servers (~$4/hr) -> Small team, moderate data
Medium: 4 servers (~$8/hr) -> Growing team, GB-scale tables
Large: 8 servers (~$16/hr) -> Heavy workloads, TB-scale tables
X-Large: 16 servers (~$32/hr) -> Large datasets, complex joins
Vertical scaling is the simplest approach. No code changes, no architecture changes. Just slide the slider up. It works until you hit the ceiling — there is a maximum machine size, and the cost doubles with each step.
The Limits of Vertical Scaling
Problem size X-Small Medium X-Large
10 GB scan 30 sec 8 sec 2 sec
100 GB scan 5 min 75 sec 20 sec
1 TB scan 50 min 12 min 3 min
10 TB scan 8 hrs 2 hrs 30 min
Cost per hour: $2 $8 $32
Doubling the warehouse size roughly halves the query time but doubles the cost per hour. For a query that runs once a day, the cost increase is small. For a query that runs thousands of times, it adds up fast.
Vertical scaling hits a wall when:
- Your data exceeds what the largest available machine can handle
- The cost of the largest machine exceeds your budget
- Your bottleneck is not compute power but I/O, network, or concurrency
Horizontal Scaling: More Nodes
Horizontal scaling adds more machines to distribute the workload. This is how modern cloud warehouses (Snowflake, BigQuery, Redshift) handle large datasets — they spread data across many nodes and process in parallel.
Multi-Cluster Warehouses
Snowflake's multi-cluster feature automatically scales the number of clusters based on query concurrency.
Single cluster:
10 concurrent queries -> queue them, run sequentially
Multi-cluster (auto-scale, max 5):
10 concurrent queries -> spin up clusters, run in parallel
50 concurrent queries -> spin up more clusters
Back to 2 queries -> scale down to 1 cluster
# Terraform: auto-scaling Snowflake warehouse
resource "snowflake_warehouse" "analytics" {
name = "ANALYTICS_WH"
warehouse_size = "medium"
min_cluster_count = 1
max_cluster_count = 5
scaling_policy = "standard" # Scale up aggressively
auto_suspend = 300
auto_resume = true
}
Horizontal Scaling for Processing
For data processing frameworks like Spark, horizontal scaling means adding more executor nodes.
# Spark: configure for horizontal scaling
spark = SparkSession.builder \
.config("spark.executor.instances", "20") \
.config("spark.executor.cores", "4") \
.config("spark.executor.memory", "8g") \
.config("spark.dynamicAllocation.enabled", "true") \
.config("spark.dynamicAllocation.minExecutors", "5") \
.config("spark.dynamicAllocation.maxExecutors", "50") \
.getOrCreate()
Dynamic allocation lets the cluster grow and shrink based on the workload. A small transformation uses 5 executors. A large backfill spins up to 50.
Auto-Scaling for Bursty Workloads
Most data workloads are bursty. Overnight batch jobs spike at 2 AM. Dashboard refreshes spike at 9 AM when people start work. Month-end reporting spikes on the first business day of the month. Fixed-size infrastructure either wastes money during quiet periods or cannot handle peaks.
Patterns of Burstiness
Time Queries/min Ideal Size
00:00-02:00 5 X-Small
02:00-04:00 200 X-Large (batch jobs)
04:00-09:00 10 Small
09:00-11:00 150 Large (analysts start work)
11:00-17:00 80 Medium
17:00-00:00 20 Small
Auto-Scaling Strategies
Time-based scaling: Scale up before known peaks, scale down after.
# Scale up the warehouse before the batch window
0 1 * * * snowsql -q "ALTER WAREHOUSE TRANSFORM_WH SET WAREHOUSE_SIZE = 'X-LARGE'"
# Scale down after the batch window
0 5 * * * snowsql -q "ALTER WAREHOUSE TRANSFORM_WH SET WAREHOUSE_SIZE = 'SMALL'"
Load-based scaling: Let the cloud provider scale based on actual demand. This is what Snowflake multi-cluster and BigQuery slots provide.
Workload isolation: Use separate warehouses (or clusters) for different workloads so they do not compete.
BATCH_WH: X-Large, runs 02:00-05:00, auto-suspend after
ANALYST_WH: Medium, auto-scale 1-3 clusters, business hours
DASHBOARD_WH: Small, always on (dashboards need low latency)
DEV_WH: X-Small, auto-suspend after 1 minute
When to Optimize Queries vs Throw Hardware at It
This is the most important cost decision in data engineering. Engineer time is expensive. Compute time is (relatively) cheap. The right choice depends on the numbers.
The Decision Framework
Scenario 1: A query costs $5/run, runs 10x/day
Monthly cost: $1,500
Option A: Spend 40 hours optimizing -> save 80% -> $300/month
Payback: 40 hours * $100/hr = $4,000 / $1,200 saved = 3.3 months
-> Worth it if the query runs long-term
Option B: Upgrade warehouse one size -> 2x cost but 2x faster
New cost: $3,000/month (same number of runs, bigger warehouse)
-> Makes it worse! (unless the problem is concurrency, not speed)
Scenario 2: A query costs $0.50/run, runs 2x/day
Monthly cost: $30
Option A: Spend 20 hours optimizing
Cost of optimization: $2,000
Even if you save 90%, payback is 74 months
-> Not worth it
Option B: Do nothing
-> Correct choice
Quick Query Optimizations
Before throwing hardware at a problem, check these low-effort optimizations:
-- 1. Add partition filter
-- Before: full table scan
SELECT * FROM events WHERE user_id = 12345;
-- After: partition pruning
SELECT * FROM events WHERE event_date = '2025-03-15' AND user_id = 12345;
-- 2. Select only needed columns
-- Before: reads all 50 columns
SELECT * FROM orders WHERE status = 'pending';
-- After: reads 3 columns
SELECT order_id, customer_id, amount FROM orders WHERE status = 'pending';
-- 3. Filter before joining
-- Before: join, then filter (processes all rows)
SELECT o.*, c.name
FROM orders o JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date = '2025-03-15';
-- After: filter, then join (processes fewer rows)
WITH recent_orders AS (
SELECT * FROM orders WHERE order_date = '2025-03-15'
)
SELECT o.*, c.name
FROM recent_orders o JOIN customers c ON o.customer_id = c.customer_id;
-- 4. Use approximate functions for large aggregations
-- Before: exact count (scans everything)
SELECT COUNT(DISTINCT user_id) FROM events;
-- After: approximate count (much faster, <2% error)
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;
When to Throw Hardware at It
- The query is already well-optimized (partition pruning, selective columns, good joins)
- The data is just big and there is no way to reduce the scan
- The cost of more compute is less than the cost of engineering time to optimize
- You need results faster and latency is the bottleneck, not cost
The Cost Curve
At small scale, more compute is cheaper than optimization. At large scale, optimization saves more than more compute.
Monthly data processing cost:
Scale Brute Force Optimized Savings
---------- ----------- --------- -------
1 TB/month $500 $400 $100 (not worth optimizing)
10 TB/month $5,000 $2,000 $3,000 (worth some effort)
100 TB/month $50,000 $10,000 $40,000 (must optimize)
1 PB/month $500,000 $50,000 $450,000 (survival depends on it)
The inflection point varies by organization, but the pattern is universal: at some point, throwing more hardware at the problem becomes prohibitively expensive and optimization becomes essential.
Optimization Priority Order
When you need to scale, optimize in this order (cheapest to most expensive):
1. Query optimization (free)
- Add partition filters, reduce columns, fix joins
2. Data organization (low cost)
- Partition tables, add clustering, compress files
3. Caching and materialization (medium cost)
- Cache frequent queries, materialize expensive views
4. Architecture changes (high cost)
- Incremental processing instead of full rebuilds
- Move hot data to faster storage
5. Bigger/more hardware (ongoing cost)
- Larger warehouses, more nodes
- Last resort after exhausting optimization
Incremental Processing
The most impactful scaling technique: instead of reprocessing all data every run, process only what has changed.
-- Full rebuild: processes all historical data every day
CREATE OR REPLACE TABLE analytics.daily_revenue AS
SELECT order_date, SUM(amount) AS revenue
FROM staging.orders
GROUP BY order_date;
-- Incremental: processes only new/changed data
-- dbt incremental model
{{ config(materialized='incremental', unique_key='order_date') }}
SELECT
order_date,
SUM(amount) AS revenue
FROM staging.orders
{% if is_incremental() %}
WHERE order_date >= (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
GROUP BY order_date
A full rebuild of a 5 TB table takes an hour and costs 0.50. Over a month, that is 15.
Common Pitfalls
- Scaling hardware before optimizing queries. A poorly written query on a bigger machine is still a poorly written query. Optimize first, then scale.
- Not isolating workloads. Batch jobs and interactive queries compete for the same resources. Use separate warehouses or clusters for different workload types.
- Fixed-size infrastructure for variable workloads. If your peak is 10x your baseline, auto-scaling saves 80% compared to provisioning for peak.
- Full rebuilds when incremental is possible. Reprocessing 5 years of data every day because "it is simpler" works until the table reaches 10 TB. Build incremental from the start for large tables.
- Ignoring the cost curve. What works at 1 TB does not work at 100 TB. Revisit your scaling strategy as data grows. The right approach changes with scale.
- Premature optimization. Spending two weeks building an elaborate caching layer for a table that costs $10/month to query is wasted effort. Measure costs before optimizing.
- Not monitoring resource utilization. You cannot scale efficiently if you do not know what is bottlenecked. Monitor CPU, memory, I/O, and queue depth.
Key Takeaways
- Vertical scaling (bigger machines) is simple but has a ceiling. Horizontal scaling (more machines) is how cloud warehouses handle real growth.
- Auto-scaling matches infrastructure to bursty workloads. Use time-based scaling for predictable peaks and load-based scaling for variable demand.
- The decision to optimize queries or add hardware depends on the math: compare engineering time cost against compute cost savings over the expected lifetime.
- At small scale, more hardware is cheaper than optimization. At large scale, optimization is essential for survival.
- Incremental processing is the most impactful scaling technique. Process only new data instead of rebuilding everything.
- Optimize in order: query tuning, data organization, caching, architecture changes, then hardware. Each step has a different cost and impact profile.