Storage & Compute Costs
Storage is cheap. Compute is expensive. This asymmetry drives nearly every data infrastructure decision. The cost of storing a terabyte of data in S3 is about 5 per scan. Run that query 100 times a month and your compute costs are 20x your storage costs. Understanding this relationship is the foundation of data infrastructure cost management.
The Cost Asymmetry
Storage Costs
Cloud storage prices have fallen dramatically and continue to drop:
AWS S3 Standard: ~$0.023 / GB / month
AWS S3 Infrequent Access: ~$0.0125 / GB / month
AWS S3 Glacier: ~$0.004 / GB / month
1 TB in S3 Standard: ~$23 / month
1 TB in S3 Glacier: ~$4 / month
1 PB in S3 Standard: ~$23,000 / month
Storage is so cheap that the default strategy is: store everything, compress it, and worry about organizing it later. Deleting data to save on storage costs is almost never worth the risk of needing it later.
Compute Costs
Compute is where the bills add up:
Snowflake X-Small warehouse: ~$2 / hour
Snowflake Medium warehouse: ~$16 / hour
Snowflake X-Large warehouse: ~$128 / hour
BigQuery on-demand: $6.25 / TB scanned
BigQuery flat-rate: Varies by commitment
Databricks DBU: ~$0.07-$0.55 / DBU depending on tier
A single poorly written query that scans 10 TB costs 625 for one afternoon of work.
Compression Reduces Both
Compression is the rare optimization that reduces storage cost and compute cost simultaneously. Smaller data on disk means less to store and less to read when querying.
Columnar Formats & Compression
Parquet and ORC files compress dramatically because columnar storage groups similar values together, which compresses better than row-oriented data.
Format Typical Compression Ratio (vs CSV)
------ ------------------------------------
CSV 1x (baseline)
JSON 0.8-1.2x (often larger due to keys)
Parquet 5-10x compression
Parquet + Zstd 8-15x compression
ORC + Zlib 8-15x compression
A 100 GB CSV file might compress to 10 GB in Parquet with Snappy compression, or 7 GB with Zstandard. That is 90% less storage cost and 90% less data scanned per query.
# Writing compressed Parquet with PyArrow
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df)
pq.write_table(
table,
'orders.parquet',
compression='zstd', # Zstandard: good compression, fast decompression
row_group_size=1_000_000, # Optimize for predicate pushdown
)
Choosing a Compression Codec
Codec Compression Ratio Speed Use Case
----- ----------------- ----- --------
Snappy Moderate Very fast Default for most workloads
Zstd High Fast Best balance of ratio and speed
Gzip High Slow Archival, infrequent reads
LZ4 Low-moderate Fastest Real-time workloads
For data engineering, Zstandard (zstd) is generally the best choice. It compresses nearly as well as gzip but decompresses much faster.
Partitioning Reduces Compute
Partitioning organizes data into subdirectories by a column value, allowing the query engine to skip irrelevant partitions entirely. This is called partition pruning, and it is the single most impactful optimization for query cost.
How Partitioning Works
Without partitioning:
s3://data-lake/orders/
orders.parquet <- 500 GB, all dates mixed together
With partitioning by date:
s3://data-lake/orders/
order_date=2025-01-01/
part-001.parquet <- 2 GB
order_date=2025-01-02/
part-001.parquet <- 2 GB
...
order_date=2025-03-15/
part-001.parquet <- 2 GB
A query for WHERE order_date = '2025-03-15' reads 2 GB instead of 500 GB. That is a 99.6% reduction in data scanned and a proportional reduction in cost.
Partitioning in the Warehouse
-- BigQuery: Partition by date
CREATE TABLE analytics.orders
PARTITION BY order_date
CLUSTER BY customer_id
AS SELECT * FROM staging.orders;
-- Query: partition pruning reduces scanned data
SELECT customer_id, SUM(amount)
FROM analytics.orders
WHERE order_date BETWEEN '2025-03-01' AND '2025-03-31'
GROUP BY customer_id;
-- Only scans March data, not the entire table
Over-Partitioning
Too many partitions create overhead. A table partitioned by user_id with 10 million users creates 10 million tiny files. Query engines struggle with millions of small files (the "small files problem").
Good partitioning: date (365 partitions/year), region (10-50 partitions)
Bad partitioning: user_id (millions of partitions), transaction_id (one row each)
Rule of thumb: Each partition should contain at least 128 MB of data
The Cost of Full Table Scans
A full table scan reads every row in the table. In cloud warehouses, you pay for every byte scanned.
Table: user_events (5 TB)
Query: SELECT COUNT(*) FROM user_events WHERE user_id = 12345
Without optimization: Scans 5 TB -> $31.25 (BigQuery on-demand)
With partition on date: Scans 50 GB -> $0.31 (if you add a date filter)
With clustering: Scans 5 GB -> $0.03 (if clustered by user_id)
The difference between 0.03 per query is the difference between a 10/month bill for the same workload.
Common Causes of Full Scans
- Missing
WHEREclause on the partition column SELECT *when only a few columns are needed (in columnar storage, selecting fewer columns reads less data)- Joining large tables without filtering first
- Using functions on partition columns:
WHERE YEAR(order_date) = 2025prevents partition pruning, butWHERE order_date >= '2025-01-01'enables it
Cloud Pricing: On-Demand vs Reserved
On-Demand
Pay per query (BigQuery) or per hour of active warehouse (Snowflake, Redshift). Good for unpredictable workloads. Expensive at scale.
BigQuery on-demand: $6.25 / TB scanned, no commitment
Snowflake on-demand: $2-4 / credit, pay for active time only
Redshift Serverless: ~$0.375 / RPU-hour
Reserved / Committed
Pre-purchase capacity at a discount. Good for predictable, steady workloads.
BigQuery flat-rate: $2,000/month for 100 slots (autoscale editions)
Snowflake committed: Up to 30% discount with annual commitment
Redshift Reserved Nodes: Up to 75% discount with 1-3 year commitment
Hybrid Strategy
Use reserved capacity for your baseline workload and on-demand for spikes.
Baseline: 50 concurrent queries, steady → Reserved capacity
Peak: 200 concurrent queries during month-end reporting → On-demand burst
Development: Unpredictable, bursty → On-demand
FinOps for Data Teams
FinOps (financial operations) is the practice of managing cloud costs as a team discipline, not just an infrastructure concern.
Cost Visibility
The first step is knowing where money goes. Tag resources, track costs per team, per pipeline, and per query.
-- Snowflake: check which warehouses are costing the most
SELECT
warehouse_name,
SUM(credits_used) AS total_credits,
SUM(credits_used) * 3.00 AS estimated_cost_usd
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD(month, -1, CURRENT_TIMESTAMP)
GROUP BY warehouse_name
ORDER BY total_credits DESC;
-- BigQuery: identify expensive queries
SELECT
user_email,
query,
total_bytes_processed / POW(1024, 4) AS tb_scanned,
total_bytes_processed / POW(1024, 4) * 6.25 AS estimated_cost_usd
FROM region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY total_bytes_processed DESC
LIMIT 20;
Cost Optimization Checklist
Quick Wins (do these first):
[ ] Compress all data files (Parquet + Zstd)
[ ] Partition tables by date
[ ] Add WHERE clauses on partition columns
[ ] Replace SELECT * with explicit column lists
[ ] Auto-suspend idle warehouses (Snowflake)
[ ] Set query byte limits (BigQuery)
Medium Effort:
[ ] Cluster tables by frequently filtered columns
[ ] Materialize expensive subqueries as tables
[ ] Move cold data to cheaper storage tiers
[ ] Right-size warehouses (do not use XL for small queries)
[ ] Implement query result caching
High Effort:
[ ] Migrate from on-demand to reserved pricing
[ ] Implement incremental processing (process only new data)
[ ] Consolidate redundant pipelines
[ ] Implement chargeback (teams pay for their usage)
Chargeback & Showback
Showback: Show each team how much they spend, but do not charge them. Creates awareness.
Chargeback: Actually charge each team's budget for their data usage. Creates accountability.
Most organizations start with showback and move to chargeback as their data platform matures. The transition is cultural as much as technical.
Common Pitfalls
- Optimizing storage costs instead of compute costs. Spending a week saving 5,000/month in unoptimized queries is backwards.
- Not partitioning large tables. A 10 TB unpartitioned table is a ticking cost bomb. Every query pays the full table scan price.
- Over-partitioning. Millions of tiny partitions create overhead that exceeds the savings. Partition by date, not by high-cardinality columns.
- Running development workloads on production-grade infrastructure. Dev does not need an X-Large warehouse. Use X-Small or Small for development and testing.
- No query cost guardrails. Without byte scan limits or cost controls, a single runaway query can cost thousands. BigQuery custom quotas and Snowflake resource monitors prevent this.
- Ignoring idle resources. A Snowflake warehouse running 24/7 when it is only needed for 2 hours a day wastes 92% of its cost. Auto-suspend after 5 minutes of inactivity.
- Treating cost optimization as a one-time project. Costs drift. New pipelines, new analysts, new queries. Review costs monthly and build cost awareness into the team culture.
Key Takeaways
- Storage is cheap, compute is expensive. Optimize for compute costs first.
- Compression (Parquet + Zstd) reduces both storage and compute costs. There is no reason not to compress.
- Partitioning enables query engines to skip irrelevant data. Partition by date for time-series data. Ensure each partition has at least 128 MB.
- Full table scans are the primary cost driver. Add partition filters, select specific columns, and filter early.
- Use reserved pricing for predictable baseline workloads and on-demand for bursty peaks.
- FinOps is a team discipline: track costs per team and pipeline, set guardrails, and review monthly. Start with showback, evolve to chargeback.