5 min read
On this page

Storage & Compute Costs

Storage is cheap. Compute is expensive. This asymmetry drives nearly every data infrastructure decision. The cost of storing a terabyte of data in S3 is about 23permonth.Thecostofscanningthatterabytewithaqueryenginecanbe23 per month. The cost of scanning that terabyte with a query engine can be 5 per scan. Run that query 100 times a month and your compute costs are 20x your storage costs. Understanding this relationship is the foundation of data infrastructure cost management.

The Cost Asymmetry

Storage Costs

Cloud storage prices have fallen dramatically and continue to drop:

AWS S3 Standard:           ~$0.023 / GB / month
AWS S3 Infrequent Access:  ~$0.0125 / GB / month
AWS S3 Glacier:            ~$0.004 / GB / month

1 TB in S3 Standard:       ~$23 / month
1 TB in S3 Glacier:        ~$4 / month
1 PB in S3 Standard:       ~$23,000 / month

Storage is so cheap that the default strategy is: store everything, compress it, and worry about organizing it later. Deleting data to save on storage costs is almost never worth the risk of needing it later.

Compute Costs

Compute is where the bills add up:

Snowflake X-Small warehouse:  ~$2 / hour
Snowflake Medium warehouse:   ~$16 / hour
Snowflake X-Large warehouse:  ~$128 / hour

BigQuery on-demand:           $6.25 / TB scanned
BigQuery flat-rate:           Varies by commitment

Databricks DBU:               ~$0.07-$0.55 / DBU depending on tier

A single poorly written query that scans 10 TB costs 62.50inBigQueryondemandpricing.Ifananalystrunsit10timeswhiledebugging,thatis62.50 in BigQuery on-demand pricing. If an analyst runs it 10 times while debugging, that is 625 for one afternoon of work.

Compression Reduces Both

Compression is the rare optimization that reduces storage cost and compute cost simultaneously. Smaller data on disk means less to store and less to read when querying.

Columnar Formats & Compression

Parquet and ORC files compress dramatically because columnar storage groups similar values together, which compresses better than row-oriented data.

Format         Typical Compression Ratio (vs CSV)
------         ------------------------------------
CSV            1x (baseline)
JSON           0.8-1.2x (often larger due to keys)
Parquet        5-10x compression
Parquet + Zstd 8-15x compression
ORC + Zlib     8-15x compression

A 100 GB CSV file might compress to 10 GB in Parquet with Snappy compression, or 7 GB with Zstandard. That is 90% less storage cost and 90% less data scanned per query.

# Writing compressed Parquet with PyArrow
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.Table.from_pandas(df)
pq.write_table(
    table,
    'orders.parquet',
    compression='zstd',         # Zstandard: good compression, fast decompression
    row_group_size=1_000_000,   # Optimize for predicate pushdown
)

Choosing a Compression Codec

Codec      Compression Ratio   Speed        Use Case
-----      -----------------   -----        --------
Snappy     Moderate            Very fast    Default for most workloads
Zstd       High                Fast         Best balance of ratio and speed
Gzip       High                Slow         Archival, infrequent reads
LZ4        Low-moderate        Fastest      Real-time workloads

For data engineering, Zstandard (zstd) is generally the best choice. It compresses nearly as well as gzip but decompresses much faster.

Partitioning Reduces Compute

Partitioning organizes data into subdirectories by a column value, allowing the query engine to skip irrelevant partitions entirely. This is called partition pruning, and it is the single most impactful optimization for query cost.

How Partitioning Works

Without partitioning:
  s3://data-lake/orders/
    orders.parquet          <- 500 GB, all dates mixed together

With partitioning by date:
  s3://data-lake/orders/
    order_date=2025-01-01/
      part-001.parquet      <- 2 GB
    order_date=2025-01-02/
      part-001.parquet      <- 2 GB
    ...
    order_date=2025-03-15/
      part-001.parquet      <- 2 GB

A query for WHERE order_date = '2025-03-15' reads 2 GB instead of 500 GB. That is a 99.6% reduction in data scanned and a proportional reduction in cost.

Partitioning in the Warehouse

-- BigQuery: Partition by date
CREATE TABLE analytics.orders
PARTITION BY order_date
CLUSTER BY customer_id
AS SELECT * FROM staging.orders;

-- Query: partition pruning reduces scanned data
SELECT customer_id, SUM(amount)
FROM analytics.orders
WHERE order_date BETWEEN '2025-03-01' AND '2025-03-31'
GROUP BY customer_id;
-- Only scans March data, not the entire table

Over-Partitioning

Too many partitions create overhead. A table partitioned by user_id with 10 million users creates 10 million tiny files. Query engines struggle with millions of small files (the "small files problem").

Good partitioning:   date (365 partitions/year), region (10-50 partitions)
Bad partitioning:    user_id (millions of partitions), transaction_id (one row each)
Rule of thumb:       Each partition should contain at least 128 MB of data

The Cost of Full Table Scans

A full table scan reads every row in the table. In cloud warehouses, you pay for every byte scanned.

Table: user_events (5 TB)
Query: SELECT COUNT(*) FROM user_events WHERE user_id = 12345

Without optimization:  Scans 5 TB  -> $31.25 (BigQuery on-demand)
With partition on date: Scans 50 GB -> $0.31 (if you add a date filter)
With clustering:        Scans 5 GB  -> $0.03 (if clustered by user_id)

The difference between 31.25and31.25 and 0.03 per query is the difference between a 10,000/monthbillanda10,000/month bill and a 10/month bill for the same workload.

Common Causes of Full Scans

  • Missing WHERE clause on the partition column
  • SELECT * when only a few columns are needed (in columnar storage, selecting fewer columns reads less data)
  • Joining large tables without filtering first
  • Using functions on partition columns: WHERE YEAR(order_date) = 2025 prevents partition pruning, but WHERE order_date >= '2025-01-01' enables it

Cloud Pricing: On-Demand vs Reserved

On-Demand

Pay per query (BigQuery) or per hour of active warehouse (Snowflake, Redshift). Good for unpredictable workloads. Expensive at scale.

BigQuery on-demand:     $6.25 / TB scanned, no commitment
Snowflake on-demand:    $2-4 / credit, pay for active time only
Redshift Serverless:    ~$0.375 / RPU-hour

Reserved / Committed

Pre-purchase capacity at a discount. Good for predictable, steady workloads.

BigQuery flat-rate:       $2,000/month for 100 slots (autoscale editions)
Snowflake committed:      Up to 30% discount with annual commitment
Redshift Reserved Nodes:  Up to 75% discount with 1-3 year commitment

Hybrid Strategy

Use reserved capacity for your baseline workload and on-demand for spikes.

Baseline: 50 concurrent queries, steady → Reserved capacity
Peak: 200 concurrent queries during month-end reporting → On-demand burst
Development: Unpredictable, bursty → On-demand

FinOps for Data Teams

FinOps (financial operations) is the practice of managing cloud costs as a team discipline, not just an infrastructure concern.

Cost Visibility

The first step is knowing where money goes. Tag resources, track costs per team, per pipeline, and per query.

-- Snowflake: check which warehouses are costing the most
SELECT
    warehouse_name,
    SUM(credits_used) AS total_credits,
    SUM(credits_used) * 3.00 AS estimated_cost_usd
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD(month, -1, CURRENT_TIMESTAMP)
GROUP BY warehouse_name
ORDER BY total_credits DESC;
-- BigQuery: identify expensive queries
SELECT
    user_email,
    query,
    total_bytes_processed / POW(1024, 4) AS tb_scanned,
    total_bytes_processed / POW(1024, 4) * 6.25 AS estimated_cost_usd
FROM region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY total_bytes_processed DESC
LIMIT 20;

Cost Optimization Checklist

Quick Wins (do these first):
  [ ] Compress all data files (Parquet + Zstd)
  [ ] Partition tables by date
  [ ] Add WHERE clauses on partition columns
  [ ] Replace SELECT * with explicit column lists
  [ ] Auto-suspend idle warehouses (Snowflake)
  [ ] Set query byte limits (BigQuery)

Medium Effort:
  [ ] Cluster tables by frequently filtered columns
  [ ] Materialize expensive subqueries as tables
  [ ] Move cold data to cheaper storage tiers
  [ ] Right-size warehouses (do not use XL for small queries)
  [ ] Implement query result caching

High Effort:
  [ ] Migrate from on-demand to reserved pricing
  [ ] Implement incremental processing (process only new data)
  [ ] Consolidate redundant pipelines
  [ ] Implement chargeback (teams pay for their usage)

Chargeback & Showback

Showback: Show each team how much they spend, but do not charge them. Creates awareness.

Chargeback: Actually charge each team's budget for their data usage. Creates accountability.

Most organizations start with showback and move to chargeback as their data platform matures. The transition is cultural as much as technical.

Common Pitfalls

  • Optimizing storage costs instead of compute costs. Spending a week saving 50/monthonstoragewhileignoring50/month on storage while ignoring 5,000/month in unoptimized queries is backwards.
  • Not partitioning large tables. A 10 TB unpartitioned table is a ticking cost bomb. Every query pays the full table scan price.
  • Over-partitioning. Millions of tiny partitions create overhead that exceeds the savings. Partition by date, not by high-cardinality columns.
  • Running development workloads on production-grade infrastructure. Dev does not need an X-Large warehouse. Use X-Small or Small for development and testing.
  • No query cost guardrails. Without byte scan limits or cost controls, a single runaway query can cost thousands. BigQuery custom quotas and Snowflake resource monitors prevent this.
  • Ignoring idle resources. A Snowflake warehouse running 24/7 when it is only needed for 2 hours a day wastes 92% of its cost. Auto-suspend after 5 minutes of inactivity.
  • Treating cost optimization as a one-time project. Costs drift. New pipelines, new analysts, new queries. Review costs monthly and build cost awareness into the team culture.

Key Takeaways

  • Storage is cheap, compute is expensive. Optimize for compute costs first.
  • Compression (Parquet + Zstd) reduces both storage and compute costs. There is no reason not to compress.
  • Partitioning enables query engines to skip irrelevant data. Partition by date for time-series data. Ensure each partition has at least 128 MB.
  • Full table scans are the primary cost driver. Add partition filters, select specific columns, and filter early.
  • Use reserved pricing for predictable baseline workloads and on-demand for bursty peaks.
  • FinOps is a team discipline: track costs per team and pipeline, set guardrails, and review monthly. Start with showback, evolve to chargeback.