Partitioning & Clustering

Partitioning and clustering are the two most impactful performance optimizations in a data warehouse. Done right, they reduce query times from minutes to seconds and cut costs by orders of magnitude. Done wrong, they create millions of tiny files that make everything slower.

Partitioning

Partitioning splits a table into smaller, independent segments based on a column's value. The warehouse stores each partition separately, so queries that filter on the partition column can skip irrelevant partitions entirely.

How Partitioning Works

Table: events (500 GB, 2 billion rows)

Without partitioning:
  Query scans all 500 GB regardless of filter

Partitioned by event_date:
  Partition 2025-01-01: 1.5 GB
  Partition 2025-01-02: 1.4 GB
  Partition 2025-01-03: 1.6 GB
  ...
  Partition 2025-12-31: 1.3 GB

Query with WHERE event_date = '2025-06-15':
  Scans only 1 partition (~1.5 GB) instead of 500 GB

That is a 99.7% reduction in data scanned. On a platform like BigQuery where you pay per byte scanned, this is a direct cost reduction.

Partition Column Selection

The partition column should be:

Frequently used in WHERE clauses. If nobody filters on it, partitioning on it is useless.
Low-to-medium cardinality. Dates work well (365 values per year). User IDs do not (millions of values = millions of partitions).
Evenly distributed. Partitions should be roughly the same size. A partition scheme where 90% of data lands in one partition defeats the purpose.

Common partition strategies:

-- Partition by date (most common)
CREATE TABLE events
PARTITION BY DATE(event_timestamp);

-- Partition by date truncated to month (for smaller tables)
CREATE TABLE monthly_reports
PARTITION BY DATE_TRUNC(report_date, MONTH);

-- Partition by a categorical column (BigQuery)
CREATE TABLE sales
PARTITION BY RANGE_BUCKET(region_id, GENERATE_ARRAY(1, 100, 1));

Partition Pruning

Partition pruning is the mechanism by which the query planner skips partitions that cannot contain relevant data. It happens automatically when the query includes a filter on the partition column.

-- Partition pruning happens: only scans 1 day
SELECT COUNT(*) FROM events
WHERE event_date = '2025-06-15';

-- Partition pruning happens: scans 7 days
SELECT COUNT(*) FROM events
WHERE event_date BETWEEN '2025-06-01' AND '2025-06-07';

-- Partition pruning DOES NOT happen: scans everything
SELECT COUNT(*) FROM events
WHERE YEAR(event_date) = 2025;

That last query is a common mistake. Wrapping the partition column in a function prevents the query planner from matching it to partitions. Always filter directly on the partition column.

Partition Sizing

The sweet spot for partition size is 100 MB to 1 GB of compressed data per partition.

Too small (< 10 MB per partition):
  - Millions of tiny files
  - Metadata overhead dominates query time
  - File listing operations become the bottleneck
  - Object storage has per-request costs that add up

Too large (> 10 GB per partition):
  - Less granular pruning
  - Queries still read large amounts of unnecessary data
  - Single-partition queries are slow

Sweet spot (100 MB - 1 GB per partition):
  - Effective pruning without file overhead
  - Good compression ratios within each file
  - Efficient parallel reads

Over-Partitioning

Over-partitioning is the most common partitioning mistake. It happens when you partition on a high-cardinality column or combine multiple partition dimensions.

Bad: Partitioned by date AND user_id AND event_type
  = 365 days x 1M users x 50 event types = 18.25 billion partitions
  Most partitions contain a single row or are empty
  The metadata catalog alone is larger than the data

Bad: Partitioned by hour on a low-volume table
  = 8,760 partitions per year, each containing 100 KB
  Thousands of tiny files that are slower to list than to read

Good: Partitioned by date on a high-volume table
  = 365 partitions per year, each containing 500 MB
  Effective pruning, reasonable file sizes

Clustering

Clustering (also called sort keys or clustering keys) controls the physical order of data within partitions. While partitioning determines which partition a row goes into, clustering determines where within that partition the row is stored.

How Clustering Works

Table: events, partitioned by event_date

Without clustering:
  Within the 2025-06-15 partition, rows are in arbitrary order
  A query filtering on user_id must scan the entire partition

Clustered by user_id:
  Within the 2025-06-15 partition, rows are sorted by user_id
  A query filtering on user_id reads only the relevant sorted blocks

Clustering in Different Warehouses

BigQuery:

CREATE TABLE events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type;

BigQuery automatically maintains clustering as data is loaded. It stores metadata about the min/max values in each block, enabling block-level pruning.

Snowflake:

-- Snowflake uses "clustering keys" on micro-partitions
ALTER TABLE events CLUSTER BY (event_date, user_id);

Snowflake automatically re-clusters data over time through a background process. You define the clustering key, and Snowflake manages the physical organization.

Redshift:

-- Redshift uses sort keys
CREATE TABLE events (
    event_id BIGINT,
    user_id BIGINT,
    event_type VARCHAR(50),
    event_timestamp TIMESTAMP
)
SORTKEY (event_timestamp, user_id);

Choosing Clustering Columns

Cluster on columns that are:

Frequently used in WHERE, JOIN, or ORDER BY clauses
High cardinality (clustering is most effective when it can skip large ranges)
Not already the partition column (partitioning already handles that column)

Good clustering candidates:
  - user_id (high cardinality, commonly filtered)
  - customer_id (high cardinality, common in JOINs)
  - event_type (medium cardinality, common in filters)
  - region (low-to-medium cardinality, but commonly filtered)

Poor clustering candidates:
  - boolean columns (only 2 values, minimal benefit)
  - columns rarely used in queries
  - columns with extreme skew (one value dominates)

Multi-Column Clustering

When you cluster on multiple columns, order matters. The first column has the strongest effect.

-- Cluster by user_id first, then event_type
CLUSTER BY (user_id, event_type)

Effect on different queries:
  WHERE user_id = 123                     -- Strong pruning (first cluster column)
  WHERE user_id = 123 AND event_type = X  -- Very strong pruning (both columns)
  WHERE event_type = 'click'              -- Weak pruning (second column only)

Put the column that appears most often in filters first.

Partitioning & Clustering Together

The most effective strategy combines both:

-- BigQuery example: partition by date, cluster by commonly filtered columns
CREATE TABLE analytics.events (
    event_id STRING,
    user_id STRING,
    event_type STRING,
    event_properties JSON,
    event_timestamp TIMESTAMP
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type;

Query: WHERE event_timestamp = '2025-06-15' AND user_id = 'abc123'

Step 1: Partition pruning skips all partitions except 2025-06-15
Step 2: Cluster pruning within the 2025-06-15 partition skips blocks
        where user_id range does not include 'abc123'
Step 3: Only a small fraction of the data is actually read

Measuring the Impact

Always verify that partitioning and clustering are actually helping. Check the query execution plan.

-- BigQuery: check bytes scanned
SELECT
    total_bytes_processed,
    total_bytes_billed
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE job_id = 'your-query-job-id';

-- Snowflake: check partition pruning effectiveness
SELECT *
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE QUERY_ID = 'your-query-id';
-- Look at PARTITIONS_SCANNED vs PARTITIONS_TOTAL

If PARTITIONS_SCANNED is close to PARTITIONS_TOTAL, your query is not benefiting from pruning. Either the filter does not align with the partition column, or the data is not clustered on the filtered column.

Repartitioning Existing Tables

When data grows or query patterns change, you may need to repartition.

-- BigQuery: create a new partitioned table from an existing one
CREATE TABLE analytics.events_partitioned
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
AS SELECT * FROM analytics.events_legacy;

-- Then swap: rename old table, rename new table
ALTER TABLE analytics.events_legacy RENAME TO analytics.events_backup;
ALTER TABLE analytics.events_partitioned RENAME TO analytics.events;

In Snowflake, changing a clustering key is a metadata operation, but re-clustering large tables takes time and credits in the background.

Common Pitfalls

Partitioning on a high-cardinality column. Partitioning on user_id when you have 10 million users creates 10 million partitions. Each contains a handful of rows. The metadata overhead alone will kill performance.

Wrapping the partition column in a function. WHERE YEAR(event_date) = 2025 prevents partition pruning. Use WHERE event_date >= '2025-01-01' AND event_date < '2026-01-01' instead.

Not checking if pruning is working. You added partitioning but forgot to verify. Check the query plan. If the warehouse is scanning all partitions, your filter is not aligned with the partition column.

Clustering on rarely-queried columns. Clustering has maintenance costs (re-clustering in Snowflake, auto-reclustering in BigQuery). If nobody filters on the clustered column, you pay the cost without the benefit.

Over-partitioning low-volume tables. A table with 1 GB total does not need daily partitions. Monthly or even yearly partitions are fine. The overhead of managing 365 tiny partitions outweighs the pruning benefit.

Ignoring data skew. If 80% of your data falls into one partition (e.g., most events happen in one region), that partition becomes a hotspot. Queries hitting it are no faster than scanning the whole table. Consider a different partition column or a composite approach.

Key Takeaways

Partitioning splits tables into segments by a column value; clustering sorts data within those segments
Partition pruning lets the query planner skip irrelevant partitions entirely, often reducing data scanned by 90%+
Target 100 MB to 1 GB per partition; smaller partitions create file overhead, larger ones reduce pruning effectiveness
Cluster on high-cardinality columns frequently used in WHERE and JOIN clauses
Always filter directly on the partition column; wrapping it in a function disables pruning
Combine partitioning (coarse-grained, by date) with clustering (fine-grained, by high-cardinality columns) for maximum query performance
Verify pruning effectiveness by checking query execution plans, not by assuming it works