Query Optimization

Most query performance problems are solved by the same handful of techniques. You do not need to understand the internals of a query planner to make your queries fast. You need to read execution plans, understand indexing basics, and avoid a few common anti-patterns. This is the 80/20 of query performance — the small set of practices that solve the vast majority of slow query problems.

EXPLAIN ANALYZE: Read It Before Optimizing

Never optimize a query without first reading its execution plan. EXPLAIN ANALYZE runs the query and shows you what the database actually did — which tables it scanned, which indexes it used, how many rows it processed, and where time was spent.

-- PostgreSQL
EXPLAIN ANALYZE
SELECT c.customer_name, SUM(o.order_amount) AS total_spent
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2025-01-01'
GROUP BY c.customer_name
ORDER BY total_spent DESC
LIMIT 10;

A simplified execution plan output:

Limit  (cost=15234..15234 rows=10 width=48) (actual time=342.1..342.2 rows=10)
  -> Sort  (cost=15234..15298 rows=25600 width=48) (actual time=342.1..342.1 rows=10)
        Sort Key: (sum(o.order_amount)) DESC
        -> HashAggregate  (cost=14890..15146 rows=25600 width=48) (actual time=338.5..340.2 rows=25600)
              -> Hash Join  (cost=892..13256 rows=326800 width=23) (actual time=12.3..285.6 rows=326800)
                    Hash Cond: (o.customer_id = c.customer_id)
                    -> Seq Scan on orders o  (cost=0..9823 rows=326800 width=15) (actual time=0.02..95.4 rows=326800)
                          Filter: (order_date >= '2025-01-01')
                          Rows Removed by Filter: 1473200
                    -> Hash  (cost=542..542 rows=28000 width=20) (actual time=11.8..11.8 rows=28000)
                          -> Seq Scan on customers c  (cost=0..542 rows=28000 width=20) (actual time=0.01..5.2 rows=28000)
Planning Time: 0.8 ms
Execution Time: 342.5 ms

What to Look For

Sequential scans on large tables. Seq Scan on orders means the database read every row in the orders table. For a table with 1.8 million rows, it read all of them but only kept 326,800. An index on order_date would let it skip the 1.47 million irrelevant rows.

Rows Removed by Filter. When this number is much larger than the rows returned, you are scanning data you do not need. An index on the filter column would help.

Actual vs estimated rows. If the planner estimated 100 rows but got 100,000, it may have chosen a suboptimal plan. Running ANALYZE on the table updates statistics and often fixes this.

The slowest node. Look at actual time values. The node with the highest time delta between its start and end is your bottleneck. Optimize that node first.

BigQuery & Snowflake

Cloud warehouses do not have traditional EXPLAIN ANALYZE, but they provide query profiles:

-- BigQuery: check the execution details in the console
-- or use INFORMATION_SCHEMA
SELECT
    total_bytes_processed,
    total_slot_ms,
    cache_hit
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE job_id = 'your-job-id';

In Snowflake, use the Query Profile tab in the web UI. Look for the nodes with the highest percentage of total execution time.

Indexes

Indexes are data structures that let the database find rows without scanning the entire table. They are the single most impactful optimization for transactional databases (PostgreSQL, MySQL). Cloud warehouses handle indexing differently (via automatic clustering and partitioning), but the concepts transfer.

B-Tree Indexes

The default index type. Excellent for equality checks, range queries, and sorting.

-- Create an index on the column you filter most often
CREATE INDEX idx_orders_order_date ON orders (order_date);

-- Composite index for queries that filter on both columns
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);

Column order in composite indexes matters. An index on (customer_id, order_date) supports:

WHERE customer_id = 123 (uses the index)
WHERE customer_id = 123 AND order_date >= '2025-01-01' (uses the index fully)
WHERE order_date >= '2025-01-01' (does NOT use the index efficiently — wrong leading column)

Put the most selective column (the one that eliminates the most rows) first, or put the column used in equality conditions before the one used in range conditions.

Partial Indexes

Index only the rows that matter:

-- Only index active orders (skip the 90% that are completed)
CREATE INDEX idx_orders_active ON orders (customer_id, order_date)
WHERE status = 'active';

This keeps the index small and fast. If your queries always filter on status = 'active', the partial index is dramatically more efficient than a full index.

When Not to Index

Tables with fewer than a few thousand rows. Sequential scan is faster than index lookup for small tables.
Columns with very low cardinality (like a boolean is_active flag with 50/50 distribution). The index does not eliminate enough rows to justify the overhead.
Write-heavy tables where insert performance matters more than read performance. Each index adds overhead to every INSERT and UPDATE.

Partitioning

Partitioning splits a large table into smaller physical segments based on a column value. The query engine can skip entire partitions that do not match the query filter.

-- PostgreSQL: range partition by month
CREATE TABLE orders (
    order_id      BIGINT,
    order_date    DATE,
    customer_id   INT,
    order_amount  DECIMAL(10,2)
) PARTITION BY RANGE (order_date);

CREATE TABLE orders_2025_01 PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE orders_2025_02 PARTITION OF orders
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- ... and so on

-- BigQuery: partition by date
CREATE TABLE project.dataset.orders
PARTITION BY DATE(order_date)
AS SELECT * FROM source_table;

-- Snowflake: automatic clustering
ALTER TABLE orders CLUSTER BY (order_date);

When to Partition

The table has hundreds of millions of rows or more
Queries almost always filter on a specific column (usually a date)
You want to efficiently delete old data (drop a partition instead of DELETE WHERE)

Partition Pruning

The key benefit: when you query with a filter on the partition column, the engine skips irrelevant partitions entirely.

-- Only scans the January 2025 partition, not the entire table
SELECT * FROM orders WHERE order_date = '2025-01-15';

This works only if you filter on the partition column directly. Wrapping it in a function — WHERE EXTRACT(MONTH FROM order_date) = 1 — may prevent partition pruning. Use direct comparisons.

Avoid SELECT *

-- Bad: reads every column, including that 10KB JSON blob
SELECT * FROM orders WHERE order_date = '2025-01-15';

-- Good: read only what you need
SELECT order_id, customer_id, order_amount
FROM orders
WHERE order_date = '2025-01-15';

In columnar warehouses (Snowflake, BigQuery, Redshift), this matters even more. These systems store data by column, so reading 5 columns out of 50 means reading 90% less data. In BigQuery, this directly reduces cost because you pay per byte scanned.

In row-store databases (PostgreSQL, MySQL), the impact is smaller but still real — fewer columns means smaller result sets, less network transfer, and potentially index-only scans.

Push Filters Down

Apply filters as early as possible. Do not join a million-row table and then filter the result — filter first, then join.

-- Bad: join everything, then filter
SELECT c.customer_name, o.order_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2025-01-01'
  AND o.order_amount > 100;

-- Better: filter in a CTE before joining
-- (modern optimizers often do this automatically, but be explicit)
WITH recent_large_orders AS (
    SELECT order_id, customer_id, order_amount
    FROM orders
    WHERE order_date >= '2025-01-01'
      AND order_amount > 100
)
SELECT c.customer_name, r.order_amount
FROM recent_large_orders r
JOIN customers c ON r.customer_id = c.customer_id;

Modern query optimizers usually push predicates down automatically. But in complex queries with many CTEs, subqueries, and views, the optimizer sometimes misses opportunities. Writing the filter explicitly in the earliest CTE ensures it happens.

Filter Pushdown in Joins

-- Bad: filter on the joined table after the join
SELECT o.order_id, o.order_amount
FROM orders o
LEFT JOIN returns r ON o.order_id = r.order_id
WHERE r.return_date >= '2025-01-01';
-- This silently converts the LEFT JOIN to an INNER JOIN

-- Good: filter in the ON clause to preserve the LEFT JOIN
SELECT o.order_id, o.order_amount
FROM orders o
LEFT JOIN returns r ON o.order_id = r.order_id
    AND r.return_date >= '2025-01-01';

This is a correctness issue as much as a performance issue. Filtering a LEFT JOIN table in the WHERE clause eliminates null rows, turning it into an inner join.

Materialized Views

A materialized view pre-computes and stores the result of a query. Subsequent reads hit the stored result instead of recomputing.

-- PostgreSQL: create a materialized view for an expensive aggregation
CREATE MATERIALIZED VIEW mv_daily_revenue AS
SELECT
    DATE(order_date) AS order_date,
    product_category,
    COUNT(*) AS order_count,
    SUM(order_amount) AS total_revenue,
    AVG(order_amount) AS avg_order_value
FROM orders o
JOIN products p ON o.product_id = p.product_id
GROUP BY DATE(order_date), product_category;

-- Refresh when underlying data changes
REFRESH MATERIALIZED VIEW mv_daily_revenue;

-- Query the materialized view (fast)
SELECT * FROM mv_daily_revenue WHERE order_date = '2025-01-15';

When to Use Materialized Views

A query is expensive and runs frequently with the same parameters
Dashboard queries that aggregate large tables and are queried dozens of times per day
You can tolerate slight staleness (the view is only as fresh as the last refresh)

Alternatives in Cloud Warehouses

BigQuery: Materialized views with automatic refresh
Snowflake: Dynamic tables (managed materialized views with declarative refresh)
Redshift: Materialized views with AUTO REFRESH option

In dbt, the equivalent pattern is an incremental model or a table materialization that rebuilds on schedule.

Join Optimization

Join Order

Most optimizers choose the join order automatically, but for complex queries with many joins, the optimizer sometimes picks a suboptimal plan.

-- If you know small_table has 100 rows and large_table has 10M:
-- Start with the small table (though the optimizer usually handles this)
SELECT l.*, s.category_name
FROM large_table l
JOIN small_table s ON l.category_id = s.category_id;

Avoid Cartesian Products

A missing join condition creates a cross join (every row paired with every row). On two million-row tables, that is a trillion-row result set.

-- Missing join condition: accidental cartesian product
SELECT o.order_id, c.customer_name
FROM orders o, customers c;  -- no WHERE clause = disaster

-- Always use explicit JOIN syntax with ON conditions
SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

Join on Indexed Columns

Joins are fast when both sides have indexes (or in columnar databases, when join columns are clustered). Joining on non-indexed columns forces full table scans on both sides.

The Optimization Workflow

When a query is slow, follow this process:

1. Run EXPLAIN ANALYZE (or check the query profile)
2. Identify the slowest node in the plan
3. Check: Is it scanning more rows than necessary?
   -> Add an index or a filter
4. Check: Is it doing a full table scan on a huge table?
   -> Partition the table or add clustering
5. Check: Is it processing columns you don't need?
   -> Remove SELECT * and list specific columns
6. Check: Is it recomputing the same expensive result?
   -> Use a materialized view or pre-aggregated table
7. Re-run EXPLAIN ANALYZE to verify the improvement

Do not guess. Measure, change one thing, measure again. Performance intuition is frequently wrong — the database optimizer does things you do not expect.

Common Pitfalls

Optimizing without measuring. Adding indexes, rewriting queries, and changing join orders without reading the execution plan first. You might optimize a part of the query that is not the bottleneck.
Too many indexes. Each index slows down writes and consumes storage. Index the columns you actually filter and join on, not every column.
Functions on indexed columns. WHERE YEAR(order_date) = 2025 cannot use an index on order_date. Use WHERE order_date >= '2025-01-01' AND order_date < '2026-01-01' instead.
Ignoring statistics. Run ANALYZE (PostgreSQL) or equivalent to keep table statistics current. Stale statistics lead to bad query plans.
Premature optimization of warehouse queries. Cloud warehouses are designed to scan large datasets efficiently. A query that runs in 5 seconds on BigQuery probably does not need optimization. Focus on queries that take minutes or cost significant money.
Not considering the query pattern. A query that runs once a month does not need the same optimization attention as one that runs every 30 seconds. Optimize what matters.
Caching at the wrong layer. Sometimes the right answer is not to optimize the query but to cache the result in an application layer, a materialized view, or a pre-aggregated table.

Key Takeaways

Always read the execution plan (EXPLAIN ANALYZE) before optimizing. Do not guess.
Indexes are the highest-impact optimization for row-store databases. B-tree indexes handle most cases. Put the most selective or equality-filtered column first in composite indexes.
Partition large tables by the column you filter most frequently, usually a date.
Avoid SELECT * — specify only the columns you need, especially in columnar warehouses where it directly affects performance and cost.
Push filters as early as possible in your query. Filter before joining, not after.
Materialized views pre-compute expensive queries. Use them for frequently-run aggregations that tolerate slight staleness.
Follow a systematic optimization workflow: measure, identify the bottleneck, change one thing, measure again.