Query Optimization

The query optimizer transforms a parsed query into the cheapest execution plan. It is the most complex component of a DBMS.

Optimization Overview

Logical Query Plan (relational algebra tree)
    ↓  Heuristic optimization (rule-based transformations)
Optimized Logical Plan
    ↓  Cost-based optimization (enumerate physical plans, estimate costs)
Physical Execution Plan (with specific algorithms chosen)

Heuristic (Rule-Based) Optimization

Apply algebraic transformations that almost always improve performance.

Predicate Pushdown

Move selection (WHERE) as close to the data source as possible — filter rows early.

Before:                        After:
    σ_{u.age > 25}                 ⋈
       |                         / \
       ⋈                   σ_{age>25}  posts
      / \                      |
  users  posts              users

Filtering 1M users to 100K before joining is much cheaper than joining 1M then filtering.

Projection Pushdown

Remove unneeded columns early — reduces data volume.

Before: π_{name}(users ⋈ posts)
After:  π_{name}(π_{id,name}(users) ⋈ π_{user_id}(posts))

Join Reordering

The order of joins significantly affects performance. For n tables, there are n! possible orderings.

A ⋈ B ⋈ C:
  (A ⋈ B) ⋈ C  — if A ⋈ B is small, good!
  A ⋈ (B ⋈ C)  — if B ⋈ C is small, this is better!
  (A ⋈ C) ⋈ B  — might be best if A ⋈ C is very selective

Rule of thumb: Join smaller intermediate results first. Put the most selective join first.

Cost-Based Optimization

System R Style (Bottom-Up)

The classical approach (IBM System R, 1979). Used by PostgreSQL, MySQL, Oracle.

Enumerate all access paths for single tables (seq scan, each applicable index).
Consider all 2-table join orders + algorithms.
Build up to n-table plans using dynamic programming.
Cost each plan using statistics and cost formulas.
Pick the cheapest.

Search space: For n tables, O(n! × join_algorithms^n) plans. Dynamic programming reduces to O(2ⁿ × n) by memoizing optimal sub-plans.

Cost model: Estimate I/O cost (disk pages read/written) + CPU cost (comparisons, hashing).

Cost(Seq Scan) ≈ num_pages
Cost(Index Scan) ≈ selectivity × (index_pages + table_pages)
Cost(Hash Join) ≈ 3 × (|R_pages| + |S_pages|)  (build + probe + write partitions)
Cost(Sort-Merge) ≈ |R_pages| × log|R_pages| + |S_pages| × log|S_pages| + |R_pages| + |S_pages|

Volcano / Cascades Framework (Top-Down)

Used by SQL Server, CockroachDB, Apache Calcite, Greenplum.

Cascades (Graefe, 1995): Top-down exploration with memoization.

Start with the logical plan.
Apply transformation rules to generate equivalent plans (exploring alternatives).
Apply implementation rules to map logical operators to physical operators.
Use branch-and-bound pruning: If a partial plan already exceeds the best known cost, stop exploring.
Memoize subexpressions in a MEMO structure to avoid redundant work.

Advantages over System R: More extensible (new rules easily added). Better pruning. Handles complex queries (many tables) more gracefully.

Join Ordering Algorithms

Dynamic programming (System R): Optimal for up to ~12 tables. O(2ⁿ) states.

Greedy: Start with the two most selective tables. Add the next best table. Fast but suboptimal.

Genetic algorithm (PostgreSQL geqo): For queries with many tables (>12). Random search with crossover/mutation. Good solutions, not guaranteed optimal.

Linearization: For queries with a specific structure (star, chain), specialized algorithms find the optimal order efficiently.

Subquery Decorrelation

Correlated subquery: Inner query references the outer query → executed once per outer row (slow!).

-- Correlated (slow): N executions of inner query
SELECT * FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.total > 100);

-- Decorrelated (fast): single join
SELECT DISTINCT u.* FROM users u
JOIN orders o ON u.id = o.user_id WHERE o.total > 100;

The optimizer automatically decorrelates correlated subqueries when possible — transforming them into joins or semi-joins.

Common Decorrelation Patterns

IN subquery → Semi-join:

WHERE x IN (SELECT y FROM T) → SEMI JOIN T ON x = y

NOT IN → Anti-semi-join:

WHERE x NOT IN (SELECT y FROM T) → ANTI JOIN T ON x = y

EXISTS → Semi-join:

WHERE EXISTS (SELECT ... FROM T WHERE T.x = outer.y) → SEMI JOIN

Scalar subquery → Left join + aggregate:

SELECT (SELECT MAX(o.total) FROM orders o WHERE o.user_id = u.id) → LEFT JOIN + MAX

Common Subexpression Elimination

If the same subexpression appears multiple times, compute it once and reuse.

WITH cte AS (
    SELECT user_id, SUM(amount) AS total FROM orders GROUP BY user_id
)
SELECT * FROM cte WHERE total > 1000
UNION ALL
SELECT * FROM cte WHERE total < 10;
-- CTE computed once, used twice (if the optimizer doesn't inline it)

Adaptive Query Processing

Problem: Statistics are estimates. Bad estimates → bad plan → slow query.

Solution: Adjust the plan during execution based on actual row counts.

Adaptive join (Oracle): Start with nested loop join. If actual rows exceed threshold, switch to hash join mid-execution.

Spark AQE (Adaptive Query Execution): After each shuffle stage, reoptimize based on actual data sizes. Coalesce partitions, switch join strategies.

JIT Compilation

Compile query operators into machine code at execution time for CPU-intensive queries.

PostgreSQL (since 11): Uses LLVM to JIT-compile expression evaluation and tuple deforming. Benefit: 10-30% speedup for analytical queries.

Query compilation (HyPer, Umbra, MemSQL/SingleStore): Compile the entire query pipeline into native code. Eliminate iterator overhead. Dramatically faster for in-memory analytics.

Optimizer Hints

When the optimizer makes a bad choice, hints override its decisions.

-- MySQL
SELECT /*+ INDEX(orders idx_customer) */ * FROM orders WHERE customer_id = 42;

-- PostgreSQL: no official hints, but:
SET enable_seqscan = OFF;  -- force index usage
SET enable_hashjoin = OFF;  -- force merge or nested loop

-- Oracle
SELECT /*+ USE_HASH(o) PARALLEL(o, 4) */ * FROM orders o;

Philosophy: PostgreSQL intentionally avoids hints — fix statistics or the query instead. MySQL/Oracle provide hints as a pragmatic escape hatch.

Applications in CS

Performance tuning: Understanding the optimizer helps interpret EXPLAIN plans and identify bottlenecks.
Query writing: Knowing what the optimizer can/can't do helps write optimizer-friendly SQL.
Database development: Building a query optimizer is one of the most complex software engineering tasks.
Data engineering: ETL query performance depends on good optimization. Understanding join ordering and pushdown helps design efficient pipelines.