4 min read
On this page

Query Optimization

The query optimizer transforms a parsed query into the cheapest execution plan. It is the most complex component of a DBMS.

Optimization Overview

Logical Query Plan (relational algebra tree)
    ↓  Heuristic optimization (rule-based transformations)
Optimized Logical Plan
    ↓  Cost-based optimization (enumerate physical plans, estimate costs)
Physical Execution Plan (with specific algorithms chosen)

Heuristic (Rule-Based) Optimization

Apply algebraic transformations that almost always improve performance.

Predicate Pushdown

Move selection (WHERE) as close to the data source as possible — filter rows early.

Before:                        After:
    σ_{u.age > 25}                 ⋈
       |                         / \
       ⋈                   σ_{age>25}  posts
      / \                      |
  users  posts              users

Filtering 1M users to 100K before joining is much cheaper than joining 1M then filtering.

Projection Pushdown

Remove unneeded columns early — reduces data volume.

Before: π_{name}(users ⋈ posts)
After:  π_{name}(π_{id,name}(users) ⋈ π_{user_id}(posts))

Join Reordering

The order of joins significantly affects performance. For n tables, there are n! possible orderings.

A ⋈ B ⋈ C:
  (A ⋈ B) ⋈ C  — if A ⋈ B is small, good!
  A ⋈ (B ⋈ C)  — if B ⋈ C is small, this is better!
  (A ⋈ C) ⋈ B  — might be best if A ⋈ C is very selective

Rule of thumb: Join smaller intermediate results first. Put the most selective join first.

Cost-Based Optimization

System R Style (Bottom-Up)

The classical approach (IBM System R, 1979). Used by PostgreSQL, MySQL, Oracle.

  1. Enumerate all access paths for single tables (seq scan, each applicable index).
  2. Consider all 2-table join orders + algorithms.
  3. Build up to n-table plans using dynamic programming.
  4. Cost each plan using statistics and cost formulas.
  5. Pick the cheapest.

Search space: For n tables, O(n! × join_algorithms^n) plans. Dynamic programming reduces to O(2ⁿ × n) by memoizing optimal sub-plans.

Cost model: Estimate I/O cost (disk pages read/written) + CPU cost (comparisons, hashing).

Cost(Seq Scan) ≈ num_pages
Cost(Index Scan) ≈ selectivity × (index_pages + table_pages)
Cost(Hash Join) ≈ 3 × (|R_pages| + |S_pages|)  (build + probe + write partitions)
Cost(Sort-Merge) ≈ |R_pages| × log|R_pages| + |S_pages| × log|S_pages| + |R_pages| + |S_pages|

Volcano / Cascades Framework (Top-Down)

Used by SQL Server, CockroachDB, Apache Calcite, Greenplum.

Cascades (Graefe, 1995): Top-down exploration with memoization.

  1. Start with the logical plan.
  2. Apply transformation rules to generate equivalent plans (exploring alternatives).
  3. Apply implementation rules to map logical operators to physical operators.
  4. Use branch-and-bound pruning: If a partial plan already exceeds the best known cost, stop exploring.
  5. Memoize subexpressions in a MEMO structure to avoid redundant work.

Advantages over System R: More extensible (new rules easily added). Better pruning. Handles complex queries (many tables) more gracefully.

Join Ordering Algorithms

Dynamic programming (System R): Optimal for up to ~12 tables. O(2ⁿ) states.

Greedy: Start with the two most selective tables. Add the next best table. Fast but suboptimal.

Genetic algorithm (PostgreSQL geqo): For queries with many tables (>12). Random search with crossover/mutation. Good solutions, not guaranteed optimal.

Linearization: For queries with a specific structure (star, chain), specialized algorithms find the optimal order efficiently.

Subquery Decorrelation

Correlated subquery: Inner query references the outer query → executed once per outer row (slow!).

-- Correlated (slow): N executions of inner query
SELECT * FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.total > 100);

-- Decorrelated (fast): single join
SELECT DISTINCT u.* FROM users u
JOIN orders o ON u.id = o.user_id WHERE o.total > 100;

The optimizer automatically decorrelates correlated subqueries when possible — transforming them into joins or semi-joins.

Common Decorrelation Patterns

IN subquery → Semi-join:

WHERE x IN (SELECT y FROM T) → SEMI JOIN T ON x = y

NOT IN → Anti-semi-join:

WHERE x NOT IN (SELECT y FROM T) → ANTI JOIN T ON x = y

EXISTS → Semi-join:

WHERE EXISTS (SELECT ... FROM T WHERE T.x = outer.y) → SEMI JOIN

Scalar subquery → Left join + aggregate:

SELECT (SELECT MAX(o.total) FROM orders o WHERE o.user_id = u.id) → LEFT JOIN + MAX

Common Subexpression Elimination

If the same subexpression appears multiple times, compute it once and reuse.

WITH cte AS (
    SELECT user_id, SUM(amount) AS total FROM orders GROUP BY user_id
)
SELECT * FROM cte WHERE total > 1000
UNION ALL
SELECT * FROM cte WHERE total < 10;
-- CTE computed once, used twice (if the optimizer doesn't inline it)

Adaptive Query Processing

Problem: Statistics are estimates. Bad estimates → bad plan → slow query.

Solution: Adjust the plan during execution based on actual row counts.

Adaptive join (Oracle): Start with nested loop join. If actual rows exceed threshold, switch to hash join mid-execution.

Spark AQE (Adaptive Query Execution): After each shuffle stage, reoptimize based on actual data sizes. Coalesce partitions, switch join strategies.

JIT Compilation

Compile query operators into machine code at execution time for CPU-intensive queries.

PostgreSQL (since 11): Uses LLVM to JIT-compile expression evaluation and tuple deforming. Benefit: 10-30% speedup for analytical queries.

Query compilation (HyPer, Umbra, MemSQL/SingleStore): Compile the entire query pipeline into native code. Eliminate iterator overhead. Dramatically faster for in-memory analytics.

Optimizer Hints

When the optimizer makes a bad choice, hints override its decisions.

-- MySQL
SELECT /*+ INDEX(orders idx_customer) */ * FROM orders WHERE customer_id = 42;

-- PostgreSQL: no official hints, but:
SET enable_seqscan = OFF;  -- force index usage
SET enable_hashjoin = OFF;  -- force merge or nested loop

-- Oracle
SELECT /*+ USE_HASH(o) PARALLEL(o, 4) */ * FROM orders o;

Philosophy: PostgreSQL intentionally avoids hints — fix statistics or the query instead. MySQL/Oracle provide hints as a pragmatic escape hatch.

Applications in CS

  • Performance tuning: Understanding the optimizer helps interpret EXPLAIN plans and identify bottlenecks.
  • Query writing: Knowing what the optimizer can/can't do helps write optimizer-friendly SQL.
  • Database development: Building a query optimizer is one of the most complex software engineering tasks.
  • Data engineering: ETL query performance depends on good optimization. Understanding join ordering and pushdown helps design efficient pipelines.