4 min read
On this page

Optimization

Compiler optimization transforms the IR to produce faster, smaller, or more power-efficient code without changing its behavior. The optimizer is typically the largest and most complex part of a compiler.

Optimization Levels

| Level | GCC/Clang | Effect | |---|---|---| | -O0 | No optimization | Fast compilation. Debuggable. | | -O1 | Basic optimizations | Moderate speedup. Reasonable compile time. | | -O2 | Standard optimizations | Good speedup. Recommended for production. | | -O3 | Aggressive | Maximum speed. May increase code size. | | -Os | Size optimization | Minimize binary size. Similar to -O2 but avoids size-increasing transforms. | | -Oz | Minimum size | Even smaller than -Os. May sacrifice speed. |

Rust: cargo build --release uses optimization level 3 by default (configurable in Cargo.toml).

Local Optimizations

Transformations within a single basic block.

Constant Folding

Evaluate expressions with known constant values at compile time.

// Before                   After
x = 2 + 3                  x = 5
y = x * 4                  y = 20    (if x is constant 5)
z = "hello".len()           z = 5     (computed at compile time)

Constant Propagation

Replace variable uses with their known constant values.

// Before          After
x = 5              x = 5
y = x + 3          y = 8       (x is known to be 5)
z = y * 2          z = 16      (y is known to be 8)

Algebraic Simplification

Apply algebraic identities.

x + 0     → x
x * 1     → x
x * 0     → 0
x - x     → 0
x / x     → 1     (if x ≠ 0)
x * 2     → x << 1  (strength reduction)
x / 4     → x >> 2  (if x is unsigned or non-negative)
x % 8     → x & 7   (if x is unsigned)

Dead Code Elimination

Remove instructions whose results are never used.

// Before
x = a + b       // x is never used
y = c + d       // y IS used later
return y

// After
y = c + d
return y

Common Subexpression Elimination (CSE)

Don't recompute the same expression.

// Before              After
t1 = a + b            t1 = a + b
t2 = a + b            t2 = t1        (reuse t1)
t3 = t1 * t2          t3 = t1 * t1

Strength Reduction

Replace expensive operations with cheaper equivalents.

x * 2       → x << 1     (shift instead of multiply)
x * 15      → (x << 4) - x
x / 8       → x >> 3     (for unsigned)
x % 2       → x & 1
x ** 2      → x * x

Global Optimizations

Transformations across multiple basic blocks within a function.

Data Flow Analysis

Compute properties of values at each program point by propagating information through the CFG.

Framework: Define a set of facts, transfer functions (how each instruction changes the facts), and a join function (how facts combine at merge points). Iterate until fixpoint.

| Analysis | Direction | Facts | Use | |---|---|---|---| | Reaching definitions | Forward | Which assignments reach each point | Constant propagation | | Live variables | Backward | Which variables are used later | Dead code elimination | | Available expressions | Forward | Which expressions are already computed | CSE | | Very busy expressions | Backward | Which expressions will be computed on all paths | Code hoisting |

Global CSE

Extend CSE across basic blocks. If an expression is available (already computed and not invalidated) at a program point, reuse it.

Copy Propagation

After x = y, replace uses of x with y (if y hasn't changed).

// Before          After
x = y              (x = y can be removed)
z = x + 1          z = y + 1

Loop Optimizations

Loops are where programs spend most of their time. Loop optimizations provide the biggest speedups.

Loop-Invariant Code Motion (LICM)

Move computations that produce the same result on every iteration out of the loop.

// Before                  After
for i in 0..n {            let t = a * b;       // hoisted
    x[i] = a * b + i;      for i in 0..n {
}                              x[i] = t + i;
                            }

Induction Variable Elimination

Replace expensive operations on loop index variables with cheaper increments.

// Before                      After
for i in 0..n {                let mut addr = &x[0] as *const _;
    access x[i];               for _ in 0..n {
    // x[i] = base + i * size      access *addr;
}                                  addr = addr.add(1);
                               }

Loop Unrolling

Replicate the loop body to reduce loop overhead (branch, counter increment) and enable more optimization within the unrolled body.

// Before (4 iterations)      After (unrolled 4×)
for i in 0..4 {                x[0] = a;
    x[i] = a;                  x[1] = a;
}                               x[2] = a;
                                x[3] = a;

Partial unrolling: Unroll by factor k, keep a loop for the remaining iterations. Common: unroll 4× or 8×.

Loop Fusion

Merge two loops with the same bounds into one. Reduces loop overhead and improves cache locality.

// Before                      After
for i in 0..n { a[i] = ...; }  for i in 0..n {
for i in 0..n { b[i] = a[i]; }     a[i] = ...;
                                    b[i] = a[i];
                                }

Loop Tiling (Blocking)

Process data in tiles that fit in cache. Dramatically improves cache behavior for multi-dimensional data.

// Before (cache-unfriendly for large N)
for i in 0..N {
    for j in 0..N {
        C[i][j] += A[i][k] * B[k][j];
    }
}

// After (tiled, cache-friendly)
for ii in (0..N).step_by(TILE) {
    for jj in (0..N).step_by(TILE) {
        for i in ii..min(ii+TILE, N) {
            for j in jj..min(jj+TILE, N) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

Vectorization (Auto-vectorization)

Transform scalar loops into SIMD operations.

// Before (scalar)            After (vectorized with AVX2)
for i in 0..n {                for i in (0..n).step_by(8) {
    a[i] = b[i] + c[i];           _mm256_store(a+i,
}                                      _mm256_add(_mm256_load(b+i),
                                                   _mm256_load(c+i)));
                               }

The compiler detects parallelism in loops and emits SIMD instructions (SSE, AVX, NEON).

Interprocedural Optimizations

Transformations across function boundaries.

Inline Expansion (Inlining)

Replace a function call with the function's body.

// Before                After
fn square(x: i32) -> i32 {   let result = x * x;
    x * x
}
let result = square(y);

Benefits: Eliminates call overhead. Enables further optimizations (constant folding through the inlined body, CSE).

Cost: Increases code size. Excessive inlining → instruction cache misses.

Heuristic: Inline small functions. Inline hot functions (profiling-guided). Don't inline recursive functions. Don't inline cold functions.

Tail Call Optimization

If the last action of a function is a call, reuse the current stack frame (jump instead of call).

// Before (uses stack)       After (reuses stack frame)
fn factorial(n, acc) {        fn factorial(n, acc) {
    if n == 0 { acc }             loop {
    else { factorial(n-1, n*acc) } if n == 0 { return acc; }
}                                  acc = n * acc;
                                   n = n - 1;
                                 }
                              }

Prevents stack overflow for tail-recursive functions.

Escape Analysis

Determine if an object escapes the function that creates it.

  • Doesn't escape: Allocate on the stack (or eliminate entirely). Avoid heap allocation.
  • Escapes: Must allocate on the heap.
FUNCTION SUM_PAIR(a, b)
    pair ← (a, b)      // pair doesn't escape -> stack allocated (or eliminated)
    RETURN pair.first + pair.second

JVM uses escape analysis to avoid heap allocation for non-escaping objects (scalar replacement).

Devirtualization

Replace virtual (dynamic dispatch) calls with direct calls when the concrete type is known.

// Before (dynamic dispatch via vtable)
FUNCTION PROCESS(shape: Shape)  shape.AREA()

// After (if compiler proves shape is always Circle)
FUNCTION PROCESS(shape: Circle)  Circle.AREA(shape)   // direct call, can be inlined

Enables further optimizations (inlining, constant folding) that are impossible through a vtable.

Polyhedral Model

A mathematical framework for analyzing and optimizing nested loops over arrays. Models loop iterations as integer points in a polyhedron.

Enables: Automatic parallelization, tiling, fusion, fission, skewing, interchange.

Tools: Polly (LLVM polyhedral optimizer), PLUTO.

Applications in CS

  • Performance: Understanding compiler optimizations helps write code that the compiler can optimize well (avoid aliasing, use simple loops, prefer stack allocation).
  • Benchmarking: Ensure the compiler doesn't optimize away the code you're benchmarking. Use black_box() in Rust.
  • Debugging optimized code: Optimized code may not correspond to source line-by-line. Understanding transformations helps debug optimized binaries.
  • Language design: Language features affect optimizability. Rust's ownership enables optimizations impossible in C (no aliasing between &mut references).