Optimization

Compiler optimization transforms the IR to produce faster, smaller, or more power-efficient code without changing its behavior. The optimizer is typically the largest and most complex part of a compiler.

Optimization Levels

Level	GCC/Clang	Effect
-O0	No optimization	Fast compilation. Debuggable.
-O1	Basic optimizations	Moderate speedup. Reasonable compile time.
-O2	Standard optimizations	Good speedup. Recommended for production.
-O3	Aggressive	Maximum speed. May increase code size.
-Os	Size optimization	Minimize binary size. Similar to -O2 but avoids size-increasing transforms.
-Oz	Minimum size	Even smaller than -Os. May sacrifice speed.

Rust: cargo build --release uses optimization level 3 by default (configurable in Cargo.toml).

Local Optimizations

Transformations within a single basic block.

Constant Folding

Evaluate expressions with known constant values at compile time.

// Before                   After
x = 2 + 3                  x = 5
y = x * 4                  y = 20    (if x is constant 5)
z = "hello".len()           z = 5     (computed at compile time)

Constant Propagation

Replace variable uses with their known constant values.

// Before          After
x = 5              x = 5
y = x + 3          y = 8       (x is known to be 5)
z = y * 2          z = 16      (y is known to be 8)

Algebraic Simplification

Apply algebraic identities.

x + 0     → x
x * 1     → x
x * 0     → 0
x - x     → 0
x / x     → 1     (if x ≠ 0)
x * 2     → x << 1  (strength reduction)
x / 4     → x >> 2  (if x is unsigned or non-negative)
x % 8     → x & 7   (if x is unsigned)

Dead Code Elimination

Remove instructions whose results are never used.

// Before
x = a + b       // x is never used
y = c + d       // y IS used later
return y

// After
y = c + d
return y

Common Subexpression Elimination (CSE)

Don't recompute the same expression.

// Before              After
t1 = a + b            t1 = a + b
t2 = a + b            t2 = t1        (reuse t1)
t3 = t1 * t2          t3 = t1 * t1

Strength Reduction

Replace expensive operations with cheaper equivalents.

x * 2       → x << 1     (shift instead of multiply)
x * 15      → (x << 4) - x
x / 8       → x >> 3     (for unsigned)
x % 2       → x & 1
x ** 2      → x * x

Global Optimizations

Transformations across multiple basic blocks within a function.

Data Flow Analysis

Compute properties of values at each program point by propagating information through the CFG.

Framework: Define a set of facts, transfer functions (how each instruction changes the facts), and a join function (how facts combine at merge points). Iterate until fixpoint.

Analysis	Direction	Facts	Use
Reaching definitions	Forward	Which assignments reach each point	Constant propagation
Live variables	Backward	Which variables are used later	Dead code elimination
Available expressions	Forward	Which expressions are already computed	CSE
Very busy expressions	Backward	Which expressions will be computed on all paths	Code hoisting

Global CSE

Extend CSE across basic blocks. If an expression is available (already computed and not invalidated) at a program point, reuse it.

Copy Propagation

After x = y, replace uses of x with y (if y hasn't changed).

// Before          After
x = y              (x = y can be removed)
z = x + 1          z = y + 1

Loop Optimizations

Loops are where programs spend most of their time. Loop optimizations provide the biggest speedups.

Loop-Invariant Code Motion (LICM)

Move computations that produce the same result on every iteration out of the loop.

// Before                  After
for i in 0..n {            let t = a * b;       // hoisted
    x[i] = a * b + i;      for i in 0..n {
}                              x[i] = t + i;
                            }

Induction Variable Elimination

Replace expensive operations on loop index variables with cheaper increments.

// Before                      After
for i in 0..n {                let mut addr = &x[0] as *const _;
    access x[i];               for _ in 0..n {
    // x[i] = base + i * size      access *addr;
}                                  addr = addr.add(1);
                               }

Loop Unrolling

Replicate the loop body to reduce loop overhead (branch, counter increment) and enable more optimization within the unrolled body.

// Before (4 iterations)      After (unrolled 4×)
for i in 0..4 {                x[0] = a;
    x[i] = a;                  x[1] = a;
}                               x[2] = a;
                                x[3] = a;

Partial unrolling: Unroll by factor k, keep a loop for the remaining iterations. Common: unroll 4× or 8×.

Loop Fusion

Merge two loops with the same bounds into one. Reduces loop overhead and improves cache locality.

// Before                      After
for i in 0..n { a[i] = ...; }  for i in 0..n {
for i in 0..n { b[i] = a[i]; }     a[i] = ...;
                                    b[i] = a[i];
                                }

Loop Tiling (Blocking)

Process data in tiles that fit in cache. Dramatically improves cache behavior for multi-dimensional data.

// Before (cache-unfriendly for large N)
for i in 0..N {
    for j in 0..N {
        C[i][j] += A[i][k] * B[k][j];
    }
}

// After (tiled, cache-friendly)
for ii in (0..N).step_by(TILE) {
    for jj in (0..N).step_by(TILE) {
        for i in ii..min(ii+TILE, N) {
            for j in jj..min(jj+TILE, N) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

Vectorization (Auto-vectorization)

Transform scalar loops into SIMD operations.

// Before (scalar)            After (vectorized with AVX2)
for i in 0..n {                for i in (0..n).step_by(8) {
    a[i] = b[i] + c[i];           _mm256_store(a+i,
}                                      _mm256_add(_mm256_load(b+i),
                                                   _mm256_load(c+i)));
                               }

The compiler detects parallelism in loops and emits SIMD instructions (SSE, AVX, NEON).

Interprocedural Optimizations

Transformations across function boundaries.

Inline Expansion (Inlining)

Replace a function call with the function's body.

// Before                After
fn square(x: i32) -> i32 {   let result = x * x;
    x * x
}
let result = square(y);

Benefits: Eliminates call overhead. Enables further optimizations (constant folding through the inlined body, CSE).

Cost: Increases code size. Excessive inlining → instruction cache misses.

Heuristic: Inline small functions. Inline hot functions (profiling-guided). Don't inline recursive functions. Don't inline cold functions.

Tail Call Optimization

If the last action of a function is a call, reuse the current stack frame (jump instead of call).

// Before (uses stack)       After (reuses stack frame)
fn factorial(n, acc) {        fn factorial(n, acc) {
    if n == 0 { acc }             loop {
    else { factorial(n-1, n*acc) } if n == 0 { return acc; }
}                                  acc = n * acc;
                                   n = n - 1;
                                 }
                              }

Prevents stack overflow for tail-recursive functions.

Escape Analysis

Determine if an object escapes the function that creates it.

Doesn't escape: Allocate on the stack (or eliminate entirely). Avoid heap allocation.
Escapes: Must allocate on the heap.

FUNCTION SUM_PAIR(a, b)
    pair ← (a, b)      // pair doesn't escape -> stack allocated (or eliminated)
    RETURN pair.first + pair.second

JVM uses escape analysis to avoid heap allocation for non-escaping objects (scalar replacement).

Devirtualization

Replace virtual (dynamic dispatch) calls with direct calls when the concrete type is known.

// Before (dynamic dispatch via vtable)
FUNCTION PROCESS(shape: Shape)  shape.AREA()

// After (if compiler proves shape is always Circle)
FUNCTION PROCESS(shape: Circle)  Circle.AREA(shape)   // direct call, can be inlined

Enables further optimizations (inlining, constant folding) that are impossible through a vtable.

Polyhedral Model

A mathematical framework for analyzing and optimizing nested loops over arrays. Models loop iterations as integer points in a polyhedron.

Enables: Automatic parallelization, tiling, fusion, fission, skewing, interchange.

Tools: Polly (LLVM polyhedral optimizer), PLUTO.

Applications in CS

Performance: Understanding compiler optimizations helps write code that the compiler can optimize well (avoid aliasing, use simple loops, prefer stack allocation).
Benchmarking: Ensure the compiler doesn't optimize away the code you're benchmarking. Use black_box() in Rust.
Debugging optimized code: Optimized code may not correspond to source line-by-line. Understanding transformations helps debug optimized binaries.
Language design: Language features affect optimizability. Rust's ownership enables optimizations impossible in C (no aliasing between &mut references).