Optimization
Compiler optimization transforms the IR to produce faster, smaller, or more power-efficient code without changing its behavior. The optimizer is typically the largest and most complex part of a compiler.
Optimization Levels
| Level | GCC/Clang | Effect | |---|---|---| | -O0 | No optimization | Fast compilation. Debuggable. | | -O1 | Basic optimizations | Moderate speedup. Reasonable compile time. | | -O2 | Standard optimizations | Good speedup. Recommended for production. | | -O3 | Aggressive | Maximum speed. May increase code size. | | -Os | Size optimization | Minimize binary size. Similar to -O2 but avoids size-increasing transforms. | | -Oz | Minimum size | Even smaller than -Os. May sacrifice speed. |
Rust: cargo build --release uses optimization level 3 by default (configurable in Cargo.toml).
Local Optimizations
Transformations within a single basic block.
Constant Folding
Evaluate expressions with known constant values at compile time.
// Before After
x = 2 + 3 x = 5
y = x * 4 y = 20 (if x is constant 5)
z = "hello".len() z = 5 (computed at compile time)
Constant Propagation
Replace variable uses with their known constant values.
// Before After
x = 5 x = 5
y = x + 3 y = 8 (x is known to be 5)
z = y * 2 z = 16 (y is known to be 8)
Algebraic Simplification
Apply algebraic identities.
x + 0 → x
x * 1 → x
x * 0 → 0
x - x → 0
x / x → 1 (if x ≠ 0)
x * 2 → x << 1 (strength reduction)
x / 4 → x >> 2 (if x is unsigned or non-negative)
x % 8 → x & 7 (if x is unsigned)
Dead Code Elimination
Remove instructions whose results are never used.
// Before
x = a + b // x is never used
y = c + d // y IS used later
return y
// After
y = c + d
return y
Common Subexpression Elimination (CSE)
Don't recompute the same expression.
// Before After
t1 = a + b t1 = a + b
t2 = a + b t2 = t1 (reuse t1)
t3 = t1 * t2 t3 = t1 * t1
Strength Reduction
Replace expensive operations with cheaper equivalents.
x * 2 → x << 1 (shift instead of multiply)
x * 15 → (x << 4) - x
x / 8 → x >> 3 (for unsigned)
x % 2 → x & 1
x ** 2 → x * x
Global Optimizations
Transformations across multiple basic blocks within a function.
Data Flow Analysis
Compute properties of values at each program point by propagating information through the CFG.
Framework: Define a set of facts, transfer functions (how each instruction changes the facts), and a join function (how facts combine at merge points). Iterate until fixpoint.
| Analysis | Direction | Facts | Use | |---|---|---|---| | Reaching definitions | Forward | Which assignments reach each point | Constant propagation | | Live variables | Backward | Which variables are used later | Dead code elimination | | Available expressions | Forward | Which expressions are already computed | CSE | | Very busy expressions | Backward | Which expressions will be computed on all paths | Code hoisting |
Global CSE
Extend CSE across basic blocks. If an expression is available (already computed and not invalidated) at a program point, reuse it.
Copy Propagation
After x = y, replace uses of x with y (if y hasn't changed).
// Before After
x = y (x = y can be removed)
z = x + 1 z = y + 1
Loop Optimizations
Loops are where programs spend most of their time. Loop optimizations provide the biggest speedups.
Loop-Invariant Code Motion (LICM)
Move computations that produce the same result on every iteration out of the loop.
// Before After
for i in 0..n { let t = a * b; // hoisted
x[i] = a * b + i; for i in 0..n {
} x[i] = t + i;
}
Induction Variable Elimination
Replace expensive operations on loop index variables with cheaper increments.
// Before After
for i in 0..n { let mut addr = &x[0] as *const _;
access x[i]; for _ in 0..n {
// x[i] = base + i * size access *addr;
} addr = addr.add(1);
}
Loop Unrolling
Replicate the loop body to reduce loop overhead (branch, counter increment) and enable more optimization within the unrolled body.
// Before (4 iterations) After (unrolled 4×)
for i in 0..4 { x[0] = a;
x[i] = a; x[1] = a;
} x[2] = a;
x[3] = a;
Partial unrolling: Unroll by factor k, keep a loop for the remaining iterations. Common: unroll 4× or 8×.
Loop Fusion
Merge two loops with the same bounds into one. Reduces loop overhead and improves cache locality.
// Before After
for i in 0..n { a[i] = ...; } for i in 0..n {
for i in 0..n { b[i] = a[i]; } a[i] = ...;
b[i] = a[i];
}
Loop Tiling (Blocking)
Process data in tiles that fit in cache. Dramatically improves cache behavior for multi-dimensional data.
// Before (cache-unfriendly for large N)
for i in 0..N {
for j in 0..N {
C[i][j] += A[i][k] * B[k][j];
}
}
// After (tiled, cache-friendly)
for ii in (0..N).step_by(TILE) {
for jj in (0..N).step_by(TILE) {
for i in ii..min(ii+TILE, N) {
for j in jj..min(jj+TILE, N) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
Vectorization (Auto-vectorization)
Transform scalar loops into SIMD operations.
// Before (scalar) After (vectorized with AVX2)
for i in 0..n { for i in (0..n).step_by(8) {
a[i] = b[i] + c[i]; _mm256_store(a+i,
} _mm256_add(_mm256_load(b+i),
_mm256_load(c+i)));
}
The compiler detects parallelism in loops and emits SIMD instructions (SSE, AVX, NEON).
Interprocedural Optimizations
Transformations across function boundaries.
Inline Expansion (Inlining)
Replace a function call with the function's body.
// Before After
fn square(x: i32) -> i32 { let result = x * x;
x * x
}
let result = square(y);
Benefits: Eliminates call overhead. Enables further optimizations (constant folding through the inlined body, CSE).
Cost: Increases code size. Excessive inlining → instruction cache misses.
Heuristic: Inline small functions. Inline hot functions (profiling-guided). Don't inline recursive functions. Don't inline cold functions.
Tail Call Optimization
If the last action of a function is a call, reuse the current stack frame (jump instead of call).
// Before (uses stack) After (reuses stack frame)
fn factorial(n, acc) { fn factorial(n, acc) {
if n == 0 { acc } loop {
else { factorial(n-1, n*acc) } if n == 0 { return acc; }
} acc = n * acc;
n = n - 1;
}
}
Prevents stack overflow for tail-recursive functions.
Escape Analysis
Determine if an object escapes the function that creates it.
- Doesn't escape: Allocate on the stack (or eliminate entirely). Avoid heap allocation.
- Escapes: Must allocate on the heap.
FUNCTION SUM_PAIR(a, b)
pair ← (a, b) // pair doesn't escape -> stack allocated (or eliminated)
RETURN pair.first + pair.second
JVM uses escape analysis to avoid heap allocation for non-escaping objects (scalar replacement).
Devirtualization
Replace virtual (dynamic dispatch) calls with direct calls when the concrete type is known.
// Before (dynamic dispatch via vtable)
FUNCTION PROCESS(shape: Shape) shape.AREA()
// After (if compiler proves shape is always Circle)
FUNCTION PROCESS(shape: Circle) Circle.AREA(shape) // direct call, can be inlined
Enables further optimizations (inlining, constant folding) that are impossible through a vtable.
Polyhedral Model
A mathematical framework for analyzing and optimizing nested loops over arrays. Models loop iterations as integer points in a polyhedron.
Enables: Automatic parallelization, tiling, fusion, fission, skewing, interchange.
Tools: Polly (LLVM polyhedral optimizer), PLUTO.
Applications in CS
- Performance: Understanding compiler optimizations helps write code that the compiler can optimize well (avoid aliasing, use simple loops, prefer stack allocation).
- Benchmarking: Ensure the compiler doesn't optimize away the code you're benchmarking. Use
black_box()in Rust. - Debugging optimized code: Optimized code may not correspond to source line-by-line. Understanding transformations helps debug optimized binaries.
- Language design: Language features affect optimizability. Rust's ownership enables optimizations impossible in C (no aliasing between &mut references).