Advanced Compiler Topics

This file covers advanced optimization and compilation techniques used in production compilers.

Profile-Guided Optimization (PGO)

Use runtime profiling data to guide optimization decisions.

Workflow

1. Compile with instrumentation:  cargo build --profile=profiling
2. Run with representative workload: ./target/profiling/myapp < workload.txt
3. Collect profile data:          → profile.profdata
4. Recompile with profile data:   RUSTFLAGS="-Cprofile-use=profile.profdata" cargo build --release

What PGO Optimizes

Branch prediction hints: Mark likely/unlikely branches based on actual execution frequencies.

Function inlining: Inline hot functions more aggressively. Don't inline cold functions.

Block placement: Arrange basic blocks so the hot path is fall-through (no taken branches). Move cold code (error handling) to the end.

Function placement: Place frequently co-called functions near each other in the binary (improves I-cache locality).

Indirect call devirtualization: If a virtual call site almost always calls the same function, specialize for that target.

Loop unrolling: Unroll loops with high trip counts. Don't unroll loops that rarely execute.

Typical improvement: 10-30% for CPU-bound workloads. Up to 50% for branch-heavy code.

Used by: Chromium, Firefox, Clang itself, Linux kernel (since 6.x for x86).

Link-Time Optimization (LTO)

Optimize across translation unit (file) boundaries at link time.

How It Works

Each source file is compiled to an IR (LLVM bitcode or GCC GIMPLE) instead of machine code.
At link time, the linker invokes the optimizer on the combined IR of all files.
Cross-module optimizations are applied (inlining across files, interprocedural analysis, dead code elimination).

# Cargo.toml
[profile.release]
lto = true       # "fat" LTO — all code optimized together
# lto = "thin"   # "thin" LTO — parallel, faster compilation

Fat LTO vs Thin LTO

Fat LTO: Merge all bitcode into one module. Full optimization. Very slow compilation (single-threaded). Best code quality.

Thin LTO (LLVM): Index all modules. Only inline/optimize across module boundaries where profitable. Parallel compilation. 90%+ of fat LTO's benefit at a fraction of the compile time. Default for Rust lto = true in recent versions.

Typical improvement: 5-20% over non-LTO builds. Mainly from cross-module inlining and dead code elimination.

Auto-Vectorization

The compiler automatically transforms scalar code into SIMD (vector) instructions.

What Gets Vectorized

// Simple loop — easily vectorized
PROCEDURE ADD_ARRAYS(a, b, c)
    FOR i ← 0 TO length(a) - 1
        c[i] ← a[i] + b[i]
// Compiler generates: SIMD add (e.g., AVX2: 8 floats at once)

What Blocks Vectorization

Loop-carried dependencies: a[i] = a[i-1] + 1 (each iteration depends on the previous).
Function calls: Unknown side effects prevent vectorization (unless the function is inlined).
Complex control flow: Branches inside the loop reduce vectorization effectiveness.
Aliasing: Compiler can't prove arrays don't overlap. Use restrict (C) or Rust's borrow rules (which guarantee no aliasing of &mut references).

Rust advantage: &mut guarantees no aliasing → the compiler can vectorize more aggressively than C (where aliasing is possible unless annotated with restrict).

Checking Vectorization

# LLVM: emit optimization remarks
RUSTFLAGS="-C llvm-args=-pass-remarks=loop-vectorize" cargo build --release

# Inspect assembly
cargo asm mylib::add_arrays  # using cargo-show-asm

Just-In-Time Compilation

Compile code during execution based on runtime information.

JIT Architecture

Source → Bytecode (AOT)
    ↓
Interpreter (start executing immediately)
    ↓ (hot method detected)
Baseline JIT (quick compilation, basic optimizations)
    ↓ (still hot)
Optimizing JIT (slow compilation, aggressive optimizations)
    ↓ (assumption violated)
Deoptimization (fall back to interpreter or baseline)

Tiered Compilation (JVM HotSpot)

Tier	Mode	Compile Speed	Code Quality
0	Interpreter	N/A	Slow execution
1	C1 with profiling	Fast	Moderate
2	C1 without profiling	Fast	Moderate
3	C1 with full profiling	Fast	Moderate + data collection
4	C2 (optimizing)	Slow	Best

Typical path: 0 → 3 → 4 (interpret → profile → optimize).

Speculative Optimization

JIT compilers can make optimistic assumptions based on runtime observations:

Type specialization: "This variable is always an integer" → generate integer-only code.
Inline caching: "This virtual call always calls Circle::area()" → inline it.
Branch profiling: "This branch is taken 99.9% of the time" → optimize for the common case.

Deoptimization

If a speculative assumption is violated:

Detect the violation (type guard fails, unexpected class loaded).
Stop executing the optimized code.
Reconstruct the interpreter state from the optimized code's state (on-stack replacement — OSR).
Continue in the interpreter (or baseline JIT).

This is complex but essential — it allows the JIT to speculate aggressively while maintaining correctness.

On-Stack Replacement (OSR)

Switch between execution modes in the middle of a function — while it's on the stack.

OSR entry: Switch from interpreter to compiled code mid-function (e.g., enter a hot loop in compiled mode). Requires mapping interpreter state to compiled code state.

OSR exit (deoptimization): Switch from compiled code back to interpreter mid-function. Requires the reverse mapping.

Polyhedral Optimization

A mathematical framework for analyzing and transforming nested loops.

Key idea: Model loop iterations as integer points in a polyhedron (defined by loop bounds and conditions). Transformations = affine mappings on the polyhedron.

Enabled transformations: Loop interchange, tiling, fusion, fission, skewing, parallelization — all derived from the same mathematical framework.

Polly (LLVM): Polyhedral optimizer. Detects SCoPs (Static Control Parts) in the program and applies polyhedral transformations.

isl (Integer Set Library): The mathematical engine behind Polly and many polyhedral tools.

Limitations: Only works on "affine" loops (bounds and conditions are affine functions of loop indices). Can't handle indirect array accesses (A[B[i]]). Compilation time can be high.

Compiler Verification

Ensuring the compiler is correct — that it doesn't introduce bugs when transforming code.

Translation Validation

After each optimization pass, verify that the output is equivalent to the input.

Alive2 (LLVM): Automatically verifies LLVM IR transformations using an SMT solver. Has found hundreds of bugs in LLVM.

Verified Compilers

CompCert (Leroy, 2006): A C compiler formally verified in Coq. Proven correct: if the source program has defined behavior, the compiled code has the same behavior.

CakeML: A verified compiler for ML, verified in HOL4.

Trade-offs: Verified compilers produce slower code (fewer optimizations — each must be proved correct). CompCert is ~10% slower than GCC -O1. But guaranteed correct — critical for safety-critical systems.

Differential Testing (Fuzzing)

Csmith: Generates random C programs. Compile with multiple compilers. If outputs differ → compiler bug.

CReduce: When a test case triggers a bug, automatically reduce it to a minimal example.

This approach has found thousands of bugs in GCC, Clang, MSVC, and other compilers.

Summary of Advanced Techniques

Technique	When Applied	Typical Improvement	Cost
PGO	Build time (after profiling)	10-30%	Requires profiling run
LTO	Link time	5-20%	Slower linking
Auto-vectorization	Compile time	2-8× for vectorizable loops	Negligible
JIT	Runtime	Varies (can approach AOT)	Memory + compilation time
Polyhedral	Compile time	2-10× for nested loops	High compile time
Compiler verification	Development time	Correctness guarantee	Fewer optimizations

Applications in CS

Production builds: PGO + LTO + auto-vectorization for maximum performance (Chromium, Firefox, databases).
Language runtimes: JIT compilation for managed languages (JVM, .NET, V8, PyPy).
Scientific computing: Polyhedral optimization for dense linear algebra, stencil computations.
Safety-critical: Verified compilers (CompCert) for avionics, automotive, medical devices.
Compiler development: Fuzzing + differential testing finds bugs. Translation validation ensures correctness.