Advanced Compiler Topics
This file covers advanced optimization and compilation techniques used in production compilers.
Profile-Guided Optimization (PGO)
Use runtime profiling data to guide optimization decisions.
Workflow
1. Compile with instrumentation: cargo build --profile=profiling
2. Run with representative workload: ./target/profiling/myapp < workload.txt
3. Collect profile data: → profile.profdata
4. Recompile with profile data: RUSTFLAGS="-Cprofile-use=profile.profdata" cargo build --release
What PGO Optimizes
Branch prediction hints: Mark likely/unlikely branches based on actual execution frequencies.
Function inlining: Inline hot functions more aggressively. Don't inline cold functions.
Block placement: Arrange basic blocks so the hot path is fall-through (no taken branches). Move cold code (error handling) to the end.
Function placement: Place frequently co-called functions near each other in the binary (improves I-cache locality).
Indirect call devirtualization: If a virtual call site almost always calls the same function, specialize for that target.
Loop unrolling: Unroll loops with high trip counts. Don't unroll loops that rarely execute.
Typical improvement: 10-30% for CPU-bound workloads. Up to 50% for branch-heavy code.
Used by: Chromium, Firefox, Clang itself, Linux kernel (since 6.x for x86).
Link-Time Optimization (LTO)
Optimize across translation unit (file) boundaries at link time.
How It Works
- Each source file is compiled to an IR (LLVM bitcode or GCC GIMPLE) instead of machine code.
- At link time, the linker invokes the optimizer on the combined IR of all files.
- Cross-module optimizations are applied (inlining across files, interprocedural analysis, dead code elimination).
# Cargo.toml
[profile.release]
lto = true # "fat" LTO — all code optimized together
# lto = "thin" # "thin" LTO — parallel, faster compilation
Fat LTO vs Thin LTO
Fat LTO: Merge all bitcode into one module. Full optimization. Very slow compilation (single-threaded). Best code quality.
Thin LTO (LLVM): Index all modules. Only inline/optimize across module boundaries where profitable. Parallel compilation. 90%+ of fat LTO's benefit at a fraction of the compile time. Default for Rust lto = true in recent versions.
Typical improvement: 5-20% over non-LTO builds. Mainly from cross-module inlining and dead code elimination.
Auto-Vectorization
The compiler automatically transforms scalar code into SIMD (vector) instructions.
What Gets Vectorized
// Simple loop — easily vectorized
PROCEDURE ADD_ARRAYS(a, b, c)
FOR i ← 0 TO length(a) - 1
c[i] ← a[i] + b[i]
// Compiler generates: SIMD add (e.g., AVX2: 8 floats at once)
What Blocks Vectorization
- Loop-carried dependencies:
a[i] = a[i-1] + 1(each iteration depends on the previous). - Function calls: Unknown side effects prevent vectorization (unless the function is inlined).
- Complex control flow: Branches inside the loop reduce vectorization effectiveness.
- Aliasing: Compiler can't prove arrays don't overlap. Use
restrict(C) or Rust's borrow rules (which guarantee no aliasing of&mutreferences).
Rust advantage: &mut guarantees no aliasing → the compiler can vectorize more aggressively than C (where aliasing is possible unless annotated with restrict).
Checking Vectorization
# LLVM: emit optimization remarks
RUSTFLAGS="-C llvm-args=-pass-remarks=loop-vectorize" cargo build --release
# Inspect assembly
cargo asm mylib::add_arrays # using cargo-show-asm
Just-In-Time Compilation
Compile code during execution based on runtime information.
JIT Architecture
Source → Bytecode (AOT)
↓
Interpreter (start executing immediately)
↓ (hot method detected)
Baseline JIT (quick compilation, basic optimizations)
↓ (still hot)
Optimizing JIT (slow compilation, aggressive optimizations)
↓ (assumption violated)
Deoptimization (fall back to interpreter or baseline)
Tiered Compilation (JVM HotSpot)
| Tier | Mode | Compile Speed | Code Quality | |---|---|---|---| | 0 | Interpreter | N/A | Slow execution | | 1 | C1 with profiling | Fast | Moderate | | 2 | C1 without profiling | Fast | Moderate | | 3 | C1 with full profiling | Fast | Moderate + data collection | | 4 | C2 (optimizing) | Slow | Best |
Typical path: 0 → 3 → 4 (interpret → profile → optimize).
Speculative Optimization
JIT compilers can make optimistic assumptions based on runtime observations:
- Type specialization: "This variable is always an integer" → generate integer-only code.
- Inline caching: "This virtual call always calls Circle::area()" → inline it.
- Branch profiling: "This branch is taken 99.9% of the time" → optimize for the common case.
Deoptimization
If a speculative assumption is violated:
- Detect the violation (type guard fails, unexpected class loaded).
- Stop executing the optimized code.
- Reconstruct the interpreter state from the optimized code's state (on-stack replacement — OSR).
- Continue in the interpreter (or baseline JIT).
This is complex but essential — it allows the JIT to speculate aggressively while maintaining correctness.
On-Stack Replacement (OSR)
Switch between execution modes in the middle of a function — while it's on the stack.
OSR entry: Switch from interpreter to compiled code mid-function (e.g., enter a hot loop in compiled mode). Requires mapping interpreter state to compiled code state.
OSR exit (deoptimization): Switch from compiled code back to interpreter mid-function. Requires the reverse mapping.
Polyhedral Optimization
A mathematical framework for analyzing and transforming nested loops.
Key idea: Model loop iterations as integer points in a polyhedron (defined by loop bounds and conditions). Transformations = affine mappings on the polyhedron.
Enabled transformations: Loop interchange, tiling, fusion, fission, skewing, parallelization — all derived from the same mathematical framework.
Polly (LLVM): Polyhedral optimizer. Detects SCoPs (Static Control Parts) in the program and applies polyhedral transformations.
isl (Integer Set Library): The mathematical engine behind Polly and many polyhedral tools.
Limitations: Only works on "affine" loops (bounds and conditions are affine functions of loop indices). Can't handle indirect array accesses (A[B[i]]). Compilation time can be high.
Compiler Verification
Ensuring the compiler is correct — that it doesn't introduce bugs when transforming code.
Translation Validation
After each optimization pass, verify that the output is equivalent to the input.
Alive2 (LLVM): Automatically verifies LLVM IR transformations using an SMT solver. Has found hundreds of bugs in LLVM.
Verified Compilers
CompCert (Leroy, 2006): A C compiler formally verified in Coq. Proven correct: if the source program has defined behavior, the compiled code has the same behavior.
CakeML: A verified compiler for ML, verified in HOL4.
Trade-offs: Verified compilers produce slower code (fewer optimizations — each must be proved correct). CompCert is ~10% slower than GCC -O1. But guaranteed correct — critical for safety-critical systems.
Differential Testing (Fuzzing)
Csmith: Generates random C programs. Compile with multiple compilers. If outputs differ → compiler bug.
CReduce: When a test case triggers a bug, automatically reduce it to a minimal example.
This approach has found thousands of bugs in GCC, Clang, MSVC, and other compilers.
Summary of Advanced Techniques
| Technique | When Applied | Typical Improvement | Cost | |---|---|---|---| | PGO | Build time (after profiling) | 10-30% | Requires profiling run | | LTO | Link time | 5-20% | Slower linking | | Auto-vectorization | Compile time | 2-8× for vectorizable loops | Negligible | | JIT | Runtime | Varies (can approach AOT) | Memory + compilation time | | Polyhedral | Compile time | 2-10× for nested loops | High compile time | | Compiler verification | Development time | Correctness guarantee | Fewer optimizations |
Applications in CS
- Production builds: PGO + LTO + auto-vectorization for maximum performance (Chromium, Firefox, databases).
- Language runtimes: JIT compilation for managed languages (JVM, .NET, V8, PyPy).
- Scientific computing: Polyhedral optimization for dense linear algebra, stencil computations.
- Safety-critical: Verified compilers (CompCert) for avionics, automotive, medical devices.
- Compiler development: Fuzzing + differential testing finds bugs. Translation validation ensures correctness.