Code Generation
Code generation transforms the optimized IR into target machine code. It must select instructions, allocate registers, and schedule operations for maximum performance.
Instruction Selection
Map IR operations to target machine instructions. The challenge: multiple ways to implement each IR operation — choose the cheapest.
Tree Pattern Matching
Model IR as trees. Machine instructions are tree patterns. Select patterns that cover the IR tree with minimum cost.
IR tree: Possible instruction patterns:
+ ADD r1, r2, r3 (register + register)
/ \ ADD r1, r2, #imm (register + immediate)
a * MADD r1, r2, r3, r4 (multiply-add fused)
/ \
b c
// If target has MADD: use a single MADD instruction instead of MUL + ADD
BURG (Bottom-Up Rewrite Generator): Optimal instruction selection via dynamic programming on trees. Used in some production compilers.
LLVM's approach: SelectionDAG (tree-based) → pattern matching against .td (TableGen) target descriptions. Global instruction selection (GlobalISel) is replacing it for new targets.
Maximal Munch
Greedy: at each IR node, select the largest pattern that matches. Not optimal but simple and fast.
Register Allocation
Map virtual registers (unlimited in SSA/IR) to physical registers (limited: 16 on x86-64, 32 on ARM/RISC-V).
The Problem
If there are more live virtual registers at a point than physical registers, some values must be spilled to memory (stack).
Live variables at this point: a, b, c, d, e, f, g, h, i, j
Physical registers available: 8 (r0-r7)
→ Must spill 2 variables to memory
Goal: Minimize spills (memory accesses are expensive). This is an NP-hard problem (graph coloring with K colors).
Graph Coloring
Interference graph: Nodes = virtual registers. Edge between two nodes if they are live simultaneously (can't share a physical register).
K-coloring: Assign K colors (physical registers) such that no adjacent nodes share a color. If the graph is K-colorable → no spills needed.
Algorithm (Chaitin):
- Build interference graph from liveness analysis.
- Simplify: Find a node with < K neighbors. Remove it (push onto stack). It can always be colored. Repeat.
- Spill: If no node has < K neighbors, choose one to spill (mark for memory storage). Remove it. Continue simplifying.
- Select: Pop nodes from stack. Assign colors (registers) avoiding colors of already-colored neighbors.
Coalescing: If a MOVE instruction copies x to y, and x and y don't interfere, merge them into one node (eliminate the MOVE). Improves code quality but may make coloring harder.
Linear Scan
A simpler, faster alternative. Scan through the program linearly. Maintain a list of active intervals. When a register is needed and none is free, spill the variable with the farthest next use.
Time: O(n log n) vs O(n²) for graph coloring.
Used by: JIT compilers (where compilation speed matters more than code quality). HotSpot C1, V8 (originally).
Register Allocation in SSA
SSA form simplifies register allocation:
- Interference: In SSA, two values interfere iff one is live at the definition of the other. The interference graph of SSA programs is always chordal → chordal graph coloring is polynomial (O(V + E)).
- φ-functions correspond to parallel copies. After coloring, φ-functions are replaced with moves (most eliminated by coalescing).
Instruction Scheduling
Reorder instructions to maximize instruction-level parallelism and minimize pipeline stalls.
The Problem
// Naive order: pipeline stalls on data dependencies
LOAD r1, [addr] // takes 3 cycles
ADD r2, r1, r3 // must wait for LOAD (stall 2 cycles)
MUL r4, r5, r6 // independent — could run during the stall!
// Scheduled order: no stalls
LOAD r1, [addr] // starts loading
MUL r4, r5, r6 // independent, fills the latency
NOP // (or another independent instruction)
ADD r2, r1, r3 // r1 is now ready
List Scheduling
- Build a dependency graph (data dependencies, memory dependencies).
- Compute the priority of each instruction (typically: longest path to the end).
- Greedily schedule: at each cycle, pick the highest-priority ready instruction.
Constraint: NP-hard in general. List scheduling is a heuristic (greedy). Produces good results in practice.
Phase Ordering
Instruction scheduling and register allocation interact:
- Scheduling before allocation: More freedom to reorder (more virtual registers). May increase register pressure.
- Scheduling after allocation: Fewer registers → less scheduling freedom. But respects actual register constraints.
Typical approach: Schedule → allocate → reschedule (three passes). Or integrate scheduling with allocation.
Calling Conventions
Rules for how functions call each other: parameter passing, return values, register saving.
System V AMD64 (Linux, macOS x86-64)
| Purpose | Registers | |---|---| | Integer arguments | RDI, RSI, RDX, RCX, R8, R9 (first 6) | | Float arguments | XMM0-XMM7 (first 8) | | Return value | RAX (integer), XMM0 (float) | | Caller-saved | RAX, RCX, RDX, RSI, RDI, R8-R11 | | Callee-saved | RBX, RBP, R12-R15 | | Stack pointer | RSP (16-byte aligned at call) | | Frame pointer | RBP (optional) |
Arguments: First 6 integer args in registers, remaining on stack.
ARM AAPCS (AArch64)
| Purpose | Registers | |---|---| | Arguments | X0-X7 (integer), V0-V7 (float) | | Return value | X0 (integer), V0 (float) | | Caller-saved | X0-X18 | | Callee-saved | X19-X30 | | Stack pointer | SP | | Frame pointer | X29 (FP) | | Link register | X30 (LR) |
Stack Frame Layout
High addresses
┌──────────────────────┐
│ Caller's frame │
├──────────────────────┤ ← old RBP (frame pointer)
│ Return address │ (pushed by CALL instruction)
├──────────────────────┤ ← new RBP
│ Saved callee-saved │ (RBX, R12-R15 if used)
│ registers │
├──────────────────────┤
│ Local variables │ (known size at compile time)
├──────────────────────┤
│ Spilled registers │ (virtual regs that didn't get physical regs)
├──────────────────────┤
│ Outgoing arguments │ (args for functions we call, if > 6)
├──────────────────────┤ ← RSP (stack pointer, 16-byte aligned)
Low addresses
Code Generation for Expressions
Stack Machine
Simple. Push operands, pop and apply operators.
// Expression: a + b * c
PUSH a
PUSH b
PUSH c
MUL // pop b, c; push b*c
ADD // pop a, b*c; push a+b*c
Easy to generate but poor performance (many memory accesses to the stack). Used in JVM bytecode, WebAssembly (conceptually).
Register Machine
Map values to registers. Better performance. Requires register allocation.
// Expression: a + b * c
MUL r1, rb, rc // r1 = b * c
ADD r0, ra, r1 // r0 = a + b * c
Peephole Optimization
Post-code-generation local optimization. Scan a small window (peephole) of instructions and replace patterns with better sequences.
// Redundant load after store
STORE [addr], r1 → STORE [addr], r1
LOAD r2, [addr] MOV r2, r1 // avoid memory access
// Redundant jump
JUMP L1 → (remove — L1 is the next instruction)
L1:
// Strength reduction in generated code
MUL r1, r2, 2 → SHL r1, r2, 1 // shift instead of multiply
// Null sequences
ADD r1, r1, 0 → (remove)
MOV r1, r1 → (remove)
Cost: Very cheap (single pass). Catches patterns missed by higher-level optimizations.
Applications in CS
- Compiler development: Code generation is where the rubber meets the road — the quality of generated code determines compiler performance.
- JIT compilers: Fast code generation (linear scan, simple scheduling) enables low-latency JIT compilation.
- Embedded systems: Code generation for constrained targets (limited registers, no FPU, specific ISA extensions).
- Performance engineering: Understanding code generation helps predict and improve performance of generated code.
- Cross-compilation: Code generation back end determines which targets a compiler supports.