Code Generation

Code generation transforms the optimized IR into target machine code. It must select instructions, allocate registers, and schedule operations for maximum performance.

Instruction Selection

Map IR operations to target machine instructions. The challenge: multiple ways to implement each IR operation — choose the cheapest.

Tree Pattern Matching

Model IR as trees. Machine instructions are tree patterns. Select patterns that cover the IR tree with minimum cost.

IR tree:        Possible instruction patterns:
    +              ADD r1, r2, r3     (register + register)
   / \             ADD r1, r2, #imm   (register + immediate)
  a   *            MADD r1, r2, r3, r4 (multiply-add fused)
     / \
    b   c

// If target has MADD: use a single MADD instruction instead of MUL + ADD

BURG (Bottom-Up Rewrite Generator): Optimal instruction selection via dynamic programming on trees. Used in some production compilers.

LLVM's approach: SelectionDAG (tree-based) → pattern matching against .td (TableGen) target descriptions. Global instruction selection (GlobalISel) is replacing it for new targets.

Maximal Munch

Greedy: at each IR node, select the largest pattern that matches. Not optimal but simple and fast.

Register Allocation

Map virtual registers (unlimited in SSA/IR) to physical registers (limited: 16 on x86-64, 32 on ARM/RISC-V).

The Problem

If there are more live virtual registers at a point than physical registers, some values must be spilled to memory (stack).

Live variables at this point: a, b, c, d, e, f, g, h, i, j
Physical registers available: 8 (r0-r7)
→ Must spill 2 variables to memory

Goal: Minimize spills (memory accesses are expensive). This is an NP-hard problem (graph coloring with K colors).

Graph Coloring

Interference graph: Nodes = virtual registers. Edge between two nodes if they are live simultaneously (can't share a physical register).

K-coloring: Assign K colors (physical registers) such that no adjacent nodes share a color. If the graph is K-colorable → no spills needed.

Algorithm (Chaitin):

Build interference graph from liveness analysis.
Simplify: Find a node with < K neighbors. Remove it (push onto stack). It can always be colored. Repeat.
Spill: If no node has < K neighbors, choose one to spill (mark for memory storage). Remove it. Continue simplifying.
Select: Pop nodes from stack. Assign colors (registers) avoiding colors of already-colored neighbors.

Coalescing: If a MOVE instruction copies x to y, and x and y don't interfere, merge them into one node (eliminate the MOVE). Improves code quality but may make coloring harder.

Linear Scan

A simpler, faster alternative. Scan through the program linearly. Maintain a list of active intervals. When a register is needed and none is free, spill the variable with the farthest next use.

Time: O(n log n) vs O(n²) for graph coloring.

Used by: JIT compilers (where compilation speed matters more than code quality). HotSpot C1, V8 (originally).

Register Allocation in SSA

SSA form simplifies register allocation:

Interference: In SSA, two values interfere iff one is live at the definition of the other. The interference graph of SSA programs is always chordal → chordal graph coloring is polynomial (O(V + E)).
φ-functions correspond to parallel copies. After coloring, φ-functions are replaced with moves (most eliminated by coalescing).

Instruction Scheduling

Reorder instructions to maximize instruction-level parallelism and minimize pipeline stalls.

The Problem

// Naive order: pipeline stalls on data dependencies
LOAD r1, [addr]      // takes 3 cycles
ADD  r2, r1, r3      // must wait for LOAD (stall 2 cycles)
MUL  r4, r5, r6      // independent — could run during the stall!

// Scheduled order: no stalls
LOAD r1, [addr]      // starts loading
MUL  r4, r5, r6      // independent, fills the latency
NOP                   // (or another independent instruction)
ADD  r2, r1, r3      // r1 is now ready

List Scheduling

Build a dependency graph (data dependencies, memory dependencies).
Compute the priority of each instruction (typically: longest path to the end).
Greedily schedule: at each cycle, pick the highest-priority ready instruction.

Constraint: NP-hard in general. List scheduling is a heuristic (greedy). Produces good results in practice.

Phase Ordering

Instruction scheduling and register allocation interact:

Scheduling before allocation: More freedom to reorder (more virtual registers). May increase register pressure.
Scheduling after allocation: Fewer registers → less scheduling freedom. But respects actual register constraints.

Typical approach: Schedule → allocate → reschedule (three passes). Or integrate scheduling with allocation.

Calling Conventions

Rules for how functions call each other: parameter passing, return values, register saving.

System V AMD64 (Linux, macOS x86-64)

Purpose	Registers
Integer arguments	RDI, RSI, RDX, RCX, R8, R9 (first 6)
Float arguments	XMM0-XMM7 (first 8)
Return value	RAX (integer), XMM0 (float)
Caller-saved	RAX, RCX, RDX, RSI, RDI, R8-R11
Callee-saved	RBX, RBP, R12-R15
Stack pointer	RSP (16-byte aligned at call)
Frame pointer	RBP (optional)

Arguments: First 6 integer args in registers, remaining on stack.

ARM AAPCS (AArch64)

Purpose	Registers
Arguments	X0-X7 (integer), V0-V7 (float)
Return value	X0 (integer), V0 (float)
Caller-saved	X0-X18
Callee-saved	X19-X30
Stack pointer	SP
Frame pointer	X29 (FP)
Link register	X30 (LR)

Stack Frame Layout

High addresses
┌──────────────────────┐
│ Caller's frame       │
├──────────────────────┤ ← old RBP (frame pointer)
│ Return address       │ (pushed by CALL instruction)
├──────────────────────┤ ← new RBP
│ Saved callee-saved   │ (RBX, R12-R15 if used)
│ registers            │
├──────────────────────┤
│ Local variables      │ (known size at compile time)
├──────────────────────┤
│ Spilled registers    │ (virtual regs that didn't get physical regs)
├──────────────────────┤
│ Outgoing arguments   │ (args for functions we call, if > 6)
├──────────────────────┤ ← RSP (stack pointer, 16-byte aligned)
Low addresses

Code Generation for Expressions

Stack Machine

Simple. Push operands, pop and apply operators.

// Expression: a + b * c
PUSH a
PUSH b
PUSH c
MUL         // pop b, c; push b*c
ADD         // pop a, b*c; push a+b*c

Easy to generate but poor performance (many memory accesses to the stack). Used in JVM bytecode, WebAssembly (conceptually).

Register Machine

Map values to registers. Better performance. Requires register allocation.

// Expression: a + b * c
MUL  r1, rb, rc     // r1 = b * c
ADD  r0, ra, r1      // r0 = a + b * c

Peephole Optimization

Post-code-generation local optimization. Scan a small window (peephole) of instructions and replace patterns with better sequences.

// Redundant load after store
STORE [addr], r1     →  STORE [addr], r1
LOAD  r2, [addr]         MOV r2, r1          // avoid memory access

// Redundant jump
JUMP L1              →  (remove — L1 is the next instruction)
L1:

// Strength reduction in generated code
MUL r1, r2, 2       →  SHL r1, r2, 1       // shift instead of multiply

// Null sequences
ADD r1, r1, 0        →  (remove)
MOV r1, r1           →  (remove)

Cost: Very cheap (single pass). Catches patterns missed by higher-level optimizations.

Applications in CS

Compiler development: Code generation is where the rubber meets the road — the quality of generated code determines compiler performance.
JIT compilers: Fast code generation (linear scan, simple scheduling) enables low-latency JIT compilation.
Embedded systems: Code generation for constrained targets (limited registers, no FPU, specific ISA extensions).
Performance engineering: Understanding code generation helps predict and improve performance of generated code.
Cross-compilation: Code generation back end determines which targets a compiler supports.