5 min read
On this page

Advanced Pipelining

Modern high-performance processors go far beyond the simple 5-stage pipeline, executing multiple instructions per cycle, out of program order, and speculatively.

Superscalar Execution

A superscalar processor issues multiple instructions per cycle from a single instruction stream.

Cycle 1: Issue ADD + MUL (if independent)
Cycle 2: Issue LOAD + SUB + AND
Cycle 3: ...

Instruction-Level Parallelism (ILP): The degree to which instructions in a program can be executed simultaneously. Limited by data dependencies, control dependencies, and resource constraints.

Issue width: The maximum number of instructions issued per cycle (2-way, 4-way, 6-way, 8-way). Modern high-performance CPUs: 4-8 wide.

In-order superscalar: Issue multiple instructions per cycle, but in program order. Stalls if any instruction in the issue window has a dependency.

Out-of-order superscalar: Reorder instructions dynamically to maximize parallelism.

Out-of-Order Execution

Motivation

In-order execution stalls the pipeline when an instruction's operands aren't ready, even if later instructions are independent and could execute.

LOAD x1, [x2]      // Cache miss — 100 cycles!
ADD  x3, x4, x5    // Independent — could execute now but stalls in-order
MUL  x6, x1, x7    // Depends on LOAD — must wait

Out-of-order execution allows the ADD to execute during the LOAD's cache miss.

Tomasulo's Algorithm

Developed for IBM 360/91 (1967). The foundation of modern out-of-order execution.

Key structures:

Reservation stations: Hold instructions waiting for operands. Each functional unit has its own reservation stations.

Common Data Bus (CDB): Broadcasts results to all reservation stations. Waiting instructions grab their operands.

Register renaming (implicit): Tags (reservation station IDs) replace register names, eliminating WAR and WAW hazards.

Algorithm Steps

  1. Issue: Decode instruction. Allocate reservation station. Read available operands from register file. If operand not ready, record the tag of the producing instruction.

  2. Execute: When all operands are available (forwarded via CDB), execute the operation.

  3. Write Result: Broadcast result on CDB with tag. All waiting reservation stations capture the value.

Reorder Buffer (ROB)

Tomasulo doesn't guarantee in-order completion — needed for precise exceptions.

The ROB is a circular buffer that tracks instructions in program order:

┌────────────────────────────────────────┐
│ Entry │ Instr │ Dest │ Value │ Done?   │
│   0   │ ADD   │  x1  │  42   │  yes    │ ← commit pointer
│   1   │ LOAD  │  x2  │  ---  │  no     │
│   2   │ MUL   │  x3  │  ---  │  no     │
│   3   │ SUB   │  x4  │  17   │  yes    │
│   4   │       │      │       │         │ ← issue pointer
└────────────────────────────────────────┘

Commit (retire): When the oldest instruction in the ROB is done, write its result to the architectural register file and remove it. Commit is always in order — preserving precise exceptions.

Flush on mispredict: If a branch misprediction is detected, flush all ROB entries after the branch. Restore register state to the point before the branch.

Modern OoO Pipeline

Fetch → Decode → Rename → Dispatch → Issue → Execute → Complete → Commit
                  ↓                     ↑
           Register Alias Table    Reservation Stations
           (maps arch→phys regs)   (wait for operands)

Physical register file: Much larger than the architectural register file (e.g., 256 physical registers for 32 architectural registers). Enables renaming.

Register renaming (explicit): Map architectural registers to physical registers using a Register Alias Table (RAT). Eliminates WAR and WAW hazards completely. Each instruction gets unique physical destination registers.

Speculative Execution

Execute instructions before knowing if they should execute (branch prediction).

Benefits: Hide latency of branches. Keep the pipeline full.

Requirement: Must be able to undo speculation if wrong. The ROB provides this — speculative results are buffered in the ROB and only committed when confirmed correct.

What is Speculated

  • Branch direction: Predicted by branch predictor
  • Branch target: Predicted by BTB
  • Load values: Speculated to not alias with prior stores (memory disambiguation)
  • Data values: Value prediction (experimental)
  • Prefetch: Speculative data prefetch based on access patterns

Security Implications

Spectre (2018): Speculative execution can access data that should be protected. Even though results are "rolled back," the cache side effects of speculative memory accesses remain — leaking information through timing side channels.

Mitigations: Retpoline (return trampoline), IBRS/IBPB (indirect branch control), STIBP (single-thread indirect branch), speculative load hardening, fences. All carry performance cost.

VLIW (Very Long Instruction Word)

The compiler explicitly schedules multiple operations into a single wide instruction:

VLIW instruction (128 bits):
[ALU op 1 | ALU op 2 | MEM op | Branch op]

Advantages: Simple hardware (no dependency checking, no reordering). Compiler does the hard work.

Disadvantages: Binary compatibility issues (recompile for new hardware). Code bloat (NOPs for unfilled slots). Hard for compiler to find ILP in general code. Poor at handling variable-latency operations (cache misses).

Examples: Intel Itanium (IA-64/EPIC), TI C6000 DSP, NVIDIA GPU shader cores (partially).

Status: VLIW has largely failed for general-purpose CPUs but thrives in DSPs and specialized processors.

Dynamic Scheduling Details

Memory Disambiguation

Problem: A store and a later load might alias (access the same address). The load can't execute until we know the store's address.

Store buffer: Holds uncommitted stores with addresses and data.

Load bypass: If a load's address doesn't match any store in the store buffer, it can proceed. If it matches, forward the store's data.

Speculative loads: Execute loads before knowing all prior store addresses. Verify later. If wrong, replay the load.

Branch Target Buffer (BTB)

Caches target addresses of taken branches. Indexed by PC. On fetch, if BTB hits and predictor says taken, redirect fetch to the target in the same cycle.

Return Address Stack (RAS)

Predicts return addresses. On call (JAL), push return address. On return (JALR), pop and use as predicted target. Very high accuracy (>99%).

Instruction-Level Parallelism Limits

Data Dependencies

True dependencies (RAW) limit ILP. The dataflow graph shows which instructions can execute in parallel.

ILP in typical programs: 2-5 instructions per cycle (despite 4-8 issue width). Limited by dependencies, branches, and memory latency.

Window Size

The out-of-order window (ROB size) limits how far ahead the processor can look for independent instructions. Modern: 200-400 entries (Apple M-series: 600+).

Larger window → more ILP extracted → but more power and area.

Memory-Level Parallelism (MLP)

Overlap multiple cache misses. Even if each miss takes 100 cycles, executing 4 misses concurrently gives 4× throughput.

Techniques: Non-blocking caches, out-of-order execution, prefetching, speculative loads.

Applications in CS

  • Performance optimization: Understanding OoO execution helps explain why some code restructurings improve performance (increasing instruction-level parallelism, reducing dependency chains).
  • Compiler design: Compilers must schedule instructions to maximize ILP. Software pipelining for loops. Register allocation interacts with renaming.
  • Security: Spectre/Meltdown attacks exploit speculative execution. Understanding microarchitecture is essential for security analysis.
  • Benchmarking: IPC (instructions per cycle) reveals how well the hardware extracts parallelism from the code. Low IPC → likely memory-bound or branch-misprediction-bound.
  • Architecture research: Exploring wider issue, deeper OoO, better speculation. Diminishing returns drive focus toward thread-level and data-level parallelism.