Pipelining

Pipelining is the key technique for high-throughput instruction execution. Like an assembly line, multiple instructions are in different stages of execution simultaneously.

Pipeline Concept

The Laundry Analogy

Without pipelining: Wash → Dry → Fold → Store, then start next load. With pipelining: Start a new load every stage-time. All machines busy simultaneously.

5-stage pipeline with hazard examples

Five-Stage Pipeline (Classic RISC)

Time →  1    2    3    4    5    6    7    8
I1:    IF   ID   EX   MEM  WB
I2:         IF   ID   EX   MEM  WB
I3:              IF   ID   EX   MEM  WB
I4:                   IF   ID   EX   MEM  WB

Stage	Name	Activity
IF	Instruction Fetch	Read instruction from I-cache; PC ← PC + 4
ID	Instruction Decode	Decode opcode; read registers; sign-extend immediate
EX	Execute	ALU operation; compute branch target; compute address
MEM	Memory Access	Read/write data cache (load/store only)
WB	Write Back	Write result to register file

Pipeline Registers

Between each stage, pipeline registers store intermediate values:

IF/ID: Instruction, PC+4
ID/EX: Register values, immediate, control signals, PC+4
EX/MEM: ALU result, write data, control signals
MEM/WB: Memory data, ALU result, control signals

These registers are clocked — at each edge, all stages advance simultaneously.

Performance

Ideal speedup: S = number of stages (k). With k = 5, throughput is 5× a non-pipelined processor.

Actual speedup: Less than k due to:

Pipeline register overhead (adds to clock period)
Hazards causing stalls
Unequal stage delays (clock limited by slowest stage)

Throughput: Ideally 1 instruction per cycle (CPI = 1).

Latency: Each individual instruction still takes k cycles. Pipelining improves throughput, not latency.

Pipeline Hazards

Situations that prevent the next instruction from executing in the next clock cycle.

Structural Hazards

Two instructions need the same hardware resource simultaneously.

Example: Single memory for both instructions and data. IF and MEM stages conflict.

Solution: Separate instruction and data caches (Harvard architecture). Modern processors always do this.

Example: Single-port register file — simultaneous read (ID) and write (WB).

Solution: Write in first half of cycle, read in second half. Or multi-port register file.

Data Hazards

An instruction depends on the result of a previous instruction still in the pipeline.

Types:

RAW (Read After Write): Most common. Instruction reads a register before a prior instruction writes it.
WAR (Write After Read): Later instruction writes before earlier reads. Occurs in out-of-order execution.
WAW (Write After Write): Two instructions write the same register. Occurs in out-of-order execution.

Example (RAW):

ADD x1, x2, x3    // Writes x1 in WB (cycle 5)
SUB x4, x1, x5    // Reads x1 in ID (cycle 3) — x1 not yet written!

Forwarding (Bypassing)

Route the result directly from where it's produced to where it's needed, without waiting for the write-back stage.

EX/MEM forwarding: Result available after EX → forward to next instruction's EX input
MEM/WB forwarding: Result available after MEM → forward to instruction 2 stages later

Hardware: Multiplexers at ALU inputs. Forwarding unit detects hazards and selects the correct input source.

Time →  1    2    3    4    5
ADD:   IF   ID   EX   MEM  WB
SUB:        IF   ID   EX←──┘ (forward from EX/MEM)
                      ↑
                  ALU input comes from forward path, not register file

Load-Use Hazard

A load followed immediately by an instruction using the loaded value cannot be resolved by forwarding alone — the data isn't available until after the MEM stage.

LW  x1, 0(x2)    // Data available after MEM (cycle 4)
ADD x3, x1, x4   // Needs x1 in EX (cycle 3) — too early!

Solution: Insert a pipeline bubble (1-cycle stall). The hardware detects the load-use hazard and inserts a NOP.

Compiler optimization: Schedule an independent instruction between the load and its use ("load delay slot scheduling") to avoid the stall.

Control Hazards (Branch Hazards)

The pipeline doesn't know which instruction to fetch next until the branch is resolved.

Problem: Branch decision is made in EX (cycle 3). But we've already fetched 2 instructions that may be wrong (if branch taken).

BEQ x1, x2, target  // Branch decided in EX
IF   ID   EX ← branch decided here
     IF   ID  ← fetched speculatively
          IF  ← fetched speculatively (may be wrong)

Solutions

1. Stall (flush): Always stall until branch is resolved. 2-cycle penalty. Simple but slow.

2. Assume not taken: Always fetch the next sequential instruction. If branch is taken, flush the incorrect instructions (2-cycle penalty). If not taken, no penalty. Works well for loops (most backward branches are taken, most forward are not).

3. Delayed branch: The instruction after the branch (in the "delay slot") always executes regardless of branch outcome. Compiler fills the slot with a useful instruction. Used in MIPS. Considered a design mistake — complicates the ISA permanently.

4. Branch prediction: Predict the outcome and speculatively fetch along the predicted path. If wrong, flush and recover. See below.

Branch Prediction

Static Prediction

Always not taken: Fetch PC + 4. Penalty on taken branches.
Always taken: Fetch branch target. Penalty on not-taken branches.
Backward taken, forward not taken (BTFNT): Backward branches are loops (usually taken). Forward branches are if-statements (often not taken). ~65% accuracy.

Dynamic Prediction

Use runtime history to predict future branches.

Branch History Table (BHT)

Index by low-order PC bits. Each entry stores a predictor.

1-bit predictor: Remember last outcome. Problem: always mispredicts twice on loop exit.

2-bit saturating counter state machine for branch prediction

2-bit saturating counter: States: Strongly Not Taken (00), Weakly Not Taken (01), Weakly Taken (10), Strongly Taken (11). Takes two mispredictions to change prediction. ~85% accuracy.

       Taken           Taken
SNT(00) → WNT(01) → WT(10) → ST(11)
  ↑ Not Taken    ↑ Not Taken    ↑
  └──────────────┘              │
                   Not Taken    │
                   WT(10) ← ST(11)

Branch Target Buffer (BTB)

Caches the target address of recently taken branches. Indexed by PC. If the branch is predicted taken and the target is in the BTB, we can fetch from the target immediately (no delay for address computation).

Correlating Predictors

Use the outcome of recent branches to predict the current branch.

(m, n) predictor: Use the last m branch outcomes (global history) to select among 2ᵐ separate n-bit predictors per branch.

Example: (2, 2) predictor: 4 separate 2-bit counters per branch, selected by the last 2 branch outcomes. ~92% accuracy.

Tournament Predictors

Combine multiple predictors with a selector that learns which predictor works better for each branch.

Alpha 21264: Local (per-branch history) + global (correlating) + selector. ~95%+ accuracy.

Neural Branch Predictors

Use a perceptron (simple neural network) to predict branches. Each input is a past branch outcome. Weights learned online.

TAGE predictor: Tagged geometric history length predictor. Multiple tables indexed by different history lengths. State of the art. Used in modern Intel/AMD processors. >99% accuracy.

Misprediction Penalty

When a prediction is wrong:

Flush incorrectly fetched/decoded instructions
Restart fetch at the correct address
Wasted cycles = pipeline depth to the branch resolution stage

Modern processors: 15-20 cycle misprediction penalty. This is why high accuracy matters enormously.

Pipeline Performance

CPI Breakdown

CPI = 1 + stall_cycles_per_instruction

CPI = 1 + (load_stalls × load_fraction) + (branch_penalty × branch_fraction × mispredict_rate)

Example: 20% loads with 50% load-use stalls (1 cycle each). 15% branches with 5% mispredict rate, 2-cycle penalty.

CPI = 1 + (0.2 × 0.5 × 1) + (0.15 × 0.05 × 2) = 1 + 0.10 + 0.015 = 1.115

Speedup

Speedup = CPI_unpipelined / CPI_pipelined × (Clock_pipelined / Clock_unpipelined)

Pipeline register overhead slightly increases clock period but dramatically improves throughput.

Deeper Pipelines

More stages → higher clock frequency but:

More hazards (longer forwarding paths, higher branch penalty)
More pipeline register overhead
Diminishing returns

Historical trend: Pentium 4 had 31 stages (too deep — high misprediction penalty). Modern designs: 14-20 stages.

Applications in CS

Compiler optimization: Instruction scheduling to avoid stalls. Loop unrolling to reduce branch overhead. Software pipelining for loops.
Performance analysis: Understanding pipeline behavior helps interpret profiler data (IPC, branch misprediction rate, cache miss stalls).
Operating systems: Context switch must save/restore pipeline state. Interrupt handling flushes the pipeline.
Security: Spectre exploits speculative execution (branch prediction + cache side channel). Meltdown exploits out-of-order execution. Understanding the pipeline is essential for understanding these attacks.
Embedded systems: Simpler pipelines (3-5 stages) for deterministic timing. WCET (worst-case execution time) analysis considers pipeline behavior.