Datapath and Control
The datapath performs data operations (arithmetic, memory access, register read/write). The control unit orchestrates the datapath by generating the right signals at the right time.
Single-Cycle Datapath
Each instruction completes in one clock cycle. The clock period must accommodate the slowest instruction.
Components
┌──────┐ ┌──────────┐ ┌─────┐ ┌────────┐ ┌──────────┐
│Instr │ │ Register │ │ │ │ Data │ │ Register │
│Memory│───→│ File │───→│ ALU │───→│ Memory │───→│ File │
│(IM) │ │ (RF) │ │ │ │ (DM) │ │ (WB) │
└──────┘ └──────────┘ └─────┘ └────────┘ └──────────┘
↑ ↑
│ │
PC Control
Instruction Execution Steps
R-type (ADD, SUB, AND, OR):
- Fetch instruction from IM at address PC
- Read two source registers from RF
- ALU performs the operation
- Write result to destination register in RF
- PC ← PC + 4
I-type Load (LW):
- Fetch instruction
- Read base register from RF
- ALU computes address: base + sign-extended offset
- Read data from DM at computed address
- Write loaded data to destination register
- PC ← PC + 4
S-type Store (SW):
- Fetch instruction
- Read base register and data register from RF
- ALU computes address: base + sign-extended offset
- Write data to DM at computed address
- PC ← PC + 4
B-type Branch (BEQ):
- Fetch instruction
- Read two registers from RF
- ALU compares (subtract, check zero flag)
- If condition met: PC ← PC + sign-extended offset × 2
- If not: PC ← PC + 4
Datapath Control Signals
| Signal | Purpose | |---|---| | RegWrite | Enable writing to register file | | ALUSrc | Select ALU input: register or immediate | | ALUOp | Select ALU operation (add, sub, and, etc.) | | MemRead | Enable reading from data memory | | MemWrite | Enable writing to data memory | | MemToReg | Select write-back source: ALU result or memory data | | Branch | Enable branch (PC update from branch target) | | Jump | Force PC to jump target |
Performance
Clock period = longest instruction delay = tIM + tRF_read + tALU + tDM + tRF_write.
Typically the load instruction is the critical path.
Problem: Simple instructions (ADD) take the same time as complex ones (LOAD). The clock is limited by the worst case. Very wasteful.
Multi-Cycle Datapath
Break each instruction into multiple shorter steps, one per clock cycle. Different instructions take different numbers of cycles.
Steps (5 stages)
- Instruction Fetch (IF): IR ← IM[PC]; PC ← PC + 4
- Instruction Decode / Register Read (ID): Read registers; decode control signals; compute branch target
- Execute (EX): ALU operation (arithmetic, address calculation, comparison)
- Memory Access (MEM): Read/write data memory (only for load/store)
- Write Back (WB): Write result to register file
Cycle Counts
| Instruction | Cycles | Steps Used | |---|---|---| | R-type (ADD) | 4 | IF, ID, EX, WB | | Load (LW) | 5 | IF, ID, EX, MEM, WB | | Store (SW) | 4 | IF, ID, EX, MEM | | Branch (BEQ) | 3 | IF, ID, EX | | Jump (J) | 3 | IF, ID, EX |
Advantages over Single-Cycle
- Shorter clock period (each stage is shorter than the longest instruction)
- Shared hardware (one ALU, one memory — used in different cycles)
- CPI (cycles per instruction) varies but is lower on average
Disadvantages
- More complex control (FSM or microcode)
- Still sequential — next instruction waits until current one finishes
Control Unit
Hardwired Control
Control signals generated by combinational logic based on opcode and current state.
Single-cycle: Pure combinational. Opcode → control signals (one-level decode).
Multi-cycle: FSM. Current state + opcode → control signals + next state.
Opcode + State → Control Logic → Control Signals + Next State
Advantages: Fast, efficient for simple ISAs. Disadvantages: Hard to modify, complex for large ISAs.
Microprogrammed Control
Control signals stored in a control memory (microcode ROM). Each micro-instruction specifies control signals for one cycle.
Microprogram Counter → Control Memory → Control Signals
↓
Micro-instruction
(control word)
Micro-instruction fields: ALUSrc, RegWrite, MemRead, next-μPC, branch-condition, etc.
Advantages: Flexible (change behavior by updating microcode). Natural for complex ISAs (x86). Disadvantages: Slower than hardwired (ROM access time). More area.
Modern use: x86 processors use microcode for complex instructions, hardwired logic for simple/common ones. Microcode updates can fix CPU bugs post-manufacture (Intel/AMD regularly issue microcode patches).
ALU Design
The ALU performs arithmetic and logic operations.
Simple ALU
Inputs: A, B (operands), ALUOp (operation select)
Output: Result, Zero flag, Overflow flag, Carry flag
Operations:
000: AND (A & B)
001: OR (A | B)
010: ADD (A + B)
011: SUB (A - B) [Add with B inverted + carry-in]
100: SLT (Set if A < B)
101: XOR (A ^ B)
110: SLL (A << B)
111: SRL (A >> B)
Flags / Condition Codes
| Flag | Meaning | Set When | |---|---|---| | Zero (Z) | Result is zero | Result == 0 | | Negative (N) | Result is negative | MSB of result is 1 | | Carry (C) | Unsigned overflow | Carry out of MSB | | Overflow (V) | Signed overflow | Sign of result wrong |
RISC-V note: RISC-V does not use condition codes. Instead, it uses compare-and-branch instructions (BEQ, BLT) that combine comparison and branch.
Register File Design
A register file is a small, fast memory inside the processor.
Typical Structure
32 registers × 64 bits (for RV64)
2 read ports + 1 write port (for basic pipeline)
Read: Combinational — provide register number, get value immediately. Write: Sequential — data written on clock edge when RegWrite is asserted.
Register x0 (RISC-V): Hardwired to 0. Reads always return 0. Writes are discarded. Simplifies many operations (e.g., NOP = ADD x0, x0, 0; MOV = ADD rd, rs, x0).
Multi-Ported Register Files
Superscalar processors need multiple read/write ports:
- 2-issue: 4 read + 2 write ports
- 4-issue: 8 read + 4 write ports
Area grows as O(ports²). At some point, a register cache or physical register file with renaming is used.
Data Hazards Overview
In a multi-cycle or pipelined processor, instructions may depend on results not yet available:
ADD x1, x2, x3 // Writes x1
SUB x4, x1, x5 // Reads x1 — but x1 not yet written!
This is a Read After Write (RAW) data hazard. Solutions:
- Stalling: Insert bubbles (NOPs) until data is available
- Forwarding/Bypassing: Route the result directly from where it's produced to where it's needed
- Compiler scheduling: Reorder instructions to avoid hazards
Detailed treatment in the pipelining file.
Performance Metrics
Execution Time
CPU Time = Instruction Count × CPI × Clock Period
= IC × CPI / Clock Rate
CPI (Cycles Per Instruction)
CPI = Σ (CPIᵢ × Fᵢ)
where CPIᵢ is cycles for instruction class i and Fᵢ is its frequency.
Single-cycle: CPI = 1, but long clock period. Multi-cycle: CPI > 1 on average, but shorter clock period. Pipelined: CPI ≈ 1 ideally (throughput of 1 instruction/cycle). Superscalar: CPI < 1 (IPC > 1 — multiple instructions per cycle).
Amdahl's Law
Speedup from improving a fraction f of execution by a factor S:
Speedup = 1 / ((1 - f) + f/S)
Consequence: Improving a small fraction of execution provides limited overall speedup. "Make the common case fast."
Applications in CS
- Compiler optimization: Understanding the datapath helps compilers generate efficient code (instruction selection, register allocation, scheduling).
- Performance tuning: Knowing CPI breakdown helps identify bottlenecks (compute-bound vs memory-bound).
- Hardware design: Datapath design determines the tradeoffs of a processor (single-cycle simplicity vs multi-cycle efficiency vs pipelined throughput).
- Emulation: Software emulators implement the ISA's datapath in software. Understanding the hardware helps write efficient emulators.
- Security: Microcode vulnerabilities (Spectre mitigations via microcode update). Side channels through ALU timing differences.