Datapath and Control

The datapath performs data operations (arithmetic, memory access, register read/write). The control unit orchestrates the datapath by generating the right signals at the right time.

Single-Cycle Datapath

Each instruction completes in one clock cycle. The clock period must accommodate the slowest instruction.

Components

┌──────┐    ┌──────────┐    ┌─────┐    ┌────────┐    ┌──────────┐
│Instr │    │ Register │    │     │    │  Data  │    │ Register │
│Memory│───→│  File    │───→│ ALU │───→│ Memory │───→│  File    │
│(IM)  │    │  (RF)    │    │     │    │  (DM)  │    │  (WB)    │
└──────┘    └──────────┘    └─────┘    └────────┘    └──────────┘
    ↑                           ↑
    │                           │
   PC                       Control

Instruction Execution Steps

R-type (ADD, SUB, AND, OR):

Fetch instruction from IM at address PC
Read two source registers from RF
ALU performs the operation
Write result to destination register in RF
PC ← PC + 4

I-type Load (LW):

Fetch instruction
Read base register from RF
ALU computes address: base + sign-extended offset
Read data from DM at computed address
Write loaded data to destination register
PC ← PC + 4

S-type Store (SW):

Fetch instruction
Read base register and data register from RF
ALU computes address: base + sign-extended offset
Write data to DM at computed address
PC ← PC + 4

B-type Branch (BEQ):

Fetch instruction
Read two registers from RF
ALU compares (subtract, check zero flag)
If condition met: PC ← PC + sign-extended offset × 2
If not: PC ← PC + 4

Datapath Control Signals

Signal	Purpose
RegWrite	Enable writing to register file
ALUSrc	Select ALU input: register or immediate
ALUOp	Select ALU operation (add, sub, and, etc.)
MemRead	Enable reading from data memory
MemWrite	Enable writing to data memory
MemToReg	Select write-back source: ALU result or memory data
Branch	Enable branch (PC update from branch target)
Jump	Force PC to jump target

Performance

Clock period = longest instruction delay = tIM + tRF_read + tALU + tDM + tRF_write.

Typically the load instruction is the critical path.

Problem: Simple instructions (ADD) take the same time as complex ones (LOAD). The clock is limited by the worst case. Very wasteful.

Multi-Cycle Datapath

Break each instruction into multiple shorter steps, one per clock cycle. Different instructions take different numbers of cycles.

Steps (5 stages)

Instruction Fetch (IF): IR ← IM[PC]; PC ← PC + 4
Instruction Decode / Register Read (ID): Read registers; decode control signals; compute branch target
Execute (EX): ALU operation (arithmetic, address calculation, comparison)
Memory Access (MEM): Read/write data memory (only for load/store)
Write Back (WB): Write result to register file

Cycle Counts

Instruction	Cycles	Steps Used
R-type (ADD)	4	IF, ID, EX, WB
Load (LW)	5	IF, ID, EX, MEM, WB
Store (SW)	4	IF, ID, EX, MEM
Branch (BEQ)	3	IF, ID, EX
Jump (J)	3	IF, ID, EX

Advantages over Single-Cycle

Shorter clock period (each stage is shorter than the longest instruction)
Shared hardware (one ALU, one memory — used in different cycles)
CPI (cycles per instruction) varies but is lower on average

Disadvantages

More complex control (FSM or microcode)
Still sequential — next instruction waits until current one finishes

Control Unit

Hardwired Control

Control signals generated by combinational logic based on opcode and current state.

Single-cycle: Pure combinational. Opcode → control signals (one-level decode).

Multi-cycle: FSM. Current state + opcode → control signals + next state.

Opcode + State → Control Logic → Control Signals + Next State

Advantages: Fast, efficient for simple ISAs. Disadvantages: Hard to modify, complex for large ISAs.

Microprogrammed Control

Control signals stored in a control memory (microcode ROM). Each micro-instruction specifies control signals for one cycle.

Microprogram Counter → Control Memory → Control Signals
                              ↓
                         Micro-instruction
                         (control word)

Micro-instruction fields: ALUSrc, RegWrite, MemRead, next-μPC, branch-condition, etc.

Advantages: Flexible (change behavior by updating microcode). Natural for complex ISAs (x86). Disadvantages: Slower than hardwired (ROM access time). More area.

Modern use: x86 processors use microcode for complex instructions, hardwired logic for simple/common ones. Microcode updates can fix CPU bugs post-manufacture (Intel/AMD regularly issue microcode patches).

ALU Design

The ALU performs arithmetic and logic operations.

Simple ALU

Inputs: A, B (operands), ALUOp (operation select)
Output: Result, Zero flag, Overflow flag, Carry flag

Operations:
  000: AND    (A & B)
  001: OR     (A | B)
  010: ADD    (A + B)
  011: SUB    (A - B)     [Add with B inverted + carry-in]
  100: SLT    (Set if A < B)
  101: XOR    (A ^ B)
  110: SLL    (A << B)
  111: SRL    (A >> B)

Flags / Condition Codes

Flag	Meaning	Set When
Zero (Z)	Result is zero	Result == 0
Negative (N)	Result is negative	MSB of result is 1
Carry (C)	Unsigned overflow	Carry out of MSB
Overflow (V)	Signed overflow	Sign of result wrong

RISC-V note: RISC-V does not use condition codes. Instead, it uses compare-and-branch instructions (BEQ, BLT) that combine comparison and branch.

Register File Design

A register file is a small, fast memory inside the processor.

Typical Structure

32 registers × 64 bits (for RV64)
2 read ports + 1 write port (for basic pipeline)

Read: Combinational — provide register number, get value immediately. Write: Sequential — data written on clock edge when RegWrite is asserted.

Register x0 (RISC-V): Hardwired to 0. Reads always return 0. Writes are discarded. Simplifies many operations (e.g., NOP = ADD x0, x0, 0; MOV = ADD rd, rs, x0).

Multi-Ported Register Files

Superscalar processors need multiple read/write ports:

2-issue: 4 read + 2 write ports
4-issue: 8 read + 4 write ports

Area grows as O(ports²). At some point, a register cache or physical register file with renaming is used.

Data Hazards Overview

In a multi-cycle or pipelined processor, instructions may depend on results not yet available:

ADD x1, x2, x3    // Writes x1
SUB x4, x1, x5    // Reads x1 — but x1 not yet written!

This is a Read After Write (RAW) data hazard. Solutions:

Stalling: Insert bubbles (NOPs) until data is available
Forwarding/Bypassing: Route the result directly from where it's produced to where it's needed
Compiler scheduling: Reorder instructions to avoid hazards

Detailed treatment in the pipelining file.

Performance Metrics

Execution Time

CPU Time = Instruction Count × CPI × Clock Period
         = IC × CPI / Clock Rate

CPI (Cycles Per Instruction)

CPI = Σ (CPIᵢ × Fᵢ)

where CPIᵢ is cycles for instruction class i and Fᵢ is its frequency.

Single-cycle: CPI = 1, but long clock period. Multi-cycle: CPI > 1 on average, but shorter clock period. Pipelined: CPI ≈ 1 ideally (throughput of 1 instruction/cycle). Superscalar: CPI < 1 (IPC > 1 — multiple instructions per cycle).

Amdahl's Law

Speedup from improving a fraction f of execution by a factor S:

Speedup = 1 / ((1 - f) + f/S)

Consequence: Improving a small fraction of execution provides limited overall speedup. "Make the common case fast."

Applications in CS

Compiler optimization: Understanding the datapath helps compilers generate efficient code (instruction selection, register allocation, scheduling).
Performance tuning: Knowing CPI breakdown helps identify bottlenecks (compute-bound vs memory-bound).
Hardware design: Datapath design determines the tradeoffs of a processor (single-cycle simplicity vs multi-cycle efficiency vs pipelined throughput).
Emulation: Software emulators implement the ISA's datapath in software. Understanding the hardware helps write efficient emulators.
Security: Microcode vulnerabilities (Spectre mitigations via microcode update). Side channels through ALU timing differences.