GPU Architecture

Graphics Processing Units (GPUs) are massively parallel processors designed for throughput over latency. Originally for graphics rendering, they now power machine learning, scientific computing, and cryptocurrency mining.

CPU vs GPU Design Philosophy

Aspect	CPU	GPU
Cores	Few (4-64), complex	Many (thousands), simple
Design goal	Low latency (single thread fast)	High throughput (many threads)
Control logic	Large (OoO, branch prediction, speculation)	Small (in-order, simple)
Cache	Large per core (MB)	Small per core (KB), large shared
Memory bandwidth	~50-100 GB/s	~500-3000 GB/s
Best for	Serial, branchy, irregular code	Parallel, regular, data-parallel code

CPU: Latency-oriented. Make one task finish fast. GPU: Throughput-oriented. Finish many tasks per unit time.

SIMD vs SIMT

SIMD (Single Instruction, Multiple Data)

One instruction operates on multiple data elements simultaneously (vector processing).

CPU SIMD: SSE (128-bit), AVX2 (256-bit), AVX-512 (512-bit). Explicit vector instructions.

SIMT (Single Instruction, Multiple Threads)

NVIDIA's execution model. Multiple threads execute the same instruction simultaneously, each on different data. Unlike SIMD, each thread has its own registers and can follow different control flow paths (at the cost of serialization).

Key difference: SIMT allows threads to diverge (if/else branches), while SIMD doesn't. But divergence reduces efficiency.

GPU Execution Model

Thread Hierarchy (CUDA terminology)

Grid
├── Block (0,0)
│   ├── Thread (0,0)
│   ├── Thread (0,1)
│   ├── ...
│   └── Thread (31,31)
├── Block (0,1)
│   └── ...
└── Block (N,M)
    └── ...

Thread: Smallest unit of execution. Has private registers and local memory.
Warp (NVIDIA) / Wavefront (AMD): 32 (NVIDIA) or 64 (AMD) threads executing in lockstep. The actual SIMT execution unit.
Block (Thread Block / Workgroup): Group of threads that share shared memory and can synchronize. Mapped to one Streaming Multiprocessor.
Grid: All blocks for a kernel launch.

Warp Execution

All threads in a warp execute the same instruction simultaneously:

Warp 0, Cycle 1: All 32 threads execute: ADD r1, r2, r3
Warp 0, Cycle 2: All 32 threads execute: MUL r4, r1, r5

Warp divergence: When threads in a warp take different branches:

if (threadIdx.x < 16)
    A();  // First 16 threads
else
    B();  // Last 16 threads

Both paths are executed serially — A() for the first half (second half masked off), then B() for the second half. Effective throughput halved!

Avoid divergence: Use data-parallel patterns where all threads follow the same path.

GPU Memory Hierarchy

┌───────────────────────────────┐
│          Global Memory         │  (GB, high latency: ~400-800 cycles)
│          (GDDR6/HBM)          │
├───────────────────────────────┤
│        L2 Cache (MB)          │  (shared across all SMs)
├────────────┬──────────────────┤
│ SM 0       │ SM 1             │
│ ┌────────┐ │ ┌────────┐      │
│ │Shared  │ │ │Shared  │      │  (KB, low latency: ~20-30 cycles)
│ │Memory  │ │ │Memory  │      │
│ │/ L1    │ │ │/ L1    │      │
│ ├────────┤ │ ├────────┤      │
│ │Registers│ │ │Registers│    │  (fastest: 0-1 cycles)
│ └────────┘ │ └────────┘      │
└────────────┴──────────────────┘

Memory Types

Memory	Scope	Latency	Size	Read/Write
Registers	Per thread	0-1 cycle	~256 KB per SM	R/W
Shared memory	Per block	~20 cycles	48-164 KB per SM	R/W
L1 cache	Per SM	~30 cycles	Combined with shared	R (auto)
L2 cache	Global	~200 cycles	4-50 MB	R (auto)
Global memory	Global	~400-800 cycles	8-80 GB	R/W
Constant memory	Global	~4 cycles (cached)	64 KB	R
Texture memory	Global	~4 cycles (cached)	(in global)	R

Memory Coalescing

Global memory is accessed in large transactions (32/64/128 bytes). When threads in a warp access contiguous addresses, the accesses merge into fewer transactions.

Coalesced (good): Thread i accesses address base + i × 4. One transaction for 32 threads.

Strided (bad): Thread i accesses address base + i × stride. Many transactions → low bandwidth.

Random (worst): Each thread accesses an arbitrary address. 32 separate transactions.

Optimization: Ensure contiguous memory access patterns. Use Structure of Arrays (SoA) instead of Array of Structures (AoS).

Streaming Multiprocessor (SM)

The basic processing unit of a GPU. Each SM contains:

CUDA cores: Simple ALUs (INT, FP32). 64-128 per SM.
Tensor cores: Matrix multiply-accumulate units (for ML). Compute 4×4 matrix operations per cycle.
Load/Store units: Handle memory operations.
Special function units (SFU): Transcendentals (sin, cos, exp, log).
Warp schedulers: Select warps to execute each cycle. 2-4 per SM.
Register file: Large (~256 KB). Shared among all threads on the SM.
Shared memory / L1 cache: Configurable split.

Occupancy

Occupancy = active warps / maximum warps per SM.

Higher occupancy → better latency hiding (when one warp stalls on memory, another can execute).

Limited by: registers per thread, shared memory per block, block size, max threads per SM.

Tradeoff: Using more registers per thread → fewer concurrent warps → lower occupancy. But more registers may reduce spills to slow local memory.

GPU vs CPU Design Tradeoffs

Latency Hiding

CPU: Hides latency with caches, OoO execution, speculation. Few threads, each running fast.

GPU: Hides latency with massive multithreading. When one warp stalls on memory, immediately switch to another ready warp. Thousands of threads in flight. No complex OoO needed.

Warp 0: LOAD (stall 400 cycles)
Warp 1: COMPUTE (executing)    ← switch to ready warp
Warp 2: COMPUTE (executing)
...
Warp 15: COMPUTE (executing)
Warp 0: (data arrived) COMPUTE (resume)

Area Allocation

CPU: ~50% cache, ~30% control logic, ~20% compute. GPU: ~5% cache, ~5% control, ~90% compute.

GPUs dedicate transistors to compute, relying on multithreading instead of caches for latency tolerance.

HBM (High Bandwidth Memory)

Stacked memory (HBM2, HBM2E, HBM3) provides massive bandwidth:

Memory Type	Bandwidth	Used In
GDDR6	500-1000 GB/s	Gaming GPUs (RTX 4090)
HBM2E	1.5-2.0 TB/s	Data center GPUs (A100)
HBM3	2.0-3.35 TB/s	Latest GPUs (H100, H200)
HBM3E	4.8+ TB/s	Next-gen GPUs (B200)

Bandwidth is the key differentiator for GPU performance in ML workloads (many models are memory-bandwidth-bound).

Modern GPU Features

Tensor Cores

Dedicated matrix multiply-accumulate units:

D = A × B + C   (4×4 matrices, mixed precision)

Support FP16×FP16→FP32, BF16, TF32, INT8, FP8. Dramatically accelerate deep learning (training and inference).

NVIDIA A100: 312 TFLOPS FP16 with tensor cores. H100: 990 TFLOPS FP16.

Deep learning: Training (matrix multiplies, convolutions) and inference. GPUs enabled the deep learning revolution. NVIDIA CUDA ecosystem dominates.
Scientific computing: Molecular dynamics, climate modeling, fluid simulation. CUDA, OpenCL, HIP.
Computer graphics: Real-time rendering (rasterization + ray tracing). Game engines, CAD visualization.
Cryptocurrency: SHA-256 mining (ASICs dominate now), Ethash (was GPU-friendly).
Data analytics: GPU-accelerated databases (RAPIDS, BlazingSQL). Parallel sorting, joining, aggregation.
Signal processing: GPU-accelerated FFT, filtering, image processing.
Bioinformatics: Sequence alignment, molecular simulation (GROMACS).