GPU Architecture
Graphics Processing Units (GPUs) are massively parallel processors designed for throughput over latency. Originally for graphics rendering, they now power machine learning, scientific computing, and cryptocurrency mining.
CPU vs GPU Design Philosophy
| Aspect | CPU | GPU | |---|---|---| | Cores | Few (4-64), complex | Many (thousands), simple | | Design goal | Low latency (single thread fast) | High throughput (many threads) | | Control logic | Large (OoO, branch prediction, speculation) | Small (in-order, simple) | | Cache | Large per core (MB) | Small per core (KB), large shared | | Memory bandwidth | ~50-100 GB/s | ~500-3000 GB/s | | Best for | Serial, branchy, irregular code | Parallel, regular, data-parallel code |
CPU: Latency-oriented. Make one task finish fast. GPU: Throughput-oriented. Finish many tasks per unit time.
SIMD vs SIMT
SIMD (Single Instruction, Multiple Data)
One instruction operates on multiple data elements simultaneously (vector processing).
CPU SIMD: SSE (128-bit), AVX2 (256-bit), AVX-512 (512-bit). Explicit vector instructions.
SIMT (Single Instruction, Multiple Threads)
NVIDIA's execution model. Multiple threads execute the same instruction simultaneously, each on different data. Unlike SIMD, each thread has its own registers and can follow different control flow paths (at the cost of serialization).
Key difference: SIMT allows threads to diverge (if/else branches), while SIMD doesn't. But divergence reduces efficiency.
GPU Execution Model
Thread Hierarchy (CUDA terminology)
Grid
├── Block (0,0)
│ ├── Thread (0,0)
│ ├── Thread (0,1)
│ ├── ...
│ └── Thread (31,31)
├── Block (0,1)
│ └── ...
└── Block (N,M)
└── ...
- Thread: Smallest unit of execution. Has private registers and local memory.
- Warp (NVIDIA) / Wavefront (AMD): 32 (NVIDIA) or 64 (AMD) threads executing in lockstep. The actual SIMT execution unit.
- Block (Thread Block / Workgroup): Group of threads that share shared memory and can synchronize. Mapped to one Streaming Multiprocessor.
- Grid: All blocks for a kernel launch.
Warp Execution
All threads in a warp execute the same instruction simultaneously:
Warp 0, Cycle 1: All 32 threads execute: ADD r1, r2, r3
Warp 0, Cycle 2: All 32 threads execute: MUL r4, r1, r5
Warp divergence: When threads in a warp take different branches:
if (threadIdx.x < 16)
A(); // First 16 threads
else
B(); // Last 16 threads
Both paths are executed serially — A() for the first half (second half masked off), then B() for the second half. Effective throughput halved!
Avoid divergence: Use data-parallel patterns where all threads follow the same path.
GPU Memory Hierarchy
┌───────────────────────────────┐
│ Global Memory │ (GB, high latency: ~400-800 cycles)
│ (GDDR6/HBM) │
├───────────────────────────────┤
│ L2 Cache (MB) │ (shared across all SMs)
├────────────┬──────────────────┤
│ SM 0 │ SM 1 │
│ ┌────────┐ │ ┌────────┐ │
│ │Shared │ │ │Shared │ │ (KB, low latency: ~20-30 cycles)
│ │Memory │ │ │Memory │ │
│ │/ L1 │ │ │/ L1 │ │
│ ├────────┤ │ ├────────┤ │
│ │Registers│ │ │Registers│ │ (fastest: 0-1 cycles)
│ └────────┘ │ └────────┘ │
└────────────┴──────────────────┘
Memory Types
| Memory | Scope | Latency | Size | Read/Write | |---|---|---|---|---| | Registers | Per thread | 0-1 cycle | ~256 KB per SM | R/W | | Shared memory | Per block | ~20 cycles | 48-164 KB per SM | R/W | | L1 cache | Per SM | ~30 cycles | Combined with shared | R (auto) | | L2 cache | Global | ~200 cycles | 4-50 MB | R (auto) | | Global memory | Global | ~400-800 cycles | 8-80 GB | R/W | | Constant memory | Global | ~4 cycles (cached) | 64 KB | R | | Texture memory | Global | ~4 cycles (cached) | (in global) | R |
Memory Coalescing
Global memory is accessed in large transactions (32/64/128 bytes). When threads in a warp access contiguous addresses, the accesses merge into fewer transactions.
Coalesced (good): Thread i accesses address base + i × 4. One transaction for 32 threads.
Strided (bad): Thread i accesses address base + i × stride. Many transactions → low bandwidth.
Random (worst): Each thread accesses an arbitrary address. 32 separate transactions.
Optimization: Ensure contiguous memory access patterns. Use Structure of Arrays (SoA) instead of Array of Structures (AoS).
Streaming Multiprocessor (SM)
The basic processing unit of a GPU. Each SM contains:
- CUDA cores: Simple ALUs (INT, FP32). 64-128 per SM.
- Tensor cores: Matrix multiply-accumulate units (for ML). Compute 4×4 matrix operations per cycle.
- Load/Store units: Handle memory operations.
- Special function units (SFU): Transcendentals (sin, cos, exp, log).
- Warp schedulers: Select warps to execute each cycle. 2-4 per SM.
- Register file: Large (~256 KB). Shared among all threads on the SM.
- Shared memory / L1 cache: Configurable split.
Occupancy
Occupancy = active warps / maximum warps per SM.
Higher occupancy → better latency hiding (when one warp stalls on memory, another can execute).
Limited by: registers per thread, shared memory per block, block size, max threads per SM.
Tradeoff: Using more registers per thread → fewer concurrent warps → lower occupancy. But more registers may reduce spills to slow local memory.
GPU vs CPU Design Tradeoffs
Latency Hiding
CPU: Hides latency with caches, OoO execution, speculation. Few threads, each running fast.
GPU: Hides latency with massive multithreading. When one warp stalls on memory, immediately switch to another ready warp. Thousands of threads in flight. No complex OoO needed.
Warp 0: LOAD (stall 400 cycles)
Warp 1: COMPUTE (executing) ← switch to ready warp
Warp 2: COMPUTE (executing)
...
Warp 15: COMPUTE (executing)
Warp 0: (data arrived) COMPUTE (resume)
Area Allocation
CPU: ~50% cache, ~30% control logic, ~20% compute. GPU: ~5% cache, ~5% control, ~90% compute.
GPUs dedicate transistors to compute, relying on multithreading instead of caches for latency tolerance.
HBM (High Bandwidth Memory)
Stacked memory (HBM2, HBM2E, HBM3) provides massive bandwidth:
| Memory Type | Bandwidth | Used In | |---|---|---| | GDDR6 | 500-1000 GB/s | Gaming GPUs (RTX 4090) | | HBM2E | 1.5-2.0 TB/s | Data center GPUs (A100) | | HBM3 | 2.0-3.35 TB/s | Latest GPUs (H100, H200) | | HBM3E | 4.8+ TB/s | Next-gen GPUs (B200) |
Bandwidth is the key differentiator for GPU performance in ML workloads (many models are memory-bandwidth-bound).
Modern GPU Features
Tensor Cores
Dedicated matrix multiply-accumulate units:
D = A × B + C (4×4 matrices, mixed precision)
Support FP16×FP16→FP32, BF16, TF32, INT8, FP8. Dramatically accelerate deep learning (training and inference).
NVIDIA A100: 312 TFLOPS FP16 with tensor cores. H100: 990 TFLOPS FP16.
Ray Tracing Cores
Dedicated hardware for BVH traversal and ray-triangle intersection. Accelerate real-time ray tracing by 10× vs software.
Sparsity Support
2:4 structured sparsity: 2 of every 4 elements can be zero. Tensor cores process sparse matrices at 2× throughput. Models can be pruned to 2:4 sparsity with minimal accuracy loss.
Applications in CS
- Deep learning: Training (matrix multiplies, convolutions) and inference. GPUs enabled the deep learning revolution. NVIDIA CUDA ecosystem dominates.
- Scientific computing: Molecular dynamics, climate modeling, fluid simulation. CUDA, OpenCL, HIP.
- Computer graphics: Real-time rendering (rasterization + ray tracing). Game engines, CAD visualization.
- Cryptocurrency: SHA-256 mining (ASICs dominate now), Ethash (was GPU-friendly).
- Data analytics: GPU-accelerated databases (RAPIDS, BlazingSQL). Parallel sorting, joining, aggregation.
- Signal processing: GPU-accelerated FFT, filtering, image processing.
- Bioinformatics: Sequence alignment, molecular simulation (GROMACS).