GPU Computing
GPU Architecture Overview
GPUs are massively parallel processors designed for throughput, not latency. A modern GPU has thousands of simple cores organized hierarchically.
GPU
└── Streaming Multiprocessors (SMs) [~80-144 on modern GPUs]
└── CUDA Cores (ALUs) [~64-128 per SM]
└── Tensor Cores [for matrix ops]
└── Shared Memory / L1 Cache
└── Register File
└── Warp Schedulers
Key design philosophy: hide memory latency through massive parallelism rather than caches and branch prediction.
CUDA Programming Model
Thread Hierarchy
Grid (entire kernel launch)
└── Blocks (independent groups, map to SMs)
└── Threads (individual execution units)
// Kernel launch syntax
kernel<<<numBlocks, threadsPerBlock>>>(args);
// Or with shared memory and stream
kernel<<<numBlocks, threadsPerBlock, sharedMemBytes, stream>>>(args);
Thread Indexing
// 1D indexing
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// 2D indexing
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
// Grid-stride loop (handles arbitrary sizes)
for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
output[i] = process(input[i]);
}
Warps
A warp is a group of 32 threads that execute in SIMT (Single Instruction, Multiple Thread) lockstep.
- All threads in a warp execute the same instruction simultaneously
- Warp divergence: when threads in a warp take different branches, both paths execute serially (masked execution), reducing throughput
- Warp shuffle: direct register-to-register communication within a warp
// Warp-level primitives
int val = __shfl_sync(mask, var, srcLane); // broadcast
int val = __shfl_down_sync(mask, var, delta); // shift down
int val = __shfl_xor_sync(mask, var, laneMask); // butterfly
unsigned ballot = __ballot_sync(mask, predicate); // vote
GPU Memory Hierarchy
| Memory | Scope | Latency | Size | Cached | |--------|-------|---------|------|--------| | Registers | Thread | 0 cycles | ~255 per thread | N/A | | Shared memory | Block | ~5 cycles | 48-228 KB per SM | N/A (SRAM) | | L1 cache | SM | ~30 cycles | Configurable with shared | Yes | | L2 cache | Device | ~200 cycles | 4-50 MB | Yes | | Global memory (HBM) | Device | ~400 cycles | 16-80 GB | Via L1/L2 | | Constant memory | Device | ~5 cycles (cached) | 64 KB | Yes (broadcast) | | Texture memory | Device | ~400 cycles | Same as global | Spatial cache |
Shared Memory Usage
__global__ void kernel() {
__shared__ float smem[256]; // static allocation
extern __shared__ float dynamic_smem[]; // dynamic (size at launch)
smem[threadIdx.x] = global_data[idx];
__syncthreads(); // barrier: all threads in block must reach this point
// Now all threads can safely read any element of smem
}
Memory Coalescing
Global memory is accessed in 32-byte, 64-byte, or 128-byte transactions. When threads in a warp access contiguous addresses, accesses are coalesced into fewer transactions.
Coalesced (1 transaction):
Thread 0 reads addr+0, Thread 1 reads addr+4, ..., Thread 31 reads addr+124
Strided (32 transactions, worst case):
Thread 0 reads addr+0, Thread 1 reads addr+512, ...
Structure of Arrays (SoA) vs Array of Structures (AoS):
AoS: struct { float x, y, z; } particles[N]; // strided access
SoA: struct { float x[N], y[N], z[N]; } particles; // coalesced access
Always prefer SoA layouts for GPU data.
Bank Conflicts
Shared memory is divided into 32 banks (one per warp lane). Consecutive 4-byte words map to consecutive banks.
Bank assignment: bank = (address / 4) % 32
No conflict: each thread accesses a different bank
2-way conflict: two threads access the same bank (2x latency)
Broadcast: multiple threads read the same address (no conflict)
Avoiding Bank Conflicts
// Conflict-free access pattern
smem[threadIdx.x] // stride 1: no conflicts
// 32-way bank conflict
smem[threadIdx.x * 32] // all access bank 0
// Padding trick for 2D arrays
__shared__ float tile[32][33]; // pad to 33 columns to avoid conflicts
// Without padding: tile[row][col] where col=0 always hits bank 0
Occupancy
Occupancy = active warps per SM / maximum warps per SM.
Higher occupancy helps hide memory latency (more warps available to schedule while others wait for memory). Limited by:
- Registers per thread: more registers -> fewer threads per SM
- Shared memory per block: more shared memory -> fewer blocks per SM
- Threads per block: block size must be a multiple of warp size
Example:
SM supports 2048 threads max, 65536 registers, 48 KB shared memory
Kernel uses 32 registers/thread, 16 KB shared memory/block, 256 threads/block
Register limit: 65536 / 32 = 2048 threads
Shared mem limit: 48 KB / 16 KB = 3 blocks = 768 threads
Block limit: 2048 / 256 = 8 blocks = 2048 threads
Bottleneck: shared memory -> 768 / 2048 = 37.5% occupancy
Use NVIDIA's occupancy calculator or cudaOccupancyMaxPotentialBlockSize API.
Note: maximum occupancy does not always mean maximum performance. Sometimes using more registers or shared memory (lower occupancy) yields better per-thread performance.
CUDA Streams
Streams enable concurrent execution of kernels, memory transfers, and host computation.
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
// Overlap copy and compute
cudaMemcpyAsync(d_a, h_a, size, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream2>>>(d_b); // runs concurrently with copy
cudaStreamSynchronize(stream1);
cudaStreamDestroy(stream1);
Events
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, stream);
kernel<<<grid, block, 0, stream>>>(args);
cudaEventRecord(stop, stream);
cudaEventSynchronize(stop);
float ms;
cudaEventElapsedTime(&ms, start, stop);
OpenCL
Open standard for heterogeneous computing (GPUs, CPUs, FPGAs, DSPs).
Terminology Mapping
| CUDA | OpenCL |
|------|--------|
| Thread | Work-item |
| Block | Work-group |
| Grid | NDRange |
| Shared memory | Local memory |
| Global memory | Global memory |
| Warp | Sub-group (vendor-dependent size) |
| __global__ | __kernel |
| __shared__ | __local |
Key Differences from CUDA
- Platform and device model: must query and select devices at runtime
- Kernels compiled at runtime from source strings (or pre-compiled SPIR-V)
- More verbose API but portable across vendors
- Command queues instead of streams
SYCL
Modern C++ abstraction for heterogeneous computing, built on top of OpenCL concepts.
#include <sycl/sycl.hpp>
using namespace sycl;
queue q;
buffer<float> buf(data, range<1>(N));
q.submit([&](handler &h) {
accessor acc(buf, h, read_write);
h.parallel_for(range<1>(N), [=](id<1> i) {
acc[i] = acc[i] * 2.0f;
});
});
q.wait();
- Single-source programming (host and device code in same file)
- Uses C++ lambdas for kernels
- Implementations: Intel oneAPI DPC++, hipSYCL, ComputeCpp
Tensor Cores
Specialized matrix multiply-accumulate units on NVIDIA GPUs (Volta and later).
Operation: D = A * B + C
A: 16x16 matrix (FP16)
B: 16x16 matrix (FP16)
C, D: 16x16 matrix (FP16 or FP32)
WMMA API
#include <mma.h>
using namespace nvcuda::wmma;
fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;
load_matrix_sync(a_frag, a_ptr, lda);
load_matrix_sync(b_frag, b_ptr, ldb);
fill_fragment(c_frag, 0.0f);
mma_sync(c_frag, a_frag, b_frag, c_frag);
store_matrix_sync(c_ptr, c_frag, ldc, mem_row_major);
Supported Precision (Hopper/Ada)
| Input | Accumulator | Peak TFLOPS (H100) | |-------|-------------|---------------------| | FP16 | FP32 | ~990 | | BF16 | FP32 | ~990 | | TF32 | FP32 | ~495 | | FP8 | FP32 | ~1979 | | INT8 | INT32 | ~1979 TOPS |
Tensor cores are essential for deep learning training and inference workloads.
Performance Optimization Checklist
- Maximize parallelism: launch enough threads to saturate the GPU
- Coalesce global memory: use SoA, aligned access
- Use shared memory: reduce global memory traffic
- Avoid warp divergence: restructure branches, sort data
- Minimize bank conflicts: pad shared memory arrays
- Balance occupancy vs per-thread resources: profile and tune
- Overlap compute and transfers: use streams and async copies
- Use appropriate precision: FP16/BF16 for throughput, FP32/FP64 when needed
- Reduce synchronization: minimize
__syncthreads(), use warp-level ops - Profile: use Nsight Compute and Nsight Systems to find real bottlenecks