GPU Computing

GPU Architecture Overview

GPUs are massively parallel processors designed for throughput, not latency. A modern GPU has thousands of simple cores organized hierarchically.

GPU
 └── Streaming Multiprocessors (SMs)     [~80-144 on modern GPUs]
      └── CUDA Cores (ALUs)              [~64-128 per SM]
      └── Tensor Cores                    [for matrix ops]
      └── Shared Memory / L1 Cache
      └── Register File
      └── Warp Schedulers

Key design philosophy: hide memory latency through massive parallelism rather than caches and branch prediction.

CUDA Programming Model

Thread Hierarchy

Grid (entire kernel launch)
 └── Blocks (independent groups, map to SMs)
      └── Threads (individual execution units)

// Kernel launch syntax
kernel<<<numBlocks, threadsPerBlock>>>(args);
// Or with shared memory and stream
kernel<<<numBlocks, threadsPerBlock, sharedMemBytes, stream>>>(args);

Thread Indexing

// 1D indexing
int idx = blockIdx.x * blockDim.x + threadIdx.x;

// 2D indexing
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

// Grid-stride loop (handles arbitrary sizes)
for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
    output[i] = process(input[i]);
}

Warps

A warp is a group of 32 threads that execute in SIMT (Single Instruction, Multiple Thread) lockstep.

All threads in a warp execute the same instruction simultaneously
Warp divergence: when threads in a warp take different branches, both paths execute serially (masked execution), reducing throughput
Warp shuffle: direct register-to-register communication within a warp

// Warp-level primitives
int val = __shfl_sync(mask, var, srcLane);         // broadcast
int val = __shfl_down_sync(mask, var, delta);      // shift down
int val = __shfl_xor_sync(mask, var, laneMask);    // butterfly
unsigned ballot = __ballot_sync(mask, predicate);   // vote

GPU Memory Hierarchy

Memory	Scope	Latency	Size	Cached
Registers	Thread	0 cycles	~255 per thread	N/A
Shared memory	Block	~5 cycles	48-228 KB per SM	N/A (SRAM)
L1 cache	SM	~30 cycles	Configurable with shared	Yes
L2 cache	Device	~200 cycles	4-50 MB	Yes
Global memory (HBM)	Device	~400 cycles	16-80 GB	Via L1/L2
Constant memory	Device	~5 cycles (cached)	64 KB	Yes (broadcast)
Texture memory	Device	~400 cycles	Same as global	Spatial cache

Shared Memory Usage

__global__ void kernel() {
    __shared__ float smem[256];  // static allocation
    extern __shared__ float dynamic_smem[];  // dynamic (size at launch)

    smem[threadIdx.x] = global_data[idx];
    __syncthreads();  // barrier: all threads in block must reach this point
    // Now all threads can safely read any element of smem
}

Memory Coalescing

Global memory is accessed in 32-byte, 64-byte, or 128-byte transactions. When threads in a warp access contiguous addresses, accesses are coalesced into fewer transactions.

Coalesced (1 transaction):
  Thread 0 reads addr+0, Thread 1 reads addr+4, ..., Thread 31 reads addr+124

Strided (32 transactions, worst case):
  Thread 0 reads addr+0, Thread 1 reads addr+512, ...

Structure of Arrays (SoA) vs Array of Structures (AoS):
  AoS: struct { float x, y, z; } particles[N];     // strided access
  SoA: struct { float x[N], y[N], z[N]; } particles; // coalesced access

Always prefer SoA layouts for GPU data.

Bank Conflicts

Shared memory is divided into 32 banks (one per warp lane). Consecutive 4-byte words map to consecutive banks.

Bank assignment: bank = (address / 4) % 32

No conflict:    each thread accesses a different bank
2-way conflict: two threads access the same bank (2x latency)
Broadcast:      multiple threads read the same address (no conflict)

Avoiding Bank Conflicts

// Conflict-free access pattern
smem[threadIdx.x]                    // stride 1: no conflicts

// 32-way bank conflict
smem[threadIdx.x * 32]              // all access bank 0

// Padding trick for 2D arrays
__shared__ float tile[32][33];       // pad to 33 columns to avoid conflicts
// Without padding: tile[row][col] where col=0 always hits bank 0

Occupancy

Occupancy = active warps per SM / maximum warps per SM.

Higher occupancy helps hide memory latency (more warps available to schedule while others wait for memory). Limited by:

Registers per thread: more registers -> fewer threads per SM
Shared memory per block: more shared memory -> fewer blocks per SM
Threads per block: block size must be a multiple of warp size

Example:
  SM supports 2048 threads max, 65536 registers, 48 KB shared memory
  Kernel uses 32 registers/thread, 16 KB shared memory/block, 256 threads/block

  Register limit: 65536 / 32 = 2048 threads
  Shared mem limit: 48 KB / 16 KB = 3 blocks = 768 threads
  Block limit: 2048 / 256 = 8 blocks = 2048 threads

  Bottleneck: shared memory -> 768 / 2048 = 37.5% occupancy

Use NVIDIA's occupancy calculator or cudaOccupancyMaxPotentialBlockSize API.

Note: maximum occupancy does not always mean maximum performance. Sometimes using more registers or shared memory (lower occupancy) yields better per-thread performance.

CUDA Streams

Streams enable concurrent execution of kernels, memory transfers, and host computation.

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Overlap copy and compute
cudaMemcpyAsync(d_a, h_a, size, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream2>>>(d_b);  // runs concurrently with copy

cudaStreamSynchronize(stream1);
cudaStreamDestroy(stream1);

Events

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, stream);
kernel<<<grid, block, 0, stream>>>(args);
cudaEventRecord(stop, stream);
cudaEventSynchronize(stop);
float ms;
cudaEventElapsedTime(&ms, start, stop);

OpenCL

Open standard for heterogeneous computing (GPUs, CPUs, FPGAs, DSPs).

Terminology Mapping

CUDA	OpenCL
Thread	Work-item
Block	Work-group
Grid	NDRange
Shared memory	Local memory
Global memory	Global memory
Warp	Sub-group (vendor-dependent size)
`__global__`	`__kernel`
`__shared__`	`__local`

Key Differences from CUDA

Platform and device model: must query and select devices at runtime
Kernels compiled at runtime from source strings (or pre-compiled SPIR-V)
More verbose API but portable across vendors
Command queues instead of streams

SYCL

Modern C++ abstraction for heterogeneous computing, built on top of OpenCL concepts.

#include <sycl/sycl.hpp>
using namespace sycl;

queue q;
buffer<float> buf(data, range<1>(N));

q.submit([&](handler &h) {
    accessor acc(buf, h, read_write);
    h.parallel_for(range<1>(N), [=](id<1> i) {
        acc[i] = acc[i] * 2.0f;
    });
});
q.wait();

Single-source programming (host and device code in same file)
Uses C++ lambdas for kernels
Implementations: Intel oneAPI DPC++, hipSYCL, ComputeCpp

Tensor Cores

Specialized matrix multiply-accumulate units on NVIDIA GPUs (Volta and later).

Operation: D = A * B + C
  A: 16x16 matrix (FP16)
  B: 16x16 matrix (FP16)
  C, D: 16x16 matrix (FP16 or FP32)

WMMA API

#include <mma.h>
using namespace nvcuda::wmma;

fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;

load_matrix_sync(a_frag, a_ptr, lda);
load_matrix_sync(b_frag, b_ptr, ldb);
fill_fragment(c_frag, 0.0f);
mma_sync(c_frag, a_frag, b_frag, c_frag);
store_matrix_sync(c_ptr, c_frag, ldc, mem_row_major);

Supported Precision (Hopper/Ada)

Input	Accumulator	Peak TFLOPS (H100)
FP16	FP32	~990
BF16	FP32	~990
TF32	FP32	~495
FP8	FP32	~1979
INT8	INT32	~1979 TOPS

Tensor cores are essential for deep learning training and inference workloads.

Performance Optimization Checklist

Maximize parallelism: launch enough threads to saturate the GPU
Coalesce global memory: use SoA, aligned access
Use shared memory: reduce global memory traffic
Avoid warp divergence: restructure branches, sort data
Minimize bank conflicts: pad shared memory arrays
Balance occupancy vs per-thread resources: profile and tune
Overlap compute and transfers: use streams and async copies
Use appropriate precision: FP16/BF16 for throughput, FP32/FP64 when needed
Reduce synchronization: minimize __syncthreads(), use warp-level ops
Profile: use Nsight Compute and Nsight Systems to find real bottlenecks