4 min read
On this page

GPU Computing

GPU Architecture Overview

GPUs are massively parallel processors designed for throughput, not latency. A modern GPU has thousands of simple cores organized hierarchically.

GPU
 └── Streaming Multiprocessors (SMs)     [~80-144 on modern GPUs]
      └── CUDA Cores (ALUs)              [~64-128 per SM]
      └── Tensor Cores                    [for matrix ops]
      └── Shared Memory / L1 Cache
      └── Register File
      └── Warp Schedulers

Key design philosophy: hide memory latency through massive parallelism rather than caches and branch prediction.

CUDA Programming Model

Thread Hierarchy

Grid (entire kernel launch)
 └── Blocks (independent groups, map to SMs)
      └── Threads (individual execution units)
// Kernel launch syntax
kernel<<<numBlocks, threadsPerBlock>>>(args);
// Or with shared memory and stream
kernel<<<numBlocks, threadsPerBlock, sharedMemBytes, stream>>>(args);

Thread Indexing

// 1D indexing
int idx = blockIdx.x * blockDim.x + threadIdx.x;

// 2D indexing
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

// Grid-stride loop (handles arbitrary sizes)
for (int i = idx; i < n; i += blockDim.x * gridDim.x) {
    output[i] = process(input[i]);
}

Warps

A warp is a group of 32 threads that execute in SIMT (Single Instruction, Multiple Thread) lockstep.

  • All threads in a warp execute the same instruction simultaneously
  • Warp divergence: when threads in a warp take different branches, both paths execute serially (masked execution), reducing throughput
  • Warp shuffle: direct register-to-register communication within a warp
// Warp-level primitives
int val = __shfl_sync(mask, var, srcLane);         // broadcast
int val = __shfl_down_sync(mask, var, delta);      // shift down
int val = __shfl_xor_sync(mask, var, laneMask);    // butterfly
unsigned ballot = __ballot_sync(mask, predicate);   // vote

GPU Memory Hierarchy

| Memory | Scope | Latency | Size | Cached | |--------|-------|---------|------|--------| | Registers | Thread | 0 cycles | ~255 per thread | N/A | | Shared memory | Block | ~5 cycles | 48-228 KB per SM | N/A (SRAM) | | L1 cache | SM | ~30 cycles | Configurable with shared | Yes | | L2 cache | Device | ~200 cycles | 4-50 MB | Yes | | Global memory (HBM) | Device | ~400 cycles | 16-80 GB | Via L1/L2 | | Constant memory | Device | ~5 cycles (cached) | 64 KB | Yes (broadcast) | | Texture memory | Device | ~400 cycles | Same as global | Spatial cache |

Shared Memory Usage

__global__ void kernel() {
    __shared__ float smem[256];  // static allocation
    extern __shared__ float dynamic_smem[];  // dynamic (size at launch)

    smem[threadIdx.x] = global_data[idx];
    __syncthreads();  // barrier: all threads in block must reach this point
    // Now all threads can safely read any element of smem
}

Memory Coalescing

Global memory is accessed in 32-byte, 64-byte, or 128-byte transactions. When threads in a warp access contiguous addresses, accesses are coalesced into fewer transactions.

Coalesced (1 transaction):
  Thread 0 reads addr+0, Thread 1 reads addr+4, ..., Thread 31 reads addr+124

Strided (32 transactions, worst case):
  Thread 0 reads addr+0, Thread 1 reads addr+512, ...

Structure of Arrays (SoA) vs Array of Structures (AoS):
  AoS: struct { float x, y, z; } particles[N];     // strided access
  SoA: struct { float x[N], y[N], z[N]; } particles; // coalesced access

Always prefer SoA layouts for GPU data.

Bank Conflicts

Shared memory is divided into 32 banks (one per warp lane). Consecutive 4-byte words map to consecutive banks.

Bank assignment: bank = (address / 4) % 32

No conflict:    each thread accesses a different bank
2-way conflict: two threads access the same bank (2x latency)
Broadcast:      multiple threads read the same address (no conflict)

Avoiding Bank Conflicts

// Conflict-free access pattern
smem[threadIdx.x]                    // stride 1: no conflicts

// 32-way bank conflict
smem[threadIdx.x * 32]              // all access bank 0

// Padding trick for 2D arrays
__shared__ float tile[32][33];       // pad to 33 columns to avoid conflicts
// Without padding: tile[row][col] where col=0 always hits bank 0

Occupancy

Occupancy = active warps per SM / maximum warps per SM.

Higher occupancy helps hide memory latency (more warps available to schedule while others wait for memory). Limited by:

  1. Registers per thread: more registers -> fewer threads per SM
  2. Shared memory per block: more shared memory -> fewer blocks per SM
  3. Threads per block: block size must be a multiple of warp size
Example:
  SM supports 2048 threads max, 65536 registers, 48 KB shared memory
  Kernel uses 32 registers/thread, 16 KB shared memory/block, 256 threads/block

  Register limit: 65536 / 32 = 2048 threads
  Shared mem limit: 48 KB / 16 KB = 3 blocks = 768 threads
  Block limit: 2048 / 256 = 8 blocks = 2048 threads

  Bottleneck: shared memory -> 768 / 2048 = 37.5% occupancy

Use NVIDIA's occupancy calculator or cudaOccupancyMaxPotentialBlockSize API.

Note: maximum occupancy does not always mean maximum performance. Sometimes using more registers or shared memory (lower occupancy) yields better per-thread performance.

CUDA Streams

Streams enable concurrent execution of kernels, memory transfers, and host computation.

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Overlap copy and compute
cudaMemcpyAsync(d_a, h_a, size, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream2>>>(d_b);  // runs concurrently with copy

cudaStreamSynchronize(stream1);
cudaStreamDestroy(stream1);

Events

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, stream);
kernel<<<grid, block, 0, stream>>>(args);
cudaEventRecord(stop, stream);
cudaEventSynchronize(stop);
float ms;
cudaEventElapsedTime(&ms, start, stop);

OpenCL

Open standard for heterogeneous computing (GPUs, CPUs, FPGAs, DSPs).

Terminology Mapping

| CUDA | OpenCL | |------|--------| | Thread | Work-item | | Block | Work-group | | Grid | NDRange | | Shared memory | Local memory | | Global memory | Global memory | | Warp | Sub-group (vendor-dependent size) | | __global__ | __kernel | | __shared__ | __local |

Key Differences from CUDA

  • Platform and device model: must query and select devices at runtime
  • Kernels compiled at runtime from source strings (or pre-compiled SPIR-V)
  • More verbose API but portable across vendors
  • Command queues instead of streams

SYCL

Modern C++ abstraction for heterogeneous computing, built on top of OpenCL concepts.

#include <sycl/sycl.hpp>
using namespace sycl;

queue q;
buffer<float> buf(data, range<1>(N));

q.submit([&](handler &h) {
    accessor acc(buf, h, read_write);
    h.parallel_for(range<1>(N), [=](id<1> i) {
        acc[i] = acc[i] * 2.0f;
    });
});
q.wait();
  • Single-source programming (host and device code in same file)
  • Uses C++ lambdas for kernels
  • Implementations: Intel oneAPI DPC++, hipSYCL, ComputeCpp

Tensor Cores

Specialized matrix multiply-accumulate units on NVIDIA GPUs (Volta and later).

Operation: D = A * B + C
  A: 16x16 matrix (FP16)
  B: 16x16 matrix (FP16)
  C, D: 16x16 matrix (FP16 or FP32)

WMMA API

#include <mma.h>
using namespace nvcuda::wmma;

fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;

load_matrix_sync(a_frag, a_ptr, lda);
load_matrix_sync(b_frag, b_ptr, ldb);
fill_fragment(c_frag, 0.0f);
mma_sync(c_frag, a_frag, b_frag, c_frag);
store_matrix_sync(c_ptr, c_frag, ldc, mem_row_major);

Supported Precision (Hopper/Ada)

| Input | Accumulator | Peak TFLOPS (H100) | |-------|-------------|---------------------| | FP16 | FP32 | ~990 | | BF16 | FP32 | ~990 | | TF32 | FP32 | ~495 | | FP8 | FP32 | ~1979 | | INT8 | INT32 | ~1979 TOPS |

Tensor cores are essential for deep learning training and inference workloads.

Performance Optimization Checklist

  1. Maximize parallelism: launch enough threads to saturate the GPU
  2. Coalesce global memory: use SoA, aligned access
  3. Use shared memory: reduce global memory traffic
  4. Avoid warp divergence: restructure branches, sort data
  5. Minimize bank conflicts: pad shared memory arrays
  6. Balance occupancy vs per-thread resources: profile and tune
  7. Overlap compute and transfers: use streams and async copies
  8. Use appropriate precision: FP16/BF16 for throughput, FP32/FP64 when needed
  9. Reduce synchronization: minimize __syncthreads(), use warp-level ops
  10. Profile: use Nsight Compute and Nsight Systems to find real bottlenecks