GPU Programming

Overview

Modern GPUs are massively parallel processors with thousands of cores optimized for the graphics pipeline and general-purpose computation. Programming them requires understanding shader stages, graphics APIs, memory hierarchies, and synchronization primitives.

Shader Types

Vertex Shader

Runs once per vertex. Transforms positions, computes per-vertex data.

// GLSL example
layout(location = 0) in vec3 aPosition;
layout(location = 1) in vec3 aNormal;

uniform mat4 uMVP;
uniform mat3 uNormalMatrix;

out vec3 vNormal;

void main() {
    gl_Position = uMVP * vec4(aPosition, 1.0);
    vNormal = normalize(uNormalMatrix * aNormal);
}

Inputs: vertex attributes from buffers. Outputs: clip-space position and interpolants for the fragment shader.

Fragment (Pixel) Shader

Runs once per fragment (candidate pixel). Computes final color.

in vec3 vNormal;
out vec4 fragColor;

void main() {
    vec3 lightDir = normalize(vec3(1.0, 1.0, 1.0));
    float NdotL = max(dot(normalize(vNormal), lightDir), 0.0);
    fragColor = vec4(vec3(NdotL), 1.0);
}

Has access to interpolated vertex outputs, textures, and uniforms. Can discard fragments or write to multiple render targets (MRT).

Geometry Shader

Operates on complete primitives (points, lines, triangles). Can emit zero or more new primitives. Use cases:

Billboard generation from point sprites
Wireframe overlay
Shadow volume extrusion
Layered rendering (cubemap faces in one pass)

Performance caveat: often a bottleneck due to variable output and poor parallelism. Prefer compute shaders or mesh shaders when possible.

Tessellation Shaders

Two stages that subdivide patches into finer geometry:

Tessellation Control Shader (TCS / Hull Shader):

Runs per control point
Sets tessellation levels (inner and outer)
Passes patch data to the tessellation engine

Tessellation Evaluation Shader (TES / Domain Shader):

Runs per generated vertex
Receives barycentric coordinates from the tessellator
Evaluates the surface position (Bezier, PN-Triangles, displacement)

Input Patch --> TCS --> Fixed-Function Tessellator --> TES --> Output Vertices

Compute Shader

General-purpose GPU computation outside the graphics pipeline. Organized in workgroups with shared memory.

layout(local_size_x = 256) in;

layout(std430, binding = 0) buffer Data { float values[]; };

shared float cache[256];

void main() {
    uint id = gl_GlobalInvocationID.x;
    cache[gl_LocalInvocationID.x] = values[id];
    barrier();
    // parallel reduction, filtering, simulation, etc.
}

Dispatch model: dispatch(num_groups_x, num_groups_y, num_groups_z), each group runs local_size_x * local_size_y * local_size_z invocations.

Mesh Shaders (Modern)

Replace vertex + geometry + tessellation stages with two programmable stages:

Task (Amplification) Shader: Optional, determines how many mesh shader workgroups to launch (culling, LOD)
Mesh Shader: Outputs vertices and primitives directly from a workgroup, with shared memory

Benefits: workgroup-level parallelism, explicit output, GPU-driven culling. Supported in Vulkan (VK_EXT_mesh_shader), DirectX 12, and Metal 3.

Graphics APIs

OpenGL / OpenGL ES

Cross-platform, mature, large ecosystem
Global state machine model
Single-threaded command submission
GLSL shading language
OpenGL ES for mobile/embedded
Being superseded by Vulkan but still widely used in education and tools

Vulkan

Low-overhead, explicit GPU control
Multi-threaded command buffer recording
Explicit memory management, synchronization, and resource transitions
SPIR-V binary shader format
Render passes with subpass dependencies
Descriptor sets for resource binding
Cross-platform (Windows, Linux, Android, macOS via MoltenVK)

DirectX 12

Microsoft's low-level API (Windows, Xbox)
Similar philosophy to Vulkan: explicit control, multi-threaded
HLSL compiled to DXIL
Root signatures for resource binding
Command lists and command queues
DirectX Raytracing (DXR) for hardware RT

Metal

Apple's GPU API (macOS, iOS, visionOS)
Moderate abstraction level between OpenGL and Vulkan
Metal Shading Language (C++-based)
Argument buffers for bindless-style access
Tile shading for on-chip memory on Apple GPUs
Mesh shaders and ray tracing in Metal 3

WebGPU

Modern web standard replacing WebGL
Designed to map to Vulkan, Metal, and D3D12
WGSL shading language
Explicit resource binding and pipeline creation
Available in browsers (Chrome, Firefox) and native (wgpu, Dawn)

SPIR-V

Standard Portable Intermediate Representation for Vulkan and OpenCL. Binary format that decouples shading language from driver compilation.

Source (GLSL/HLSL) --> Compiler (glslc/DXC) --> SPIR-V --> Driver --> GPU ISA

Benefits:

Offline compilation catches errors early
Reduces driver complexity and shader compilation stalls
Enables cross-compilation between shading languages
Reflection data for automatic descriptor layout generation

Render Passes

Vulkan Render Pass

Defines the structure of a rendering operation: attachments (color, depth, resolve), subpasses, and dependencies.

Render Pass:
  Attachment 0: Color (RGBA8, LOAD_CLEAR, STORE_STORE)
  Attachment 1: Depth (D32F, LOAD_CLEAR, STORE_DONT_CARE)

  Subpass 0: Color[0] as color attachment, Depth[1] as depth attachment

Load/store operations control whether attachments are cleared, loaded, or discarded, which is critical for tile-based GPUs that can avoid off-chip memory access.

Dynamic rendering (VK_KHR_dynamic_rendering): Simplifies render pass setup by removing the need for framebuffer and render pass objects. Preferred in modern Vulkan code.

Framebuffer Objects (OpenGL)

Group render targets (textures or renderbuffers) into a framebuffer for off-screen rendering:

glBindFramebuffer(GL_FRAMEBUFFER, fbo);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, tex, 0);

Synchronization

GPU Synchronization Primitives

Primitive	Scope	Use Case
Fence	CPU-GPU	Wait for GPU work to complete on CPU
Semaphore	Queue-Queue	Order work between queues
Barrier	Within command buf	Resource transitions, memory visibility
Event	Within/across CBs	Fine-grained GPU-GPU sync

Pipeline Barriers (Vulkan)

Ensure memory operations are visible and resources are in the correct layout:

vkCmdPipelineBarrier(
    srcStageMask = COLOR_ATTACHMENT_OUTPUT,
    dstStageMask = FRAGMENT_SHADER,
    imageMemoryBarrier = {
        oldLayout = COLOR_ATTACHMENT_OPTIMAL,
        newLayout = SHADER_READ_ONLY_OPTIMAL
    }
);

Frames in Flight

Overlap CPU and GPU work by using multiple sets of resources (command buffers, semaphores, fences):

Frame 0: CPU records while GPU executes frame N-2
Frame 1: GPU executes while CPU records next frame
Typically 2-3 frames in flight for good utilization.

Memory Management

GPU Memory Types

Device-local: Fastest GPU access, not CPU-visible (VRAM)
Host-visible: CPU-accessible, used for uploads (may be slower for GPU)
Host-coherent: No explicit flush needed for CPU writes to be visible
Host-cached: CPU reads are fast (for readback)

Buffer Usage Patterns

Staging buffer: Host-visible, used to upload data to device-local memory via transfer commands
Uniform buffer: Small, frequently updated data (matrices, parameters)
Storage buffer: Large, read/write data for compute shaders
Vertex/Index buffer: Geometry data, typically device-local

Descriptor Sets (Vulkan)

Group resource bindings into sets that can be bound together:

Set 0: Per-frame data (camera, lights)      -- bound once per frame
Set 1: Per-material data (textures, params)  -- bound per material
Set 2: Per-object data (model matrix)        -- bound per draw call

Organize sets by update frequency to minimize rebinding.

GPU Architecture Concepts

Warp/Wavefront: Group of threads (32 NVIDIA, 64 AMD) executing in lockstep (SIMT)
Occupancy: Ratio of active warps to maximum warps; higher occupancy helps hide latency
Register pressure: Too many registers per thread reduces occupancy
Shared memory (LDS): Fast on-chip memory shared within a workgroup (16-48 KB)
Texture cache: Optimized for 2D spatial locality
Quad-based derivatives: Fragment shaders run in 2x2 quads to compute ddx/ddy for mipmapping; helper invocations fill incomplete quads