GPU Programming
Overview
Modern GPUs are massively parallel processors with thousands of cores optimized for the graphics pipeline and general-purpose computation. Programming them requires understanding shader stages, graphics APIs, memory hierarchies, and synchronization primitives.
Shader Types
Vertex Shader
Runs once per vertex. Transforms positions, computes per-vertex data.
// GLSL example
layout(location = 0) in vec3 aPosition;
layout(location = 1) in vec3 aNormal;
uniform mat4 uMVP;
uniform mat3 uNormalMatrix;
out vec3 vNormal;
void main() {
gl_Position = uMVP * vec4(aPosition, 1.0);
vNormal = normalize(uNormalMatrix * aNormal);
}
Inputs: vertex attributes from buffers. Outputs: clip-space position and interpolants for the fragment shader.
Fragment (Pixel) Shader
Runs once per fragment (candidate pixel). Computes final color.
in vec3 vNormal;
out vec4 fragColor;
void main() {
vec3 lightDir = normalize(vec3(1.0, 1.0, 1.0));
float NdotL = max(dot(normalize(vNormal), lightDir), 0.0);
fragColor = vec4(vec3(NdotL), 1.0);
}
Has access to interpolated vertex outputs, textures, and uniforms. Can discard fragments or write to multiple render targets (MRT).
Geometry Shader
Operates on complete primitives (points, lines, triangles). Can emit zero or more new primitives. Use cases:
- Billboard generation from point sprites
- Wireframe overlay
- Shadow volume extrusion
- Layered rendering (cubemap faces in one pass)
Performance caveat: often a bottleneck due to variable output and poor parallelism. Prefer compute shaders or mesh shaders when possible.
Tessellation Shaders
Two stages that subdivide patches into finer geometry:
Tessellation Control Shader (TCS / Hull Shader):
- Runs per control point
- Sets tessellation levels (inner and outer)
- Passes patch data to the tessellation engine
Tessellation Evaluation Shader (TES / Domain Shader):
- Runs per generated vertex
- Receives barycentric coordinates from the tessellator
- Evaluates the surface position (Bezier, PN-Triangles, displacement)
Input Patch --> TCS --> Fixed-Function Tessellator --> TES --> Output Vertices
Compute Shader
General-purpose GPU computation outside the graphics pipeline. Organized in workgroups with shared memory.
layout(local_size_x = 256) in;
layout(std430, binding = 0) buffer Data { float values[]; };
shared float cache[256];
void main() {
uint id = gl_GlobalInvocationID.x;
cache[gl_LocalInvocationID.x] = values[id];
barrier();
// parallel reduction, filtering, simulation, etc.
}
Dispatch model: dispatch(num_groups_x, num_groups_y, num_groups_z), each group runs local_size_x * local_size_y * local_size_z invocations.
Mesh Shaders (Modern)
Replace vertex + geometry + tessellation stages with two programmable stages:
- Task (Amplification) Shader: Optional, determines how many mesh shader workgroups to launch (culling, LOD)
- Mesh Shader: Outputs vertices and primitives directly from a workgroup, with shared memory
Benefits: workgroup-level parallelism, explicit output, GPU-driven culling. Supported in Vulkan (VK_EXT_mesh_shader), DirectX 12, and Metal 3.
Graphics APIs
OpenGL / OpenGL ES
- Cross-platform, mature, large ecosystem
- Global state machine model
- Single-threaded command submission
- GLSL shading language
- OpenGL ES for mobile/embedded
- Being superseded by Vulkan but still widely used in education and tools
Vulkan
- Low-overhead, explicit GPU control
- Multi-threaded command buffer recording
- Explicit memory management, synchronization, and resource transitions
- SPIR-V binary shader format
- Render passes with subpass dependencies
- Descriptor sets for resource binding
- Cross-platform (Windows, Linux, Android, macOS via MoltenVK)
DirectX 12
- Microsoft's low-level API (Windows, Xbox)
- Similar philosophy to Vulkan: explicit control, multi-threaded
- HLSL compiled to DXIL
- Root signatures for resource binding
- Command lists and command queues
- DirectX Raytracing (DXR) for hardware RT
Metal
- Apple's GPU API (macOS, iOS, visionOS)
- Moderate abstraction level between OpenGL and Vulkan
- Metal Shading Language (C++-based)
- Argument buffers for bindless-style access
- Tile shading for on-chip memory on Apple GPUs
- Mesh shaders and ray tracing in Metal 3
WebGPU
- Modern web standard replacing WebGL
- Designed to map to Vulkan, Metal, and D3D12
- WGSL shading language
- Explicit resource binding and pipeline creation
- Available in browsers (Chrome, Firefox) and native (wgpu, Dawn)
SPIR-V
Standard Portable Intermediate Representation for Vulkan and OpenCL. Binary format that decouples shading language from driver compilation.
Source (GLSL/HLSL) --> Compiler (glslc/DXC) --> SPIR-V --> Driver --> GPU ISA
Benefits:
- Offline compilation catches errors early
- Reduces driver complexity and shader compilation stalls
- Enables cross-compilation between shading languages
- Reflection data for automatic descriptor layout generation
Render Passes
Vulkan Render Pass
Defines the structure of a rendering operation: attachments (color, depth, resolve), subpasses, and dependencies.
Render Pass:
Attachment 0: Color (RGBA8, LOAD_CLEAR, STORE_STORE)
Attachment 1: Depth (D32F, LOAD_CLEAR, STORE_DONT_CARE)
Subpass 0: Color[0] as color attachment, Depth[1] as depth attachment
Load/store operations control whether attachments are cleared, loaded, or discarded, which is critical for tile-based GPUs that can avoid off-chip memory access.
Dynamic rendering (VK_KHR_dynamic_rendering): Simplifies render pass setup by removing the need for framebuffer and render pass objects. Preferred in modern Vulkan code.
Framebuffer Objects (OpenGL)
Group render targets (textures or renderbuffers) into a framebuffer for off-screen rendering:
glBindFramebuffer(GL_FRAMEBUFFER, fbo);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, tex, 0);
Synchronization
GPU Synchronization Primitives
| Primitive | Scope | Use Case | |-----------------|--------------------|-----------------------------------------| | Fence | CPU-GPU | Wait for GPU work to complete on CPU | | Semaphore | Queue-Queue | Order work between queues | | Barrier | Within command buf | Resource transitions, memory visibility | | Event | Within/across CBs | Fine-grained GPU-GPU sync |
Pipeline Barriers (Vulkan)
Ensure memory operations are visible and resources are in the correct layout:
vkCmdPipelineBarrier(
srcStageMask = COLOR_ATTACHMENT_OUTPUT,
dstStageMask = FRAGMENT_SHADER,
imageMemoryBarrier = {
oldLayout = COLOR_ATTACHMENT_OPTIMAL,
newLayout = SHADER_READ_ONLY_OPTIMAL
}
);
Frames in Flight
Overlap CPU and GPU work by using multiple sets of resources (command buffers, semaphores, fences):
Frame 0: CPU records while GPU executes frame N-2
Frame 1: GPU executes while CPU records next frame
Typically 2-3 frames in flight for good utilization.
Memory Management
GPU Memory Types
- Device-local: Fastest GPU access, not CPU-visible (VRAM)
- Host-visible: CPU-accessible, used for uploads (may be slower for GPU)
- Host-coherent: No explicit flush needed for CPU writes to be visible
- Host-cached: CPU reads are fast (for readback)
Buffer Usage Patterns
- Staging buffer: Host-visible, used to upload data to device-local memory via transfer commands
- Uniform buffer: Small, frequently updated data (matrices, parameters)
- Storage buffer: Large, read/write data for compute shaders
- Vertex/Index buffer: Geometry data, typically device-local
Descriptor Sets (Vulkan)
Group resource bindings into sets that can be bound together:
Set 0: Per-frame data (camera, lights) -- bound once per frame
Set 1: Per-material data (textures, params) -- bound per material
Set 2: Per-object data (model matrix) -- bound per draw call
Organize sets by update frequency to minimize rebinding.
GPU Architecture Concepts
- Warp/Wavefront: Group of threads (32 NVIDIA, 64 AMD) executing in lockstep (SIMT)
- Occupancy: Ratio of active warps to maximum warps; higher occupancy helps hide latency
- Register pressure: Too many registers per thread reduces occupancy
- Shared memory (LDS): Fast on-chip memory shared within a workgroup (16-48 KB)
- Texture cache: Optimized for 2D spatial locality
- Quad-based derivatives: Fragment shaders run in 2x2 quads to compute ddx/ddy for mipmapping; helper invocations fill incomplete quads