Heterogeneous Computing
Overview
Heterogeneous computing uses multiple types of processors (CPU, GPU, FPGA, ASIC, DSP) within a single system to exploit the strengths of each architecture for different workloads.
Why Heterogeneous?
- End of Dennard scaling: power constraints limit single-core performance
- Dark silicon: not all transistors can be active simultaneously at full frequency
- Domain specialization: purpose-built accelerators achieve 10-1000x efficiency over general-purpose CPUs for specific workloads
- Amdahl's law for power: energy efficiency gains from specialization compound
CPU + GPU Computing
The most common heterogeneous configuration.
Execution Model
Host (CPU) Device (GPU)
1. Allocate device memory
2. Transfer data H->D -->
3. Launch kernel --> Execute in parallel
4. Synchronize <--
5. Transfer results D->H <--
6. Free device memory
Unified Memory (CUDA)
float *data;
cudaMallocManaged(&data, N * sizeof(float)); // accessible from both CPU and GPU
kernel<<<grid, block>>>(data, N);
cudaDeviceSynchronize();
// CPU can now access data directly -- page migration happens automatically
printf("%f\n", data[0]);
cudaFree(data);
The runtime migrates pages between CPU and GPU memory on demand. Simplifies programming but may not match the performance of explicit transfers due to page fault overhead.
PCIe and NVLink Bandwidth
| Interconnect | Bandwidth | Latency | |-------------|-----------|---------| | PCIe 4.0 x16 | ~32 GB/s (bidirectional) | ~1-2 us | | PCIe 5.0 x16 | ~64 GB/s | ~1-2 us | | NVLink 4.0 (per link) | ~100 GB/s | ~0.5 us | | CXL 2.0 | ~64 GB/s | ~100-200 ns |
Data transfer overhead is often the bottleneck. Strategies to mitigate:
- Overlap transfers with computation using streams
- Minimize transfer frequency (batch operations)
- Use pinned (page-locked) memory for higher transfer bandwidth
- Compute on GPU even for moderate parallelism to avoid transfer costs
FPGA Acceleration
Field-Programmable Gate Arrays provide reconfigurable hardware parallelism.
FPGA Architecture
Configurable Logic Blocks (CLBs)
└── Look-Up Tables (LUTs): implement arbitrary boolean functions
└── Flip-Flops: storage elements
└── Carry chains: fast arithmetic
DSP Blocks: multiply-accumulate units
Block RAM (BRAM): distributed on-chip memory (typically 18-36 Kbit blocks)
I/O Blocks: interface to external pins
Routing fabric: programmable interconnect
Advantages Over GPU/CPU
| Aspect | FPGA Advantage | |--------|---------------| | Latency | Sub-microsecond, deterministic | | Power efficiency | 10-50x better perf/watt for certain workloads | | Custom data widths | Arbitrary precision (e.g., 12-bit, 3-bit) | | Custom pipelines | Deep, fully pipelined datapaths | | I/O flexibility | Direct connection to network, sensors, storage |
Disadvantages
- Lower clock frequency (100-500 MHz vs 1-5 GHz)
- Complex development (hardware design mindset)
- Long compilation times (hours for place-and-route)
- Limited on-chip memory compared to GPU HBM
High-Level Synthesis (HLS)
Compile C/C++ to hardware descriptions, lowering the barrier to FPGA programming.
// Vivado HLS / Vitis HLS example
void vector_add(float *a, float *b, float *c, int n) {
#pragma HLS INTERFACE m_axi port=a offset=slave
#pragma HLS INTERFACE m_axi port=b offset=slave
#pragma HLS INTERFACE m_axi port=c offset=slave
for (int i = 0; i < n; i++) {
#pragma HLS PIPELINE II=1 // initiation interval of 1 cycle
c[i] = a[i] + b[i];
}
}
Key HLS Pragmas
| Pragma | Effect |
|--------|--------|
| PIPELINE | Pipeline a loop, process new input every II cycles |
| UNROLL | Replicate loop body to execute iterations in parallel |
| ARRAY_PARTITION | Split arrays into sub-arrays for parallel access |
| DATAFLOW | Execute functions concurrently in a pipeline |
| INLINE | Remove function boundary for optimization |
FPGA Use Cases
- Network packet processing (line-rate at 100+ Gbps)
- Financial trading (sub-microsecond latency)
- Genomics (Smith-Waterman, BLAST)
- Video transcoding and image processing
- Database acceleration (Microsoft Catapult/Brainwave)
- Cryptography
ASICs (Application-Specific Integrated Circuits)
Custom silicon designed for a single workload. Maximum performance and efficiency but zero flexibility.
ASIC vs FPGA vs GPU
| Metric | ASIC | FPGA | GPU | |--------|------|------|-----| | Performance | Highest | Medium | High | | Power efficiency | Best | Good | Moderate | | Flexibility | None | Reconfigurable | Programmable | | Development cost | 1B+ | 1M | $0 (software) | | Time to market | 12-24 months | Weeks-months | Days | | Volume economics | Best at scale | Small-medium volume | N/A |
Examples
- Bitcoin mining ASICs (SHA-256)
- Google TPU (matrix multiply)
- Apple Neural Engine
- Video codec chips (H.264/H.265 encoders in every phone)
- Network switches (Memory-centric ASICs)
Domain-Specific Architectures: TPU
Google's Tensor Processing Unit, designed for neural network inference and training.
TPU v4 Architecture
TPU v4 chip:
└── Matrix Multiply Units (MXUs): 128x128 systolic arrays
└── Vector Processing Unit
└── Scalar Unit
└── HBM memory (32 GB)
└── Inter-chip interconnect (ICI)
TPU v4 pod: 4096 chips connected in 3D torus topology
Peak: ~1.1 EXAFLOPS (BF16) per pod
Systolic Array
Data flows through a regular grid of processing elements, each performing a multiply-accumulate.
Matrix A flows left-to-right
Matrix B flows top-to-bottom
Results accumulate in place
b0 b1 b2
| | |
a0-[*]-[*]-[*]->
| | |
a1-[*]-[*]-[*]->
| | |
a2-[*]-[*]-[*]->
| | |
v v v
Each [*] computes: acc += a * b, then passes a right and b down
Properties:
- O(n^2) PEs compute O(n^3) multiply-accumulate in O(n) time
- High data reuse: each value read once from memory, used n times
- Simple control: no instruction fetch/decode per PE
TPU Software Stack
- XLA (Accelerated Linear Algebra): compiler that maps TensorFlow/JAX ops to TPU instructions
- JAX: NumPy-like API with automatic differentiation and XLA compilation
- GSPMD: automatic partitioning across TPU pods
oneAPI
Intel's unified programming model for heterogeneous architectures.
Supported Targets
oneAPI application
|
v
DPC++ (Data Parallel C++, based on SYCL)
|
+---> CPU (Intel, AMD)
+---> GPU (Intel, NVIDIA via plugins)
+---> FPGA (Intel)
+---> Other accelerators
DPC++ Example
#include <sycl/sycl.hpp>
using namespace sycl;
int main() {
queue q(gpu_selector_v); // select GPU device
int *data = malloc_shared<int>(N, q); // USM allocation
q.parallel_for(range<1>(N), [=](id<1> i) {
data[i] = data[i] * 2;
}).wait();
free(data, q);
}
Key Components
| Component | Purpose | |-----------|---------| | DPC++ | Core programming language | | oneMKL | Math kernel library | | oneDNN | Deep learning primitives | | oneTBB | Threading building blocks | | oneDAL | Data analytics | | Level Zero | Low-level hardware interface |
Memory Coherence in Heterogeneous Systems
Maintaining a consistent view of memory across different processors.
Cache Coherence Challenges
Traditional CPU cache coherence (MESI/MOESI) does not extend to accelerators. Different approaches:
| Approach | Description | Example | |----------|-------------|---------| | Discrete memory | Separate address spaces, explicit copies | Traditional GPU (cudaMemcpy) | | Unified virtual memory | Shared address space, page migration | CUDA UVM, HSA | | Cache coherent | Hardware coherence across CPU-accelerator | CXL, CCIX, Gen-Z |
CXL (Compute Express Link)
CXL provides cache-coherent interconnect between CPU and accelerators.
CXL Protocol Types:
CXL.io: PCIe-based I/O protocol (discovery, configuration)
CXL.cache: Device caches host memory with coherence
CXL.mem: Host accesses device-attached memory
Type 1 device: CXL.io + CXL.cache (smart NIC, accelerator caching host memory)
Type 2 device: CXL.io + CXL.cache + CXL.mem (GPU/FPGA with own memory)
Type 3 device: CXL.io + CXL.mem (memory expander)
CXL enables fine-grained sharing between CPU and accelerator without explicit data movement, though with higher latency than local cache.
HSA (Heterogeneous System Architecture)
AMD-led standard for unified address space and memory model across CPU and GPU.
- Shared virtual memory with unified page tables
- User-mode dispatch queues (bypass kernel driver)
- Memory-based signaling between agents
- Supported on AMD APUs and some discrete GPUs
Computational Storage
Moving computation to where data resides, reducing data movement.
Architecture
Traditional:
Storage -> Network/Bus -> CPU -> Process -> CPU -> Network/Bus -> Storage
Computational Storage:
Storage -> Process in place -> Results (much smaller) -> CPU
Types
| Type | Description | Example | |------|-------------|---------| | Computational Storage Drive (CSD) | Processing logic inside SSD | Samsung SmartSSD (FPGA in SSD) | | Computational Storage Processor (CSP) | Near-storage processing unit | NVIDIA BlueField DPU | | In-Storage Computing | Minimal processing in storage controller | Search, filter, decompress |
Use Cases
- Database scan/filter pushdown (skip irrelevant data at the source)
- Compression/decompression at the storage layer
- Pattern matching and regex evaluation
- Data reduction for analytics pipelines
Challenges
- Limited compute capability near storage
- Programming model complexity (what runs where?)
- Standardization (SNIA Computational Storage specification)
- Thermal and power constraints in the storage form factor
Architecture Selection Guidelines
| Workload Characteristic | Best Accelerator | |------------------------|------------------| | Massive data parallelism, floating point | GPU | | Low latency, deterministic timing | FPGA | | Extremely high volume, fixed algorithm | ASIC | | Matrix-heavy ML training | TPU / GPU | | Streaming data at line rate | FPGA / SmartNIC | | General purpose with some parallelism | Multi-core CPU | | Data-intensive with minimal compute | Computational storage |
The trend is toward systems-on-chip that integrate multiple accelerator types (CPU + GPU + NPU + DSP) and chiplet-based designs that combine specialized dies via advanced packaging.