6 min read
On this page

Heterogeneous Computing

Overview

Heterogeneous computing uses multiple types of processors (CPU, GPU, FPGA, ASIC, DSP) within a single system to exploit the strengths of each architecture for different workloads.

Why Heterogeneous?

  • End of Dennard scaling: power constraints limit single-core performance
  • Dark silicon: not all transistors can be active simultaneously at full frequency
  • Domain specialization: purpose-built accelerators achieve 10-1000x efficiency over general-purpose CPUs for specific workloads
  • Amdahl's law for power: energy efficiency gains from specialization compound

CPU + GPU Computing

The most common heterogeneous configuration.

Execution Model

Host (CPU)                          Device (GPU)
  1. Allocate device memory
  2. Transfer data H->D        -->
  3. Launch kernel              -->   Execute in parallel
  4. Synchronize                <--
  5. Transfer results D->H     <--
  6. Free device memory

Unified Memory (CUDA)

float *data;
cudaMallocManaged(&data, N * sizeof(float));  // accessible from both CPU and GPU

kernel<<<grid, block>>>(data, N);
cudaDeviceSynchronize();
// CPU can now access data directly -- page migration happens automatically
printf("%f\n", data[0]);

cudaFree(data);

The runtime migrates pages between CPU and GPU memory on demand. Simplifies programming but may not match the performance of explicit transfers due to page fault overhead.

Interconnect Bandwidth Latency
PCIe 4.0 x16 ~32 GB/s (bidirectional) ~1-2 us
PCIe 5.0 x16 ~64 GB/s ~1-2 us
NVLink 4.0 (per link) ~100 GB/s ~0.5 us
CXL 2.0 ~64 GB/s ~100-200 ns

Data transfer overhead is often the bottleneck. Strategies to mitigate:

  • Overlap transfers with computation using streams
  • Minimize transfer frequency (batch operations)
  • Use pinned (page-locked) memory for higher transfer bandwidth
  • Compute on GPU even for moderate parallelism to avoid transfer costs

FPGA Acceleration

Field-Programmable Gate Arrays provide reconfigurable hardware parallelism.

FPGA Architecture

Configurable Logic Blocks (CLBs)
 └── Look-Up Tables (LUTs): implement arbitrary boolean functions
 └── Flip-Flops: storage elements
 └── Carry chains: fast arithmetic

DSP Blocks: multiply-accumulate units
Block RAM (BRAM): distributed on-chip memory (typically 18-36 Kbit blocks)
I/O Blocks: interface to external pins
Routing fabric: programmable interconnect

Advantages Over GPU/CPU

Aspect FPGA Advantage
Latency Sub-microsecond, deterministic
Power efficiency 10-50x better perf/watt for certain workloads
Custom data widths Arbitrary precision (e.g., 12-bit, 3-bit)
Custom pipelines Deep, fully pipelined datapaths
I/O flexibility Direct connection to network, sensors, storage

Disadvantages

  • Lower clock frequency (100-500 MHz vs 1-5 GHz)
  • Complex development (hardware design mindset)
  • Long compilation times (hours for place-and-route)
  • Limited on-chip memory compared to GPU HBM

High-Level Synthesis (HLS)

Compile C/C++ to hardware descriptions, lowering the barrier to FPGA programming.

// Vivado HLS / Vitis HLS example
void vector_add(float *a, float *b, float *c, int n) {
    #pragma HLS INTERFACE m_axi port=a offset=slave
    #pragma HLS INTERFACE m_axi port=b offset=slave
    #pragma HLS INTERFACE m_axi port=c offset=slave

    for (int i = 0; i < n; i++) {
        #pragma HLS PIPELINE II=1    // initiation interval of 1 cycle
        c[i] = a[i] + b[i];
    }
}

Key HLS Pragmas

Pragma Effect
PIPELINE Pipeline a loop, process new input every II cycles
UNROLL Replicate loop body to execute iterations in parallel
ARRAY_PARTITION Split arrays into sub-arrays for parallel access
DATAFLOW Execute functions concurrently in a pipeline
INLINE Remove function boundary for optimization

FPGA Use Cases

  • Network packet processing (line-rate at 100+ Gbps)
  • Financial trading (sub-microsecond latency)
  • Genomics (Smith-Waterman, BLAST)
  • Video transcoding and image processing
  • Database acceleration (Microsoft Catapult/Brainwave)
  • Cryptography

ASICs (Application-Specific Integrated Circuits)

Custom silicon designed for a single workload. Maximum performance and efficiency but zero flexibility.

ASIC vs FPGA vs GPU

Metric ASIC FPGA GPU
Performance Highest Medium High
Power efficiency Best Good Moderate
Flexibility None Reconfigurable Programmable
Development cost 10M10M-1B+ 10K10K-1M $0 (software)
Time to market 12-24 months Weeks-months Days
Volume economics Best at scale Small-medium volume N/A

Examples

  • Bitcoin mining ASICs (SHA-256)
  • Google TPU (matrix multiply)
  • Apple Neural Engine
  • Video codec chips (H.264/H.265 encoders in every phone)
  • Network switches (Memory-centric ASICs)

Domain-Specific Architectures: TPU

Google's Tensor Processing Unit, designed for neural network inference and training.

TPU v4 Architecture

TPU v4 chip:
  └── Matrix Multiply Units (MXUs): 128x128 systolic arrays
  └── Vector Processing Unit
  └── Scalar Unit
  └── HBM memory (32 GB)
  └── Inter-chip interconnect (ICI)

TPU v4 pod: 4096 chips connected in 3D torus topology
Peak: ~1.1 EXAFLOPS (BF16) per pod

Systolic Array

Data flows through a regular grid of processing elements, each performing a multiply-accumulate.

Matrix A flows left-to-right
Matrix B flows top-to-bottom
Results accumulate in place

   b0  b1  b2
    |   |   |
a0-[*]-[*]-[*]->
    |   |   |
a1-[*]-[*]-[*]->
    |   |   |
a2-[*]-[*]-[*]->
    |   |   |
    v   v   v

Each [*] computes: acc += a * b, then passes a right and b down

Properties:

  • O(n^2) PEs compute O(n^3) multiply-accumulate in O(n) time
  • High data reuse: each value read once from memory, used n times
  • Simple control: no instruction fetch/decode per PE

TPU Software Stack

  • XLA (Accelerated Linear Algebra): compiler that maps TensorFlow/JAX ops to TPU instructions
  • JAX: NumPy-like API with automatic differentiation and XLA compilation
  • GSPMD: automatic partitioning across TPU pods

oneAPI

Intel's unified programming model for heterogeneous architectures.

Supported Targets

oneAPI application
    |
    v
DPC++ (Data Parallel C++, based on SYCL)
    |
    +---> CPU (Intel, AMD)
    +---> GPU (Intel, NVIDIA via plugins)
    +---> FPGA (Intel)
    +---> Other accelerators

DPC++ Example

#include <sycl/sycl.hpp>
using namespace sycl;

int main() {
    queue q(gpu_selector_v);  // select GPU device

    int *data = malloc_shared<int>(N, q);  // USM allocation

    q.parallel_for(range<1>(N), [=](id<1> i) {
        data[i] = data[i] * 2;
    }).wait();

    free(data, q);
}

Key Components

Component Purpose
DPC++ Core programming language
oneMKL Math kernel library
oneDNN Deep learning primitives
oneTBB Threading building blocks
oneDAL Data analytics
Level Zero Low-level hardware interface

Memory Coherence in Heterogeneous Systems

Maintaining a consistent view of memory across different processors.

Cache Coherence Challenges

Traditional CPU cache coherence (MESI/MOESI) does not extend to accelerators. Different approaches:

Approach Description Example
Discrete memory Separate address spaces, explicit copies Traditional GPU (cudaMemcpy)
Unified virtual memory Shared address space, page migration CUDA UVM, HSA
Cache coherent Hardware coherence across CPU-accelerator CXL, CCIX, Gen-Z

CXL provides cache-coherent interconnect between CPU and accelerators.

CXL Protocol Types:
  CXL.io:     PCIe-based I/O protocol (discovery, configuration)
  CXL.cache:  Device caches host memory with coherence
  CXL.mem:    Host accesses device-attached memory

Type 1 device: CXL.io + CXL.cache (smart NIC, accelerator caching host memory)
Type 2 device: CXL.io + CXL.cache + CXL.mem (GPU/FPGA with own memory)
Type 3 device: CXL.io + CXL.mem (memory expander)

CXL enables fine-grained sharing between CPU and accelerator without explicit data movement, though with higher latency than local cache.

HSA (Heterogeneous System Architecture)

AMD-led standard for unified address space and memory model across CPU and GPU.

  • Shared virtual memory with unified page tables
  • User-mode dispatch queues (bypass kernel driver)
  • Memory-based signaling between agents
  • Supported on AMD APUs and some discrete GPUs

Computational Storage

Moving computation to where data resides, reducing data movement.

Architecture

Traditional:
  Storage -> Network/Bus -> CPU -> Process -> CPU -> Network/Bus -> Storage

Computational Storage:
  Storage -> Process in place -> Results (much smaller) -> CPU

Types

Type Description Example
Computational Storage Drive (CSD) Processing logic inside SSD Samsung SmartSSD (FPGA in SSD)
Computational Storage Processor (CSP) Near-storage processing unit NVIDIA BlueField DPU
In-Storage Computing Minimal processing in storage controller Search, filter, decompress

Use Cases

  • Database scan/filter pushdown (skip irrelevant data at the source)
  • Compression/decompression at the storage layer
  • Pattern matching and regex evaluation
  • Data reduction for analytics pipelines

Challenges

  • Limited compute capability near storage
  • Programming model complexity (what runs where?)
  • Standardization (SNIA Computational Storage specification)
  • Thermal and power constraints in the storage form factor

Architecture Selection Guidelines

Workload Characteristic Best Accelerator
Massive data parallelism, floating point GPU
Low latency, deterministic timing FPGA
Extremely high volume, fixed algorithm ASIC
Matrix-heavy ML training TPU / GPU
Streaming data at line rate FPGA / SmartNIC
General purpose with some parallelism Multi-core CPU
Data-intensive with minimal compute Computational storage

The trend is toward systems-on-chip that integrate multiple accelerator types (CPU + GPU + NPU + DSP) and chiplet-based designs that combine specialized dies via advanced packaging.