6 min read
On this page

Heterogeneous Computing

Overview

Heterogeneous computing uses multiple types of processors (CPU, GPU, FPGA, ASIC, DSP) within a single system to exploit the strengths of each architecture for different workloads.

Why Heterogeneous?

  • End of Dennard scaling: power constraints limit single-core performance
  • Dark silicon: not all transistors can be active simultaneously at full frequency
  • Domain specialization: purpose-built accelerators achieve 10-1000x efficiency over general-purpose CPUs for specific workloads
  • Amdahl's law for power: energy efficiency gains from specialization compound

CPU + GPU Computing

The most common heterogeneous configuration.

Execution Model

Host (CPU)                          Device (GPU)
  1. Allocate device memory
  2. Transfer data H->D        -->
  3. Launch kernel              -->   Execute in parallel
  4. Synchronize                <--
  5. Transfer results D->H     <--
  6. Free device memory

Unified Memory (CUDA)

float *data;
cudaMallocManaged(&data, N * sizeof(float));  // accessible from both CPU and GPU

kernel<<<grid, block>>>(data, N);
cudaDeviceSynchronize();
// CPU can now access data directly -- page migration happens automatically
printf("%f\n", data[0]);

cudaFree(data);

The runtime migrates pages between CPU and GPU memory on demand. Simplifies programming but may not match the performance of explicit transfers due to page fault overhead.

| Interconnect | Bandwidth | Latency | |-------------|-----------|---------| | PCIe 4.0 x16 | ~32 GB/s (bidirectional) | ~1-2 us | | PCIe 5.0 x16 | ~64 GB/s | ~1-2 us | | NVLink 4.0 (per link) | ~100 GB/s | ~0.5 us | | CXL 2.0 | ~64 GB/s | ~100-200 ns |

Data transfer overhead is often the bottleneck. Strategies to mitigate:

  • Overlap transfers with computation using streams
  • Minimize transfer frequency (batch operations)
  • Use pinned (page-locked) memory for higher transfer bandwidth
  • Compute on GPU even for moderate parallelism to avoid transfer costs

FPGA Acceleration

Field-Programmable Gate Arrays provide reconfigurable hardware parallelism.

FPGA Architecture

Configurable Logic Blocks (CLBs)
 └── Look-Up Tables (LUTs): implement arbitrary boolean functions
 └── Flip-Flops: storage elements
 └── Carry chains: fast arithmetic

DSP Blocks: multiply-accumulate units
Block RAM (BRAM): distributed on-chip memory (typically 18-36 Kbit blocks)
I/O Blocks: interface to external pins
Routing fabric: programmable interconnect

Advantages Over GPU/CPU

| Aspect | FPGA Advantage | |--------|---------------| | Latency | Sub-microsecond, deterministic | | Power efficiency | 10-50x better perf/watt for certain workloads | | Custom data widths | Arbitrary precision (e.g., 12-bit, 3-bit) | | Custom pipelines | Deep, fully pipelined datapaths | | I/O flexibility | Direct connection to network, sensors, storage |

Disadvantages

  • Lower clock frequency (100-500 MHz vs 1-5 GHz)
  • Complex development (hardware design mindset)
  • Long compilation times (hours for place-and-route)
  • Limited on-chip memory compared to GPU HBM

High-Level Synthesis (HLS)

Compile C/C++ to hardware descriptions, lowering the barrier to FPGA programming.

// Vivado HLS / Vitis HLS example
void vector_add(float *a, float *b, float *c, int n) {
    #pragma HLS INTERFACE m_axi port=a offset=slave
    #pragma HLS INTERFACE m_axi port=b offset=slave
    #pragma HLS INTERFACE m_axi port=c offset=slave

    for (int i = 0; i < n; i++) {
        #pragma HLS PIPELINE II=1    // initiation interval of 1 cycle
        c[i] = a[i] + b[i];
    }
}

Key HLS Pragmas

| Pragma | Effect | |--------|--------| | PIPELINE | Pipeline a loop, process new input every II cycles | | UNROLL | Replicate loop body to execute iterations in parallel | | ARRAY_PARTITION | Split arrays into sub-arrays for parallel access | | DATAFLOW | Execute functions concurrently in a pipeline | | INLINE | Remove function boundary for optimization |

FPGA Use Cases

  • Network packet processing (line-rate at 100+ Gbps)
  • Financial trading (sub-microsecond latency)
  • Genomics (Smith-Waterman, BLAST)
  • Video transcoding and image processing
  • Database acceleration (Microsoft Catapult/Brainwave)
  • Cryptography

ASICs (Application-Specific Integrated Circuits)

Custom silicon designed for a single workload. Maximum performance and efficiency but zero flexibility.

ASIC vs FPGA vs GPU

| Metric | ASIC | FPGA | GPU | |--------|------|------|-----| | Performance | Highest | Medium | High | | Power efficiency | Best | Good | Moderate | | Flexibility | None | Reconfigurable | Programmable | | Development cost | 10M10M-1B+ | 10K10K-1M | $0 (software) | | Time to market | 12-24 months | Weeks-months | Days | | Volume economics | Best at scale | Small-medium volume | N/A |

Examples

  • Bitcoin mining ASICs (SHA-256)
  • Google TPU (matrix multiply)
  • Apple Neural Engine
  • Video codec chips (H.264/H.265 encoders in every phone)
  • Network switches (Memory-centric ASICs)

Domain-Specific Architectures: TPU

Google's Tensor Processing Unit, designed for neural network inference and training.

TPU v4 Architecture

TPU v4 chip:
  └── Matrix Multiply Units (MXUs): 128x128 systolic arrays
  └── Vector Processing Unit
  └── Scalar Unit
  └── HBM memory (32 GB)
  └── Inter-chip interconnect (ICI)

TPU v4 pod: 4096 chips connected in 3D torus topology
Peak: ~1.1 EXAFLOPS (BF16) per pod

Systolic Array

Data flows through a regular grid of processing elements, each performing a multiply-accumulate.

Matrix A flows left-to-right
Matrix B flows top-to-bottom
Results accumulate in place

   b0  b1  b2
    |   |   |
a0-[*]-[*]-[*]->
    |   |   |
a1-[*]-[*]-[*]->
    |   |   |
a2-[*]-[*]-[*]->
    |   |   |
    v   v   v

Each [*] computes: acc += a * b, then passes a right and b down

Properties:

  • O(n^2) PEs compute O(n^3) multiply-accumulate in O(n) time
  • High data reuse: each value read once from memory, used n times
  • Simple control: no instruction fetch/decode per PE

TPU Software Stack

  • XLA (Accelerated Linear Algebra): compiler that maps TensorFlow/JAX ops to TPU instructions
  • JAX: NumPy-like API with automatic differentiation and XLA compilation
  • GSPMD: automatic partitioning across TPU pods

oneAPI

Intel's unified programming model for heterogeneous architectures.

Supported Targets

oneAPI application
    |
    v
DPC++ (Data Parallel C++, based on SYCL)
    |
    +---> CPU (Intel, AMD)
    +---> GPU (Intel, NVIDIA via plugins)
    +---> FPGA (Intel)
    +---> Other accelerators

DPC++ Example

#include <sycl/sycl.hpp>
using namespace sycl;

int main() {
    queue q(gpu_selector_v);  // select GPU device

    int *data = malloc_shared<int>(N, q);  // USM allocation

    q.parallel_for(range<1>(N), [=](id<1> i) {
        data[i] = data[i] * 2;
    }).wait();

    free(data, q);
}

Key Components

| Component | Purpose | |-----------|---------| | DPC++ | Core programming language | | oneMKL | Math kernel library | | oneDNN | Deep learning primitives | | oneTBB | Threading building blocks | | oneDAL | Data analytics | | Level Zero | Low-level hardware interface |

Memory Coherence in Heterogeneous Systems

Maintaining a consistent view of memory across different processors.

Cache Coherence Challenges

Traditional CPU cache coherence (MESI/MOESI) does not extend to accelerators. Different approaches:

| Approach | Description | Example | |----------|-------------|---------| | Discrete memory | Separate address spaces, explicit copies | Traditional GPU (cudaMemcpy) | | Unified virtual memory | Shared address space, page migration | CUDA UVM, HSA | | Cache coherent | Hardware coherence across CPU-accelerator | CXL, CCIX, Gen-Z |

CXL provides cache-coherent interconnect between CPU and accelerators.

CXL Protocol Types:
  CXL.io:     PCIe-based I/O protocol (discovery, configuration)
  CXL.cache:  Device caches host memory with coherence
  CXL.mem:    Host accesses device-attached memory

Type 1 device: CXL.io + CXL.cache (smart NIC, accelerator caching host memory)
Type 2 device: CXL.io + CXL.cache + CXL.mem (GPU/FPGA with own memory)
Type 3 device: CXL.io + CXL.mem (memory expander)

CXL enables fine-grained sharing between CPU and accelerator without explicit data movement, though with higher latency than local cache.

HSA (Heterogeneous System Architecture)

AMD-led standard for unified address space and memory model across CPU and GPU.

  • Shared virtual memory with unified page tables
  • User-mode dispatch queues (bypass kernel driver)
  • Memory-based signaling between agents
  • Supported on AMD APUs and some discrete GPUs

Computational Storage

Moving computation to where data resides, reducing data movement.

Architecture

Traditional:
  Storage -> Network/Bus -> CPU -> Process -> CPU -> Network/Bus -> Storage

Computational Storage:
  Storage -> Process in place -> Results (much smaller) -> CPU

Types

| Type | Description | Example | |------|-------------|---------| | Computational Storage Drive (CSD) | Processing logic inside SSD | Samsung SmartSSD (FPGA in SSD) | | Computational Storage Processor (CSP) | Near-storage processing unit | NVIDIA BlueField DPU | | In-Storage Computing | Minimal processing in storage controller | Search, filter, decompress |

Use Cases

  • Database scan/filter pushdown (skip irrelevant data at the source)
  • Compression/decompression at the storage layer
  • Pattern matching and regex evaluation
  • Data reduction for analytics pipelines

Challenges

  • Limited compute capability near storage
  • Programming model complexity (what runs where?)
  • Standardization (SNIA Computational Storage specification)
  • Thermal and power constraints in the storage form factor

Architecture Selection Guidelines

| Workload Characteristic | Best Accelerator | |------------------------|------------------| | Massive data parallelism, floating point | GPU | | Low latency, deterministic timing | FPGA | | Extremely high volume, fixed algorithm | ASIC | | Matrix-heavy ML training | TPU / GPU | | Streaming data at line rate | FPGA / SmartNIC | | General purpose with some parallelism | Multi-core CPU | | Data-intensive with minimal compute | Computational storage |

The trend is toward systems-on-chip that integrate multiple accelerator types (CPU + GPU + NPU + DSP) and chiplet-based designs that combine specialized dies via advanced packaging.