Heterogeneous Computing

Overview

Heterogeneous computing uses multiple types of processors (CPU, GPU, FPGA, ASIC, DSP) within a single system to exploit the strengths of each architecture for different workloads.

Why Heterogeneous?

End of Dennard scaling: power constraints limit single-core performance
Dark silicon: not all transistors can be active simultaneously at full frequency
Domain specialization: purpose-built accelerators achieve 10-1000x efficiency over general-purpose CPUs for specific workloads
Amdahl's law for power: energy efficiency gains from specialization compound

CPU + GPU Computing

The most common heterogeneous configuration.

Execution Model

Host (CPU)                          Device (GPU)
  1. Allocate device memory
  2. Transfer data H->D        -->
  3. Launch kernel              -->   Execute in parallel
  4. Synchronize                <--
  5. Transfer results D->H     <--
  6. Free device memory

Unified Memory (CUDA)

float *data;
cudaMallocManaged(&data, N * sizeof(float));  // accessible from both CPU and GPU

kernel<<<grid, block>>>(data, N);
cudaDeviceSynchronize();
// CPU can now access data directly -- page migration happens automatically
printf("%f\n", data[0]);

cudaFree(data);

The runtime migrates pages between CPU and GPU memory on demand. Simplifies programming but may not match the performance of explicit transfers due to page fault overhead.

PCIe and NVLink Bandwidth

Interconnect	Bandwidth	Latency
PCIe 4.0 x16	~32 GB/s (bidirectional)	~1-2 us
PCIe 5.0 x16	~64 GB/s	~1-2 us
NVLink 4.0 (per link)	~100 GB/s	~0.5 us
CXL 2.0	~64 GB/s	~100-200 ns

Data transfer overhead is often the bottleneck. Strategies to mitigate:

Overlap transfers with computation using streams
Minimize transfer frequency (batch operations)
Use pinned (page-locked) memory for higher transfer bandwidth
Compute on GPU even for moderate parallelism to avoid transfer costs

FPGA Acceleration

Field-Programmable Gate Arrays provide reconfigurable hardware parallelism.

FPGA Architecture

Configurable Logic Blocks (CLBs)
 └── Look-Up Tables (LUTs): implement arbitrary boolean functions
 └── Flip-Flops: storage elements
 └── Carry chains: fast arithmetic

DSP Blocks: multiply-accumulate units
Block RAM (BRAM): distributed on-chip memory (typically 18-36 Kbit blocks)
I/O Blocks: interface to external pins
Routing fabric: programmable interconnect

Advantages Over GPU/CPU

Aspect	FPGA Advantage
Latency	Sub-microsecond, deterministic
Power efficiency	10-50x better perf/watt for certain workloads
Custom data widths	Arbitrary precision (e.g., 12-bit, 3-bit)
Custom pipelines	Deep, fully pipelined datapaths
I/O flexibility	Direct connection to network, sensors, storage

Disadvantages

Lower clock frequency (100-500 MHz vs 1-5 GHz)
Complex development (hardware design mindset)
Long compilation times (hours for place-and-route)
Limited on-chip memory compared to GPU HBM

High-Level Synthesis (HLS)

Compile C/C++ to hardware descriptions, lowering the barrier to FPGA programming.

// Vivado HLS / Vitis HLS example
void vector_add(float *a, float *b, float *c, int n) {
    #pragma HLS INTERFACE m_axi port=a offset=slave
    #pragma HLS INTERFACE m_axi port=b offset=slave
    #pragma HLS INTERFACE m_axi port=c offset=slave

    for (int i = 0; i < n; i++) {
        #pragma HLS PIPELINE II=1    // initiation interval of 1 cycle
        c[i] = a[i] + b[i];
    }
}

Key HLS Pragmas

Pragma	Effect
`PIPELINE`	Pipeline a loop, process new input every II cycles
`UNROLL`	Replicate loop body to execute iterations in parallel
`ARRAY_PARTITION`	Split arrays into sub-arrays for parallel access
`DATAFLOW`	Execute functions concurrently in a pipeline
`INLINE`	Remove function boundary for optimization

FPGA Use Cases

Network packet processing (line-rate at 100+ Gbps)
Financial trading (sub-microsecond latency)
Genomics (Smith-Waterman, BLAST)
Video transcoding and image processing
Database acceleration (Microsoft Catapult/Brainwave)
Cryptography

ASICs (Application-Specific Integrated Circuits)

Custom silicon designed for a single workload. Maximum performance and efficiency but zero flexibility.

ASIC vs FPGA vs GPU

Metric	ASIC	FPGA	GPU
Performance	Highest	Medium	High
Power efficiency	Best	Good	Moderate
Flexibility	None	Reconfigurable	Programmable
Development cost	$10M-$ 1B+	$10K-$ 1M	$0 (software)
Time to market	12-24 months	Weeks-months	Days
Volume economics	Best at scale	Small-medium volume	N/A

Examples

Bitcoin mining ASICs (SHA-256)
Google TPU (matrix multiply)
Apple Neural Engine
Video codec chips (H.264/H.265 encoders in every phone)
Network switches (Memory-centric ASICs)

Domain-Specific Architectures: TPU

Google's Tensor Processing Unit, designed for neural network inference and training.

TPU v4 Architecture

TPU v4 chip:
  └── Matrix Multiply Units (MXUs): 128x128 systolic arrays
  └── Vector Processing Unit
  └── Scalar Unit
  └── HBM memory (32 GB)
  └── Inter-chip interconnect (ICI)

TPU v4 pod: 4096 chips connected in 3D torus topology
Peak: ~1.1 EXAFLOPS (BF16) per pod

Systolic Array

Data flows through a regular grid of processing elements, each performing a multiply-accumulate.

Matrix A flows left-to-right
Matrix B flows top-to-bottom
Results accumulate in place

   b0  b1  b2
    |   |   |
a0-[*]-[*]-[*]->
    |   |   |
a1-[*]-[*]-[*]->
    |   |   |
a2-[*]-[*]-[*]->
    |   |   |
    v   v   v

Each [*] computes: acc += a * b, then passes a right and b down

Properties:

O(n^2) PEs compute O(n^3) multiply-accumulate in O(n) time
High data reuse: each value read once from memory, used n times
Simple control: no instruction fetch/decode per PE

TPU Software Stack

XLA (Accelerated Linear Algebra): compiler that maps TensorFlow/JAX ops to TPU instructions
JAX: NumPy-like API with automatic differentiation and XLA compilation
GSPMD: automatic partitioning across TPU pods

oneAPI

Intel's unified programming model for heterogeneous architectures.

Supported Targets

oneAPI application
    |
    v
DPC++ (Data Parallel C++, based on SYCL)
    |
    +---> CPU (Intel, AMD)
    +---> GPU (Intel, NVIDIA via plugins)
    +---> FPGA (Intel)
    +---> Other accelerators

DPC++ Example

#include <sycl/sycl.hpp>
using namespace sycl;

int main() {
    queue q(gpu_selector_v);  // select GPU device

    int *data = malloc_shared<int>(N, q);  // USM allocation

    q.parallel_for(range<1>(N), [=](id<1> i) {
        data[i] = data[i] * 2;
    }).wait();

    free(data, q);
}

Key Components

Component	Purpose
DPC++	Core programming language
oneMKL	Math kernel library
oneDNN	Deep learning primitives
oneTBB	Threading building blocks
oneDAL	Data analytics
Level Zero	Low-level hardware interface

Memory Coherence in Heterogeneous Systems

Maintaining a consistent view of memory across different processors.

Cache Coherence Challenges

Traditional CPU cache coherence (MESI/MOESI) does not extend to accelerators. Different approaches:

Approach	Description	Example
Discrete memory	Separate address spaces, explicit copies	Traditional GPU (cudaMemcpy)
Unified virtual memory	Shared address space, page migration	CUDA UVM, HSA
Cache coherent	Hardware coherence across CPU-accelerator	CXL, CCIX, Gen-Z

CXL (Compute Express Link)

CXL provides cache-coherent interconnect between CPU and accelerators.

CXL Protocol Types:
  CXL.io:     PCIe-based I/O protocol (discovery, configuration)
  CXL.cache:  Device caches host memory with coherence
  CXL.mem:    Host accesses device-attached memory

Type 1 device: CXL.io + CXL.cache (smart NIC, accelerator caching host memory)
Type 2 device: CXL.io + CXL.cache + CXL.mem (GPU/FPGA with own memory)
Type 3 device: CXL.io + CXL.mem (memory expander)

CXL enables fine-grained sharing between CPU and accelerator without explicit data movement, though with higher latency than local cache.

HSA (Heterogeneous System Architecture)

AMD-led standard for unified address space and memory model across CPU and GPU.

Shared virtual memory with unified page tables
User-mode dispatch queues (bypass kernel driver)
Memory-based signaling between agents
Supported on AMD APUs and some discrete GPUs

Computational Storage

Moving computation to where data resides, reducing data movement.

Architecture

Traditional:
  Storage -> Network/Bus -> CPU -> Process -> CPU -> Network/Bus -> Storage

Computational Storage:
  Storage -> Process in place -> Results (much smaller) -> CPU

Types

Type	Description	Example
Computational Storage Drive (CSD)	Processing logic inside SSD	Samsung SmartSSD (FPGA in SSD)
Computational Storage Processor (CSP)	Near-storage processing unit	NVIDIA BlueField DPU
In-Storage Computing	Minimal processing in storage controller	Search, filter, decompress

Use Cases

Database scan/filter pushdown (skip irrelevant data at the source)
Compression/decompression at the storage layer
Pattern matching and regex evaluation
Data reduction for analytics pipelines

Challenges

Limited compute capability near storage
Programming model complexity (what runs where?)
Standardization (SNIA Computational Storage specification)
Thermal and power constraints in the storage form factor

Architecture Selection Guidelines

Workload Characteristic	Best Accelerator
Massive data parallelism, floating point	GPU
Low latency, deterministic timing	FPGA
Extremely high volume, fixed algorithm	ASIC
Matrix-heavy ML training	TPU / GPU
Streaming data at line rate	FPGA / SmartNIC
General purpose with some parallelism	Multi-core CPU
Data-intensive with minimal compute	Computational storage

The trend is toward systems-on-chip that integrate multiple accelerator types (CPU + GPU + NPU + DSP) and chiplet-based designs that combine specialized dies via advanced packaging.