Modern Architectures

This file surveys the architectures powering today's computers — from smartphones to data centers — and the trends shaping future designs.

ARM Architecture

ARM Overview

ARM (originally Acorn RISC Machine) dominates mobile and embedded, and is expanding into laptops and servers.

Key principles: Power efficiency, simplicity, licensable IP (ARM licenses designs to chip makers).

big.LITTLE and DynamIQ

big.LITTLE: Pairs high-performance "big" cores with power-efficient "little" cores. OS schedules workloads to appropriate cores.

[Big core 0] [Big core 1] ←→ Shared L3 ←→ [Little core 0] [Little core 1]
  High perf    High perf                      Power efficient   Power efficient

DynamIQ: More flexible. Mix any combination of big and little cores in a cluster. Up to 8 cores per cluster, different types.

ARM Cortex families:

Cortex-A (Application): High performance. Smartphones, tablets, laptops, servers. (A510, A710, A715, X3, X4)
Cortex-R (Real-time): Deterministic timing. Automotive, storage controllers.
Cortex-M (Microcontroller): Ultra-low power. IoT, embedded. (M0, M3, M4, M7, M33, M55)

ARM SVE/SVE2 (Scalable Vector Extension)

Vector ISA with scalable vector length (128 to 2048 bits). Software doesn't need to know the vector width — it adapts at runtime.

Per-lane predication: Each element can be individually masked. Handles loop tails without scalar cleanup.

Designed for HPC. Used in Fugaku supercomputer (Arm A64FX, 512-bit SVE).

Neoverse (Server ARM)

ARM cores designed for data centers:

Neoverse N-series: Balanced performance/efficiency. AWS Graviton3, Ampere Altra.
Neoverse V-series: Maximum performance. Graviton4.

AWS Graviton processors demonstrate ARM's competitiveness in cloud: ~40% better price-performance vs x86 for many workloads.

Apple Silicon

Architecture

Apple's custom ARM-based SoCs integrate CPU, GPU, Neural Engine, and memory on one chip.

M-series (M1, M2, M3, M4):

Unified Memory Architecture (UMA): CPU, GPU, and NPU share the same memory pool. No copying between CPU and GPU memory.
Performance cores (P-cores): Wide OoO (8-wide decode, 600+ entry ROB). Among the widest cores ever built.
Efficiency cores (E-cores): Narrower, lower power. Handle background tasks.
Neural Engine: 16-core dedicated ML accelerator. ~38 TOPS (M4).
GPU: Custom architecture. Tile-based deferred rendering. Up to 40 cores (M4 Max).
Media Engine: Dedicated encode/decode (H.264, HEVC, ProRes, AV1).

Key Innovations

Wide execution: P-cores decode 8 instructions/cycle, execute up to 9 μops/cycle. This is wider than most x86 cores.

Large reorder buffer: 600+ entries (M3/M4). Extracts more ILP.

Low memory latency: UMA with fast on-package memory (LPDDR5). No PCIe bottleneck for GPU.

Power efficiency: 5nm/3nm process. Low power consumption enables fanless laptops with desktop-class performance.

Intel / AMD x86

Intel Architecture Evolution

Generation	Microarchitecture	Key Features
Skylake (2015)	14nm	Foundation of years of refinements
Ice Lake (2019)	10nm	AVX-512, deeper buffers
Alder Lake (2021)	Intel 7	Hybrid (P+E cores), Thread Director
Raptor Lake (2022)	Intel 7	More E-cores, higher clocks
Meteor Lake (2023)	Intel 4	Chiplet design, NPU
Lunar Lake (2024)	Intel 18A	Larger NPU, improved E-cores

AMD Architecture

Generation	Microarchitecture	Key Features
Zen (2017)	14nm	AMD's comeback. Competitive IPC
Zen 2 (2019)	7nm	Chiplet design (CCD + IOD)
Zen 3 (2020)	7nm	Unified L3, +19% IPC
Zen 4 (2022)	5nm	AVX-512, DDR5, +13% IPC
Zen 5 (2024)	4/3nm	Wider, 2× AI throughput

Hybrid Architectures

Both Intel (Alder Lake+) and ARM (big.LITTLE) use heterogeneous cores:

Advantages: Run background tasks on E-cores (low power), burst to P-cores for demanding work. Better performance-per-watt.

Challenges: OS scheduler must understand core types. Thread Director (Intel) provides hardware hints. Some workloads suffer from core migration.

Domain-Specific Accelerators

TPU (Tensor Processing Unit) — Google

Custom ASIC for ML inference and training.

Architecture: Systolic array for matrix multiplication. Massive parallelism for matrix ops.

Version	Year	Performance
TPU v1	2016	92 TOPS (INT8, inference only)
TPU v2	2017	45 TFLOPS (BF16, training)
TPU v3	2018	123 TFLOPS (BF16)
TPU v4	2021	275 TFLOPS (BF16)
TPU v5e	2023	Cost-optimized for inference
TPU v6 (Trillium)	2024	4.7× v5e performance

Systolic array: Data flows through a grid of processing elements. Each PE performs a multiply-accumulate and passes data to the next. Very efficient for matrix multiplication.

NPU (Neural Processing Unit)

Dedicated ML accelerators integrated into SoCs:

Apple Neural Engine: 16 cores, 38 TOPS (M4)
Qualcomm Hexagon: In Snapdragon chips
Intel NPU: In Meteor Lake+
AMD XDNA: Ryzen AI

Use cases: On-device inference (image classification, NLP, voice recognition) without burning CPU/GPU power.

Other Accelerators

Network processors: Programmable packet processing (Broadcom, Marvell, Intel IPU)
Cryptographic accelerators: AES-NI (x86), crypto extensions (ARM)
Video codecs: Fixed-function encode/decode (H.264, HEVC, AV1)
Compression: QAT (Intel Quick Assist Technology) for zlib/zstd acceleration

RISC-V Ecosystem

Current State

Commercial cores: SiFive (P670, X280), Andes, Alibaba T-Head (XuanTie C910).

Application processors: Emerging in Android devices, automotive, IoT.

Data center: Early stages. RISC-V International consortium growing rapidly.

Strengths: Open ISA (no licensing fees), customizable (add custom instructions), growing ecosystem.

Challenges: Software ecosystem still maturing. Performance gap vs ARM/x86 for high-end applications. Fragmentation risk from custom extensions.

RISC-V Vectors (RVV)

Scalable vector extension (inspired by ARM SVE). Vector length agnostic — same binary works on different hardware with different vector register widths.

Chiplet Design

Motivation

As transistors shrink, large monolithic dies become expensive (yield drops with die size) and inflexible (one size fits all).

Approach

Decompose the SoC into smaller chiplets connected via advanced packaging:

┌──────────┬──────────┬──────────┐
│  CPU     │  CPU     │  I/O     │
│  Chiplet │  Chiplet │  Chiplet │
│  (5nm)   │  (5nm)   │  (12nm)  │
└────┬─────┴────┬─────┴────┬─────┘
     │          │          │
     └──────────┴──────────┘
           Silicon Interposer
           or Organic Substrate

Advantages:

Higher yield (smaller dies)
Mix different process nodes (CPU on 5nm, I/O on 12nm)
Modularity (mix and match chiplets for different products)
Cost reduction

Examples: AMD Zen 2+ (CCD + IOD), Intel Meteor Lake (CPU + GPU + SoC + I/O tiles), Apple M1 Ultra (two M1 Max connected via UltraFusion).

Interconnect Technologies

EMIB (Intel): Embedded Multi-die Interconnect Bridge. Short silicon bridges between adjacent chiplets.
Foveros (Intel): 3D stacking. Stack chiplets on top of each other with through-silicon vias (TSVs).
UltraFusion (Apple): Die-to-die connection with 2.5 TB/s bandwidth.
Infinity Fabric (AMD): Scalable interconnect between chiplets.
UCIe: Universal Chiplet Interconnect Express. Industry standard for chiplet-to-chiplet communication.

3D Stacking

Stack multiple layers of transistors or dies vertically:

3D NAND: Stack 100+ layers of flash memory cells. TLC/QLC.
HBM: Stack DRAM dies (4-12 high) on a silicon interposer.
Foveros: Stack logic dies (CPU on top of base die).
Hybrid bonding: Direct copper-to-copper connections between dies. Very high density (~10 μm pitch).

Benefits: Shorter wires (less latency, less power), higher bandwidth, smaller footprint.

Challenges: Heat dissipation (interior layers have no direct cooling), manufacturing complexity, testing.

Neuromorphic: Mimic brain structure (Intel Loihi, IBM TrueNorth). Event-driven, low power.
Quantum: Fundamentally different computing model. Covered in quantum computing topic.
Processing-in-Memory (PIM): Compute where the data lives. Reduces data movement (the dominant energy cost).
Optical computing: Use photons for computation and interconnect. Very high bandwidth.

Applications in CS

System design: Understanding architectures helps choose the right hardware for workloads (ARM for power-sensitive, x86 for compatibility, GPU for parallel compute).
Software optimization: Architecture-aware code (SIMD intrinsics, cache-friendly layouts, NUMA-aware allocation) can give 10-100× speedups.
Cloud computing: Instance types map to architectures (ARM: Graviton, GPU: A100/H100, FPGA: F1). Cost-performance depends on architecture match.
Mobile development: ARM cores, NPU, GPU — understanding the SoC helps optimize apps (use Neural Engine for ML, GPU for graphics).
ML infrastructure: TPU vs GPU vs CPU choice depends on model type, batch size, and latency requirements.