6 min read
On this page

Multiprocessor Systems

As single-core performance gains slowed (end of Dennard scaling, ~2005), the industry shifted to multi-core processors. Understanding multiprocessor systems is essential for writing efficient parallel software.

Symmetric Multiprocessing (SMP)

All processors are identical and share the same memory. Any processor can run any process. The OS treats all processors equally.

┌──────┐  ┌──────┐  ┌──────┐  ┌──────┐
│ CPU0 │  │ CPU1 │  │ CPU2 │  │ CPU3 │
│  L1  │  │  L1  │  │  L1  │  │  L1  │
└──┬───┘  └──┬───┘  └──┬───┘  └──┬───┘
   │         │         │         │
   └─────────┴─────┬───┴─────────┘
                   │
            ┌──────┴──────┐
            │  Shared L3  │
            └──────┬──────┘
                   │
            ┌──────┴──────┐
            │ Main Memory │
            └─────────────┘

Cache Coherence

The Problem

When multiple processors cache the same memory location, writes by one processor can make other caches' copies stale.

CPU0 cache: X = 5
CPU1 cache: X = 5
CPU0 writes X = 7  → CPU0 cache: X = 7, CPU1 cache: X = 5 (stale!)
CPU1 reads X       → Gets 5 (WRONG!)

Cache coherence protocols ensure all processors see a consistent view.

Coherence Invariants

  1. Single Writer: At any time, either one cache has a modified (exclusive) copy, or multiple caches have read-only copies.
  2. Data Value: A read returns the most recently written value.

Snooping Protocols

Each cache monitors (snoops) the shared bus for transactions by other caches.

MSI Protocol

Three states per cache line:

  • M (Modified): Exclusive, dirty. Only copy. Must write back before others can read.
  • S (Shared): Clean copy. Other caches may also have copies.
  • I (Invalid): Not valid. Must fetch from memory or another cache.

Transitions (on CPU read/write and bus snoops):

| Event | I → | S → | M → | |---|---|---|---| | CPU Read | S (bus read) | S | M | | CPU Write | M (bus read-invalid) | M (bus invalidate) | M | | Bus Read (snoop) | I | S | S (write back, downgrade) | | Bus Read-Invalid (snoop) | I | I (invalidate) | I (write back, invalidate) |

MESI Protocol

Adds E (Exclusive) state: exclusive and clean. No bus transaction needed to transition E → M (silent upgrade). Reduces bus traffic for private data.

  • M: Modified (exclusive, dirty)
  • E: Exclusive (exclusive, clean)
  • S: Shared (clean, others may have copies)
  • I: Invalid

Most common protocol in real processors.

MOESI Protocol

Adds O (Owned) state: has a modified copy that's also shared. The owned cache is responsible for supplying data on requests, not memory. Avoids write-back on sharing a dirty line.

Used by AMD processors.

Directory-Based Protocols

For systems with many processors where bus snooping doesn't scale (the bus becomes a bottleneck).

Directory: A centralized (or distributed) structure that tracks which caches have each memory block.

Memory block X:
  Directory entry: {state: Shared, sharers: {CPU0, CPU3}}

On a write, the directory sends invalidations only to caches that have copies — no broadcast needed.

Scalability: O(n) messages per write (only to sharers), vs O(n) with snooping (broadcast to all). Better for large systems (NUMA servers, 100+ cores).

False Sharing

When two unrelated variables share the same cache line, writes to one invalidate the other's cache line — causing unnecessary coherence traffic.

// BAD: x and y likely in the same cache line
STRUCTURE SharedData
    x : 64-bit integer  // Written by thread 0
    y : 64-bit integer  // Written by thread 1 — false sharing!

// GOOD: Pad to separate cache lines (64 bytes)
STRUCTURE SharedData
    x : 64-bit integer
    _pad : 56 bytes of padding
    y : 64-bit integer

False sharing can degrade parallel performance by 10-100×. Detected with performance counters (high coherence misses).

Memory Consistency Models

Coherence guarantees that all processors agree on the order of writes to a single location. Consistency defines the order in which writes to different locations become visible.

Sequential Consistency (SC)

Lamport (1979): The result of any execution is the same as if all operations were executed in some sequential order, and within each processor, operations appear in program order.

Simple mental model: a global switch connecting one processor at a time to memory.

Problem: Very restrictive. Prevents many hardware optimizations (store buffers, write combining, out-of-order stores).

Total Store Order (TSO) — x86

Each processor has a store buffer. Stores are buffered; loads can bypass the store buffer and read from memory directly.

Consequence: A store by processor A followed by a load from processor A may return the old value if the store hasn't reached memory yet. But stores from processor A appear in order to all processors.

Allowed reordering: Store → Load (store may be delayed, load may pass it). Disallowed: Store → Store, Load → Load, Load → Store (within same processor).

x86 uses TSO. Most C/C++/Rust code "just works" on TSO without explicit fences.

Relaxed Memory Models (ARM, POWER)

Allow more reorderings:

  • Store → Store may be reordered
  • Load → Load may be reordered
  • Load → Store may be reordered

Require explicit memory barriers (fences) to enforce ordering.

ARM: Weak by default. DMB (Data Memory Barrier), DSB (Data Synchronization Barrier), ISB (Instruction Synchronization Barrier).

Relaxed models enable more hardware optimizations but require careful programming with fences.

Interconnection Networks

How processors, caches, and memory modules are connected.

Bus

Shared medium. All nodes see all transactions. Simple but doesn't scale (bandwidth shared, contention).

Crossbar

Full connectivity between N processors and M memory modules. N × M switches. Non-blocking but O(N²) complexity.

Mesh / Torus

2D grid of nodes with nearest-neighbor connections. Scalable. Used in many-core processors (Intel Xeon Phi, research chips).

  • Mesh: Edges have fewer connections.
  • Torus: Wrap-around connections (reduces diameter).

Routing: XY routing (go in X direction first, then Y). Deadlock-free for meshes.

Ring

Nodes connected in a circle. Simple, low cost. Bandwidth limited by ring. Used in Intel Core i7 (ring bus connecting cores to L3 slices).

Hypercube

N = 2ⁿ nodes. Each node connected to n neighbors (differ by one bit). Diameter = n = log₂ N. High bandwidth but complex wiring.

NUMA (Non-Uniform Memory Access)

In large multiprocessors, memory is physically distributed. Each processor has "local" memory (fast) and "remote" memory (slow).

┌────────────┐     Interconnect     ┌────────────┐
│  CPU0-3    │←─────────────────→   │  CPU4-7    │
│  Local Mem │                      │  Local Mem │
└────────────┘                      └────────────┘

Local access: ~80 ns. Remote access: ~130-200 ns.

ccNUMA (cache-coherent NUMA): Hardware maintains coherence across NUMA nodes. The standard for modern servers.

Programming implications:

  • Allocate memory close to the processor that uses it (first-touch policy)
  • Use numactl to bind threads and memory to specific nodes
  • Avoid cross-node memory access in hot loops
  • Thread migration can cause sudden performance drops (memory becomes remote)

Applications in CS

  • Parallel programming: Understanding coherence and consistency is essential for writing correct concurrent code.
  • Lock-free algorithms: Must account for memory model. What's safe on x86 (TSO) may break on ARM (relaxed).
  • Database systems: NUMA-aware query execution. Buffer pool partitioning by NUMA node. Lock design must avoid false sharing.
  • Operating systems: Scheduler must be NUMA-aware (keep threads near their memory). Page migration for load balancing.
  • High-performance computing: MPI + shared memory hybrid. NUMA topology drives process placement.
  • Virtualization: VM placement must consider NUMA topology. vNUMA exposes topology to guest OS.
  • Cloud computing: Cloud instances may span NUMA nodes. Performance variability from NUMA effects.