Memory Hierarchy

The memory hierarchy bridges the speed gap between fast processors and slow main memory by exploiting locality — the tendency of programs to access nearby data (spatial) and recently accessed data (temporal).

Locality

Temporal Locality

Recently accessed data is likely to be accessed again soon.

Examples: Loop variables, function local variables, stack top, recently called functions, hot database rows.

Spatial Locality

Data near recently accessed data is likely to be accessed soon.

Examples: Sequential array traversal, instruction fetch (sequential execution), struct/object fields, matrix rows.

The Memory Wall

Processor speed has improved ~1000× faster than memory latency:

CPU: ~0.3 ns per operation (3+ GHz)
DRAM: ~50-100 ns access time

Without caches, the processor would spend >99% of its time waiting for memory.

Cache hierarchy showing L1/L2/L3/RAM/Disk with sizes and latencies

Cache Organization

Direct-mapped vs set-associative vs fully-associative cache organization

A cache is a small, fast memory that stores copies of frequently accessed data.

Cache Structure

┌──────┬──────────┬─────────────────────────────────┐
│ Valid │   Tag    │           Data Block             │
│  bit  │          │  (multiple bytes/words)          │
├──────┼──────────┼─────────────────────────────────┤
│  1   │ 0x1A3F  │  data data data data data data   │
│  1   │ 0x0B72  │  data data data data data data   │
│  0   │   —     │           (invalid)               │
│  ...  │  ...    │           ...                     │
└──────┴──────────┴─────────────────────────────────┘

Address Decomposition

For a cache with 2ˢ sets, block size of 2ᵇ bytes:

Address: [    Tag    |  Index (s bits)  |  Offset (b bits)  ]

Offset: Which byte within the block (b = log₂(block size))
Index: Which set to look in (s = log₂(number of sets))
Tag: Identifies the specific block within the set

Cache Types

Direct-Mapped (1-way)

Each memory block maps to exactly one cache line.

Cache line = (address / block_size) % num_lines

Advantages: Simplest, fastest lookup (one comparison). Disadvantages: Conflict misses — two blocks mapping to the same line evict each other.

Example conflict: Arrays A and B both have elements that map to the same cache line. Accessing them alternately causes thrashing.

Set-Associative (N-way)

Each memory block can go into any of N lines within a set.

Set = (address / block_size) % num_sets
Block can be in any of the N ways within the set

2-way: Each set has 2 lines. Check both tags in parallel.
4-way: Each set has 4 lines. Common in L1 caches.
8-way: Common in L2/L3 caches.

Tradeoff: Higher associativity → fewer conflict misses but slower (more comparators) and more power.

Fully Associative

A block can go anywhere in the cache. Must check all tags (expensive). Used only for very small caches (TLBs, victim caches).

Cache Replacement Policies

When a set is full and a new block arrives, which block to evict?

LRU (Least Recently Used)

Evict the block that hasn't been used for the longest time.

Implementation: Track access order. For 2-way: 1 bit. For 4-way: 6 bits (pseudo-LRU is common). For 8+ way: true LRU is expensive → use approximations.

FIFO (First In, First Out)

Evict the oldest block. Simpler than LRU but can evict frequently used blocks.

Random

Evict a random block. Surprisingly competitive with LRU for high associativity. Very simple hardware.

Pseudo-LRU (Tree-PLRU)

Binary tree of bits. Each bit points toward the LRU side. On access, point bits away from the accessed way. Approximates LRU with much less hardware.

Optimal (Bélády's)

Evict the block that won't be used for the longest time in the future. Impossible to implement (requires future knowledge). Used as a theoretical upper bound for comparison.

Write Policies

Write Hit

Write-through: Write to both cache and memory immediately.

Simpler. Memory always up-to-date.
High memory traffic (every write goes to memory).
Often combined with a write buffer to hide latency.

Write-back: Write only to cache. Mark line as dirty. Write to memory only when evicted.

Less memory traffic (only dirty lines written back).
More complex (dirty bit tracking, write-back on eviction).
Dominant in modern processors.

Write Miss

Write-allocate: Fetch the block into cache, then write. Exploits spatial locality (future writes likely nearby).

No-write-allocate (write-around): Write directly to memory without fetching into cache.

Common pairings:

Write-back + write-allocate (most common — modern caches)
Write-through + no-write-allocate

Cache Performance

Miss Rate

Miss rate = Cache misses / Total accesses
Hit rate = 1 - Miss rate

Typical L1 miss rates: 2-5% (data), <1% (instruction).

Three Cs of Cache Misses

Compulsory (cold) misses: First access to a block. Cannot be avoided (except prefetching).

Capacity misses: Cache is too small to hold all needed data. Reduce by increasing cache size.

Conflict misses: Multiple blocks compete for the same set. Reduce by increasing associativity.

(Fourth C: Coherence misses in multiprocessor systems.)

Average Memory Access Time (AMAT)

AMAT = Hit time + Miss rate × Miss penalty

For a multi-level cache:

AMAT = L1_hit_time + L1_miss_rate × (L2_hit_time + L2_miss_rate × L2_miss_penalty)

Example

L1: 1 cycle hit, 5% miss rate. L2: 10 cycles hit, 1% miss rate (of accesses that reach L2 → local miss rate 20%). Main memory: 100 cycles.

AMAT = 1 + 0.05 × (10 + 0.20 × 100) = 1 + 0.05 × 30 = 2.5 cycles

Without caches: AMAT = 100 cycles. 40× improvement.

Cache Hierarchy

Modern processors have 2-3 levels:

Level	Size	Latency	Associativity	Shared?
L1I (instruction)	32-64 KB	1-4 cycles	4-8 way	Per core
L1D (data)	32-64 KB	1-4 cycles	4-8 way	Per core
L2	256 KB - 1 MB	10-15 cycles	8-16 way	Per core
L3	4-64 MB	30-50 cycles	12-20 way	Shared
Main memory (DRAM)	8-256 GB	100-200 cycles	N/A	Shared

Inclusion Policies

Inclusive: L2 contains a copy of everything in L1. Eviction from L2 → back-invalidate L1. Simplifies coherence.

Exclusive: Data is in either L1 or L2, not both. Effective capacity = L1 + L2. Used by AMD.

Non-inclusive non-exclusive (NINE): No strict policy. L2 doesn't track L1 contents. Used by some Intel designs.

Cache Optimization Techniques

Hardware

Prefetching: Predict and fetch blocks before they're needed (hardware stream prefetcher, stride prefetcher).
Victim cache: Small fully-associative cache for recently evicted lines. Catches conflict misses.
Multi-banked cache: Multiple banks accessed in parallel for higher bandwidth.
Non-blocking cache: Allows hits during a miss (miss-under-miss, hit-under-miss). Uses MSHRs (Miss Status Holding Registers).

Software

Loop tiling/blocking: Process data in cache-sized blocks.
Data structure layout: Prefer arrays of structs vs structs of arrays depending on access pattern.
Prefetch instructions: Software-initiated prefetch (__builtin_prefetch in GCC/Clang, _mm_prefetch for SSE).
Cache-oblivious algorithms: Work well regardless of cache size (e.g., funnel sort, cache-oblivious matrix multiply).

Cache Coherence Introduction

In multiprocessor systems, multiple caches may have copies of the same memory location. When one processor writes, other caches' copies become stale.

Coherence protocols ensure all processors see a consistent view of memory. Covered in detail in the multiprocessor file.

Applications in CS

Performance optimization: Understanding cache behavior is essential for writing fast code. Cache-friendly access patterns can give 10-100× speedups.
Data structure design: Cache-aware data structures (B-trees for disks, cache-oblivious structures, packed arrays vs linked lists).
Algorithm design: Tiling matrix multiplication, merge sort vs quicksort (locality differences), hash table design (open addressing vs chaining).
Database systems: Buffer pool management is analogous to cache management. Page replacement policies (LRU, clock, LRU-K).
Operating systems: Page cache, file system caches, TLB management.
Game development: Entity Component System (ECS) designed for cache locality. Data-oriented design.