Advanced Memory Management

Huge Pages

Standard page size on x86-64 is 4 KB. With large memory (hundreds of GB), the number of page table entries and TLB misses becomes a significant performance bottleneck.

Explicit Huge Pages

Allocated via hugetlbfs or mmap(MAP_HUGETLB)
Sizes: 2 MB (PMD level) or 1 GB (PUD level) on x86-64
Must be reserved at boot or runtime via /proc/sys/vm/nr_hugepages
Used by databases (Oracle, PostgreSQL), JVMs, DPDK

// Allocate a 2 MB huge page
void *p = mmap(NULL, 2 * 1024 * 1024,
               PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

Transparent Huge Pages (THP)

THP lets the kernel automatically promote contiguous 4 KB pages to 2 MB pages without application modification.

Modes:

always -- system-wide automatic promotion
madvise -- only for regions marked with madvise(MADV_HUGEPAGE)
never -- disabled

Pitfalls of THP:

Memory bloat: a single byte allocation can consume 2 MB
Allocation stalls: the kernel may compact memory synchronously to create 2 MB-aligned blocks
Latency spikes from khugepaged background promotion
Many database deployments (Redis, MongoDB) disable THP due to unpredictable latency

Memory Compaction

Over time, physical memory becomes fragmented -- free pages are scattered, preventing large contiguous allocations. Memory compaction migrates movable pages to create contiguous free blocks.

Compaction triggers:

Direct compaction: triggered synchronously when a high-order allocation fails
kcompactd: background kernel thread performing proactive compaction
Manual: echo 1 > /proc/sys/vm/compact_memory

Page mobility types:

Unmovable: kernel slab allocations, DMA buffers
Movable: user-space pages, page cache
Reclaimable: can be freed (clean page cache, slab caches)

The kernel groups pages by mobility in pageblocks to limit fragmentation.

NUMA-Aware Allocation

Non-Uniform Memory Access (NUMA) architectures have memory nodes attached to specific CPU sockets. Local memory access is 1.5-3x faster than remote access.

Allocation Policies

Policy	Behavior	Use Case
`MPOL_DEFAULT`	Allocate on the local node	General purpose
`MPOL_BIND`	Restrict to specified nodes	Pinned workloads
`MPOL_INTERLEAVE`	Round-robin across nodes	Hash tables, shared data
`MPOL_PREFERRED`	Prefer a node, fall back to others	Soft affinity

The kernel periodically unmaps pages and uses resulting page faults to track access patterns. Pages are then migrated to the node where they are most frequently accessed. Tasks may also be migrated to be closer to their data.

Control: /proc/sys/kernel/numa_balancing = 1
Tradeoff: migration overhead vs. improved locality

numactl and libnuma

# Run process on node 0 CPUs, allocate memory on node 0
numactl --cpunodebind=0 --membind=0 ./application

# Interleave memory across all nodes
numactl --interleave=all ./application

Kernel Same-page Merging (KSM)

KSM scans memory for pages with identical content and merges them into a single copy-on-write page. This is crucial for virtualization where many VMs run the same OS.

How it works:

ksmd daemon scans pages registered via madvise(MADV_MERGEABLE)
Pages are hashed and compared; identical pages are merged
Merged pages are marked COW -- a write triggers a page fault and copy

Performance considerations:

Memory savings of 20-50% in homogeneous VM environments
CPU overhead from continuous scanning
COW faults on write cause latency -- not suitable for write-heavy workloads
Security concern: side-channel attacks can detect page merging timing

Memory Cgroups (cgroup v2)

Memory cgroups provide per-group memory accounting and limits, essential for container isolation.

Key Controls

Control	Description
`memory.max`	Hard limit -- OOM killer invoked on breach
`memory.high`	Soft limit -- triggers reclaim pressure
`memory.low`	Best-effort protection from reclaim
`memory.min`	Hard protection -- pages guaranteed unreclaimable
`memory.swap.max`	Limit on swap usage

Memory Accounting

Cgroup v2 tracks:

Anonymous pages (heap, stack, mmap)
Page cache pages charged to the cgroup that first faulted them in
Kernel memory (slab objects, socket buffers, page tables)
Swap usage

# Set a 4 GB memory limit for a container cgroup
echo 4G > /sys/fs/cgroup/mycontainer/memory.max

# Check current usage
cat /sys/fs/cgroup/mycontainer/memory.current

OOM Killer

When a memory cgroup or the system exhausts memory and reclaim fails, the Out-Of-Memory (OOM) killer selects and terminates a process.

Selection Algorithm

Each process receives an oom_score (0-1000) based on:

Memory consumption (primary factor)
oom_score_adj (-1000 to +1000) set by the administrator
Child processes' memory is considered

# Protect a critical process from OOM killing
echo -1000 > /proc/<pid>/oom_score_adj

# Make a process the preferred OOM victim
echo 1000 > /proc/<pid>/oom_score_adj

Cgroup-level OOM

In cgroup v2, an OOM event in a cgroup only kills processes within that cgroup. The memory.oom.group knob kills the entire cgroup as a unit (useful for containers).

userfaultfd

userfaultfd allows user-space programs to handle page faults for designated memory regions. The kernel sends fault events to a file descriptor, and user space resolves them.

Use Cases

Live VM migration (postcopy): Start the VM at the destination with empty pages. When the VM touches a missing page, userfaultfd notifies a handler that fetches it from the source over the network.
Heap snapshots: Lazy checkpoint of process memory for debugging or persistence.
Garbage collectors: Custom page-fault handling for managed runtimes.

// Simplified userfaultfd flow
int uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
ioctl(uffd, UFFDIO_API, &api);
ioctl(uffd, UFFDIO_REGISTER, &region);
// Poll uffd for fault events, resolve with UFFDIO_COPY or UFFDIO_ZEROPAGE

Persistent Memory (PMEM)

Persistent memory (Intel Optane DCPMM, CXL-attached PMEM) sits on the memory bus but retains data across power cycles. It bridges the gap between DRAM and storage.

DAX (Direct Access)

DAX bypasses the page cache, mapping persistent memory directly into the application's virtual address space. File I/O becomes load/store instructions.

Traditional I/O:  app -> syscall -> page cache -> block driver -> device
DAX:              app -> load/store -> persistent memory (no copies)

File systems with DAX support: ext4-dax, XFS-dax, NOVA (research)

Programming Model Challenges

Persistence ordering: CPU caches and store buffers can reorder writes. Applications must use clflush, clwb, sfence to ensure data reaches persistent media.
Failure atomicity: A crash mid-write can leave data structures in an inconsistent state. Solutions include logging (PMDK transactions), copy-on-write, or hardware-supported atomic writes.

// Persistent write with ordering guarantee
pmem_memcpy_persist(pmem_dest, src, len);
// Internally: memcpy + clwb + sfence

PMDK (Persistent Memory Development Kit)

Intel's PMDK provides libraries for persistent memory programming:

libpmem -- low-level flush/fence primitives
libpmemobj -- transactional object store with type-safe pointers
libpmemkv -- embedded key-value store on PMEM

CXL Memory

Compute Express Link (CXL) enables memory expansion and sharing over a PCIe-based interconnect.

CXL Memory Types

Type	Description	Use Case
Type 1	CXL.cache + CXL.io (accelerators)	GPUs, FPGAs with coherency
Type 2	CXL.cache + CXL.mem (device memory)	GPU with host-accessible mem
Type 3	CXL.mem only (memory expanders)	DRAM/PMEM capacity expansion

CXL in the Linux Kernel

CXL memory appears as a new NUMA node with higher latency
The kernel's tiered memory support (demotion) moves cold pages from fast DRAM to slower CXL memory
memory tiering in Linux 6.x: hot pages on local DRAM, warm/cold pages on CXL DRAM

Memory hierarchy with CXL:
  HBM (fastest) -> Local DDR5 -> CXL-attached DRAM -> CXL-attached PMEM -> NVMe
  ~80ns            ~100ns        ~200-300ns            ~300-500ns           ~10us

Key Takeaways

THP provides automatic huge page benefits but can cause latency spikes; use madvise mode for latency-sensitive workloads
NUMA-aware allocation is essential for multi-socket performance; AutoNUMA handles it automatically but imperfectly
KSM trades CPU cycles for memory savings -- most valuable in dense VM environments
Memory cgroups (v2) provide hierarchical limits with soft/hard thresholds and OOM isolation
userfaultfd enables powerful user-space memory management (live migration, snapshots)
Persistent memory + DAX eliminates the storage I/O stack but demands careful ordering of stores
CXL creates a tiered memory hierarchy managed transparently by the kernel's page demotion/promotion