6 min read
On this page

Advanced Memory Management

Huge Pages

Standard page size on x86-64 is 4 KB. With large memory (hundreds of GB), the number of page table entries and TLB misses becomes a significant performance bottleneck.

Explicit Huge Pages

  • Allocated via hugetlbfs or mmap(MAP_HUGETLB)
  • Sizes: 2 MB (PMD level) or 1 GB (PUD level) on x86-64
  • Must be reserved at boot or runtime via /proc/sys/vm/nr_hugepages
  • Used by databases (Oracle, PostgreSQL), JVMs, DPDK
// Allocate a 2 MB huge page
void *p = mmap(NULL, 2 * 1024 * 1024,
               PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

Transparent Huge Pages (THP)

THP lets the kernel automatically promote contiguous 4 KB pages to 2 MB pages without application modification.

Modes:

  • always -- system-wide automatic promotion
  • madvise -- only for regions marked with madvise(MADV_HUGEPAGE)
  • never -- disabled

Pitfalls of THP:

  • Memory bloat: a single byte allocation can consume 2 MB
  • Allocation stalls: the kernel may compact memory synchronously to create 2 MB-aligned blocks
  • Latency spikes from khugepaged background promotion
  • Many database deployments (Redis, MongoDB) disable THP due to unpredictable latency

Memory Compaction

Over time, physical memory becomes fragmented -- free pages are scattered, preventing large contiguous allocations. Memory compaction migrates movable pages to create contiguous free blocks.

Compaction triggers:

  1. Direct compaction: triggered synchronously when a high-order allocation fails
  2. kcompactd: background kernel thread performing proactive compaction
  3. Manual: echo 1 > /proc/sys/vm/compact_memory

Page mobility types:

  • Unmovable: kernel slab allocations, DMA buffers
  • Movable: user-space pages, page cache
  • Reclaimable: can be freed (clean page cache, slab caches)

The kernel groups pages by mobility in pageblocks to limit fragmentation.

NUMA-Aware Allocation

Non-Uniform Memory Access (NUMA) architectures have memory nodes attached to specific CPU sockets. Local memory access is 1.5-3x faster than remote access.

Allocation Policies

| Policy | Behavior | Use Case | |---------------------|-------------------------------------------------|-----------------------| | MPOL_DEFAULT | Allocate on the local node | General purpose | | MPOL_BIND | Restrict to specified nodes | Pinned workloads | | MPOL_INTERLEAVE | Round-robin across nodes | Hash tables, shared data | | MPOL_PREFERRED | Prefer a node, fall back to others | Soft affinity |

AutoNUMA (NUMA Balancing)

The kernel periodically unmaps pages and uses resulting page faults to track access patterns. Pages are then migrated to the node where they are most frequently accessed. Tasks may also be migrated to be closer to their data.

Control: /proc/sys/kernel/numa_balancing = 1
Tradeoff: migration overhead vs. improved locality

numactl and libnuma

# Run process on node 0 CPUs, allocate memory on node 0
numactl --cpunodebind=0 --membind=0 ./application

# Interleave memory across all nodes
numactl --interleave=all ./application

Kernel Same-page Merging (KSM)

KSM scans memory for pages with identical content and merges them into a single copy-on-write page. This is crucial for virtualization where many VMs run the same OS.

How it works:

  1. ksmd daemon scans pages registered via madvise(MADV_MERGEABLE)
  2. Pages are hashed and compared; identical pages are merged
  3. Merged pages are marked COW -- a write triggers a page fault and copy

Performance considerations:

  • Memory savings of 20-50% in homogeneous VM environments
  • CPU overhead from continuous scanning
  • COW faults on write cause latency -- not suitable for write-heavy workloads
  • Security concern: side-channel attacks can detect page merging timing

Memory Cgroups (cgroup v2)

Memory cgroups provide per-group memory accounting and limits, essential for container isolation.

Key Controls

| Control | Description | |-----------------------|------------------------------------------------------| | memory.max | Hard limit -- OOM killer invoked on breach | | memory.high | Soft limit -- triggers reclaim pressure | | memory.low | Best-effort protection from reclaim | | memory.min | Hard protection -- pages guaranteed unreclaimable | | memory.swap.max | Limit on swap usage |

Memory Accounting

Cgroup v2 tracks:

  • Anonymous pages (heap, stack, mmap)
  • Page cache pages charged to the cgroup that first faulted them in
  • Kernel memory (slab objects, socket buffers, page tables)
  • Swap usage
# Set a 4 GB memory limit for a container cgroup
echo 4G > /sys/fs/cgroup/mycontainer/memory.max

# Check current usage
cat /sys/fs/cgroup/mycontainer/memory.current

OOM Killer

When a memory cgroup or the system exhausts memory and reclaim fails, the Out-Of-Memory (OOM) killer selects and terminates a process.

Selection Algorithm

Each process receives an oom_score (0-1000) based on:

  • Memory consumption (primary factor)
  • oom_score_adj (-1000 to +1000) set by the administrator
  • Child processes' memory is considered
# Protect a critical process from OOM killing
echo -1000 > /proc/<pid>/oom_score_adj

# Make a process the preferred OOM victim
echo 1000 > /proc/<pid>/oom_score_adj

Cgroup-level OOM

In cgroup v2, an OOM event in a cgroup only kills processes within that cgroup. The memory.oom.group knob kills the entire cgroup as a unit (useful for containers).

userfaultfd

userfaultfd allows user-space programs to handle page faults for designated memory regions. The kernel sends fault events to a file descriptor, and user space resolves them.

Use Cases

  1. Live VM migration (postcopy): Start the VM at the destination with empty pages. When the VM touches a missing page, userfaultfd notifies a handler that fetches it from the source over the network.
  2. Heap snapshots: Lazy checkpoint of process memory for debugging or persistence.
  3. Garbage collectors: Custom page-fault handling for managed runtimes.
// Simplified userfaultfd flow
int uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
ioctl(uffd, UFFDIO_API, &api);
ioctl(uffd, UFFDIO_REGISTER, &region);
// Poll uffd for fault events, resolve with UFFDIO_COPY or UFFDIO_ZEROPAGE

Persistent Memory (PMEM)

Persistent memory (Intel Optane DCPMM, CXL-attached PMEM) sits on the memory bus but retains data across power cycles. It bridges the gap between DRAM and storage.

DAX (Direct Access)

DAX bypasses the page cache, mapping persistent memory directly into the application's virtual address space. File I/O becomes load/store instructions.

Traditional I/O:  app -> syscall -> page cache -> block driver -> device
DAX:              app -> load/store -> persistent memory (no copies)

File systems with DAX support: ext4-dax, XFS-dax, NOVA (research)

Programming Model Challenges

  • Persistence ordering: CPU caches and store buffers can reorder writes. Applications must use clflush, clwb, sfence to ensure data reaches persistent media.
  • Failure atomicity: A crash mid-write can leave data structures in an inconsistent state. Solutions include logging (PMDK transactions), copy-on-write, or hardware-supported atomic writes.
// Persistent write with ordering guarantee
pmem_memcpy_persist(pmem_dest, src, len);
// Internally: memcpy + clwb + sfence

PMDK (Persistent Memory Development Kit)

Intel's PMDK provides libraries for persistent memory programming:

  • libpmem -- low-level flush/fence primitives
  • libpmemobj -- transactional object store with type-safe pointers
  • libpmemkv -- embedded key-value store on PMEM

CXL Memory

Compute Express Link (CXL) enables memory expansion and sharing over a PCIe-based interconnect.

CXL Memory Types

| Type | Description | Use Case | |---------|--------------------------------------|-----------------------------| | Type 1 | CXL.cache + CXL.io (accelerators) | GPUs, FPGAs with coherency | | Type 2 | CXL.cache + CXL.mem (device memory)| GPU with host-accessible mem| | Type 3 | CXL.mem only (memory expanders) | DRAM/PMEM capacity expansion|

CXL in the Linux Kernel

  • CXL memory appears as a new NUMA node with higher latency
  • The kernel's tiered memory support (demotion) moves cold pages from fast DRAM to slower CXL memory
  • memory tiering in Linux 6.x: hot pages on local DRAM, warm/cold pages on CXL DRAM
Memory hierarchy with CXL:
  HBM (fastest) -> Local DDR5 -> CXL-attached DRAM -> CXL-attached PMEM -> NVMe
  ~80ns            ~100ns        ~200-300ns            ~300-500ns           ~10us

Key Takeaways

  1. THP provides automatic huge page benefits but can cause latency spikes; use madvise mode for latency-sensitive workloads
  2. NUMA-aware allocation is essential for multi-socket performance; AutoNUMA handles it automatically but imperfectly
  3. KSM trades CPU cycles for memory savings -- most valuable in dense VM environments
  4. Memory cgroups (v2) provide hierarchical limits with soft/hard thresholds and OOM isolation
  5. userfaultfd enables powerful user-space memory management (live migration, snapshots)
  6. Persistent memory + DAX eliminates the storage I/O stack but demands careful ordering of stores
  7. CXL creates a tiered memory hierarchy managed transparently by the kernel's page demotion/promotion