Runtime Systems
The runtime system provides the execution environment for compiled programs — memory layout, function calls, exception handling, and dynamic dispatch.
Memory Layout
High addresses
┌───────────────────┐
│ Kernel space │ (not accessible to user programs)
├───────────────────┤
│ Stack │ ↓ grows downward (function frames)
│ ... │
│ (gap) │
│ ... │
│ Heap │ ↑ grows upward (dynamic allocation)
├───────────────────┤
│ BSS │ uninitialized global data (zeroed)
├───────────────────┤
│ Data │ initialized global data
├───────────────────┤
│ Text (code) │ read-only, executable
└───────────────────┘
Low addresses
Text: Machine code. Mapped read-only + execute. Shared between processes running the same binary.
Data: Initialized globals and static variables. Read-write.
BSS (Block Started by Symbol): Uninitialized globals. Zeroed at program start. Doesn't occupy space in the binary file.
Heap: Dynamic allocation (malloc, Box::new). Grows upward via brk/mmap.
Stack: Function call frames. Grows downward. Fixed maximum size (default 1-8 MB per thread).
Activation Records (Stack Frames)
Each function call creates a stack frame containing:
┌──────────────────────┐
│ Arguments (if > 6) │ (passed on stack)
├──────────────────────┤
│ Return address │ (pushed by CALL)
├──────────────────────┤ ← Frame pointer (RBP)
│ Old frame pointer │ (saved RBP)
├──────────────────────┤
│ Saved registers │ (callee-saved: RBX, R12-R15)
├──────────────────────┤
│ Local variables │
├──────────────────────┤
│ Temporary values │ (spilled registers)
├──────────────────────┤ ← Stack pointer (RSP)
Frame Pointer vs Frame Pointer Omission
With frame pointer (RBP): Each frame has a pointer to the previous frame. Stack unwinding follows the chain of frame pointers. Easy debugging.
Without frame pointer (-fomit-frame-pointer): RBP is used as a general-purpose register. Stack unwinding uses DWARF unwind info instead. One more register available. Default in release builds.
Dynamic Linking
PLT (Procedure Linkage Table)
Lazy binding of external function calls. First call goes through PLT → resolves the symbol → patches the GOT → subsequent calls go directly through GOT.
call printf@PLT // First call
→ PLT stub:
jmp *GOT[printf] // Initially points back to PLT resolver
→ Resolver:
dlsym("printf") // Find printf's address
GOT[printf] = addr // Patch GOT
jmp addr // Call printf
// Subsequent calls:
call printf@PLT
→ PLT stub:
jmp *GOT[printf] // Now points directly to printf → fast!
GOT (Global Offset Table)
Table of pointers to external symbols. Updated by the dynamic linker at load time (eager binding) or first call (lazy binding).
Position-Independent Code (PIC): Code that works regardless of where it's loaded in memory. Uses PC-relative addressing + GOT/PLT for external references. Required for shared libraries.
Exception Handling
Table-Driven (Zero-Cost)
No runtime overhead when no exceptions are thrown. Only pays cost when an exception occurs.
Implementation: Compiler generates unwind tables mapping each PC range to cleanup actions (destructors, catch blocks).
.eh_frame / .gcc_except_table:
PC range [0x1000, 0x1050]: cleanup → call destructor for obj_a
PC range [0x1050, 0x1100]: catch(std::exception) → handler at 0x2000
On throw:
- Examine the unwind table for the current PC.
- If a matching catch is found → jump to handler.
- If cleanup needed → run destructors.
- Unwind to the caller's frame (restore callee-saved registers from the frame).
- Repeat from step 1 with the caller's PC.
Cost: Zero overhead on the non-exception path. Exception path is slower (table lookup, unwinding). Tables add to binary size.
Used by: C++ (via DWARF unwind info), Rust (panics use this mechanism).
setjmp/longjmp
C-style: setjmp saves the current stack state. longjmp restores it (jumping back to the setjmp point).
Disadvantage: Destructors are not called during the unwind. Not composable. Error-prone.
Garbage Collection Implementation
Mark-and-Sweep (Tracing GC)
- Root scanning: Find all GC roots (stack variables, globals, registers).
- Mark phase: Starting from roots, traverse all reachable objects. Mark each as live.
- Sweep phase: Scan the heap. Free unmarked objects. Reset marks.
Stop-the-world: All application threads are paused during GC. Pause time ∝ live data (mark) + total heap (sweep).
Copying Collector
Divide heap into from-space and to-space. Copy live objects from from-space to to-space. Swap spaces.
Advantages: Compacts memory (no fragmentation). Allocation is O(1) (bump pointer). Cost ∝ live data only (dead objects are never touched).
Disadvantage: Uses 2× memory (only half the heap is usable at a time).
Generational GC
Exploit the generational hypothesis: most objects die young.
Young generation: Small, collected frequently. Copying collector. Most objects die here → GC is very fast (only copies survivors).
Old generation: Large, collected infrequently. Mark-and-sweep or mark-compact. Contains long-lived objects.
Write barrier: Track references from old → young generation. When an old object is modified to point to a young object, record it. The young generation GC uses these records as additional roots.
Promotion: Objects surviving several young GC cycles are promoted to old generation.
Concurrent/Incremental GC
GC runs alongside the application to minimize pauses.
Challenges: The application modifies the object graph while the GC is tracing it. Requires barriers:
- Write barrier: Notify GC when a reference is modified.
- Read barrier: Intercept reads to handle relocated objects.
Tri-color invariant (Dijkstra): Objects are white (unvisited), gray (visited but children unprocessed), black (fully processed). The invariant: no black object points to a white object.
Object Layout
C Structs
Fields laid out in declaration order with padding for alignment.
struct Example {
a: u8, // offset 0, 1 byte
// 7 bytes padding for alignment
b: u64, // offset 8, 8 bytes
c: u16, // offset 16, 2 bytes
// 6 bytes padding to next alignment boundary
}
// Total size: 24 bytes (with padding)
// Could be 11 bytes without padding
Field ordering optimization: Reorder fields by size (largest first) to minimize padding. Rust automatically reorders struct fields (unless #[repr(C)]).
Vtables (Virtual Function Tables)
For dynamic dispatch, each object has a pointer to its class's vtable — a table of function pointers.
// C++ / Java-style vtable
Object layout: Vtable for Circle:
┌──────────────┐ ┌─────────────────┐
│ vtable_ptr ──────────→│ Circle::area() │
├──────────────┤ │ Circle::draw() │
│ radius: f64 │ │ Circle::name() │
└──────────────┘ └─────────────────┘
Method call: obj.area() → obj.vtable_ptr[0](obj)
Rust trait objects use fat pointers: (data_ptr, vtable_ptr). The vtable is not stored in the object — only in the reference.
// Rust fat pointer for &dyn Shape:
┌──────────────┐
│ data_ptr ──────→ Circle { radius: 5.0 }
│ vtable_ptr ────→ [area_fn, draw_fn, drop_fn, size, align]
└──────────────┘
Fat Pointers
Pointers that carry extra metadata beyond the address.
Rust fat pointers:
&dyn Trait: (data_ptr, vtable_ptr)&[T](slice): (data_ptr, length)&str: (data_ptr, byte_length)
Coroutine Implementation
Stackful (Green Threads)
Each coroutine has its own stack. Context switch saves/restores registers + stack pointer.
Stack allocation: Small initial stacks (2-8 KB) that grow on demand (segmented stacks or growable stacks with guard pages).
Used by: Go goroutines, Lua coroutines, early Rust (before 1.0).
Stackless (State Machines)
Compiler transforms the async function into a state machine struct. Each await point becomes a state transition.
// Source:
ASYNC FUNCTION EXAMPLE()
a ← AWAIT FETCH_DATA() // suspend point 1
b ← AWAIT PROCESS(a) // suspend point 2
RETURN a + b
// Compiler generates (conceptually):
TYPE ExampleFuture =
State0 // initial state
| State1(a, fetch_future)
| State2(a, process_future)
| Done
FUNCTION POLL(self: ExampleFuture, context) -> Poll
MATCH self
State0: // start fetch, transition to State1
State1: // poll fetch, if ready go to State2
State2: // poll process, if ready return a + b
Done: ERROR "polled after completion"
Size: Each future is the size of its largest state variant. Typically 100-500 bytes (much smaller than a stack).
Used by: Rust async/await, JavaScript async/await, C# async/await, Python asyncio.
Applications in CS
- Systems programming: Understanding memory layout, calling conventions, and linking is essential for low-level programming.
- Debugging: Understanding stack frames, unwind info, and dynamic linking helps diagnose crashes (stack traces, core dumps, debugger integration).
- FFI (Foreign Function Interface): Calling C from Rust (or vice versa) requires matching calling conventions, memory layout, and ABI.
- Language implementation: GC design, object layout, and coroutine implementation are core concerns for language runtime developers.
- Performance: Understanding vtable overhead, allocation patterns, and GC pauses helps optimize performance.