Operating System Security

Mandatory Access Control (MAC)

Traditional UNIX uses Discretionary Access Control (DAC): file owners set permissions. MAC enforces system-wide security policies that even root cannot bypass.

SELinux

SELinux (Security-Enhanced Linux), developed by the NSA, implements Type Enforcement (TE) and Multi-Level Security (MLS).

Core concepts:

Every process has a security context (user:role:type:level)
Every object (file, socket, port) has a type label
Policy rules define which types can access which types and how

# Allow the web server to read web content files
allow httpd_t httpd_content_t:file { read open getattr };

# Deny everything not explicitly allowed (deny-by-default)

Modes:

enforcing -- policy violations are blocked and logged
permissive -- violations are logged but not blocked (useful for policy development)
disabled -- SELinux is off

Booleans: Runtime-tunable policy switches (e.g., httpd_can_network_connect) that avoid full policy recompilation.

AppArmor

AppArmor uses path-based profiles instead of labels. Simpler to configure than SELinux but less granular.

# AppArmor profile for nginx
/usr/sbin/nginx {
  /etc/nginx/** r,
  /var/www/** r,
  /var/log/nginx/** rw,
  /run/nginx.pid rw,
  network inet tcp,
  deny /etc/shadow r,
}

AppArmor vs. SELinux:

Aspect	SELinux	AppArmor
Model	Type enforcement (labels)	Path-based profiles
Granularity	Very fine (object labels)	Medium (path-based)
Complexity	High (steep learning curve)	Lower
Default distro	RHEL, Fedora, CentOS	Ubuntu, SUSE
File rename	Label follows the inode	Path changes break policy

Linux Capabilities

Traditional UNIX: root (UID 0) has all privileges. Linux capabilities split root's power into ~40 distinct capabilities.

Capability	Grants
`CAP_NET_BIND_SERVICE`	Bind to ports below 1024
`CAP_NET_RAW`	Use raw sockets (ping, packet capture)
`CAP_SYS_ADMIN`	Catch-all: mount, namespace ops, etc.
`CAP_DAC_OVERRIDE`	Bypass file permission checks
`CAP_SYS_PTRACE`	Trace/debug any process
`CAP_NET_ADMIN`	Network configuration

Capability sets per thread:

Effective: capabilities actually checked by the kernel
Permitted: upper bound on effective capabilities
Inheritable: passed to exec'd programs
Bounding: absolute ceiling (cannot be raised)
Ambient: capabilities preserved across non-setuid exec

# Run a web server binding port 80 without root
setcap cap_net_bind_service=+ep /usr/bin/myserver

Containers heavily use capability dropping: Docker containers start with a restricted set (~14 of ~40 capabilities).

seccomp-bpf

seccomp-bpf restricts which system calls a process can make, using BPF programs to filter syscalls.

Actions:

SECCOMP_RET_ALLOW -- permit the syscall
SECCOMP_RET_KILL_PROCESS -- terminate immediately
SECCOMP_RET_ERRNO -- return an error code
SECCOMP_RET_TRACE -- notify a ptrace tracer
SECCOMP_RET_LOG -- allow but log
SECCOMP_RET_USER_NOTIF -- delegate decision to a supervisor process

// Install a seccomp filter (simplified)
struct sock_filter filter[] = {
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_execve, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),  // block execve
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),          // allow others
};
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);

Deployment: Chrome, Firefox, Docker, systemd, Flatpak, and Android all use seccomp-bpf to sandbox processes.

Sandboxing

Landlock (Linux)

Landlock is an unprivileged, stackable security module for file system access control. Unlike SELinux/AppArmor, any process can restrict itself without root or special configuration.

struct landlock_ruleset_attr attr = { .handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE };
int ruleset_fd = landlock_create_ruleset(&attr, sizeof(attr), 0);
// Add rules for allowed paths
landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH, &rule);
// Enforce
landlock_restrict_self(ruleset_fd, 0);

Pledge and Unveil (OpenBSD)

OpenBSD's approach to sandboxing: simple, hard to misuse.

pledge: Restrict a process to a set of operations (e.g., pledge("stdio rpath", NULL) allows only standard I/O and reading files)
unveil: Restrict file system visibility (e.g., unveil("/var/www", "r") makes only /var/www visible, read-only)

// OpenBSD: restrict a web server
unveil("/var/www", "r");
unveil("/run/httpd.sock", "rw");
unveil(NULL, NULL);  // lock unveil
pledge("stdio rpath unix", NULL);

This model has influenced Linux's Landlock design.

Kernel Address Space Layout Randomization (KASLR)

KASLR randomizes the kernel's base virtual address at each boot, defeating exploits that rely on known kernel addresses.

Text KASLR: Randomizes the kernel code base address
Physical KASLR: Randomizes the kernel's physical memory location
FG-KASLR (fine-grained): Randomizes individual functions (research/experimental)

KASLR is defeated by information leaks (e.g., /proc/kallsyms, dmesg). Mitigations:

kptr_restrict = 1 -- hide kernel pointers from non-root users
dmesg_restrict = 1 -- restrict dmesg to privileged users

Kernel Hardening

KASAN (Kernel Address Sanitizer)

KASAN detects memory errors in the kernel at runtime:

Out-of-bounds access (heap and stack)
Use-after-free
Relies on shadow memory (1 byte of shadow per 8 bytes of kernel memory)
Significant performance overhead (~2-3x) -- development and CI use only

KCSAN (Kernel Concurrency Sanitizer)

KCSAN detects data races in kernel code:

Instruments memory accesses and checks for concurrent conflicting access
Reports races with full stack traces
Based on a sampling approach (low but non-zero overhead)

Other Hardening Features

Feature	Protection
Stack protector	Canary values detect stack buffer overflows
FORTIFY_SOURCE	Compile-time and runtime bounds checking
STACKLEAK	Clear kernel stack between syscalls
Hardened usercopy	Validate user/kernel copy boundaries
init_on_alloc/free	Zero-initialize heap memory (info leak prevention)
CFI (Control Flow Integrity)	Indirect call targets validated at runtime
W^X enforcement	Memory is writable or executable, never both

Secure Boot and TPM

Secure Boot Chain

1. CPU reset -> ROM bootloader (immutable)
2. ROM verifies -> UEFI firmware (signed)
3. UEFI verifies -> bootloader/shim (signed by Microsoft/vendor)
4. Bootloader verifies -> kernel (signed)
5. Kernel verifies -> modules (signed)
6. Kernel enforces -> only signed modules load (CONFIG_MODULE_SIG_FORCE)

TPM (Trusted Platform Module)

A hardware security chip providing:

Platform Configuration Registers (PCRs): extend-only hash registers measuring each boot stage
Sealed storage: Encrypt data so it can only be decrypted if PCR values match (system is in known-good state)
Remote attestation: Prove to a remote verifier that the system booted securely
Key generation and storage: Hardware-protected keys that never leave the TPM

Measured boot vs. secure boot: Secure boot blocks unauthorized code. Measured boot records what ran (attestation) but does not block.

Trusted Execution Environments (TEEs)

TEEs provide hardware-isolated enclaves where code and data are protected even from a compromised OS.

Intel SGX (Software Guard Extensions)

Creates encrypted memory enclaves (EPC -- Enclave Page Cache)
The CPU decrypts enclave memory only when executing enclave code
OS, hypervisor, and even physical access cannot read enclave memory
Limited EPC size (128-512 MB), significant enclave entry/exit overhead
Attestation: prove to a remote party that an enclave runs specific code
Deprecated on consumer CPUs (12th gen+), continues on server (Xeon)

ARM TrustZone

Partitions the system into Normal World and Secure World
Hardware-enforced isolation: separate address spaces, interrupts, peripherals
Secure World runs a trusted OS (OP-TEE) with trusted applications
Used for: secure key storage, DRM, biometric authentication, secure boot

AMD SEV (Secure Encrypted Virtualization)

Encrypts VM memory with per-VM keys managed by a dedicated security processor
SEV-ES: additionally encrypts CPU register state on VM exit
SEV-SNP: adds integrity protection (prevents host from replaying or remapping pages)
Use case: confidential cloud computing -- cloud provider cannot inspect VM memory

Comparison

TEE	Granularity	Threat Model	Attestation
SGX	Per-enclave	Untrusted OS + hypervisor	Remote
TrustZone	Two worlds	Normal world compromised	Platform
SEV-SNP	Per-VM	Untrusted hypervisor/host	Remote

Spectre and Meltdown Mitigations

These transient execution attacks exploit speculative execution to leak data across security boundaries.

Meltdown (CVE-2017-5754)

Attack: User process reads kernel memory via speculative execution before the permission check retires
Mitigation: KPTI (Kernel Page Table Isolation) -- separate page tables for user and kernel mode. Kernel pages are unmapped in user-space page tables.
Performance cost: 1-5% for most workloads (higher for syscall-heavy workloads)

Spectre Variant 1 (Bounds Check Bypass)

Attack: Speculative array access past bounds, leaking data via cache side channel
Mitigation: array_index_nospec() -- clamp index to array bounds, speculation barrier (lfence)
Compiler support: Speculative Load Hardening (SLH)

Spectre Variant 2 (Branch Target Injection)

Attack: Poison indirect branch predictor to redirect speculative execution
Mitigations:
- Retpolines: replace indirect calls with a return-based trampoline that prevents speculation
- IBRS/IBPB: hardware controls for indirect branch prediction
- eIBRS (enhanced IBRS): hardware mitigation on newer CPUs

Other Variants

Attack	Target	Mitigation
Spectre-RSB	Return Stack Buffer	RSB stuffing on context switch
MDS (RIDL, Fallout)	CPU internal buffers	VERW instruction, HT disabling
L1TF	L1 cache + page tables	PTE inversion, VM flush
Spectre-BHB	Branch History Buffer	BHB clearing sequences

Performance Impact Summary

Mitigations collectively cost 5-30% depending on workload (syscall-heavy and VM-heavy workloads are most affected). Hardware fixes in newer CPU generations reduce the software mitigation overhead.

Key Takeaways

MAC (SELinux, AppArmor) enforces system-wide policies beyond DAC; SELinux is more granular, AppArmor is simpler
Linux capabilities decompose root privilege; containers drop most capabilities by default
seccomp-bpf restricts syscalls at the per-process level and is ubiquitous in sandboxing
Landlock enables unprivileged sandboxing; pledge/unveil (OpenBSD) inspired its design philosophy
Secure boot + TPM establishes a hardware root of trust; measured boot enables remote attestation
TEEs (SGX, TrustZone, SEV-SNP) protect code and data from the OS itself -- essential for confidential computing
Spectre/Meltdown mitigations (KPTI, retpolines, IBRS) are a permanent tax on performance, gradually reduced by hardware fixes