7 min read
On this page

Operating System Security

Mandatory Access Control (MAC)

Traditional UNIX uses Discretionary Access Control (DAC): file owners set permissions. MAC enforces system-wide security policies that even root cannot bypass.

SELinux

SELinux (Security-Enhanced Linux), developed by the NSA, implements Type Enforcement (TE) and Multi-Level Security (MLS).

Core concepts:

  • Every process has a security context (user:role:type:level)
  • Every object (file, socket, port) has a type label
  • Policy rules define which types can access which types and how
# Allow the web server to read web content files
allow httpd_t httpd_content_t:file { read open getattr };

# Deny everything not explicitly allowed (deny-by-default)

Modes:

  • enforcing -- policy violations are blocked and logged
  • permissive -- violations are logged but not blocked (useful for policy development)
  • disabled -- SELinux is off

Booleans: Runtime-tunable policy switches (e.g., httpd_can_network_connect) that avoid full policy recompilation.

AppArmor

AppArmor uses path-based profiles instead of labels. Simpler to configure than SELinux but less granular.

# AppArmor profile for nginx
/usr/sbin/nginx {
  /etc/nginx/** r,
  /var/www/** r,
  /var/log/nginx/** rw,
  /run/nginx.pid rw,
  network inet tcp,
  deny /etc/shadow r,
}

AppArmor vs. SELinux:

| Aspect | SELinux | AppArmor | |-----------------|-----------------------------|-----------------------------| | Model | Type enforcement (labels) | Path-based profiles | | Granularity | Very fine (object labels) | Medium (path-based) | | Complexity | High (steep learning curve) | Lower | | Default distro | RHEL, Fedora, CentOS | Ubuntu, SUSE | | File rename | Label follows the inode | Path changes break policy |

Linux Capabilities

Traditional UNIX: root (UID 0) has all privileges. Linux capabilities split root's power into ~40 distinct capabilities.

| Capability | Grants | |----------------------|--------------------------------------------| | CAP_NET_BIND_SERVICE | Bind to ports below 1024 | | CAP_NET_RAW | Use raw sockets (ping, packet capture) | | CAP_SYS_ADMIN | Catch-all: mount, namespace ops, etc. | | CAP_DAC_OVERRIDE | Bypass file permission checks | | CAP_SYS_PTRACE | Trace/debug any process | | CAP_NET_ADMIN | Network configuration |

Capability sets per thread:

  • Effective: capabilities actually checked by the kernel
  • Permitted: upper bound on effective capabilities
  • Inheritable: passed to exec'd programs
  • Bounding: absolute ceiling (cannot be raised)
  • Ambient: capabilities preserved across non-setuid exec
# Run a web server binding port 80 without root
setcap cap_net_bind_service=+ep /usr/bin/myserver

Containers heavily use capability dropping: Docker containers start with a restricted set (~14 of ~40 capabilities).

seccomp-bpf

seccomp-bpf restricts which system calls a process can make, using BPF programs to filter syscalls.

Actions:

  • SECCOMP_RET_ALLOW -- permit the syscall
  • SECCOMP_RET_KILL_PROCESS -- terminate immediately
  • SECCOMP_RET_ERRNO -- return an error code
  • SECCOMP_RET_TRACE -- notify a ptrace tracer
  • SECCOMP_RET_LOG -- allow but log
  • SECCOMP_RET_USER_NOTIF -- delegate decision to a supervisor process
// Install a seccomp filter (simplified)
struct sock_filter filter[] = {
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_execve, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),  // block execve
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),          // allow others
};
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);

Deployment: Chrome, Firefox, Docker, systemd, Flatpak, and Android all use seccomp-bpf to sandbox processes.

Sandboxing

Landlock (Linux)

Landlock is an unprivileged, stackable security module for file system access control. Unlike SELinux/AppArmor, any process can restrict itself without root or special configuration.

struct landlock_ruleset_attr attr = { .handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE };
int ruleset_fd = landlock_create_ruleset(&attr, sizeof(attr), 0);
// Add rules for allowed paths
landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH, &rule);
// Enforce
landlock_restrict_self(ruleset_fd, 0);

Pledge and Unveil (OpenBSD)

OpenBSD's approach to sandboxing: simple, hard to misuse.

  • pledge: Restrict a process to a set of operations (e.g., pledge("stdio rpath", NULL) allows only standard I/O and reading files)
  • unveil: Restrict file system visibility (e.g., unveil("/var/www", "r") makes only /var/www visible, read-only)
// OpenBSD: restrict a web server
unveil("/var/www", "r");
unveil("/run/httpd.sock", "rw");
unveil(NULL, NULL);  // lock unveil
pledge("stdio rpath unix", NULL);

This model has influenced Linux's Landlock design.

Kernel Address Space Layout Randomization (KASLR)

KASLR randomizes the kernel's base virtual address at each boot, defeating exploits that rely on known kernel addresses.

  • Text KASLR: Randomizes the kernel code base address
  • Physical KASLR: Randomizes the kernel's physical memory location
  • FG-KASLR (fine-grained): Randomizes individual functions (research/experimental)

KASLR is defeated by information leaks (e.g., /proc/kallsyms, dmesg). Mitigations:

  • kptr_restrict = 1 -- hide kernel pointers from non-root users
  • dmesg_restrict = 1 -- restrict dmesg to privileged users

Kernel Hardening

KASAN (Kernel Address Sanitizer)

KASAN detects memory errors in the kernel at runtime:

  • Out-of-bounds access (heap and stack)
  • Use-after-free
  • Relies on shadow memory (1 byte of shadow per 8 bytes of kernel memory)
  • Significant performance overhead (~2-3x) -- development and CI use only

KCSAN (Kernel Concurrency Sanitizer)

KCSAN detects data races in kernel code:

  • Instruments memory accesses and checks for concurrent conflicting access
  • Reports races with full stack traces
  • Based on a sampling approach (low but non-zero overhead)

Other Hardening Features

| Feature | Protection | |-------------------------|--------------------------------------------------| | Stack protector | Canary values detect stack buffer overflows | | FORTIFY_SOURCE | Compile-time and runtime bounds checking | | STACKLEAK | Clear kernel stack between syscalls | | Hardened usercopy | Validate user/kernel copy boundaries | | init_on_alloc/free | Zero-initialize heap memory (info leak prevention)| | CFI (Control Flow Integrity) | Indirect call targets validated at runtime | | W^X enforcement | Memory is writable or executable, never both |

Secure Boot and TPM

Secure Boot Chain

1. CPU reset -> ROM bootloader (immutable)
2. ROM verifies -> UEFI firmware (signed)
3. UEFI verifies -> bootloader/shim (signed by Microsoft/vendor)
4. Bootloader verifies -> kernel (signed)
5. Kernel verifies -> modules (signed)
6. Kernel enforces -> only signed modules load (CONFIG_MODULE_SIG_FORCE)

TPM (Trusted Platform Module)

A hardware security chip providing:

  • Platform Configuration Registers (PCRs): extend-only hash registers measuring each boot stage
  • Sealed storage: Encrypt data so it can only be decrypted if PCR values match (system is in known-good state)
  • Remote attestation: Prove to a remote verifier that the system booted securely
  • Key generation and storage: Hardware-protected keys that never leave the TPM

Measured boot vs. secure boot: Secure boot blocks unauthorized code. Measured boot records what ran (attestation) but does not block.

Trusted Execution Environments (TEEs)

TEEs provide hardware-isolated enclaves where code and data are protected even from a compromised OS.

Intel SGX (Software Guard Extensions)

  • Creates encrypted memory enclaves (EPC -- Enclave Page Cache)
  • The CPU decrypts enclave memory only when executing enclave code
  • OS, hypervisor, and even physical access cannot read enclave memory
  • Limited EPC size (128-512 MB), significant enclave entry/exit overhead
  • Attestation: prove to a remote party that an enclave runs specific code
  • Deprecated on consumer CPUs (12th gen+), continues on server (Xeon)

ARM TrustZone

  • Partitions the system into Normal World and Secure World
  • Hardware-enforced isolation: separate address spaces, interrupts, peripherals
  • Secure World runs a trusted OS (OP-TEE) with trusted applications
  • Used for: secure key storage, DRM, biometric authentication, secure boot

AMD SEV (Secure Encrypted Virtualization)

  • Encrypts VM memory with per-VM keys managed by a dedicated security processor
  • SEV-ES: additionally encrypts CPU register state on VM exit
  • SEV-SNP: adds integrity protection (prevents host from replaying or remapping pages)
  • Use case: confidential cloud computing -- cloud provider cannot inspect VM memory

Comparison

| TEE | Granularity | Threat Model | Attestation | |---------------|-------------|------------------------------|-------------| | SGX | Per-enclave | Untrusted OS + hypervisor | Remote | | TrustZone | Two worlds | Normal world compromised | Platform | | SEV-SNP | Per-VM | Untrusted hypervisor/host | Remote |

Spectre and Meltdown Mitigations

These transient execution attacks exploit speculative execution to leak data across security boundaries.

Meltdown (CVE-2017-5754)

  • Attack: User process reads kernel memory via speculative execution before the permission check retires
  • Mitigation: KPTI (Kernel Page Table Isolation) -- separate page tables for user and kernel mode. Kernel pages are unmapped in user-space page tables.
  • Performance cost: 1-5% for most workloads (higher for syscall-heavy workloads)

Spectre Variant 1 (Bounds Check Bypass)

  • Attack: Speculative array access past bounds, leaking data via cache side channel
  • Mitigation: array_index_nospec() -- clamp index to array bounds, speculation barrier (lfence)
  • Compiler support: Speculative Load Hardening (SLH)

Spectre Variant 2 (Branch Target Injection)

  • Attack: Poison indirect branch predictor to redirect speculative execution
  • Mitigations:
    • Retpolines: replace indirect calls with a return-based trampoline that prevents speculation
    • IBRS/IBPB: hardware controls for indirect branch prediction
    • eIBRS (enhanced IBRS): hardware mitigation on newer CPUs

Other Variants

| Attack | Target | Mitigation | |----------------|---------------------------|-----------------------------------| | Spectre-RSB | Return Stack Buffer | RSB stuffing on context switch | | MDS (RIDL, Fallout) | CPU internal buffers | VERW instruction, HT disabling | | L1TF | L1 cache + page tables | PTE inversion, VM flush | | Spectre-BHB | Branch History Buffer | BHB clearing sequences |

Performance Impact Summary

Mitigations collectively cost 5-30% depending on workload (syscall-heavy and VM-heavy workloads are most affected). Hardware fixes in newer CPU generations reduce the software mitigation overhead.

Key Takeaways

  1. MAC (SELinux, AppArmor) enforces system-wide policies beyond DAC; SELinux is more granular, AppArmor is simpler
  2. Linux capabilities decompose root privilege; containers drop most capabilities by default
  3. seccomp-bpf restricts syscalls at the per-process level and is ubiquitous in sandboxing
  4. Landlock enables unprivileged sandboxing; pledge/unveil (OpenBSD) inspired its design philosophy
  5. Secure boot + TPM establishes a hardware root of trust; measured boot enables remote attestation
  6. TEEs (SGX, TrustZone, SEV-SNP) protect code and data from the OS itself -- essential for confidential computing
  7. Spectre/Meltdown mitigations (KPTI, retpolines, IBRS) are a permanent tax on performance, gradually reduced by hardware fixes