5 min read
On this page

Containers

Containers provide lightweight, isolated environments for applications using OS-level virtualization. Unlike VMs, containers share the host kernel — making them faster to start, smaller, and more efficient.

OS-Level Virtualization

VMs:                           Containers:
[App1][App2][App3]             [App1][App2][App3]
[OS1 ][OS2 ][OS3 ]            [Libs ][Libs ][Libs ]
[    Hypervisor   ]           [   Container Runtime  ]
[    Hardware     ]           [     Host OS Kernel    ]
                              [      Hardware        ]

Key difference: Containers share the host kernel. No guest OS overhead. But less isolation than VMs (shared kernel = shared attack surface).

Linux Namespaces

Namespaces provide isolation — each container sees its own view of the system.

Namespace Isolates Effect
PID Process IDs Container has PID 1 (init). Can't see host processes.
Network Network stack Own IP, ports, routing, firewall rules.
Mount Filesystem mounts Own mount points. Can't see host FS.
User UIDs/GIDs Container root ≠ host root (unprivileged containers).
UTS Hostname Own hostname and domain name.
IPC IPC resources Own message queues, semaphores, shared memory.
Cgroup Cgroup root Own view of cgroup hierarchy.
Time (5.6+) System clocks Own clock offsets.

Creating Namespaces

// Create a new PID + network + mount namespace
int child_pid = clone(child_fn, stack,
    CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | SIGCHLD, NULL);

unshare() creates namespaces for the current process. setns() joins an existing namespace.

Cgroups (Control Groups)

Cgroups limit, account for, and isolate resource usage.

Controllers

Controller Controls
cpu CPU time allocation (shares, quotas)
cpuset Pin to specific CPUs/NUMA nodes
memory Memory limit (hard/soft), swap limit
blkio / io Block I/O bandwidth limits, weights
pids Maximum number of processes
devices Access to specific device files
freezer Pause/resume all processes in a cgroup
hugetlb Huge page limits

Resource Limiting

# Create a cgroup limiting memory to 256MB
mkdir /sys/fs/cgroup/mycontainer
echo 268435456 > /sys/fs/cgroup/mycontainer/memory.max
echo $$ > /sys/fs/cgroup/mycontainer/cgroup.procs

CPU Limits

CPU shares (relative weight): Container A gets 1024 shares, B gets 512 → A gets 2× CPU when both are busy.

CPU quota (absolute limit): 50000μs per 100000μs period → 50% of one CPU core.

echo "50000 100000" > /sys/fs/cgroup/mycontainer/cpu.max

Memory Limits

Hard limit: OOM kill if exceeded. Soft limit: Reclaim memory under pressure but allow burst.

Container Runtimes

Low-Level Runtimes

runc: Reference implementation of the OCI (Open Container Initiative) runtime spec. Creates and runs containers using namespaces + cgroups. Written in Go.

crun: Faster alternative written in C. Used by Podman.

youki: Container runtime written in Rust.

High-Level Runtimes

containerd: Industry-standard container runtime. Manages container lifecycle (pull images, create containers, manage storage). Used by Docker and Kubernetes.

CRI-O: Lightweight container runtime for Kubernetes. Implements the Container Runtime Interface (CRI).

Docker Architecture

User: docker run nginx
    │
    ▼
Docker CLI → Docker Daemon (dockerd)
                │
                ▼
            containerd → runc → Container
                │
                ▼
            Image management (pull, store)

Docker Components

  • Docker CLI: Command-line interface
  • Docker Daemon (dockerd): Manages containers, images, networks, volumes
  • containerd: Container lifecycle management
  • runc: Creates and runs containers (sets up namespaces, cgroups)

OCI Specifications

The Open Container Initiative defines standards:

OCI Image Spec: How container images are formatted (layers, config, manifest).

OCI Runtime Spec: How containers are configured and executed (config.json with namespaces, mounts, cgroups).

OCI Distribution Spec: How images are pushed/pulled from registries.

Container Images

Layers

A container image is a stack of read-only layers. Each layer adds, modifies, or deletes files.

Layer 3 (top): COPY app.js /app/          (application code)
Layer 2:       RUN npm install             (dependencies)
Layer 1:       RUN apt-get install nodejs  (runtime)
Layer 0 (base): ubuntu:22.04              (base OS)

Union file system: Layers are stacked using a union mount. The container sees a merged view.

Overlay Filesystem (OverlayFS)

Container layer (read-write):  upperdir
                    │
Merged view:    merged = overlay(lowerdir + upperdir)
                    │
Image layers (read-only):  lowerdir (stacked)

Copy-on-write: First write to a file from a lower layer → copy to upperdir, then modify.

Image Building

FROM rust:1.75-slim
WORKDIR /app
COPY Cargo.toml Cargo.lock ./
RUN cargo fetch
COPY src/ src/
RUN cargo build --release
CMD ["./target/release/myapp"]

Multi-stage builds: Use a build image, then copy only the binary to a minimal runtime image.

FROM rust:1.75 AS builder
WORKDIR /app
COPY . .
RUN cargo build --release

FROM debian:bookworm-slim
COPY --from=builder /app/target/release/myapp /usr/local/bin/
CMD ["myapp"]

Kubernetes Concepts

Architecture

┌────────────────────────────────────────────────┐
│                 Control Plane                   │
│  API Server │ Scheduler │ Controller Manager    │
│  etcd (state store)                             │
└───────────────────┬────────────────────────────┘
                    │
    ┌───────────────┼───────────────┐
    │               │               │
┌───┴───┐      ┌───┴───┐      ┌───┴───┐
│ Node 1│      │ Node 2│      │ Node 3│
│kubelet│      │kubelet│      │kubelet│
│kube-  │      │kube-  │      │kube-  │
│proxy  │      │proxy  │      │proxy  │
│[Pod]  │      │[Pod]  │      │[Pod]  │
│[Pod]  │      │[Pod]  │      │[Pod]  │
└───────┘      └───────┘      └───────┘

Key Objects

Pod: Smallest deployable unit. One or more containers sharing network and storage.

Service: Stable network endpoint for a set of Pods. Load balancing. DNS name.

Deployment: Declarative updates for Pods. Rolling updates, rollbacks, scaling.

StatefulSet: Like Deployment but for stateful applications. Stable network IDs and persistent storage.

DaemonSet: Ensures a Pod runs on every node (monitoring, logging).

ConfigMap / Secret: External configuration. Injected as env vars or files.

Ingress: HTTP routing from external to Services.

Operators: Custom controllers that automate application management (database operators, message queue operators).

Kubernetes Patterns

Sidecar: Helper container in the same Pod (logging, monitoring, proxy — Envoy in Istio).

Init container: Run before main containers. Database migrations, config setup.

Ambassador: Proxy container handling external communication.

Container Security

Defense in Depth

  1. Minimal base images: Alpine, distroless, scratch — fewer packages = fewer vulnerabilities.
  2. Non-root user: Run as non-root inside the container.
  3. Read-only root filesystem: Prevent writes to the image layer.
  4. Drop capabilities: Remove unnecessary Linux capabilities (NET_RAW, SYS_ADMIN).
  5. Seccomp profiles: Restrict syscalls.
  6. AppArmor / SELinux: MAC policies for containers.
  7. Network policies: Restrict pod-to-pod communication in Kubernetes.
  8. Image scanning: Scan images for known vulnerabilities (Trivy, Snyk, Grype).

Container Isolation Limits

Containers share the host kernel. A kernel vulnerability can escape the container.

Stronger isolation: gVisor (user-space kernel), Kata Containers (lightweight VMs with container API), Firecracker (micro-VMs used by AWS Lambda).

Applications in CS

  • Microservices: Each service in its own container. Independent deployment, scaling, updates.
  • CI/CD: Build and test in containers (reproducible environments). GitHub Actions, GitLab CI.
  • Development: Docker Compose for local multi-service development. "Works on my machine" → "Works in this container."
  • Cloud-native: Kubernetes orchestrates containers at scale. Auto-scaling, self-healing, rolling updates.
  • Edge computing: Lightweight containers on IoT devices and edge servers.
  • ML/AI: Containerized training and inference. GPU support (NVIDIA Container Toolkit). Reproducible experiments.
  • Serverless: AWS Lambda uses Firecracker (container-like micro-VMs). Cloud Run runs containers on demand.