5 min read
On this page

Containers

Containers provide lightweight, isolated environments for applications using OS-level virtualization. Unlike VMs, containers share the host kernel — making them faster to start, smaller, and more efficient.

OS-Level Virtualization

VMs:                           Containers:
[App1][App2][App3]             [App1][App2][App3]
[OS1 ][OS2 ][OS3 ]            [Libs ][Libs ][Libs ]
[    Hypervisor   ]           [   Container Runtime  ]
[    Hardware     ]           [     Host OS Kernel    ]
                              [      Hardware        ]

Key difference: Containers share the host kernel. No guest OS overhead. But less isolation than VMs (shared kernel = shared attack surface).

Linux Namespaces

Namespaces provide isolation — each container sees its own view of the system.

| Namespace | Isolates | Effect | |---|---|---| | PID | Process IDs | Container has PID 1 (init). Can't see host processes. | | Network | Network stack | Own IP, ports, routing, firewall rules. | | Mount | Filesystem mounts | Own mount points. Can't see host FS. | | User | UIDs/GIDs | Container root ≠ host root (unprivileged containers). | | UTS | Hostname | Own hostname and domain name. | | IPC | IPC resources | Own message queues, semaphores, shared memory. | | Cgroup | Cgroup root | Own view of cgroup hierarchy. | | Time (5.6+) | System clocks | Own clock offsets. |

Creating Namespaces

// Create a new PID + network + mount namespace
int child_pid = clone(child_fn, stack,
    CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | SIGCHLD, NULL);

unshare() creates namespaces for the current process. setns() joins an existing namespace.

Cgroups (Control Groups)

Cgroups limit, account for, and isolate resource usage.

Controllers

| Controller | Controls | |---|---| | cpu | CPU time allocation (shares, quotas) | | cpuset | Pin to specific CPUs/NUMA nodes | | memory | Memory limit (hard/soft), swap limit | | blkio / io | Block I/O bandwidth limits, weights | | pids | Maximum number of processes | | devices | Access to specific device files | | freezer | Pause/resume all processes in a cgroup | | hugetlb | Huge page limits |

Resource Limiting

# Create a cgroup limiting memory to 256MB
mkdir /sys/fs/cgroup/mycontainer
echo 268435456 > /sys/fs/cgroup/mycontainer/memory.max
echo $$ > /sys/fs/cgroup/mycontainer/cgroup.procs

CPU Limits

CPU shares (relative weight): Container A gets 1024 shares, B gets 512 → A gets 2× CPU when both are busy.

CPU quota (absolute limit): 50000μs per 100000μs period → 50% of one CPU core.

echo "50000 100000" > /sys/fs/cgroup/mycontainer/cpu.max

Memory Limits

Hard limit: OOM kill if exceeded. Soft limit: Reclaim memory under pressure but allow burst.

Container Runtimes

Low-Level Runtimes

runc: Reference implementation of the OCI (Open Container Initiative) runtime spec. Creates and runs containers using namespaces + cgroups. Written in Go.

crun: Faster alternative written in C. Used by Podman.

youki: Container runtime written in Rust.

High-Level Runtimes

containerd: Industry-standard container runtime. Manages container lifecycle (pull images, create containers, manage storage). Used by Docker and Kubernetes.

CRI-O: Lightweight container runtime for Kubernetes. Implements the Container Runtime Interface (CRI).

Docker Architecture

User: docker run nginx
    │
    ▼
Docker CLI → Docker Daemon (dockerd)
                │
                ▼
            containerd → runc → Container
                │
                ▼
            Image management (pull, store)

Docker Components

  • Docker CLI: Command-line interface
  • Docker Daemon (dockerd): Manages containers, images, networks, volumes
  • containerd: Container lifecycle management
  • runc: Creates and runs containers (sets up namespaces, cgroups)

OCI Specifications

The Open Container Initiative defines standards:

OCI Image Spec: How container images are formatted (layers, config, manifest).

OCI Runtime Spec: How containers are configured and executed (config.json with namespaces, mounts, cgroups).

OCI Distribution Spec: How images are pushed/pulled from registries.

Container Images

Layers

A container image is a stack of read-only layers. Each layer adds, modifies, or deletes files.

Layer 3 (top): COPY app.js /app/          (application code)
Layer 2:       RUN npm install             (dependencies)
Layer 1:       RUN apt-get install nodejs  (runtime)
Layer 0 (base): ubuntu:22.04              (base OS)

Union file system: Layers are stacked using a union mount. The container sees a merged view.

Overlay Filesystem (OverlayFS)

Container layer (read-write):  upperdir
                    │
Merged view:    merged = overlay(lowerdir + upperdir)
                    │
Image layers (read-only):  lowerdir (stacked)

Copy-on-write: First write to a file from a lower layer → copy to upperdir, then modify.

Image Building

FROM rust:1.75-slim
WORKDIR /app
COPY Cargo.toml Cargo.lock ./
RUN cargo fetch
COPY src/ src/
RUN cargo build --release
CMD ["./target/release/myapp"]

Multi-stage builds: Use a build image, then copy only the binary to a minimal runtime image.

FROM rust:1.75 AS builder
WORKDIR /app
COPY . .
RUN cargo build --release

FROM debian:bookworm-slim
COPY --from=builder /app/target/release/myapp /usr/local/bin/
CMD ["myapp"]

Kubernetes Concepts

Architecture

┌────────────────────────────────────────────────┐
│                 Control Plane                   │
│  API Server │ Scheduler │ Controller Manager    │
│  etcd (state store)                             │
└───────────────────┬────────────────────────────┘
                    │
    ┌───────────────┼───────────────┐
    │               │               │
┌───┴───┐      ┌───┴───┐      ┌───┴───┐
│ Node 1│      │ Node 2│      │ Node 3│
│kubelet│      │kubelet│      │kubelet│
│kube-  │      │kube-  │      │kube-  │
│proxy  │      │proxy  │      │proxy  │
│[Pod]  │      │[Pod]  │      │[Pod]  │
│[Pod]  │      │[Pod]  │      │[Pod]  │
└───────┘      └───────┘      └───────┘

Key Objects

Pod: Smallest deployable unit. One or more containers sharing network and storage.

Service: Stable network endpoint for a set of Pods. Load balancing. DNS name.

Deployment: Declarative updates for Pods. Rolling updates, rollbacks, scaling.

StatefulSet: Like Deployment but for stateful applications. Stable network IDs and persistent storage.

DaemonSet: Ensures a Pod runs on every node (monitoring, logging).

ConfigMap / Secret: External configuration. Injected as env vars or files.

Ingress: HTTP routing from external to Services.

Operators: Custom controllers that automate application management (database operators, message queue operators).

Kubernetes Patterns

Sidecar: Helper container in the same Pod (logging, monitoring, proxy — Envoy in Istio).

Init container: Run before main containers. Database migrations, config setup.

Ambassador: Proxy container handling external communication.

Container Security

Defense in Depth

  1. Minimal base images: Alpine, distroless, scratch — fewer packages = fewer vulnerabilities.
  2. Non-root user: Run as non-root inside the container.
  3. Read-only root filesystem: Prevent writes to the image layer.
  4. Drop capabilities: Remove unnecessary Linux capabilities (NET_RAW, SYS_ADMIN).
  5. Seccomp profiles: Restrict syscalls.
  6. AppArmor / SELinux: MAC policies for containers.
  7. Network policies: Restrict pod-to-pod communication in Kubernetes.
  8. Image scanning: Scan images for known vulnerabilities (Trivy, Snyk, Grype).

Container Isolation Limits

Containers share the host kernel. A kernel vulnerability can escape the container.

Stronger isolation: gVisor (user-space kernel), Kata Containers (lightweight VMs with container API), Firecracker (micro-VMs used by AWS Lambda).

Applications in CS

  • Microservices: Each service in its own container. Independent deployment, scaling, updates.
  • CI/CD: Build and test in containers (reproducible environments). GitHub Actions, GitLab CI.
  • Development: Docker Compose for local multi-service development. "Works on my machine" → "Works in this container."
  • Cloud-native: Kubernetes orchestrates containers at scale. Auto-scaling, self-healing, rolling updates.
  • Edge computing: Lightweight containers on IoT devices and edge servers.
  • ML/AI: Containerized training and inference. GPU support (NVIDIA Container Toolkit). Reproducible experiments.
  • Serverless: AWS Lambda uses Firecracker (container-like micro-VMs). Cloud Run runs containers on demand.