Containers
Containers provide lightweight, isolated environments for applications using OS-level virtualization. Unlike VMs, containers share the host kernel — making them faster to start, smaller, and more efficient.
OS-Level Virtualization
VMs: Containers:
[App1][App2][App3] [App1][App2][App3]
[OS1 ][OS2 ][OS3 ] [Libs ][Libs ][Libs ]
[ Hypervisor ] [ Container Runtime ]
[ Hardware ] [ Host OS Kernel ]
[ Hardware ]
Key difference: Containers share the host kernel. No guest OS overhead. But less isolation than VMs (shared kernel = shared attack surface).
Linux Namespaces
Namespaces provide isolation — each container sees its own view of the system.
| Namespace | Isolates | Effect | |---|---|---| | PID | Process IDs | Container has PID 1 (init). Can't see host processes. | | Network | Network stack | Own IP, ports, routing, firewall rules. | | Mount | Filesystem mounts | Own mount points. Can't see host FS. | | User | UIDs/GIDs | Container root ≠ host root (unprivileged containers). | | UTS | Hostname | Own hostname and domain name. | | IPC | IPC resources | Own message queues, semaphores, shared memory. | | Cgroup | Cgroup root | Own view of cgroup hierarchy. | | Time (5.6+) | System clocks | Own clock offsets. |
Creating Namespaces
// Create a new PID + network + mount namespace
int child_pid = clone(child_fn, stack,
CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | SIGCHLD, NULL);
unshare() creates namespaces for the current process. setns() joins an existing namespace.
Cgroups (Control Groups)
Cgroups limit, account for, and isolate resource usage.
Controllers
| Controller | Controls | |---|---| | cpu | CPU time allocation (shares, quotas) | | cpuset | Pin to specific CPUs/NUMA nodes | | memory | Memory limit (hard/soft), swap limit | | blkio / io | Block I/O bandwidth limits, weights | | pids | Maximum number of processes | | devices | Access to specific device files | | freezer | Pause/resume all processes in a cgroup | | hugetlb | Huge page limits |
Resource Limiting
# Create a cgroup limiting memory to 256MB
mkdir /sys/fs/cgroup/mycontainer
echo 268435456 > /sys/fs/cgroup/mycontainer/memory.max
echo $$ > /sys/fs/cgroup/mycontainer/cgroup.procs
CPU Limits
CPU shares (relative weight): Container A gets 1024 shares, B gets 512 → A gets 2× CPU when both are busy.
CPU quota (absolute limit): 50000μs per 100000μs period → 50% of one CPU core.
echo "50000 100000" > /sys/fs/cgroup/mycontainer/cpu.max
Memory Limits
Hard limit: OOM kill if exceeded. Soft limit: Reclaim memory under pressure but allow burst.
Container Runtimes
Low-Level Runtimes
runc: Reference implementation of the OCI (Open Container Initiative) runtime spec. Creates and runs containers using namespaces + cgroups. Written in Go.
crun: Faster alternative written in C. Used by Podman.
youki: Container runtime written in Rust.
High-Level Runtimes
containerd: Industry-standard container runtime. Manages container lifecycle (pull images, create containers, manage storage). Used by Docker and Kubernetes.
CRI-O: Lightweight container runtime for Kubernetes. Implements the Container Runtime Interface (CRI).
Docker Architecture
User: docker run nginx
│
▼
Docker CLI → Docker Daemon (dockerd)
│
▼
containerd → runc → Container
│
▼
Image management (pull, store)
Docker Components
- Docker CLI: Command-line interface
- Docker Daemon (dockerd): Manages containers, images, networks, volumes
- containerd: Container lifecycle management
- runc: Creates and runs containers (sets up namespaces, cgroups)
OCI Specifications
The Open Container Initiative defines standards:
OCI Image Spec: How container images are formatted (layers, config, manifest).
OCI Runtime Spec: How containers are configured and executed (config.json with namespaces, mounts, cgroups).
OCI Distribution Spec: How images are pushed/pulled from registries.
Container Images
Layers
A container image is a stack of read-only layers. Each layer adds, modifies, or deletes files.
Layer 3 (top): COPY app.js /app/ (application code)
Layer 2: RUN npm install (dependencies)
Layer 1: RUN apt-get install nodejs (runtime)
Layer 0 (base): ubuntu:22.04 (base OS)
Union file system: Layers are stacked using a union mount. The container sees a merged view.
Overlay Filesystem (OverlayFS)
Container layer (read-write): upperdir
│
Merged view: merged = overlay(lowerdir + upperdir)
│
Image layers (read-only): lowerdir (stacked)
Copy-on-write: First write to a file from a lower layer → copy to upperdir, then modify.
Image Building
FROM rust:1.75-slim
WORKDIR /app
COPY Cargo.toml Cargo.lock ./
RUN cargo fetch
COPY src/ src/
RUN cargo build --release
CMD ["./target/release/myapp"]
Multi-stage builds: Use a build image, then copy only the binary to a minimal runtime image.
FROM rust:1.75 AS builder
WORKDIR /app
COPY . .
RUN cargo build --release
FROM debian:bookworm-slim
COPY --from=builder /app/target/release/myapp /usr/local/bin/
CMD ["myapp"]
Kubernetes Concepts
Architecture
┌────────────────────────────────────────────────┐
│ Control Plane │
│ API Server │ Scheduler │ Controller Manager │
│ etcd (state store) │
└───────────────────┬────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌───┴───┐ ┌───┴───┐ ┌───┴───┐
│ Node 1│ │ Node 2│ │ Node 3│
│kubelet│ │kubelet│ │kubelet│
│kube- │ │kube- │ │kube- │
│proxy │ │proxy │ │proxy │
│[Pod] │ │[Pod] │ │[Pod] │
│[Pod] │ │[Pod] │ │[Pod] │
└───────┘ └───────┘ └───────┘
Key Objects
Pod: Smallest deployable unit. One or more containers sharing network and storage.
Service: Stable network endpoint for a set of Pods. Load balancing. DNS name.
Deployment: Declarative updates for Pods. Rolling updates, rollbacks, scaling.
StatefulSet: Like Deployment but for stateful applications. Stable network IDs and persistent storage.
DaemonSet: Ensures a Pod runs on every node (monitoring, logging).
ConfigMap / Secret: External configuration. Injected as env vars or files.
Ingress: HTTP routing from external to Services.
Operators: Custom controllers that automate application management (database operators, message queue operators).
Kubernetes Patterns
Sidecar: Helper container in the same Pod (logging, monitoring, proxy — Envoy in Istio).
Init container: Run before main containers. Database migrations, config setup.
Ambassador: Proxy container handling external communication.
Container Security
Defense in Depth
- Minimal base images: Alpine, distroless, scratch — fewer packages = fewer vulnerabilities.
- Non-root user: Run as non-root inside the container.
- Read-only root filesystem: Prevent writes to the image layer.
- Drop capabilities: Remove unnecessary Linux capabilities (NET_RAW, SYS_ADMIN).
- Seccomp profiles: Restrict syscalls.
- AppArmor / SELinux: MAC policies for containers.
- Network policies: Restrict pod-to-pod communication in Kubernetes.
- Image scanning: Scan images for known vulnerabilities (Trivy, Snyk, Grype).
Container Isolation Limits
Containers share the host kernel. A kernel vulnerability can escape the container.
Stronger isolation: gVisor (user-space kernel), Kata Containers (lightweight VMs with container API), Firecracker (micro-VMs used by AWS Lambda).
Applications in CS
- Microservices: Each service in its own container. Independent deployment, scaling, updates.
- CI/CD: Build and test in containers (reproducible environments). GitHub Actions, GitLab CI.
- Development: Docker Compose for local multi-service development. "Works on my machine" → "Works in this container."
- Cloud-native: Kubernetes orchestrates containers at scale. Auto-scaling, self-healing, rolling updates.
- Edge computing: Lightweight containers on IoT devices and edge servers.
- ML/AI: Containerized training and inference. GPU support (NVIDIA Container Toolkit). Reproducible experiments.
- Serverless: AWS Lambda uses Firecracker (container-like micro-VMs). Cloud Run runs containers on demand.