Supervisors and Supervision Trees

A GenServer that crashes is not a bug — it's a design feature, but only if something is watching to restart it. That something is a supervisor. Supervisors are processes whose only job is to start child processes and restart them when they die. Stack supervisors on top of supervisors and you get a supervision tree, which is the actual shape of every Elixir application worth running in production.

WhatsApp's reliability story isn't really about Erlang's syntax or its concurrency primitives. It's about supervision trees. When a single user's session process dies — bad input, network glitch, whatever — the supervisor restarts just that one process, and 50 million other users don't notice. That isolation is what people mean when they say BEAM is built for fault tolerance.

What a Supervisor Actually Does

A supervisor starts a list of children. If any child crashes, the supervisor decides what to do based on its strategy. That's it. Supervisors don't do business logic. They don't hold state. They don't process messages. They watch and restart.

defmodule MyApp.Supervisor do
  use Supervisor

  def start_link(init_arg) do
    Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
  end

  @impl true
  def init(_init_arg) do
    children = [
      MyApp.Cache,
      MyApp.RateLimiter,
      {MyApp.Worker, [name: :worker_1]},
      {MyApp.Worker, [name: :worker_2]}
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end
end

Each entry in children is a child spec. The shorthand MyApp.Cache calls MyApp.Cache.child_spec/1, which has a default implementation when you use GenServer. The tuple form {MyApp.Worker, opts} passes opts to start_link/1.

Supervision Strategies

There are three strategies, and choosing the right one is the actual design decision.

one_for_one

If a child dies, restart only that child. The other children keep running, untouched.

This is the default and it's right most of the time. Use it when your children are independent — a cache, a rate limiter, and a metrics reporter don't need to know about each other, and one crashing shouldn't disrupt the others.

Supervisor.init(children, strategy: :one_for_one)

rest_for_one

If a child dies, restart it and every child started after it. Children started before it are left alone.

Use this when later children depend on earlier ones. If you have a database connection pool, then a cache that warms from the DB, then a worker that uses the cache — and the cache crashes — the worker is now talking to a stale or empty cache. Better to restart the worker too. The DB pool is fine and stays up.

children = [
  MyApp.DBPool,       # if this crashes, restart everything below
  MyApp.Cache,        # if this crashes, restart Worker too
  MyApp.Worker        # if this crashes, just restart Worker
]
Supervisor.init(children, strategy: :rest_for_one)

The order of the list matters. This is the only strategy where it does (for restart purposes).

one_for_all

If any child dies, kill them all and restart everything.

Use this only when the children are tightly coupled — they share state, depend on a coordinated handshake at startup, or there's no meaningful way to recover one without resetting the others. It's a sledgehammer. If you find yourself reaching for it, ask whether your children are really one logical unit that should be redesigned as a single process or a smaller subtree.

Supervisor.init(children, strategy: :one_for_all)

Restart Intensity

A supervisor will give up if its children crash too often. The defaults are 3 restarts in 5 seconds — exceed that and the supervisor itself crashes, propagating the failure up the tree.

Supervisor.init(children,
  strategy: :one_for_one,
  max_restarts: 5,
  max_seconds: 10
)

This is deliberate. If a process keeps crashing every time it's restarted, restarting it harder won't help. The supervisor escalates and lets a higher-level supervisor decide what to do — usually restart a larger chunk of the system, or eventually crash the whole application so the orchestrator (systemd, Kubernetes, whatever) can intervene.

Child Specs

A child spec tells the supervisor how to start, stop, and restart a child. The full form looks like this:

%{
  id: MyApp.Worker,
  start: {MyApp.Worker, :start_link, [opts]},
  restart: :permanent,
  shutdown: 5_000,
  type: :worker
}

You almost never write this by hand. use GenServer and use Supervisor generate child_spec/1 for you. When you need to override something, do it inline:

children = [
  Supervisor.child_spec({MyApp.Worker, []}, id: :worker_a, restart: :transient),
  Supervisor.child_spec({MyApp.Worker, []}, id: :worker_b, restart: :transient)
]

Two settings matter in practice:

restart — :permanent (default) restarts always, :temporary never restarts, :transient restarts only on abnormal termination. Most workers should be :permanent. One-shot tasks that should run once and disappear should be :temporary.

shutdown — How long the supervisor waits for the child to terminate gracefully before killing it. Default is 5000ms. If your child needs to flush a buffer or finish an in-flight request before dying, set this higher. If it's a stateless worker, you can leave it alone or use :brutal_kill for instant termination.

The Application Module

Every Elixir app has an application module that's the root of its supervision tree. This is what mix new --sup my_app generates:

defmodule MyApp.Application do
  use Application

  @impl true
  def start(_type, _args) do
    children = [
      MyApp.Repo,
      {Phoenix.PubSub, name: MyApp.PubSub},
      MyAppWeb.Endpoint,
      MyApp.Worker
    ]

    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

When you start your app — via mix run, iex -S mix, or a release — the BEAM calls MyApp.Application.start/2, which starts the root supervisor, which starts everything else. When you stop the app, the root supervisor shuts down its children in reverse order.

Real Supervision Tree

Here's a realistic structure for a job processing service:

MyApp.Supervisor (one_for_one)
├── MyApp.Repo                       (DB connection pool)
├── {Phoenix.PubSub, name: PubSub}
├── MyApp.JobsSupervisor (rest_for_one)
│   ├── MyApp.JobQueue              (GenServer holding queue state)
│   ├── MyApp.WorkerPool (one_for_one)
│   │   ├── MyApp.Worker (id: 1)
│   │   ├── MyApp.Worker (id: 2)
│   │   └── MyApp.Worker (id: 3)
│   └── MyApp.JobMonitor            (reports stuck jobs)
└── MyAppWeb.Endpoint               (Phoenix HTTP server)

The reasoning:

The root uses one_for_one because the database, jobs subsystem, and HTTP endpoint are independent. If Phoenix crashes, jobs keep processing.
JobsSupervisor uses rest_for_one because the workers and monitor depend on the queue. Queue dies, restart the workers and monitor too — they were holding references to a dead process anyway.
WorkerPool uses one_for_one because workers are independent. Worker 2 crashing has nothing to do with workers 1 and 3.

This isn't theoretical. Bleacher Report's real-time push system uses essentially this shape, with thousands of worker processes under a tree of supervisors.

Strategies in Code

defmodule MyApp.JobsSupervisor do
  use Supervisor

  def start_link(opts) do
    Supervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end

  @impl true
  def init(_opts) do
    children = [
      MyApp.JobQueue,
      MyApp.WorkerPool,
      MyApp.JobMonitor
    ]

    Supervisor.init(children, strategy: :rest_for_one)
  end
end

defmodule MyApp.WorkerPool do
  use Supervisor

  def start_link(opts) do
    Supervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end

  @impl true
  def init(_opts) do
    children =
      for id <- 1..3 do
        Supervisor.child_spec({MyApp.Worker, [id: id]}, id: {:worker, id})
      end

    Supervisor.init(children, strategy: :one_for_one)
  end
end

Each level has one job. Each level can be reasoned about in isolation. That's the architectural payoff.

Common Pitfalls

Treating supervision as exception handling. A supervisor that restarts a process every 50ms because the process keeps hitting a bad config value isn't fault tolerance — it's a hot loop. Supervisors handle transient failures, not bugs. If something always fails, the supervisor will give up and crash, which is the correct behavior.

One giant supervisor at the root. Putting every process under the application supervisor flattens your tree and loses the strategy distinctions. Group related processes under sub-supervisors. The structure of the tree should mirror the structure of your system.

Wrong strategy for the wrong shape. one_for_all because you couldn't be bothered to think about dependencies. one_for_one for processes that obviously need to restart together. The strategy choice is the design — get it wrong and recovery doesn't work the way you think it does.

Long init/1 callbacks blocking startup. A supervisor's start_link waits for every child's init/1 to return before starting the next sibling. If one child's init/1 takes 30 seconds, your whole boot is slow. Push expensive work to handle_continue/2.

Ignoring :transient and :temporary. Not every process should restart forever. A migration runner should be :transient (restart on crash, not on success). A one-shot Task spawned for a single request should be :temporary. The default :permanent is wrong for a non-trivial fraction of processes.

Key Takeaways

Supervisors restart children when they crash. They don't do business logic.
one_for_one is the default and usually correct. rest_for_one for dependent siblings. one_for_all for tightly-coupled groups.
The Application module is the root of every Elixir app's supervision tree.
Restart intensity (max_restarts / max_seconds) prevents hot-loop crash-restart cycles by escalating failure up the tree.
Use sub-supervisors to model dependencies in shape, not in code.
Set restart: correctly — :permanent, :transient, or :temporary matter for non-worker processes.