9 min read
On this page

DynamicSupervisor Deep Dive

A regular Supervisor knows its children at boot time — you hand it a list, it starts them, it watches them. DynamicSupervisor is the opposite. It starts with zero children and you add them at runtime, one at a time, as the world demands them. This is the supervision pattern behind chat sessions, per-user worker processes, IoT device handlers, and any "process per entity" architecture.

WhatsApp ran something like two million per-user processes per node at peak. Discord routes every active conversation through a dynamically supervised gateway process. Nerves spawns a supervised process per attached sensor as devices come online. The shape is always the same: an event arrives, you ask "do I have a process for this thing yet?", and if not, you spin one up under a DynamicSupervisor.

The Mental Model

A DynamicSupervisor is a Supervisor with a fixed strategy of :one_for_one and an empty child list. Its sole job is to receive start_child/2 calls, spawn the child you describe, monitor it, and restart it according to the restart strategy you specified on the child spec.

defmodule MyApp.WorkerSupervisor do
  use DynamicSupervisor

  def start_link(opts) do
    DynamicSupervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end

  @impl true
  def init(_opts) do
    DynamicSupervisor.init(strategy: :one_for_one, max_children: 10_000)
  end
end

That's the whole supervisor. There's no init that returns a child list, because there is no static child list. The strategy is locked to :one_for_one — the other strategies don't make sense when children are unrelated peers. max_children is optional but worth setting; we'll come back to it.

To start a child, you call DynamicSupervisor.start_child/2 with the parent and a child spec:

def start_worker(user_id) do
  spec = {MyApp.Worker, user_id}
  DynamicSupervisor.start_child(MyApp.WorkerSupervisor, spec)
end

The spec follows the standard child spec shape — a tuple {module, arg}, a %{id: ..., start: {...}, restart: ...} map, or whatever child_spec/1 returns from the module. Behind the scenes, DynamicSupervisor calls your module's start_link/1 with that argument, links the resulting process to itself, and monitors it.

start_child Return Values

start_child/2 returns {:ok, pid}, {:ok, pid, info}, {:error, {:already_started, pid}} (the child registered under a name that was already taken — typically via Registry), {:error, reason} for any other start_link failure, or :ignore if init returned :ignore. Production code has to handle at least the first three. The :already_started case is critical when you're racing on a Registry-backed name — two callers ask for the same worker, both call start_child, only one wins, the loser gets back the winner's pid.

def get_or_start_worker(user_id) do
  case start_worker(user_id) do
    {:ok, pid} -> {:ok, pid}
    {:error, {:already_started, pid}} -> {:ok, pid}
    error -> error
  end
end

This shape is so common that you should write it once as a helper and call it from everywhere. The next chapter covers the Registry pattern that makes this race-safe.

Restart Strategies in DynamicSupervisor Context

The restart strategy lives on the child spec, not the supervisor. Three values, and the right choice depends on what the child represents.

:permanent — restart on any exit, normal or abnormal. The default. Almost never what you want for dynamic children. A permanent child that exits cleanly (because, say, the user logged off) gets immediately restarted, defeating the point of dynamic supervision.

:transient — restart only on abnormal exit. If the child exits with reason :normal, :shutdown, or {:shutdown, _}, the supervisor lets it stay dead. If it crashes, it gets restarted. This is the right default for most session-style workers.

:temporary — never restart. The supervisor still monitors the child and cleans up, but if the child dies for any reason, that's the end. Use this when restarting would be wrong — for example, a one-shot job worker that has already done its work, or a process whose state is meaningless after a crash.

defmodule MyApp.SessionWorker do
  use GenServer, restart: :transient

  def start_link(user_id) do
    GenServer.start_link(__MODULE__, user_id)
  end

  # ... callbacks
end

The restart: :transient argument to use GenServer sets the default in the generated child_spec/1. You can override per call if needed.

The trap: :permanent workers under a DynamicSupervisor with no termination logic. Every disconnect triggers a restart, the supervisor's restart counter pegs, and the supervisor itself eventually crashes from :max_restarts exceeded, taking down every sibling.

max_children and Back-Pressure

DynamicSupervisor.init/1 accepts a :max_children option. When you hit the limit, start_child/2 returns {:error, :max_children}. This is the cheapest, dumbest back-pressure mechanism in OTP and you should almost always set it.

DynamicSupervisor.init(strategy: :one_for_one, max_children: 50_000)

Without a limit, a runaway loop or a traffic spike can spawn millions of processes before something else breaks. The BEAM can technically handle that, but every other system around it — your database connection pool, your downstream APIs, your monitoring — cannot. A :max_children ceiling turns "OOM crash" into "graceful 503."

Pick the number based on what your downstream can tolerate. If each worker holds a database connection, the limit is your pool size minus headroom. If each worker is purely in-memory, you can go much higher — Discord runs hundreds of thousands per node. Measure before you guess.

The other side of back-pressure is what happens when the limit hits. The supervisor itself does not queue or retry — that's the caller's job. The standard pattern:

def try_start_worker(user_id, retries \\ 3) do
  case start_worker(user_id) do
    {:ok, pid} -> {:ok, pid}
    {:error, {:already_started, pid}} -> {:ok, pid}
    {:error, :max_children} when retries > 0 ->
      Process.sleep(50 + :rand.uniform(50))
      try_start_worker(user_id, retries - 1)
    error -> error
  end
end

For HTTP endpoints, return 503 — push back to the client rather than retrying forever.

Walked Example: Per-User Worker Spawned on First Message

Here's the full pattern, end to end. A worker process exists per user, spawned the first time a message arrives for them, kept alive while there's activity, dies after a configurable idle timeout.

defmodule Chat.UserWorker do
  use GenServer, restart: :transient

  @idle_timeout :timer.minutes(15)

  def start_link(user_id) do
    GenServer.start_link(__MODULE__, user_id, name: via(user_id))
  end

  def send_message(user_id, msg) do
    GenServer.cast(via(user_id), {:message, msg})
  end

  defp via(user_id), do: {:via, Registry, {Chat.UserRegistry, user_id}}

  @impl true
  def init(user_id) do
    {:ok, %{user_id: user_id, inbox: []}, @idle_timeout}
  end

  @impl true
  def handle_cast({:message, msg}, state) do
    {:noreply, %{state | inbox: [msg | state.inbox]}, @idle_timeout}
  end

  @impl true
  def handle_info(:timeout, state) do
    flush(state)
    {:stop, :normal, state}
  end

  defp flush(_state), do: :ok
end

The supervisor:

defmodule Chat.UserWorkerSupervisor do
  use DynamicSupervisor

  def start_link(opts) do
    DynamicSupervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end

  def start_worker(user_id) do
    DynamicSupervisor.start_child(__MODULE__, {Chat.UserWorker, user_id})
  end

  @impl true
  def init(_opts) do
    DynamicSupervisor.init(strategy: :one_for_one, max_children: 100_000)
  end
end

The entry point that callers use:

defmodule Chat.Router do
  def deliver(user_id, message) do
    ensure_worker(user_id)
    Chat.UserWorker.send_message(user_id, message)
  end

  defp ensure_worker(user_id) do
    case Chat.UserWorkerSupervisor.start_worker(user_id) do
      {:ok, _pid} -> :ok
      {:error, {:already_started, _pid}} -> :ok
      {:error, reason} -> {:error, reason}
    end
  end
end

Three pieces working together: the Registry handles naming, the DynamicSupervisor handles lifecycle, the GenServer handles behavior. The idle timeout uses GenServer's built-in timeout return — if no message arrives within the window, handle_info(:timeout, state) fires. Because the worker is :transient, exiting with :normal lets it stay dead. The Registry entry vanishes when the process dies (Registry monitors all registered processes); the next message for this user starts fresh.

The Lifetime Question

"When does the worker die?" is the single hardest question in dynamic supervision. Three common answers, each with a different shape:

Idle timeout, as above. The worker dies when nothing has happened for N minutes. Cheap to implement, easy to reason about. Works well for chat sessions, user worker pools, anything where activity is bursty.

Explicit shutdown. Something — usually the supervisor itself, or a coordinator process — calls DynamicSupervisor.terminate_child/2 when the entity is done. Works for things like game sessions (game ends, kill the worker) or job workers (job done, exit). Requires an external signal.

Death on disconnect. The worker monitors something else — a Phoenix Channel, an external connection, the user's session pid — and exits when that thing dies. Common for processes that exist to mirror an external connection.

def init({user_id, conn_pid}) do
  Process.monitor(conn_pid)
  {:ok, %{user_id: user_id}}
end

def handle_info({:DOWN, _ref, :process, _pid, _reason}, state) do
  {:stop, :normal, state}
end

The worst answer is "it dies when it crashes." If the only way a worker exits is by crashing, your system has a memory leak shaped like supervised processes. Decide explicitly when each worker should die.

Graceful Shutdown

The :shutdown value on the child spec controls how long the supervisor waits for the child to terminate before killing it. Default is 5 seconds, which is fine for in-memory workers but wrong for workers that need to flush state to disk or finish in-flight requests.

use GenServer, restart: :transient, shutdown: 30_000

Define a terminate/2 callback to do the cleanup:

@impl true
def terminate(_reason, state) do
  flush_inbox_to_storage(state)
  :ok
end

Caveats: terminate/2 only runs on normal/shutdown exits — a crash may skip it unless you trap exits. For real durability, write through to storage on every state change, not just on shutdown. And at deploy time the supervisor cascades shutdown signals in parallel: with 100,000 workers and 5-second timeouts each, your deploy can stall. Either keep timeouts short or accept that some workers get killed without graceful shutdown — usually the right call for stateless or write-through workers.

DynamicSupervisor vs PartitionSupervisor

A single DynamicSupervisor can bottleneck. Every start_child is a GenServer.call to one supervisor process. At tens of thousands of starts per second, that mailbox becomes the hot spot. PartitionSupervisor (Elixir 1.14+) shards the work across N supervisors hash-distributed by key:

{PartitionSupervisor,
  child_spec: DynamicSupervisor,
  name: MyApp.WorkerSupervisors,
  partitions: System.schedulers_online()}

# routing a start to a partition by key
partition = {:via, PartitionSupervisor, {MyApp.WorkerSupervisors, user_id}}
DynamicSupervisor.start_child(partition, {MyApp.Worker, user_id})

Discord moved to this once their per-guild worker counts outgrew a single supervisor. For most apps under a few thousand starts per second, one DynamicSupervisor is fine — measure before sharding.

Common Pitfalls

Using :permanent for dynamic children. Default GenServer restart is :permanent, so unless you override it, every worker that exits normally gets immediately restarted. For session-style workers this defeats the entire point. Set restart: :transient on the child spec.

No :max_children. A bug in your routing code spawns processes in a loop. Without a ceiling, you OOM the node before alerts fire. Always set a limit, even a generous one — it's a circuit breaker for free.

Treating the supervisor as the entrypoint. Callers shouldn't be calling start_child directly. Wrap it in a "find or start" function so the race against an existing worker is handled in one place. Spreading start_child calls across your codebase guarantees somebody forgets the :already_started clause.

Forgetting that Registry entries die with the process. When a worker exits, Registry's monitor fires and removes the entry. New start_child for the same key will succeed. If you're caching pids in client code, they'll point to dead processes — always look up through Registry or the manager function.

Long shutdown timeouts at scale. A 30-second shutdown is fine for one worker. Multiplied by 50,000 workers during a deploy, it can hang the supervisor's exit. Keep shutdowns short or design around losing in-flight work.

Spawning a DynamicSupervisor per entity. "I'll give each user their own DynamicSupervisor" usually means you're recreating a regular Supervisor pattern incorrectly. The shape is one DynamicSupervisor per kind of worker, not per instance.

Key Takeaways

  • DynamicSupervisor exists for "children I don't know about at boot." Chat sessions, per-user workers, IoT device handlers — anything spawned on demand.
  • Strategy is locked to :one_for_one; children are independent peers. Set :max_children as cheap back-pressure.
  • restart: :transient is the default you want for most dynamic children — restart on crash, stay dead on normal exit.
  • The {:ok, pid} / {:error, {:already_started, pid}} dance happens whenever you pair with Registry. Wrap it in a "find or start" helper.
  • "When does the worker die?" is the hardest design question. Idle timeout, explicit shutdown, or monitor-and-die are the three honest answers.
  • For graceful shutdown, set shutdown: ms on the child spec and implement terminate/2, but don't lean on it for durability.
  • A single DynamicSupervisor handles thousands of starts per second. Past that, PartitionSupervisor shards across multiple supervisors.