9 min read
On this page

Agent Deep Dive

The survey treatment of Agent — start_link, get, update — covers maybe 30 percent of what you actually need to know to use one in production without regretting it. This chapter is the other 70: the atomic read-modify-write pattern, the slow-function trap that quietly serializes your whole system, why cast is rarely the answer, and how to supervise agents so they survive the boot order.

Agent is a thin GenServer wrapper. That framing matters. Every behavior, every gotcha, every performance characteristic of Agent comes from "it's a GenServer with a fixed callback shape." If you internalize that, the rest follows.

What Agent Actually Is

The Agent module is roughly a hundred lines on top of GenServer. It exposes four operations that map directly to GenServer callbacks:

  • Agent.get/2 is a synchronous handle_call that returns the result of running your function on the state.
  • Agent.update/2 is a synchronous handle_call that replaces the state with your function's return value.
  • Agent.get_and_update/2 is a single handle_call that does both in one trip.
  • Agent.cast/2 is a handle_cast — fire-and-forget update.

The functions you pass run inside the agent process, not the caller. That single fact is the source of every Agent pitfall. The agent has one mailbox, processes one message at a time, and your function holds it captive for as long as the function runs.

defmodule FeatureFlags do
  use Agent

  def start_link(_opts) do
    Agent.start_link(fn -> load_initial_flags() end, name: __MODULE__)
  end

  def enabled?(flag), do: Agent.get(__MODULE__, &Map.get(&1, flag, false))
  def set(flag, val), do: Agent.update(__MODULE__, &Map.put(&1, flag, val))

  defp load_initial_flags, do: %{"new_checkout" => true}
end

This is the canonical shape. Notice the initializer is a function, not a value — the function runs inside the agent, so a slow load does not block the caller of start_link. (It does block the supervisor's boot, though — more on that later.)

When Agent Is the Right Tool

Reach for Agent when all three of these are true:

  1. You need shared mutable state.
  2. The operations on that state are pure functions of the state — no timers, no incoming messages from elsewhere, no side effects.
  3. Contention is low — a handful of writes per second, maybe a few hundred reads.

A feature flag cache fits perfectly. A small in-memory config that admins update once an hour fits. A counter that bumps on rare events fits.

The moment you need a timer (Process.send_after), a handle_info callback, a terminate cleanup hook, or any reactive behavior, you have outgrown Agent. At that point, "Agent plus a side process to send it messages" is a worse GenServer. Just write the GenServer.

get_and_update for Atomic Read-Modify-Write

This is the single most underused Agent function. Most code that should use get_and_update instead uses get followed by update, which introduces a race window between the two calls.

# Wrong — two round trips, race between them
def reserve_slot(user_id) do
  taken = Agent.get(SlotCache, & &1)
  if MapSet.member?(taken, user_id) do
    {:error, :already_reserved}
  else
    Agent.update(SlotCache, &MapSet.put(&1, user_id))
    :ok
  end
end

If two callers race, both see "not taken," both do the update, and you reserved the same slot twice. The fix is one atomic operation that both inspects and modifies state:

def reserve_slot(user_id) do
  Agent.get_and_update(SlotCache, fn taken ->
    if MapSet.member?(taken, user_id) do
      {{:error, :already_reserved}, taken}
    else
      {:ok, MapSet.put(taken, user_id)}
    end
  end)
end

The function returns {return_value, new_state}. The agent runs this inside its single-threaded loop, so no other operation can sneak in between the read and the write. Any time you find yourself writing "check, then change," reach for get_and_update.

The Slow-Function Trap

This is the bug that destroys naive Agent usage in production. Your function runs inside the agent. Every other client is queued behind it.

# This will quietly serialize every flag lookup in the system
def enabled?(flag) do
  Agent.get(__MODULE__, fn flags ->
    # someone thought it was clever to refresh from disk on miss
    case Map.fetch(flags, flag) do
      {:ok, val} -> val
      :error -> File.read!("flags.json") |> Jason.decode!() |> Map.get(flag)
    end
  end)
end

Disk read in Agent.get. If the disk takes 50ms, every concurrent caller waits 50ms behind it. Multiply by however many requests hit your Phoenix endpoint and your throughput collapses to twenty requests per second on a 32-core box. Telemetry shows the agent's mailbox climbing. The CPU is bored. You are bottlenecked on one process holding a file handle.

The rule is mechanical: the function inside Agent.get or Agent.update must do nothing that blocks. No file IO, no network calls, no GenServer.call to another process that might be busy, no :timer.sleep. Just compute on the state and return.

If you need slow work, fetch the state out, do the work in the caller, then update with the result:

def refresh_from_disk do
  current = Agent.get(__MODULE__, & &1)
  fresh = File.read!("flags.json") |> Jason.decode!()
  merged = Map.merge(current, fresh)
  Agent.update(__MODULE__, fn _ -> merged end)
end

The disk read happens in whichever process called refresh_from_disk/0. The agent is only locked for the microseconds it takes to swap the map.

Agent.cast and Why You Usually Don't Want It

Agent.cast is fire-and-forget. The caller returns immediately, the update happens eventually. The temptation is to use it for "writes I don't care about acknowledging" to speed things up.

# tempting
Agent.cast(MetricsCache, &Map.update(&1, :requests, 1, fn n -> n + 1 end))

Two problems. First, cast does not skip the queue — the message still has to wait its turn behind any pending get or update. If the agent is already backed up, cast makes it worse, not better, because the caller stops noticing the backpressure. Second, if the agent crashes with messages still in its mailbox, those messages are lost silently. For something like a request counter, that may be acceptable. For anything you would feel bad about losing, use update.

The honest use case for cast: when the caller genuinely cannot wait, the operation is idempotent or recoverable, and you have observability on the agent's queue length so a backup will show up in alerts before it causes a problem.

Real Example: Request Deduplication Counter

A common need in API gateways: count requests by key, accept the first N per window, reject the rest. ETS is the right answer at high scale, but for moderate traffic, an Agent is simpler and easier to reason about.

defmodule RequestDedup do
  use Agent

  def start_link(_), do: Agent.start_link(fn -> %{} end, name: __MODULE__)

  def allow?(key, limit) do
    Agent.get_and_update(__MODULE__, fn counts ->
      current = Map.get(counts, key, 0)
      if current < limit do
        {true, Map.put(counts, key, current + 1)}
      else
        {false, counts}
      end
    end)
  end

  def reset, do: Agent.update(__MODULE__, fn _ -> %{} end)
end

get_and_update keeps the check-and-increment atomic. A separate process resets the map every window. The agent never holds anything slow — the function only does map operations.

Real Example: Registry of In-Flight Downloads

Imagine a service that streams files and you want to deduplicate concurrent requests for the same URL — the second caller should wait for the first to finish rather than fetching twice.

defmodule InFlightDownloads do
  use Agent

  def start_link(_), do: Agent.start_link(fn -> %{} end, name: __MODULE__)

  def claim_or_join(url) do
    Agent.get_and_update(__MODULE__, fn in_flight ->
      case Map.get(in_flight, url) do
        nil ->
          ref = make_ref()
          {{:owner, ref}, Map.put(in_flight, url, {ref, [self()]})}

        {ref, waiters} ->
          updated = Map.put(in_flight, url, {ref, [self() | waiters]})
          {{:waiter, ref}, updated}
      end
    end)
  end

  def complete(url, result) do
    Agent.get_and_update(__MODULE__, fn in_flight ->
      case Map.pop(in_flight, url) do
        {nil, _} -> {:noop, in_flight}
        {{_ref, waiters}, rest} -> {{:waiters, waiters}, rest}
      end
    end)
    |> case do
      {:waiters, waiters} ->
        Enum.each(waiters, &send(&1, {:download_done, url, result}))
      _ ->
        :ok
    end
  end
end

The owner kicks off the actual download; waiters just sit in receive. When the owner finishes, it tells the agent, gets back the list of pids to notify, and sends each one the result. The agent itself never does the download — it just coordinates who is responsible.

This pattern shows up in image proxies, build caches, and CDN-style services. Cloudflare workers and similar systems use the same shape, just with a different storage layer.

Naming Patterns

Three naming options, in order of how often you should reach for them:

Module name as atom. name: __MODULE__ registers the agent globally on the node. One-per-app singletons — feature flags, config caches, the application-wide registry of in-flight things. Most agents.

Named via Registry. name: {:via, Registry, {MyApp.Registry, key}} if you genuinely have multiple agents distinguished by a dynamic key. Rare for agents — when you have N entities each with state, you usually want a GenServer per entity, not an Agent per entity.

Anonymous (pid). No name option. You get a pid back from start_link. Useful in tests, for short-lived agents owned by a single process, or when you want to pass the agent around explicitly. Production code should mostly avoid this — debugging "which agent is process #PID<0.847.0>" is harder than "the FeatureFlags agent."

Supervising Agents

Agents need supervision like any other process. The pattern is straightforward — Agent provides child_spec/1 via use Agent, so you can drop the module name into a supervisor's child list:

defmodule MyApp.Application do
  use Application

  def start(_type, _args) do
    children = [
      FeatureFlags,
      RequestDedup,
      InFlightDownloads
    ]

    Supervisor.start_link(children, strategy: :one_for_one)
  end
end

:one_for_one is right for agents that are independent. If one crashes, the others stay up. If two agents depend on each other — say, a downloads agent that reads from a config agent at init time — use :rest_for_one so the config restarts first.

The restart option on the child spec controls what happens when the agent crashes. The default is :permanent, which is almost always what you want for agents — they hold state and you want them back. :transient is occasionally useful for "restart only on abnormal exit," but agents rarely exit normally on their own.

If the agent's initial state is expensive to compute — loading from a database, fetching from disk — be aware that this work happens during boot inside init. A slow agent init delays the whole supervisor and can cause boot timeouts. The fix is the same as for GenServer: don't put slow work in the init function. Start with empty state and lazy-load on first access, or push the initial load into a handle_continue-style follow-up via a wrapping GenServer.

How This Differs From the Survey

The survey said: Agent is a GenServer for plain state, use it for caches and counters. True, but missing the load-bearing details. The points worth carrying forward from this chapter:

  • The function runs inside the agent. Everything that matters about Agent performance and correctness flows from that.
  • get_and_update for any check-then-modify pattern. Anything else introduces races.
  • cast is a footgun more often than a feature. Use it only when you have observability on the queue.
  • Supervise like any process. Default to :permanent, watch out for slow inits.
  • The "Agent grew up and became a GenServer" moment is when you need a timer or a handle_info. Don't fight it — rewrite.

Common Pitfalls

Doing IO inside Agent.get or Agent.update. Disk reads, HTTP calls, even calls to other GenServers can block the agent and serialize every client. Compute on state only; do slow work in the caller.

Using get + update instead of get_and_update. Two round trips means a race window between them. Anything that reads state to decide whether to write needs get_and_update for atomicity.

Treating Agent as a cheaper GenServer. It's the same cost — Agent is just GenServer with less code on the outside. If you need handle_info or a timer, you don't have an Agent, you have a GenServer with the wrong wrapper.

Forgetting that cast is still serialized. A backed-up agent processes casts in mailbox order, same as calls. Cast doesn't make the agent faster; it just hides the wait from the caller.

Anonymous agents in production. When something breaks, you want to find the process by name in observer or remote shell. Named agents make that trivial. Anonymous pids buried in supervisor children make it a treasure hunt.

Slow init blocking supervisor startup. Agent init runs synchronously during boot. A database fetch or large file load there will delay every other child below it in the supervision tree, sometimes triggering boot timeouts. Start empty, load lazily.

Key Takeaways

  • Agent is a GenServer wrapper. Every behavior follows from "your function runs inside the agent process."
  • Use Agent when the operations are pure functions of state, contention is low, and you don't need timers or reactive callbacks.
  • get_and_update is the right answer for any check-then-modify operation — it's the only atomic primitive Agent gives you.
  • Never block inside a function passed to Agent. Disk, network, slow GenServer calls — all of them serialize every other client.
  • cast is rarely worth it; the only real win is when the caller cannot wait and you have observability on queue length.
  • Supervise agents like any process, default to :permanent, and keep init functions fast.
  • The moment you reach for a timer or handle_info, you've outgrown Agent — write the GenServer.