10 min read
On this page

Building a TCP Server

The single-connection echo server from the previous topic is a toy. Real BEAM servers follow a specific shape that has been refined over two decades of Erlang in production: an acceptor that loops on accept/1, a dynamic supervisor that owns one handler process per connection, and a handler GenServer that owns the socket and runs the protocol. Cowboy is built this way. Ranch — the socket pool that backs Cowboy and therefore Phoenix — is built this way. WhatsApp's chat servers are built this way. When you see "millions of connections per node," this is the shape you are seeing.

The pattern works because each piece has exactly one job and crashes are bounded. A handler that misbehaves takes down one connection. The acceptor keeps running. The supervisor sweeps up the corpse. The other nine hundred and ninety-nine thousand connections never notice.

The Three Pieces

Before any code, the architecture in plain prose:

  1. A listener / acceptor process. It owns the listen socket. Its only job is to call accept/1 in a loop. Each time accept/1 returns a client socket, the acceptor asks the dynamic supervisor to start a handler for that socket, then loops back to accept/1 immediately.

  2. A DynamicSupervisor. It owns the handler processes. Children are added on demand and restarted on crash according to whatever policy you pick (usually :temporary for connection handlers — if a connection dies, you do not magically reconnect the client).

  3. A handler GenServer per connection. It receives the socket from the acceptor, runs the protocol, manages any per-connection state, and dies when the connection closes or the client misbehaves.

A Registry is often a fourth piece — it lets handlers find each other by some key, which is what makes chat rooms, pub/sub topics, and user sessions easy to implement.

A Chat Server, End to End

Let's build a line-based chat server. Clients connect, send JOIN room_name to join a channel, then any line they send is broadcast to everyone else in the same channel. This is enough surface area to exercise the whole pattern without dragging in protocol details.

The Application Supervisor

The top-level supervision tree holds the registry, the dynamic supervisor, and the acceptor:

defmodule Chat.Application do
  use Application

  def start(_type, _args) do
    children = [
      {Registry, keys: :duplicate, name: Chat.Registry},
      {DynamicSupervisor, name: Chat.ConnectionSupervisor, strategy: :one_for_one},
      {Chat.Acceptor, port: 4040}
    ]

    Supervisor.start_link(children, strategy: :one_for_one, name: Chat.Supervisor)
  end
end

keys: :duplicate on the registry is important: multiple processes can register under the same key (the room name), which is how broadcast works.

The Acceptor

defmodule Chat.Acceptor do
  use GenServer
  require Logger

  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end

  @impl true
  def init(opts) do
    port = Keyword.fetch!(opts, :port)

    {:ok, listen_socket} =
      :gen_tcp.listen(port, [
        :binary,
        packet: :line,
        active: false,
        reuseaddr: true,
        backlog: 1024
      ])

    Logger.info("chat server listening on #{port}")
    send(self(), :accept)
    {:ok, %{listen_socket: listen_socket}}
  end

  @impl true
  def handle_info(:accept, %{listen_socket: listen_socket} = state) do
    case :gen_tcp.accept(listen_socket) do
      {:ok, client_socket} ->
        {:ok, pid} =
          DynamicSupervisor.start_child(
            Chat.ConnectionSupervisor,
            {Chat.Connection, client_socket}
          )

        # Transfer ownership of the socket to the handler process.
        :ok = :gen_tcp.controlling_process(client_socket, pid)
        send(pid, :go)

        send(self(), :accept)
        {:noreply, state}

      {:error, reason} ->
        Logger.error("accept failed: #{inspect(reason)}")
        send(self(), :accept)
        {:noreply, state}
    end
  end
end

Two things deserve attention here. First, :gen_tcp.accept/1 blocks. We do not call it directly in handle_info; we send ourselves an :accept message and call it once per message. This keeps the GenServer responsive to OTP signals (shutdown, debug, etc.) and means the accept call sits in a place where blocking is fine.

Second, controlling_process/2. Sockets in the BEAM have an owner — the process that receives {:tcp, _, _} messages and that gets EXIT signals if the socket dies. When the acceptor receives the socket from accept/1, the acceptor owns it. We immediately transfer ownership to the freshly-spawned handler. Until the transfer completes, the handler cannot receive socket messages even if it tries. The send(pid, :go) is a handshake — the handler waits for it before activating the socket, guaranteeing it never tries to set socket options on a socket it does not yet own.

Production servers run multiple acceptors in parallel — Ranch defaults to 100. They all call accept/1 against the same listen socket; the kernel hands each incoming connection to whichever is waiting. For most workloads one acceptor is enough; profile before optimising.

The Connection Handler

This is a GenServer per connection. It holds the socket, knows which room (if any) it is in, and translates between the wire protocol and the registry.

defmodule Chat.Connection do
  use GenServer, restart: :temporary
  require Logger

  def start_link(socket) do
    GenServer.start_link(__MODULE__, socket)
  end

  @impl true
  def init(socket) do
    {:ok, %{socket: socket, room: nil}}
  end

  @impl true
  def handle_info(:go, %{socket: socket} = state) do
    # Now we own the socket. Arm it for the first line.
    :ok = :inet.setopts(socket, active: :once)
    {:noreply, state}
  end

  def handle_info({:tcp, socket, line}, %{socket: socket} = state) do
    new_state = handle_line(String.trim(line), state)
    :ok = :inet.setopts(socket, active: :once)
    {:noreply, new_state}
  end

  def handle_info({:tcp_closed, socket}, %{socket: socket} = state) do
    {:stop, :normal, state}
  end

  def handle_info({:tcp_error, socket, reason}, %{socket: socket} = state) do
    Logger.warning("tcp error: #{inspect(reason)}")
    {:stop, :normal, state}
  end

  def handle_info({:broadcast, from_pid, message}, %{socket: socket} = state)
      when from_pid != self() do
    :gen_tcp.send(socket, message)
    {:noreply, state}
  end

  def handle_info({:broadcast, _from_pid, _msg}, state) do
    # Skip our own broadcasts.
    {:noreply, state}
  end

  defp handle_line("JOIN " <> room, state) do
    if state.room do
      Registry.unregister(Chat.Registry, state.room)
    end

    {:ok, _} = Registry.register(Chat.Registry, room, nil)
    :gen_tcp.send(state.socket, "OK joined #{room}\n")
    %{state | room: room}
  end

  defp handle_line(_line, %{room: nil} = state) do
    :gen_tcp.send(state.socket, "ERR join a room first\n")
    state
  end

  defp handle_line(line, %{room: room} = state) do
    Registry.dispatch(Chat.Registry, room, fn entries ->
      for {pid, _} <- entries do
        send(pid, {:broadcast, self(), line <> "\n"})
      end
    end)

    state
  end
end

A handful of choices here are deliberate.

restart: :temporary means the supervisor never restarts a dead connection. If the TCP connection drops, the handler dies and stays dead — there is no socket to restart against. If you let it :transient or :permanent the supervisor would loop trying to restart a process that immediately exits because it has no socket. Connection handlers are almost always :temporary.

The :go handshake activates the socket only after we know we own it. Calling :inet.setopts/2 on a socket you do not own returns {:error, :not_owner}.

We use active: :once and re-arm after every message. This is the back-pressure mechanism the previous topic introduced. If a peer fires lines faster than we can broadcast them, the lines pile up in the kernel's receive buffer rather than in our mailbox. Eventually the kernel will stop ACKing, the peer's TCP window will close, and the peer will block on send. That is the right behaviour — push back, do not OOM.

Registry membership is {room, nil} — we do not need a per-entry value, the pid is enough to send messages. Registry.dispatch/3 walks every entry under the key and runs a function with the list; we use it to fan out the message. For thousands of subscribers this is a single Erlang process iterating a list, which is fast; for hundreds of thousands per channel you would shard.

Trying It Out

$ nc localhost 4040
JOIN general
OK joined general
hello everyone

In a second terminal:

$ nc localhost 4040
JOIN general
OK joined general
hello everyone
hi there

The first client sees hi there on the next line. Disconnect either and the other keeps working; the dead connection's handler exits with :normal, the supervisor reaps it, the registry entry vanishes.

Back-Pressure in Depth

active: :once is the simple knob, but the broader story is worth understanding. The BEAM cannot magically slow down a peer. What it can do is stop reading from the socket, which leaves bytes in the kernel's receive buffer. When that buffer fills, the kernel advertises a zero window, the peer's TCP stack stops sending, and the peer's :gen_tcp.send/2 eventually blocks. The pressure propagates back to whatever is producing the data.

The mistake active: true makes is bypassing this mechanism. The runtime keeps draining the kernel buffer and shovelling bytes into your mailbox. The kernel never advertises a closed window because as far as it knows you are reading fast. Your mailbox grows without bound, and the OS killer arrives before TCP back-pressure ever engages.

For high-throughput protocols where one setopts call per message is too much overhead, active: N is the answer. You give yourself a budget of N messages; when it runs out the socket goes passive automatically and you setopts to refill. Ranch tunes this to 100 by default. The point is that there is always some upper bound — never an unbounded fire-hose into a single mailbox.

Connection Lifetime

A handler is born when DynamicSupervisor.start_child/2 succeeds and the socket ownership transfer completes. It lives as long as the TCP connection lives plus however long it takes to process whatever was in flight at close. It dies in one of three ways:

  1. The peer closes the connection. {:tcp_closed, socket} arrives, the handler returns {:stop, :normal, state}, the supervisor reaps it. This is the common case.

  2. The handler decides to close. Protocol violation, kick from moderator, idle timeout, server shutdown. The handler calls :gen_tcp.close/1 and exits. The peer sees a TCP close on its end.

  3. The handler crashes. A pattern match failure, an unhandled message, an Enum.fetch! on the wrong shape of binary. The handler exits with a non-normal reason, the supervisor reaps it, and the connection is dropped. The peer sees tcp_closed from its end as the socket is collected. This is the "let it crash" path, and it is fine — losing one connection because one client sent malformed input is the right outcome.

Graceful Shutdown

On controlled server shutdown, the application supervisor stops its children in reverse order. The acceptor stops first (no new connections accepted). Then the DynamicSupervisor stops its children. By default it sends :shutdown and waits 5 seconds; if you want longer, set shutdown: on the child spec or on the DynamicSupervisor's :max_seconds for the strategy that matters to you.

For protocols where a clean wind-down message matters — telling a client "server going away" before closing — handle terminate/2 in the connection GenServer:

@impl true
def terminate(_reason, %{socket: socket}) do
  :gen_tcp.send(socket, "BYE server shutting down\n")
  :gen_tcp.close(socket)
end

terminate/2 runs only when Process.flag(:trap_exit, true) is set or the GenServer is stopped via :normal/:shutdown reasons. For brutal kills it does not run; for those you accept that the peer just sees a connection drop.

Production Implementations to Steal From

You almost never write this from scratch in a real project. The ecosystem has done it for you:

  • Ranch is the socket acceptor pool used by Cowboy and (transitively) Phoenix. It runs N acceptors per listener, handles socket ownership transfer correctly, plugs into OTP supervision cleanly, and is what every other Elixir networking library uses underneath. About 1500 lines of Erlang that have been beaten on for a decade.
  • Cowboy is the HTTP/1.1, HTTP/2, and WebSocket server built on top of Ranch. Phoenix uses Cowboy by default. The connection handlers are state machines per connection, but the acceptor pattern underneath is exactly what is described above.
  • Bandit is the newer pure-Elixir HTTP server that does not use Ranch — it implements its own acceptor pool. Worth reading for a modern take. It is now the Phoenix default for new projects.

For your own protocols — a custom binary RPC, a line-based IoT protocol, a homemade message bus — building directly on Ranch is the right move. You implement a :ranch_protocol behaviour with one start_link/3 and your init/3, and Ranch handles the acceptor pool and supervision for you. The hand-rolled acceptor in this topic is for understanding, not for production.

WhatsApp's chat backbone was Erlang code in roughly this shape, with bespoke tweaks for their traffic patterns. The point of the example is not that this code is production-ready, but that this shape is.

Common Pitfalls

Forgetting controlling_process/2 or setting socket options before ownership transfer. Without the transfer the handler never receives {:tcp, _, _} messages because the acceptor still owns the socket, and :inet.setopts/2 from the handler returns {:error, :not_owner}. The :go handshake from the acceptor exists to prevent this race.

Restart strategy other than :temporary for connection handlers. A :permanent connection handler whose socket has died will be restarted by the supervisor, immediately fail because it has no socket, and the supervisor will keep restarting until it gives up. Connection handlers are one-shot — :temporary is right.

Doing protocol work in the acceptor. The acceptor's hot loop should be accept, spawn, transfer, repeat. Protocol parsing, auth, logging belong in the handler. An acceptor doing work is an acceptor not accepting.

Broadcasting via Registry.dispatch/3 for very large fan-outs. It is a single process iterating a list. For 10,000 subscribers in a room, fine. For 10,000,000, you need a different topology — shard the channel, use Phoenix PubSub, or build a tree of relay processes.

Leaving terminate/2 doing slow work. Supervisors give children a bounded time to terminate. A handler that takes 30 seconds to flush state on shutdown will be brutally killed. Either reduce the work, or persist state during normal operation.

Key Takeaways

  • The standard BEAM TCP server shape: an acceptor, a DynamicSupervisor of handlers, one handler GenServer per connection, optionally a Registry for cross-handler addressing.
  • Use controlling_process/2 to transfer socket ownership from acceptor to handler with a handshake so the handler does not setopts before it owns the socket.
  • Connection handlers should be restart: :temporary. There is no point restarting a process whose only resource — the socket — is gone.
  • active: :once (or active: N for hot paths) gives you GenServer-friendly message delivery with built-in TCP back-pressure.
  • Letting one connection crash is fine — the supervisor reaps it, the other connections are untouched. This isolation is why the pattern scales.
  • In production you almost certainly want Ranch rather than your own acceptor pool. Real systems on this shape: Cowboy, Phoenix, Bandit, WhatsApp's chat servers, Discord's voice gateway, countless IoT and gaming backends.