Let It Crash

"Let it crash" is one of those phrases that sounds reckless until you've worked in a system that practices it. New Elixir developers usually arrive with reflexes from Java, Python, or JavaScript — wrap everything in try/catch, validate every input three times, defend against every conceivable failure. Then they spend six months in production code and realize they've been doing it wrong the whole time.

The philosophy isn't "crashes are good." It's "crashes are cheap, recoverable, and informative — so don't spend code preventing them when a supervisor will handle it for you."

The Java Reflex

Here's what defensive code in another language tends to look like:

public User findUser(String id) {
    try {
        if (id == null) return null;
        if (id.isEmpty()) return null;
        User user = repository.findById(id);
        if (user == null) {
            logger.warn("User not found");
            return null;
        }
        return user;
    } catch (DatabaseException e) {
        logger.error("DB error", e);
        return null;
    } catch (Exception e) {
        logger.error("Unexpected error", e);
        return null;
    }
}

Every layer wraps every call. Errors get swallowed and turned into nulls or generic responses. The control flow is buried under guard clauses. And when something does go wrong, the original cause is three exception layers deep, the stack trace is gone, and the system muddles forward in some inconsistent state.

The reason this style exists is that in those runtimes, an uncaught exception kills the entire process — your whole web server, your whole worker. The cost of crashing is unacceptable, so you write code that never crashes, even when crashing would be the honest answer.

What Changes on the BEAM

Elixir runs on the BEAM, which has a different unit of failure: the process. A process crashing is a normal, isolated event. It doesn't take down the VM, doesn't take down other processes, doesn't take down your service. The supervisor that started it gets a notification, restarts it, and the system carries on.

This changes the economics of error handling completely. You no longer pay for a crash with downtime. You pay with a single restart of one process, often costing milliseconds. So the cost-benefit of defensive coding flips:

Defensive code costs you complexity, readability, and slower response to bugs (because they get hidden).
Letting it crash costs you a process restart and gives you a clean stack trace pointing at the actual bug.

So you write the happy path, you assume your invariants hold, and when they don't — let it crash. A supervisor restarts a fresh process with known-good state, and you get a real error report with a real stack trace that you can fix.

What This Looks Like in Practice

def handle_call({:get_user, id}, _from, state) do
  user = Map.fetch!(state.users, id)
  {:reply, user, state}
end

There's no defensive case. No null check. No try/rescue. If id isn't in the map, Map.fetch!/2 raises, the GenServer crashes, the supervisor restarts it. The caller gets an exit signal — and if the caller is also a supervised process, its supervisor handles it too.

Compare to the defensive version:

def handle_call({:get_user, id}, _from, state) do
  case Map.fetch(state.users, id) do
    {:ok, user} ->
      {:reply, user, state}
    :error ->
      Logger.warn("User not found: #{id}")
      {:reply, {:error, :not_found}, state}
  end
end

Which one is right? It depends on whether "user not found" is an expected outcome or a bug. If callers reasonably ask about users that may not exist (looking up a session by ID, for instance), the second version is correct — it's an expected runtime situation. If the system invariant is "this process only ever receives valid IDs," the first is correct, and any deviation is a bug worth crashing over.

The rule of thumb: distinguish expected absence from broken invariants. Expected absence returns {:error, reason}. Broken invariants crash.

Defensive vs Offensive Programming

Defensive programming says "assume the inputs are bad and handle every case." Offensive programming says "assume the inputs are valid, and fail loud if they aren't." Elixir leans hard on offensive programming, because offensive code is shorter, clearer, and crashes give you better signals than silent fallbacks.

# Offensive
def total(items) do
  Enum.reduce(items, 0, fn %{price: p}, acc -> acc + p end)
end

If somebody passes [%{cost: 5}] instead of [%{price: 5}], this raises a MatchError pointing at the exact line. Crash, restart, fix the bug.

# Defensive
def total(items) do
  Enum.reduce(items, 0, fn item, acc ->
    case item do
      %{price: p} when is_number(p) -> acc + p
      _ -> acc
    end
  end)
end

This silently skips bad items and returns a wrong total. The bug ships to production, your invoices are off by 12%, and you find out next quarter when accounting flags it. The defensive code did more harm than the crash would have.

What Crashing Actually Costs

The first concern people raise: "But the user gets a 500 error." Sometimes, yes. But the alternative — silently corrupting state or returning bad data — is worse. And in many systems, the crash never reaches the user at all:

Phoenix LiveView processes that crash get auto-recovered by the client.
A worker processing a job crashes; the job gets retried.
A user session process crashes; the next request boots a fresh one transparently.
A connection pool worker crashes; the pool replaces it before the next checkout.

Discord runs millions of processes per node and routinely sees crashes — that's normal operating behavior, not an alarm condition. Their dashboards show crashes as a steady-state metric, not a fire alarm.

The cost is reset state. If a GenServer crashes, the supervisor restarts it with whatever init/1 returns. State held only in process memory is gone. That's why important state lives elsewhere: ETS, the database, an external store.

When Not to Let It Crash

The philosophy isn't "never handle errors." It's "handle the errors that are part of the API, crash on the ones that aren't."

You should handle:

HTTP request failures from external services
File I/O errors when the file might genuinely not be there
Parsing errors on user input
Validation failures from form submission
Anything where the failure is part of the contract with the caller

You shouldn't handle:

"What if this internal map is missing a key it should always have"
"What if this struct has the wrong shape"
"What if this internal function returns something unexpected"

The first set is "the world is uncertain." The second set is "my code or my callers have a bug." Crashing on bugs surfaces them. Catching them hides them.

Supervisors Are the Recovery Mechanism

The reason any of this works is supervisors. Crashing without a supervisor watching is just crashing — same as any other language. The pattern is:

Write code that's correct on the happy path.
Let it crash on broken invariants.
Put it under a supervisor.
The supervisor restarts it on crash, with a max-restart-intensity guard.
If it keeps crashing, the supervisor escalates, and a higher supervisor decides what to do.

This is structured error recovery built into the language runtime. You don't have to think about it on every line of business code, the way you would with try/catch. The structure is in the supervision tree, not in every function.

Concrete Comparison

Imagine a worker that processes uploaded images — resize, watermark, save thumbnails. The worker pulls jobs from a queue and processes them one at a time.

The defensive version:

def process(job) do
  try do
    image = case Image.open(job.path) do
      {:ok, img} -> img
      {:error, _} -> nil
    end

    if image do
      try do
        resized = Image.resize(image, 800)
        case Image.save(resized, job.output_path) do
          :ok -> :ok
          {:error, reason} ->
            Logger.error("save failed: #{inspect(reason)}")
            :error
        end
      rescue
        e ->
          Logger.error("resize crashed: #{inspect(e)}")
          :error
      end
    else
      Logger.warn("could not open image")
      :error
    end
  rescue
    e ->
      Logger.error("unexpected: #{inspect(e)}")
      :error
  end
end

The let-it-crash version:

def process(job) do
  {:ok, image} = Image.open(job.path)
  resized = Image.resize(image, 800)
  :ok = Image.save(resized, job.output_path)
end

If the image can't be opened, you crash. The supervisor restarts the worker, the queue retries the job (or sends it to a dead-letter queue if it fails repeatedly). If the resize crashes because of a bad image, same thing. The worker's job is to process valid jobs — invalid ones are someone else's problem.

The defensive version is five times longer, hides multiple failure modes, and still doesn't tell you what to do with a failed job. The let-it-crash version is honest: this is what success looks like; anything else is a problem the supervisor handles.

The catch: this depends on your queue retrying jobs and on having a dead-letter mechanism for jobs that fail forever. Without that infrastructure, the worker just keeps crashing on the bad job. Let-it-crash isn't license to ignore failure modes — it's a discipline about where in the system each failure is handled.

Comparing to try/catch Languages

In Java or Python, error handling is an inline concern. Every function that might fail produces an exception, every caller that doesn't handle it propagates it, and you end up with try/catch-finally blocks scattered through the codebase. Errors and control flow are tangled.

In Elixir, error handling is mostly out-of-band. The happy path is in your function. The recovery is in your supervision tree. The two are physically separate. You read your business logic without wading through error handling, and you read your supervisor structure without wading through business logic.

This separation is the actual win. Not "fewer crashes" — more crashes, even — but cleaner code and clearer recovery.

Common Pitfalls

Wrapping everything in try/rescue. If you find yourself adding try/rescue to most functions, you've imported defensive habits from another language. Stop. Let it crash. The supervisor will handle it.

Catching exceptions and converting them to nil or empty values. This silently corrupts data and makes debugging miserable. If you must catch, return a tagged tuple {:error, reason} and propagate it up — never swallow.

Treating all errors the same. "User not found" and "database connection refused" are different. The first is a normal API result; the second is an infrastructure problem. Don't pattern-match every operation against {:error, _} and shrug.

Hiding bugs as features. "If this map is missing the key, default to empty string" sounds reasonable until you realize you've made it impossible to ever notice when the map is wrong. Crashing on missing keys catches data shape changes immediately.

No supervisor. "Let it crash" in an unsupervised process is just crashing. The crash needs to land somewhere that can recover.

Key Takeaways

BEAM processes are the unit of failure. Crashing one doesn't take down the VM or other processes.
Defensive code is expensive — it adds complexity and hides bugs. Offensive code is cheap when supervisors are doing recovery.
Crash on broken invariants. Return {:error, reason} for expected failures.
Supervisors are the structured recovery mechanism. They turn "crash" into "restart with fresh state."
The win isn't fewer errors — it's separating happy-path code from recovery code. Read business logic without error noise; read supervisors without business logic.
"Let it crash" only works under a supervisor. Otherwise it's just crashing.