4 min read
On this page

Writing for 3AM

The reader of a runbook is not sitting comfortably at their desk with a cup of coffee, browsing your docs out of curiosity. They are in bed. Their phone just woke them up. The system is down. Customers are affected. Their manager is asking for an ETA. Their brain is operating at 50% capacity, and the adrenaline is making it hard to think clearly. Write for that person.

The 3AM Reader

Understand who is reading your runbook and what state they are in:

The reader at 3AM:
  - Tired, possibly sleep-deprived
  - Stressed by the active incident
  - Cannot hold complex logic in working memory
  - Will not read paragraphs of explanation
  - Needs to act, not understand
  - May not be the person who wrote the system
  - May be on-call for the first time for this service

This is not a hypothetical. This is the actual context in which runbooks are used. Every design decision in a runbook should be evaluated against this reader.

Structure for Impaired Cognition

Short Sentences, Numbered Steps

Every action is a numbered step. Every step is one action. No compound steps, no branching logic within a step, no paragraphs.

Bad:
  If the API is returning 500 errors, first check if the database
  is reachable. You can do this by running a connectivity test from
  the API server. If the database is reachable, the problem is likely
  in the application layer, so check the application logs. If the
  database is not reachable, check the database server's status and
  network connectivity.

Good:
  1. Check if the database is reachable:
     kubectl exec -it api-server-0 -- pg_isready -h db.internal

  2. If output shows "accepting connections":
     Skip to step 5 (database is fine, check application).

  3. If output shows "no response" or times out:
     Continue to step 4 (database problem).

  4. Check database server status:
     ssh db-primary "systemctl status postgresql"
     Go to: Database Down Runbook (link)

  5. Check application logs for errors:
     kubectl logs api-server-0 --tail=100 | grep ERROR

Copy-Pasteable Commands

Every command in a runbook must be copy-pasteable. No pseudo-commands, no placeholders that require editing, no "replace X with your value."

Bad:
  Run: kubectl logs <pod-name> --tail=100
  (Replace <pod-name> with the actual pod name)

Better:
  1. Get the pod name:
     kubectl get pods -l app=api-server -o name

  2. View recent logs (paste the pod name from step 1):
     kubectl logs pods/api-server-xxxxx --tail=100

Best:
  1. View recent API server logs:
     kubectl logs -l app=api-server --tail=100

The best version requires zero editing. The reader copies, pastes, and gets results.

No Ambiguity

Runbooks must be unambiguous. "It depends" is not allowed. If a decision must be made, provide a decision rule.

Bad:
  If the error rate seems high, you may want to restart the service.

Good:
  1. Check the error rate:
     curl -s http://localhost:9090/metrics | grep http_errors_total

  2. If error rate is above 50 errors/minute:
     Restart the service (step 3).
     If error rate is below 50 errors/minute:
     Skip to step 6 (do not restart).

  3. Restart the service:
     kubectl rollout restart deployment/api-server

"Seems high" is ambiguous. "Above 50 errors/minute" is a decision rule a tired person can follow.

Runbook Template

Every runbook should follow the same structure. Consistency means the reader never has to figure out where information lives.

Runbook template:

  # [Service Name]: [Problem Description]

  ## When to Use This Runbook
  One sentence describing the alert or symptom that triggers this runbook.

  ## Impact
  What is broken and who is affected. Severity level.

  ## Prerequisites
  What access/tools you need before starting.

  ## Steps
  Numbered, copy-pasteable steps.

  ## Verification
  How to confirm the problem is resolved.

  ## Escalation
  Who to contact if the steps do not resolve the issue.

  ## History
  When this runbook was last used and what happened.

The "When to Use" Section

This section prevents the wrong runbook from being followed. It should name the specific alert or symptom.

Good:
  ## When to Use This Runbook
  You received the alert: "API error rate > 5% for 5 minutes"
  (PagerDuty alert ID: api-error-rate-high)

Bad:
  ## When to Use This Runbook
  When the API is having problems.

The Prerequisites Section

List the tools and access required before the reader starts the steps. Discovering at step 7 that you need VPN access wastes critical minutes.

## Prerequisites
  - VPN connected to production network
  - kubectl configured for the production cluster
  - SSH access to db-primary (via bastion)
  - Access to Grafana dashboards (link)
  - PagerDuty responder role

What to Explain & What to Skip

At 3AM, explanation is overhead. But some explanation is necessary to prevent dangerous mistakes.

Explain:
  - Why a step is dangerous ("This will drop all active connections")
  - What to look for in output ("You should see 'OK'. If you see
    'TIMEOUT', do not proceed.")
  - When NOT to do something ("Do NOT restart if traffic is above
    10k rps. Page the database team instead.")

Skip:
  - How the system works architecturally
  - Why the system was designed this way
  - History of previous incidents
  - Alternative approaches that were considered

The runbook is not a teaching document. The reader is not learning. They are executing.

Warnings & Danger Zones

Some steps can make things worse. Mark them explicitly.

  7. CAUTION: This step drops all active database connections.
     Verify that the connection pool has drained first:
     kubectl exec -it api-server-0 -- curl localhost:8080/pool-status

     Expected output: "active_connections: 0"

     If active_connections is not 0, wait 60 seconds and check again.
     Do not proceed until active_connections is 0.

  8. Restart the database:
     ssh db-primary "systemctl restart postgresql"

Use plain language: CAUTION, WARNING, DO NOT. Not color coding or icons that may not render in a terminal.

Verification Steps

Every runbook must end with verification: how to confirm the problem is actually resolved.

## Verification

  1. Check that the error rate has dropped:
     curl -s http://localhost:9090/metrics | grep http_errors_total

     Error rate should be below 5 errors/minute within 2 minutes
     of the restart.

  2. Check that the health endpoint is responding:
     curl -s http://api.example.com/health

     Expected output: {"status": "healthy"}

  3. Check the dashboard:
     https://grafana.internal/d/api-overview

     The error rate graph should show a sharp drop after the restart.

  4. If the error rate has not dropped within 5 minutes:
     Escalate to the API team lead (see Escalation section).

Without verification, the on-call engineer has no way to know if they actually fixed the problem or just made the symptoms temporarily disappear.

Writing & Maintenance Process

Write After Every Incident

The best time to write a runbook is immediately after an incident. The steps are fresh. The failure mode is documented in the postmortem. The fix has been verified.

Post-incident runbook process:
  1. During the postmortem, identify: "If this happens again,
     what are the exact steps to resolve it?"
  2. Write those steps as a runbook within one week
  3. Link the runbook to the relevant alert in PagerDuty
  4. Have someone who was NOT in the incident follow the runbook
     on a staging system to verify it works

Test Runbooks Regularly

A runbook that has not been tested is a runbook that might not work.

Runbook testing:
  - Run through each runbook in staging quarterly
  - New on-call engineers follow runbooks as part of onboarding
  - Game days: intentionally break things and verify runbooks work
  - After any infrastructure change, re-test affected runbooks

Common Pitfalls

  • Paragraphs instead of steps — a runbook with prose is a runbook nobody can follow at 3AM. Numbered steps or nothing.
  • Placeholders in commands<replace-this> requires the reader to think. Provide the exact command or a preceding step that retrieves the value.
  • Missing escalation paths — the steps did not work. Now what? Without an escalation section, the on-call engineer is stranded.
  • No verification — "restart the service" is not a resolution. "Restart the service and confirm error rate drops below 5/min" is a resolution.
  • Runbooks for systems that no longer exist — stale runbooks are dangerous. Review and prune after every major infrastructure change.
  • Assuming deep system knowledge — the reader may be on-call for the first time. They may be from a different team covering a shift. Write for the least experienced person who might be paged.
  • Branching logic buried in prose — decision points must be explicit and clearly formatted. "If X, go to step Y. If not X, go to step Z." Not hidden in a paragraph.

Key Takeaways

  • The reader is tired, stressed, and operating at 50% capacity. Every design decision flows from this fact.
  • One action per step. Every command copy-pasteable. No ambiguity. No "it depends."
  • Use a consistent template: when to use, impact, prerequisites, steps, verification, escalation.
  • Mark dangerous steps explicitly. Tell the reader what to look for in output and when NOT to proceed.
  • End every runbook with verification steps. If you cannot verify the fix, you do not know if it worked.
  • Write runbooks after incidents, test them regularly, and prune them when systems change.