Writing for 3AM
The reader of a runbook is not sitting comfortably at their desk with a cup of coffee, browsing your docs out of curiosity. They are in bed. Their phone just woke them up. The system is down. Customers are affected. Their manager is asking for an ETA. Their brain is operating at 50% capacity, and the adrenaline is making it hard to think clearly. Write for that person.
The 3AM Reader
Understand who is reading your runbook and what state they are in:
The reader at 3AM:
- Tired, possibly sleep-deprived
- Stressed by the active incident
- Cannot hold complex logic in working memory
- Will not read paragraphs of explanation
- Needs to act, not understand
- May not be the person who wrote the system
- May be on-call for the first time for this service
This is not a hypothetical. This is the actual context in which runbooks are used. Every design decision in a runbook should be evaluated against this reader.
Structure for Impaired Cognition
Short Sentences, Numbered Steps
Every action is a numbered step. Every step is one action. No compound steps, no branching logic within a step, no paragraphs.
Bad:
If the API is returning 500 errors, first check if the database
is reachable. You can do this by running a connectivity test from
the API server. If the database is reachable, the problem is likely
in the application layer, so check the application logs. If the
database is not reachable, check the database server's status and
network connectivity.
Good:
1. Check if the database is reachable:
kubectl exec -it api-server-0 -- pg_isready -h db.internal
2. If output shows "accepting connections":
Skip to step 5 (database is fine, check application).
3. If output shows "no response" or times out:
Continue to step 4 (database problem).
4. Check database server status:
ssh db-primary "systemctl status postgresql"
Go to: Database Down Runbook (link)
5. Check application logs for errors:
kubectl logs api-server-0 --tail=100 | grep ERROR
Copy-Pasteable Commands
Every command in a runbook must be copy-pasteable. No pseudo-commands, no placeholders that require editing, no "replace X with your value."
Bad:
Run: kubectl logs <pod-name> --tail=100
(Replace <pod-name> with the actual pod name)
Better:
1. Get the pod name:
kubectl get pods -l app=api-server -o name
2. View recent logs (paste the pod name from step 1):
kubectl logs pods/api-server-xxxxx --tail=100
Best:
1. View recent API server logs:
kubectl logs -l app=api-server --tail=100
The best version requires zero editing. The reader copies, pastes, and gets results.
No Ambiguity
Runbooks must be unambiguous. "It depends" is not allowed. If a decision must be made, provide a decision rule.
Bad:
If the error rate seems high, you may want to restart the service.
Good:
1. Check the error rate:
curl -s http://localhost:9090/metrics | grep http_errors_total
2. If error rate is above 50 errors/minute:
Restart the service (step 3).
If error rate is below 50 errors/minute:
Skip to step 6 (do not restart).
3. Restart the service:
kubectl rollout restart deployment/api-server
"Seems high" is ambiguous. "Above 50 errors/minute" is a decision rule a tired person can follow.
Runbook Template
Every runbook should follow the same structure. Consistency means the reader never has to figure out where information lives.
Runbook template:
# [Service Name]: [Problem Description]
## When to Use This Runbook
One sentence describing the alert or symptom that triggers this runbook.
## Impact
What is broken and who is affected. Severity level.
## Prerequisites
What access/tools you need before starting.
## Steps
Numbered, copy-pasteable steps.
## Verification
How to confirm the problem is resolved.
## Escalation
Who to contact if the steps do not resolve the issue.
## History
When this runbook was last used and what happened.
The "When to Use" Section
This section prevents the wrong runbook from being followed. It should name the specific alert or symptom.
Good:
## When to Use This Runbook
You received the alert: "API error rate > 5% for 5 minutes"
(PagerDuty alert ID: api-error-rate-high)
Bad:
## When to Use This Runbook
When the API is having problems.
The Prerequisites Section
List the tools and access required before the reader starts the steps. Discovering at step 7 that you need VPN access wastes critical minutes.
## Prerequisites
- VPN connected to production network
- kubectl configured for the production cluster
- SSH access to db-primary (via bastion)
- Access to Grafana dashboards (link)
- PagerDuty responder role
What to Explain & What to Skip
At 3AM, explanation is overhead. But some explanation is necessary to prevent dangerous mistakes.
Explain:
- Why a step is dangerous ("This will drop all active connections")
- What to look for in output ("You should see 'OK'. If you see
'TIMEOUT', do not proceed.")
- When NOT to do something ("Do NOT restart if traffic is above
10k rps. Page the database team instead.")
Skip:
- How the system works architecturally
- Why the system was designed this way
- History of previous incidents
- Alternative approaches that were considered
The runbook is not a teaching document. The reader is not learning. They are executing.
Warnings & Danger Zones
Some steps can make things worse. Mark them explicitly.
7. CAUTION: This step drops all active database connections.
Verify that the connection pool has drained first:
kubectl exec -it api-server-0 -- curl localhost:8080/pool-status
Expected output: "active_connections: 0"
If active_connections is not 0, wait 60 seconds and check again.
Do not proceed until active_connections is 0.
8. Restart the database:
ssh db-primary "systemctl restart postgresql"
Use plain language: CAUTION, WARNING, DO NOT. Not color coding or icons that may not render in a terminal.
Verification Steps
Every runbook must end with verification: how to confirm the problem is actually resolved.
## Verification
1. Check that the error rate has dropped:
curl -s http://localhost:9090/metrics | grep http_errors_total
Error rate should be below 5 errors/minute within 2 minutes
of the restart.
2. Check that the health endpoint is responding:
curl -s http://api.example.com/health
Expected output: {"status": "healthy"}
3. Check the dashboard:
https://grafana.internal/d/api-overview
The error rate graph should show a sharp drop after the restart.
4. If the error rate has not dropped within 5 minutes:
Escalate to the API team lead (see Escalation section).
Without verification, the on-call engineer has no way to know if they actually fixed the problem or just made the symptoms temporarily disappear.
Writing & Maintenance Process
Write After Every Incident
The best time to write a runbook is immediately after an incident. The steps are fresh. The failure mode is documented in the postmortem. The fix has been verified.
Post-incident runbook process:
1. During the postmortem, identify: "If this happens again,
what are the exact steps to resolve it?"
2. Write those steps as a runbook within one week
3. Link the runbook to the relevant alert in PagerDuty
4. Have someone who was NOT in the incident follow the runbook
on a staging system to verify it works
Test Runbooks Regularly
A runbook that has not been tested is a runbook that might not work.
Runbook testing:
- Run through each runbook in staging quarterly
- New on-call engineers follow runbooks as part of onboarding
- Game days: intentionally break things and verify runbooks work
- After any infrastructure change, re-test affected runbooks
Common Pitfalls
- Paragraphs instead of steps — a runbook with prose is a runbook nobody can follow at 3AM. Numbered steps or nothing.
- Placeholders in commands —
<replace-this>requires the reader to think. Provide the exact command or a preceding step that retrieves the value. - Missing escalation paths — the steps did not work. Now what? Without an escalation section, the on-call engineer is stranded.
- No verification — "restart the service" is not a resolution. "Restart the service and confirm error rate drops below 5/min" is a resolution.
- Runbooks for systems that no longer exist — stale runbooks are dangerous. Review and prune after every major infrastructure change.
- Assuming deep system knowledge — the reader may be on-call for the first time. They may be from a different team covering a shift. Write for the least experienced person who might be paged.
- Branching logic buried in prose — decision points must be explicit and clearly formatted. "If X, go to step Y. If not X, go to step Z." Not hidden in a paragraph.
Key Takeaways
- The reader is tired, stressed, and operating at 50% capacity. Every design decision flows from this fact.
- One action per step. Every command copy-pasteable. No ambiguity. No "it depends."
- Use a consistent template: when to use, impact, prerequisites, steps, verification, escalation.
- Mark dangerous steps explicitly. Tell the reader what to look for in output and when NOT to proceed.
- End every runbook with verification steps. If you cannot verify the fix, you do not know if it worked.
- Write runbooks after incidents, test them regularly, and prune them when systems change.