Incident Response Docs
When things break, you do not have time to figure out who to call, what to say, or how to decide if this is a real emergency. Incident response documentation exists to pre-make those decisions so you can execute instead of deliberate. Every minute spent deciding who to notify or debating severity during a live incident is a minute of downtime. Write the playbook before the crisis.
Severity Definitions
Severity levels must be defined precisely enough that an on-call engineer at 3AM can classify an incident without a committee meeting.
Severity definitions:
SEV1 - Critical
- Complete service outage or data loss
- Revenue-impacting for all customers
- Security breach with active exploitation
- Examples: API returning 500 for all requests, database
corruption, credentials leaked publicly
- Response time: Immediate. All hands.
- Communication: Exec team notified within 15 minutes.
SEV2 - Major
- Partial outage or significant degradation
- Revenue-impacting for a subset of customers
- Security vulnerability discovered but not exploited
- Examples: One region down, payment processing delayed,
latency 10x normal
- Response time: Within 15 minutes.
- Communication: Engineering leadership notified within 30 min.
SEV3 - Minor
- Service degraded but functional
- Workaround available
- No immediate revenue impact
- Examples: Non-critical endpoint slow, dashboard broken,
batch job delayed
- Response time: Within 1 hour.
- Communication: Team Slack channel.
SEV4 - Low
- Cosmetic issue or minor bug
- No user impact
- Examples: Typo in error message, minor UI glitch
- Response time: Next business day.
- Communication: Ticket created.
Classification Decision Tree
When the on-call engineer is not sure about severity, give them a decision tree, not a judgment call.
Severity classification:
1. Are customers unable to use the product?
Yes -> SEV1
No -> Continue
2. Are customers experiencing degraded service?
Yes -> Is there a workaround?
No -> SEV2
Yes -> SEV3
No -> Continue
3. Is there any user-visible impact?
Yes -> SEV3
No -> SEV4
Roles & Responsibilities
Every incident needs clear roles. Define them before the incident so assignment takes seconds, not minutes.
Incident roles:
Incident Commander (IC)
- Owns the incident from declaration to resolution
- Makes decisions about severity changes and escalation
- Does NOT debug. Coordinates others who debug.
- Runs the communication cadence
Technical Lead
- Leads the technical investigation
- Directs debugging efforts
- Proposes and implements fixes
- Communicates technical status to the IC
Communications Lead
- Drafts and sends external customer communications
- Updates the status page
- Manages stakeholder notifications
- Ensures regular update cadence
Scribe
- Records the incident timeline in real time
- Captures decisions and their rationale
- Notes who did what and when
- This log becomes the foundation of the postmortem
Role Assignment
Role assignment process:
1. First responder (whoever gets paged) becomes temporary IC
2. Temporary IC creates the incident channel:
/incident create "Brief description" sev2
3. Temporary IC assigns roles:
"I need a Tech Lead. @alice, can you lead the technical
investigation?"
"I need a Comms Lead. @bob, can you handle status updates?"
4. If nobody with the right expertise is available:
Escalate per the escalation path (see below)
Escalation Paths
Document exactly who to contact and when. No guessing, no searching through an org chart.
Escalation paths:
API service:
On-call: PagerDuty schedule "api-oncall"
Team lead: Jane Smith (@jsmith, +1-555-0101)
Director: Mike Chen (@mchen, +1-555-0102)
VP Eng: Sarah Park (@spark, +1-555-0103)
Database:
On-call: PagerDuty schedule "db-oncall"
Team lead: Alex Kumar (@akumar, +1-555-0201)
Director: Mike Chen (@mchen, +1-555-0102)
Infrastructure:
On-call: PagerDuty schedule "infra-oncall"
Team lead: Pat Rivera (@privera, +1-555-0301)
Director: Tom Walsh (@twalsh, +1-555-0302)
When to escalate to the next level:
- No response from on-call within 10 minutes
- SEV1 not resolved within 30 minutes
- SEV2 not resolved within 2 hours
- IC needs a decision that requires authority they do not have
Third-Party Escalation
If your system depends on external services, document how to contact them during an incident.
Third-party escalation:
AWS Support:
Enterprise Support: support.aws.amazon.com
Account ID: 123456789012
Support plan: Enterprise
TAM: Name, email, phone
Stripe:
Support: dashboard.stripe.com/support
Account: acct_xxxx
Escalation email: urgent@stripe.com
Cloudflare:
Support portal: dash.cloudflare.com/support
Account: xxxx
Emergency email: enterprise-support@cloudflare.com
Communication Templates
Pre-written templates eliminate the "what do I say" delay during a live incident. The on-call engineer fills in the blanks instead of composing from scratch.
Internal Communication
Internal incident notification template:
Subject: [SEVX] Brief description of the incident
Status: Investigating / Identified / Monitoring / Resolved
Impact: What users are experiencing
Start time: YYYY-MM-DD HH:MM UTC
IC: Name
Channel: #incident-YYYYMMDD-brief-description
Current understanding:
[1-2 sentences about what we know]
Next steps:
[What we are doing right now]
Next update in: 30 minutes
External Communication
Status page update templates:
Investigating:
"We are investigating reports of [brief description of impact].
Some users may experience [specific symptoms]. We will provide
an update within [time]."
Identified:
"We have identified the cause of [brief description]. Our team
is working on a fix. [Specific impact description]. We expect
to have an update within [time]."
Monitoring:
"A fix has been applied for [brief description]. We are
monitoring the results. Some users may still experience
[residual symptoms] for [duration]."
Resolved:
"The incident affecting [brief description] has been resolved.
The issue lasted from [start time] to [end time]. All services
are operating normally. A full postmortem will follow."
Update Cadence
Communication cadence by severity:
SEV1: Update every 15 minutes until identified, then every
30 minutes until resolved.
SEV2: Update every 30 minutes until identified, then every
hour until resolved.
SEV3: Update when status changes.
SEV4: No incident communication required.
Rule: If you have nothing new to say, say "No new updates.
Continuing to investigate. Next update at [time]." Silence
during an incident is worse than repetition.
The Incident Channel
Create a dedicated channel for every SEV1 or SEV2 incident. Use a consistent naming convention.
Incident channel conventions:
Name: #incident-YYYYMMDD-brief-slug
Example: #incident-20260418-api-500-errors
Topic: "SEV2 | API 500 errors in us-east-1 | IC: @jsmith"
Pin: incident declaration, status page link, runbook link, dashboard link.
Rule: incident discussion only. Side conversations go to threads.
The Postmortem
The incident is resolved. Now document what happened so it does not happen again. Every postmortem follows the same structure: summary, timeline, root cause, resolution, what went well, what went wrong, and action items with owners and due dates.
Postmortem template:
# Incident: Brief description
Date: YYYY-MM-DD | Severity: SEVX | Duration: X hours Y minutes
## Summary
2-3 sentences describing what happened and the user impact.
## Timeline
14:03 UTC - Alert fired: API error rate > 5%
14:05 UTC - On-call engineer acknowledged
14:12 UTC - Incident channel created, SEV2 declared
## Root Cause
What actually caused the incident. Be specific.
## What Went Well / What Went Wrong
Bullet points for each. Be honest.
## Action Items (each has an owner and due date)
- [ ] Add automated canary deployment (@jsmith, 2026-05-01)
- [ ] Update escalation contacts for DB team (@akumar, 2026-04-25)
Common Pitfalls
- Vague severity definitions — "major impact" means different things to different people. Define severity with specific, measurable criteria and examples.
- Missing escalation contacts — an escalation path that says "contact the team lead" without a name, phone number, and PagerDuty schedule is useless at 3AM.
- No communication templates — composing customer communications during a live incident wastes time and produces worse results. Pre-write the templates.
- IC also debugging — the Incident Commander's job is coordination, not investigation. When the IC starts debugging, coordination stops. Separate the roles.
- Postmortems without action items — a postmortem that identifies root causes but creates no action items is an exercise in documentation theater. Every postmortem needs owned, dated action items.
- Blame in postmortems — blameful postmortems teach people to hide mistakes. Blameless postmortems teach the organization to fix systems. Focus on what happened, not who did it.
- Stale contact information — escalation paths with people who left the company six months ago. Review quarterly.
Key Takeaways
- Define severity levels with specific criteria and examples. A tired on-call engineer should be able to classify an incident in under 30 seconds.
- Assign roles immediately: Incident Commander, Technical Lead, Communications Lead, Scribe. The IC does not debug.
- Pre-write communication templates for internal notifications, status page updates, and stakeholder emails. Fill in blanks, do not compose from scratch.
- Document escalation paths with names, phone numbers, and PagerDuty schedules. Review them quarterly.
- Every incident produces a blameless postmortem with owned, dated action items. A postmortem without action items is theater.