Incident Response Docs

When things break, you do not have time to figure out who to call, what to say, or how to decide if this is a real emergency. Incident response documentation exists to pre-make those decisions so you can execute instead of deliberate. Every minute spent deciding who to notify or debating severity during a live incident is a minute of downtime. Write the playbook before the crisis.

Severity Definitions

Severity levels must be defined precisely enough that an on-call engineer at 3AM can classify an incident without a committee meeting.

Severity definitions:

  SEV1 - Critical
    - Complete service outage or data loss
    - Revenue-impacting for all customers
    - Security breach with active exploitation
    - Examples: API returning 500 for all requests, database
      corruption, credentials leaked publicly
    - Response time: Immediate. All hands.
    - Communication: Exec team notified within 15 minutes.

  SEV2 - Major
    - Partial outage or significant degradation
    - Revenue-impacting for a subset of customers
    - Security vulnerability discovered but not exploited
    - Examples: One region down, payment processing delayed,
      latency 10x normal
    - Response time: Within 15 minutes.
    - Communication: Engineering leadership notified within 30 min.

  SEV3 - Minor
    - Service degraded but functional
    - Workaround available
    - No immediate revenue impact
    - Examples: Non-critical endpoint slow, dashboard broken,
      batch job delayed
    - Response time: Within 1 hour.
    - Communication: Team Slack channel.

  SEV4 - Low
    - Cosmetic issue or minor bug
    - No user impact
    - Examples: Typo in error message, minor UI glitch
    - Response time: Next business day.
    - Communication: Ticket created.

Classification Decision Tree

When the on-call engineer is not sure about severity, give them a decision tree, not a judgment call.

Severity classification:

  1. Are customers unable to use the product?
     Yes -> SEV1
     No  -> Continue

  2. Are customers experiencing degraded service?
     Yes -> Is there a workaround?
            No  -> SEV2
            Yes -> SEV3
     No  -> Continue

  3. Is there any user-visible impact?
     Yes -> SEV3
     No  -> SEV4

Roles & Responsibilities

Every incident needs clear roles. Define them before the incident so assignment takes seconds, not minutes.

Incident roles:

  Incident Commander (IC)
    - Owns the incident from declaration to resolution
    - Makes decisions about severity changes and escalation
    - Does NOT debug. Coordinates others who debug.
    - Runs the communication cadence

  Technical Lead
    - Leads the technical investigation
    - Directs debugging efforts
    - Proposes and implements fixes
    - Communicates technical status to the IC

  Communications Lead
    - Drafts and sends external customer communications
    - Updates the status page
    - Manages stakeholder notifications
    - Ensures regular update cadence

  Scribe
    - Records the incident timeline in real time
    - Captures decisions and their rationale
    - Notes who did what and when
    - This log becomes the foundation of the postmortem

Role Assignment

Role assignment process:

  1. First responder (whoever gets paged) becomes temporary IC
  2. Temporary IC creates the incident channel:
     /incident create "Brief description" sev2
  3. Temporary IC assigns roles:
     "I need a Tech Lead. @alice, can you lead the technical
     investigation?"
     "I need a Comms Lead. @bob, can you handle status updates?"
  4. If nobody with the right expertise is available:
     Escalate per the escalation path (see below)

Escalation Paths

Document exactly who to contact and when. No guessing, no searching through an org chart.

Escalation paths:

  API service:
    On-call:     PagerDuty schedule "api-oncall"
    Team lead:   Jane Smith (@jsmith, +1-555-0101)
    Director:    Mike Chen (@mchen, +1-555-0102)
    VP Eng:      Sarah Park (@spark, +1-555-0103)

  Database:
    On-call:     PagerDuty schedule "db-oncall"
    Team lead:   Alex Kumar (@akumar, +1-555-0201)
    Director:    Mike Chen (@mchen, +1-555-0102)

  Infrastructure:
    On-call:     PagerDuty schedule "infra-oncall"
    Team lead:   Pat Rivera (@privera, +1-555-0301)
    Director:    Tom Walsh (@twalsh, +1-555-0302)

  When to escalate to the next level:
    - No response from on-call within 10 minutes
    - SEV1 not resolved within 30 minutes
    - SEV2 not resolved within 2 hours
    - IC needs a decision that requires authority they do not have

Third-Party Escalation

If your system depends on external services, document how to contact them during an incident.

Third-party escalation:

  AWS Support:
    Enterprise Support: support.aws.amazon.com
    Account ID: 123456789012
    Support plan: Enterprise
    TAM: Name, email, phone

  Stripe:
    Support: dashboard.stripe.com/support
    Account: acct_xxxx
    Escalation email: urgent@stripe.com

  Cloudflare:
    Support portal: dash.cloudflare.com/support
    Account: xxxx
    Emergency email: enterprise-support@cloudflare.com

Communication Templates

Pre-written templates eliminate the "what do I say" delay during a live incident. The on-call engineer fills in the blanks instead of composing from scratch.

Internal Communication

Internal incident notification template:

  Subject: [SEVX] Brief description of the incident

  Status: Investigating / Identified / Monitoring / Resolved
  Impact: What users are experiencing
  Start time: YYYY-MM-DD HH:MM UTC
  IC: Name
  Channel: #incident-YYYYMMDD-brief-description

  Current understanding:
  [1-2 sentences about what we know]

  Next steps:
  [What we are doing right now]

  Next update in: 30 minutes

External Communication

Status page update templates:

  Investigating:
    "We are investigating reports of [brief description of impact].
    Some users may experience [specific symptoms]. We will provide
    an update within [time]."

  Identified:
    "We have identified the cause of [brief description]. Our team
    is working on a fix. [Specific impact description]. We expect
    to have an update within [time]."

  Monitoring:
    "A fix has been applied for [brief description]. We are
    monitoring the results. Some users may still experience
    [residual symptoms] for [duration]."

  Resolved:
    "The incident affecting [brief description] has been resolved.
    The issue lasted from [start time] to [end time]. All services
    are operating normally. A full postmortem will follow."

Update Cadence

Communication cadence by severity:

  SEV1: Update every 15 minutes until identified, then every
        30 minutes until resolved.
  SEV2: Update every 30 minutes until identified, then every
        hour until resolved.
  SEV3: Update when status changes.
  SEV4: No incident communication required.

Rule: If you have nothing new to say, say "No new updates.
Continuing to investigate. Next update at [time]." Silence
during an incident is worse than repetition.

The Incident Channel

Create a dedicated channel for every SEV1 or SEV2 incident. Use a consistent naming convention.

Incident channel conventions:

  Name:    #incident-YYYYMMDD-brief-slug
  Example: #incident-20260418-api-500-errors
  Topic:   "SEV2 | API 500 errors in us-east-1 | IC: @jsmith"

  Pin: incident declaration, status page link, runbook link, dashboard link.
  Rule: incident discussion only. Side conversations go to threads.

The Postmortem

The incident is resolved. Now document what happened so it does not happen again. Every postmortem follows the same structure: summary, timeline, root cause, resolution, what went well, what went wrong, and action items with owners and due dates.

Postmortem template:

  # Incident: Brief description
  Date: YYYY-MM-DD | Severity: SEVX | Duration: X hours Y minutes

  ## Summary
  2-3 sentences describing what happened and the user impact.

  ## Timeline
    14:03 UTC - Alert fired: API error rate > 5%
    14:05 UTC - On-call engineer acknowledged
    14:12 UTC - Incident channel created, SEV2 declared

  ## Root Cause
  What actually caused the incident. Be specific.

  ## What Went Well / What Went Wrong
  Bullet points for each. Be honest.

  ## Action Items (each has an owner and due date)
  - [ ] Add automated canary deployment (@jsmith, 2026-05-01)
  - [ ] Update escalation contacts for DB team (@akumar, 2026-04-25)

Common Pitfalls

Vague severity definitions — "major impact" means different things to different people. Define severity with specific, measurable criteria and examples.
Missing escalation contacts — an escalation path that says "contact the team lead" without a name, phone number, and PagerDuty schedule is useless at 3AM.
No communication templates — composing customer communications during a live incident wastes time and produces worse results. Pre-write the templates.
IC also debugging — the Incident Commander's job is coordination, not investigation. When the IC starts debugging, coordination stops. Separate the roles.
Postmortems without action items — a postmortem that identifies root causes but creates no action items is an exercise in documentation theater. Every postmortem needs owned, dated action items.
Blame in postmortems — blameful postmortems teach people to hide mistakes. Blameless postmortems teach the organization to fix systems. Focus on what happened, not who did it.
Stale contact information — escalation paths with people who left the company six months ago. Review quarterly.

Key Takeaways

Define severity levels with specific criteria and examples. A tired on-call engineer should be able to classify an incident in under 30 seconds.
Assign roles immediately: Incident Commander, Technical Lead, Communications Lead, Scribe. The IC does not debug.
Pre-write communication templates for internal notifications, status page updates, and stakeholder emails. Fill in blanks, do not compose from scratch.
Document escalation paths with names, phone numbers, and PagerDuty schedules. Review them quarterly.
Every incident produces a blameless postmortem with owned, dated action items. A postmortem without action items is theater.