Post-Incident Review

A security incident is a failure. A post-incident review that produces no meaningful changes is a second failure. The blameless post-mortem transforms incidents from pure losses into investments in resilience. Organizations that learn from incidents get stronger. Organizations that skip the review repeat the same failures.

The Blameless Post-Mortem

Blameless does not mean accountability-free. It means focusing on systemic causes rather than individual blame. When you blame a person, you get a fired employee. When you fix a system, you prevent the next hundred incidents.

The engineer who clicked a phishing link is not the root cause. The root cause is the system that allowed a single phishing click to compromise production data. Maybe there was no MFA. Maybe phishing training was outdated. Maybe email filtering was misconfigured. These are fixable systemic issues.

# Blameless framing
Bad:  "John clicked a phishing link and caused the breach."
Good: "A phishing email bypassed our email filter. The clicked
      link harvested credentials that provided access to
      production because MFA was not enabled on the VPN.
      Our monitoring did not detect the unauthorized access
      for 6 hours because we lacked alerting on VPN logins
      from new locations."

The good framing identifies four systemic failures that can each be fixed: email filtering, MFA enforcement, VPN access controls, and monitoring coverage.

Conducting the Review

Timing

Hold the review within 3-5 business days of incident resolution. Too soon and people are still in crisis mode. Too late and details are forgotten. For major incidents (P1), schedule a preliminary review within 48 hours and a thorough follow-up within two weeks.

Participants

Include everyone who was involved in the response: the incident commander, technical responders, communications lead, and scribe. Also include the engineering team that owns the affected system. Do not include management who were not directly involved — their presence can inhibit honest discussion.

Format

# Post-incident review agenda (90 minutes)
1. Timeline review (30 min)
   - Walk through events chronologically
   - Each participant adds their perspective
   - Identify gaps and discrepancies

2. Root cause analysis (25 min)
   - What allowed this to happen?
   - What systemic issues contributed?
   - What previous warnings were missed?

3. What went well (10 min)
   - What worked in our response?
   - What should we keep doing?

4. What could improve (10 min)
   - Where did we struggle?
   - What information was missing?

5. Action items (15 min)
   - Specific changes with owners and deadlines
   - Priority ranking

Timeline Reconstruction

The timeline is the foundation of the review. It answers what happened and when, from initial compromise through detection, containment, and recovery.

# Example timeline
2024-03-15 02:14 UTC  Attacker exploits vulnerable dependency
                      in payment service (CVE-2024-XXXXX)
2024-03-15 02:17 UTC  Attacker establishes reverse shell
2024-03-15 02:23 UTC  Attacker enumerates database credentials
                      from environment variables
2024-03-15 02:31 UTC  First database query from attacker IP
2024-03-15 02:31-     Attacker exports customer records in
  05:47 UTC           batches of 1000
2024-03-15 06:02 UTC  Monitoring alert: unusual database query
                      volume (delayed due to aggregation window)
2024-03-15 06:15 UTC  On-call engineer acknowledges alert
2024-03-15 06:22 UTC  Engineer confirms unauthorized access
2024-03-15 06:25 UTC  Incident declared, IC assigned
2024-03-15 06:30 UTC  Database credentials rotated
2024-03-15 06:35 UTC  Attacker access terminated
2024-03-15 06:45 UTC  Affected systems isolated
2024-03-15 09:00 UTC  Scope assessment: 47,000 customer records
                      potentially accessed

Key metrics from the timeline:

Time to compromise: 17 minutes from initial exploit to database access.
Dwell time: 3 hours 31 minutes of undetected attacker activity.
Time to detect: 3 hours 48 minutes from compromise to alert.
Time to contain: 13 minutes from alert acknowledgment to access termination.

Each of these metrics represents an improvement opportunity. The detection time of nearly 4 hours is the most critical gap — reducing it to minutes would have prevented most data exfiltration.

Root Cause Analysis

Root cause analysis goes deeper than "what happened" to "why did the system allow it to happen." The technique is simple: keep asking "why" until you reach a systemic issue that can be fixed.

# The "5 Whys" technique
Why was customer data exfiltrated?
  Because the attacker had database access.

Why did the attacker have database access?
  Because database credentials were in environment variables
  accessible from the compromised service.

Why were credentials accessible from the service?
  Because we store database credentials as environment
  variables rather than using a secrets manager with
  scoped access.

Why don't we use a secrets manager?
  Because the migration was deprioritized in favor of
  feature work three quarters in a row.

Why was security work consistently deprioritized?
  Because our prioritization framework does not account
  for security debt as a risk factor.

The root cause is not "the attacker exploited a vulnerability." The root causes are: (1) a vulnerable dependency was not patched, (2) secrets were stored insecurely, (3) monitoring had insufficient granularity, and (4) the prioritization framework systematically deprioritized security work.

Not "Human Error"

Labeling a root cause as "human error" is a failure of analysis. Humans make errors constantly — that is a given. The system must be designed to tolerate human error.

# "Human error" root causes, reframed
"Engineer deployed to production without review"
  → Deployment pipeline allows unreviewed pushes to production

"Admin used weak password"
  → System allows weak passwords; MFA not enforced

"Developer committed secrets to repo"
  → No pre-commit hook scanning for secrets;
    no automated secret detection in CI

Every "human error" can be reframed as a missing guardrail, a broken process, or an inadequate tool.

Action Items

Action items are the output that matters. Without concrete, assigned, and deadlined actions, the review is an expensive storytelling session.

Structure

Every action item needs three things: an owner (a specific person, not a team), a deadline, and a clear definition of done.

# Example action items
1. Migrate payment service secrets to HashiCorp Vault
   Owner: Sarah (Platform Engineering)
   Deadline: 2024-04-15
   Done when: All payment service credentials are sourced
   from Vault; environment variables removed

2. Reduce database query monitoring aggregation window
   from 1 hour to 5 minutes
   Owner: Marcus (Security Operations)
   Deadline: 2024-03-29
   Done when: Alert fires within 5 minutes of anomalous
   query patterns

3. Add CVE-2024-XXXXX to dependency scanning rules
   and audit all services for the vulnerable package
   Owner: Priya (Application Security)
   Deadline: 2024-03-22
   Done when: All instances patched and scan rule active

4. Propose security debt tracking in sprint planning
   Owner: James (Engineering Manager)
   Deadline: 2024-04-01
   Done when: Security debt items visible in backlog
   with risk-based prioritization

Follow-Up

Track action items to completion. Schedule a follow-up check at 30 and 60 days. Action items that are never completed represent the organization's actual risk tolerance — if you consistently fail to complete post-incident actions, you are accepting the risk of recurrence.

Share sanitized incident summaries across the organization. Other teams likely have similar vulnerabilities. A post-incident summary distributed to all engineering teams may prevent ten other teams from making the same mistake.

# Internal incident summary template
Incident: Payment service data exposure
Date: 2024-03-15
Severity: P1
Impact: 47,000 customer records potentially accessed

What happened: [2-3 sentence summary]

Key lessons:
- Secrets in environment variables are accessible to
  anyone who compromises the host process
- Dependency patching delays directly translate to
  breach risk
- Monitoring aggregation windows determine minimum
  detection time

Action for your team:
- Audit your services for secrets stored in env vars
- Verify your dependency scanning covers all packages
- Review your monitoring aggregation windows

Some organizations publish post-incident reports publicly. Cloudflare, GitHub, GitLab, and Google regularly publish detailed incident analyses. External sharing builds trust with customers and contributes to the broader security community's knowledge base.

External reports are sanitized to remove sensitive details — attacker techniques that could enable copycats, specific customer impact data, and proprietary security architecture details.

The Incident Database

Individual incidents teach individual lessons. An incident database reveals patterns across time.

# Incident database fields
- Incident ID and date
- Severity level
- Category (data breach, malware, DDoS, insider threat)
- Detection method (automated, user report, third-party)
- Time to detect, contain, and resolve
- Root causes
- Action items and completion status
- Cost estimate (response effort, business impact, fines)

Patterns emerge over time. If three incidents in six months involve exposed secrets, the pattern is clear regardless of the individual root causes. If detection time has not improved over a year despite action items, the organization is not investing enough in monitoring.

Review the incident database quarterly. Look for recurring root causes, trending categories, and whether detection and response times are improving. Present findings to leadership with resource requests tied to specific pattern data.

Common Pitfalls

Blaming individuals. People stop reporting incidents and participating honestly in reviews. The culture becomes one of hiding problems rather than fixing systems.
Root cause analysis that stops too early. "The dependency was unpatched" is not a root cause. Why was it unpatched? What process failed? Go deeper.
Action items without owners or deadlines. "We should improve monitoring" is not an action item. "Marcus will reduce the alert aggregation window to 5 minutes by March 29" is an action item.
Never following up on action items. Incomplete action items mean accepting recurrence risk. Track completion and escalate overdue items.
Skipping the review entirely. Under pressure to move on, teams skip the review and lose the most valuable part of the incident: the lessons.
Not sharing across teams. The same vulnerability often exists in multiple services. If only the affected team learns, the organization remains vulnerable.

Key Takeaways

Blameless post-mortems focus on systemic failures, not individual mistakes.
Timeline reconstruction with metrics (time to detect, time to contain) reveals the biggest improvement opportunities.
Root cause analysis must go beyond surface causes — keep asking "why" until you reach fixable systemic issues.
"Human error" is never a root cause. Find the missing guardrail or broken process.
Action items need specific owners, deadlines, and definitions of done.
Share lessons internally across teams and consider external publication.
An incident database tracks patterns over time, revealing systemic weaknesses that individual reviews miss.