Post-Incident Review
A security incident is a failure. A post-incident review that produces no meaningful changes is a second failure. The blameless post-mortem transforms incidents from pure losses into investments in resilience. Organizations that learn from incidents get stronger. Organizations that skip the review repeat the same failures.
The Blameless Post-Mortem
Blameless does not mean accountability-free. It means focusing on systemic causes rather than individual blame. When you blame a person, you get a fired employee. When you fix a system, you prevent the next hundred incidents.
The engineer who clicked a phishing link is not the root cause. The root cause is the system that allowed a single phishing click to compromise production data. Maybe there was no MFA. Maybe phishing training was outdated. Maybe email filtering was misconfigured. These are fixable systemic issues.
# Blameless framing
Bad: "John clicked a phishing link and caused the breach."
Good: "A phishing email bypassed our email filter. The clicked
link harvested credentials that provided access to
production because MFA was not enabled on the VPN.
Our monitoring did not detect the unauthorized access
for 6 hours because we lacked alerting on VPN logins
from new locations."
The good framing identifies four systemic failures that can each be fixed: email filtering, MFA enforcement, VPN access controls, and monitoring coverage.
Conducting the Review
Timing
Hold the review within 3-5 business days of incident resolution. Too soon and people are still in crisis mode. Too late and details are forgotten. For major incidents (P1), schedule a preliminary review within 48 hours and a thorough follow-up within two weeks.
Participants
Include everyone who was involved in the response: the incident commander, technical responders, communications lead, and scribe. Also include the engineering team that owns the affected system. Do not include management who were not directly involved — their presence can inhibit honest discussion.
Format
# Post-incident review agenda (90 minutes)
1. Timeline review (30 min)
- Walk through events chronologically
- Each participant adds their perspective
- Identify gaps and discrepancies
2. Root cause analysis (25 min)
- What allowed this to happen?
- What systemic issues contributed?
- What previous warnings were missed?
3. What went well (10 min)
- What worked in our response?
- What should we keep doing?
4. What could improve (10 min)
- Where did we struggle?
- What information was missing?
5. Action items (15 min)
- Specific changes with owners and deadlines
- Priority ranking
Timeline Reconstruction
The timeline is the foundation of the review. It answers what happened and when, from initial compromise through detection, containment, and recovery.
# Example timeline
2024-03-15 02:14 UTC Attacker exploits vulnerable dependency
in payment service (CVE-2024-XXXXX)
2024-03-15 02:17 UTC Attacker establishes reverse shell
2024-03-15 02:23 UTC Attacker enumerates database credentials
from environment variables
2024-03-15 02:31 UTC First database query from attacker IP
2024-03-15 02:31- Attacker exports customer records in
05:47 UTC batches of 1000
2024-03-15 06:02 UTC Monitoring alert: unusual database query
volume (delayed due to aggregation window)
2024-03-15 06:15 UTC On-call engineer acknowledges alert
2024-03-15 06:22 UTC Engineer confirms unauthorized access
2024-03-15 06:25 UTC Incident declared, IC assigned
2024-03-15 06:30 UTC Database credentials rotated
2024-03-15 06:35 UTC Attacker access terminated
2024-03-15 06:45 UTC Affected systems isolated
2024-03-15 09:00 UTC Scope assessment: 47,000 customer records
potentially accessed
Key metrics from the timeline:
- Time to compromise: 17 minutes from initial exploit to database access.
- Dwell time: 3 hours 31 minutes of undetected attacker activity.
- Time to detect: 3 hours 48 minutes from compromise to alert.
- Time to contain: 13 minutes from alert acknowledgment to access termination.
Each of these metrics represents an improvement opportunity. The detection time of nearly 4 hours is the most critical gap — reducing it to minutes would have prevented most data exfiltration.
Root Cause Analysis
Root cause analysis goes deeper than "what happened" to "why did the system allow it to happen." The technique is simple: keep asking "why" until you reach a systemic issue that can be fixed.
# The "5 Whys" technique
Why was customer data exfiltrated?
Because the attacker had database access.
Why did the attacker have database access?
Because database credentials were in environment variables
accessible from the compromised service.
Why were credentials accessible from the service?
Because we store database credentials as environment
variables rather than using a secrets manager with
scoped access.
Why don't we use a secrets manager?
Because the migration was deprioritized in favor of
feature work three quarters in a row.
Why was security work consistently deprioritized?
Because our prioritization framework does not account
for security debt as a risk factor.
The root cause is not "the attacker exploited a vulnerability." The root causes are: (1) a vulnerable dependency was not patched, (2) secrets were stored insecurely, (3) monitoring had insufficient granularity, and (4) the prioritization framework systematically deprioritized security work.
Not "Human Error"
Labeling a root cause as "human error" is a failure of analysis. Humans make errors constantly — that is a given. The system must be designed to tolerate human error.
# "Human error" root causes, reframed
"Engineer deployed to production without review"
→ Deployment pipeline allows unreviewed pushes to production
"Admin used weak password"
→ System allows weak passwords; MFA not enforced
"Developer committed secrets to repo"
→ No pre-commit hook scanning for secrets;
no automated secret detection in CI
Every "human error" can be reframed as a missing guardrail, a broken process, or an inadequate tool.
Action Items
Action items are the output that matters. Without concrete, assigned, and deadlined actions, the review is an expensive storytelling session.
Structure
Every action item needs three things: an owner (a specific person, not a team), a deadline, and a clear definition of done.
# Example action items
1. Migrate payment service secrets to HashiCorp Vault
Owner: Sarah (Platform Engineering)
Deadline: 2024-04-15
Done when: All payment service credentials are sourced
from Vault; environment variables removed
2. Reduce database query monitoring aggregation window
from 1 hour to 5 minutes
Owner: Marcus (Security Operations)
Deadline: 2024-03-29
Done when: Alert fires within 5 minutes of anomalous
query patterns
3. Add CVE-2024-XXXXX to dependency scanning rules
and audit all services for the vulnerable package
Owner: Priya (Application Security)
Deadline: 2024-03-22
Done when: All instances patched and scan rule active
4. Propose security debt tracking in sprint planning
Owner: James (Engineering Manager)
Deadline: 2024-04-01
Done when: Security debt items visible in backlog
with risk-based prioritization
Follow-Up
Track action items to completion. Schedule a follow-up check at 30 and 60 days. Action items that are never completed represent the organization's actual risk tolerance — if you consistently fail to complete post-incident actions, you are accepting the risk of recurrence.
Sharing Lessons
Internal Sharing
Share sanitized incident summaries across the organization. Other teams likely have similar vulnerabilities. A post-incident summary distributed to all engineering teams may prevent ten other teams from making the same mistake.
# Internal incident summary template
Incident: Payment service data exposure
Date: 2024-03-15
Severity: P1
Impact: 47,000 customer records potentially accessed
What happened: [2-3 sentence summary]
Key lessons:
- Secrets in environment variables are accessible to
anyone who compromises the host process
- Dependency patching delays directly translate to
breach risk
- Monitoring aggregation windows determine minimum
detection time
Action for your team:
- Audit your services for secrets stored in env vars
- Verify your dependency scanning covers all packages
- Review your monitoring aggregation windows
External Sharing
Some organizations publish post-incident reports publicly. Cloudflare, GitHub, GitLab, and Google regularly publish detailed incident analyses. External sharing builds trust with customers and contributes to the broader security community's knowledge base.
External reports are sanitized to remove sensitive details — attacker techniques that could enable copycats, specific customer impact data, and proprietary security architecture details.
The Incident Database
Individual incidents teach individual lessons. An incident database reveals patterns across time.
# Incident database fields
- Incident ID and date
- Severity level
- Category (data breach, malware, DDoS, insider threat)
- Detection method (automated, user report, third-party)
- Time to detect, contain, and resolve
- Root causes
- Action items and completion status
- Cost estimate (response effort, business impact, fines)
Patterns emerge over time. If three incidents in six months involve exposed secrets, the pattern is clear regardless of the individual root causes. If detection time has not improved over a year despite action items, the organization is not investing enough in monitoring.
Review the incident database quarterly. Look for recurring root causes, trending categories, and whether detection and response times are improving. Present findings to leadership with resource requests tied to specific pattern data.
Common Pitfalls
- Blaming individuals. People stop reporting incidents and participating honestly in reviews. The culture becomes one of hiding problems rather than fixing systems.
- Root cause analysis that stops too early. "The dependency was unpatched" is not a root cause. Why was it unpatched? What process failed? Go deeper.
- Action items without owners or deadlines. "We should improve monitoring" is not an action item. "Marcus will reduce the alert aggregation window to 5 minutes by March 29" is an action item.
- Never following up on action items. Incomplete action items mean accepting recurrence risk. Track completion and escalate overdue items.
- Skipping the review entirely. Under pressure to move on, teams skip the review and lose the most valuable part of the incident: the lessons.
- Not sharing across teams. The same vulnerability often exists in multiple services. If only the affected team learns, the organization remains vulnerable.
Key Takeaways
- Blameless post-mortems focus on systemic failures, not individual mistakes.
- Timeline reconstruction with metrics (time to detect, time to contain) reveals the biggest improvement opportunities.
- Root cause analysis must go beyond surface causes — keep asking "why" until you reach fixable systemic issues.
- "Human error" is never a root cause. Find the missing guardrail or broken process.
- Action items need specific owners, deadlines, and definitions of done.
- Share lessons internally across teams and consider external publication.
- An incident database tracks patterns over time, revealing systemic weaknesses that individual reviews miss.