Pre-Mortems
Overview
A post-mortem asks "what went wrong?" after a failure. A pre-mortem asks "what went wrong?" before you start. The technique, developed by psychologist Gary Klein, works like this: imagine your project has already failed. Now explain why it failed.
This simple shift in framing is surprisingly powerful. It removes the optimism bias that plagues planning. When a team is asked "what could go wrong?" they downplay risks to avoid looking negative. When asked "the project failed, why?" they compete to identify the most likely cause. The framing gives people permission to be honest about risks they already see but hesitate to voice.
How to Run a Pre-Mortem
Step 1: Gather the team
Everyone involved in the project. Include engineers, product
managers, designers, and anyone who will be affected by the
outcome. 5-8 people is ideal.
Step 2: Set the frame
"It is six months from now. This project has failed completely.
It did not ship, or it shipped and was a disaster.
Take 5 minutes. Write down independently why it failed."
Key word: independently. No discussion yet. This prevents
groupthink and ensures the loudest person doesn't anchor
everyone else's thinking.
Step 3: Collect and share
Go around the room. Each person reads their failure reasons.
No judgment, no rebuttals. Just collect.
Step 4: Cluster and prioritize
Group similar failure reasons. Vote on which are most likely
and most impactful.
Step 5: Mitigate
For the top 3-5 risks, define concrete actions:
→ What would prevent this failure mode?
→ What early warning signal would tell us this is happening?
→ Who owns the mitigation?
Step 6: Revisit
Check back on these risks at regular intervals (monthly or
at each milestone). Were we right? Have new risks appeared?
Why Pre-Mortems Work Better Than Risk Assessment
Traditional risk assessment:
"What could go wrong?"
→ People minimize risks to appear confident
→ The planning fallacy kicks in: we assume the best case
→ Risks feel hypothetical and distant
→ The team has already committed to the plan emotionally
Pre-mortem:
"The project failed. Why?"
→ People compete to identify the most compelling failure
→ The framing breaks the optimism bias
→ Risks feel concrete because the failure has "already happened"
→ No one has to be the pessimist — everyone is explaining
what went wrong
Research shows pre-mortems increase the ability to identify
reasons for future outcomes by 30% compared to standard
prospective thinking.
Pre-Mortem for a System Migration
Project: Migrate the payment service from a monolith to
a standalone microservice
Pre-mortem framing: "It's three months later. The migration
failed. We had to roll back to the monolith. Why?"
Team member responses (written independently):
Engineer A:
"We didn't account for the 47 places in the monolith that
directly query the payments table. The new service had an API
but half the callers were still hitting the database directly."
Engineer B:
"The data migration had a bug. Some payment records were
duplicated. We didn't discover this until customers were
charged twice."
Product Manager:
"We set a deadline based on the executive presentation date,
not based on the actual complexity. We cut testing to hit
the deadline."
Engineer C:
"The new service couldn't handle the traffic spike on the
first of the month (billing day). We load tested with
average traffic, not peak traffic."
QA Engineer:
"We tested the happy path but not the failure paths. When
the new service went down, the monolith had no fallback.
In the old architecture, payments just worked because
everything was in-process."
Top risks after clustering and voting:
1. Hidden database dependencies (Engineer A)
2. Data migration errors (Engineer B)
3. Deadline-driven scope cuts (Product Manager)
4. Load testing gaps (Engineer C)
Mitigations:
1. Audit all database queries for payments tables before
starting. No migration until we have a complete dependency map.
2. Run data migration on a copy of production data first.
Reconcile every record. Automate the reconciliation check.
3. Define a minimum viable migration. Anything below that
scope means we delay, not cut quality.
4. Load test with 2x peak traffic, not average traffic.
Run the test monthly leading up to cutover.
Pre-Mortem for a Sprint
Pre-mortems are not just for large projects. A 10-minute pre-mortem at the start of a sprint is one of the highest-value planning investments.
Sprint goal: Ship the new user onboarding flow
Pre-mortem framing: "The sprint ended. We didn't ship the
onboarding flow. Why?"
Team responses:
"The design wasn't finalized. We started building based on
wireframes and the final designs required significant changes
midway through the sprint."
"We underestimated the email integration. The email service
has a 3-day SLA for new template approvals and we didn't
submit the templates before sprint start."
"The A/B testing framework had a bug that took two days to
diagnose. We assumed it worked because the last experiment
was months ago."
"Two of the three engineers had on-call shifts this sprint.
One incident took 8 hours."
Mitigations:
→ Get design sign-off before sprint starts (or timebox
design changes to day 1-2 only)
→ Submit email templates today, before sprint officially begins
→ Test the A/B framework with a dummy experiment on day 1
→ Account for on-call load in sprint capacity planning
Pre-Mortem for a Product Launch
Project: Launch the public API for external developers
Pre-mortem framing: "The API launched three months ago. It's
considered a failure. Adoption is near zero and the few users
we have are angry. Why?"
Team responses:
"The documentation was auto-generated from code comments.
It was technically accurate but had no tutorials, no quick
start guide, no example applications. Developers couldn't
figure out how to get started."
"We didn't version the API from day one. We made breaking
changes in the first month. Early adopters had to rewrite
their integrations. They posted about it publicly."
"Rate limits were too aggressive. Developers hit the limit
during normal usage and got cryptic 429 errors with no
explanation of the limit or when it resets."
"The authentication flow required six steps. Competitors
required two. Developers gave up during onboarding."
"We had no way to communicate with API users. When we
had planned downtime, they found out when their calls
started failing."
Top risks:
1. Documentation quality
2. API versioning
3. Developer experience (rate limits, auth, errors)
4. Communication channel
Mitigations:
1. Write a quick start guide and three example apps before
launch. Have an external developer test the docs.
2. Version the API from day one. Define the breaking change
policy. Commit to 6-month deprecation windows.
3. Test the full developer experience end-to-end. Simplify
auth. Make rate limit errors include the limit, current
usage, and reset time.
4. Create a developer changelog, status page, and mailing
list before launch.
Pre-Mortem for Architecture Decisions
Decision: Adopt event-driven architecture for the order system
Pre-mortem framing: "We adopted event-driven architecture a
year ago. The team regrets it. Why?"
Team responses:
"Debugging is a nightmare. When an order fails, we have to
trace events across five services. There's no single place
to see the full order lifecycle."
"Event ordering issues cause subtle bugs. A 'payment received'
event sometimes arrives before the 'order created' event.
The code handles this most of the time but not always."
"The team doesn't understand event-driven patterns. Most
engineers came from a request-response background. They're
building request-response patterns on top of events, getting
the worst of both worlds."
"We have no way to replay events. When we need to fix a bug
in an event handler and reprocess, we have to write a
one-off script every time."
Mitigations:
→ Build distributed tracing from day one, not after problems appear
→ Design for out-of-order events explicitly. Every handler
must be idempotent and handle events arriving in any order.
→ Invest in training. Run workshops on event-driven patterns
before starting implementation.
→ Build event replay capability into the infrastructure
before writing the first event handler.
Common Pre-Mortem Failure Categories
When running a pre-mortem, prompt the team to consider these categories if they get stuck.
Category Example failures
──────────────────────────────────────────────────────────────
Technical risk "The third-party API we depend on has no SLA"
People risk "The only person who knows the auth system
is on paternity leave during the migration"
Scope risk "Requirements will change midway through"
Timeline risk "The deadline assumes everything goes right"
Dependency risk "We need the platform team to deploy X first
and they haven't committed to a date"
Knowledge risk "Nobody on the team has built this before"
Operational risk "We have no runbook for the new system"
Data risk "Migration could corrupt or lose records"
Communication risk "Stakeholders expect X but we're building Y"
Making Pre-Mortems a Habit
When to run a pre-mortem:
Always:
→ Before any project longer than 2 weeks
→ Before a major migration or architectural change
→ Before a product launch
Recommended:
→ At the start of each sprint (10-minute version)
→ Before any decision that is hard to reverse
→ When joining a project that's already in progress
("if this project fails from here, why?")
Format options:
→ Full session: 30-60 minutes for major projects
→ Quick round: 10 minutes at sprint planning
→ Async: Post the prompt in Slack, collect responses
over 24 hours, discuss the top risks synchronously
The async format works well for distributed teams. The key
is the independent writing step — people must write their
own failure reasons before seeing others' responses.
Common Pitfalls
- Skipping the independent writing step: If you just ask the room "why did it fail?" the first person to speak anchors everyone else. The power of pre-mortems comes from independent, parallel thinking. People must write their reasons before any discussion.
- Not acting on the results: A pre-mortem that produces a list of risks and no mitigations is a waste of time. Every pre-mortem must end with specific actions assigned to specific people with specific deadlines.
- Only considering technical risks: Projects fail for people reasons as often as technical reasons. Make sure the pre-mortem covers communication gaps, knowledge concentration, unrealistic timelines, and organizational dependencies.
- Running it too late: A pre-mortem after the architecture is locked in and the deadline is set has limited value. Run it early enough that you can actually change the plan based on what you learn.
- Treating it as a one-time event: The risks you identify at the start of a project are not the only risks. Revisit the pre-mortem results at each milestone. Add new risks as the project progresses. Some original risks will have been mitigated; new ones will have emerged.
- Dismissing "unlikely" risks: The value of a pre-mortem is surfacing risks that people feel but don't voice. If someone raises a risk that seems unlikely, don't dismiss it. Discuss what the early warning signs would be and what a low-cost mitigation looks like.
Key Takeaways
- A pre-mortem imagines the project has already failed and asks why. This framing breaks optimism bias and gives people permission to voice risks they already see but hesitate to raise.
- The independent writing step is essential. Everyone writes down their failure reasons before discussion. This prevents groupthink and ensures diverse perspectives surface.
- Apply pre-mortems to projects, sprints, migrations, architecture decisions, and product launches. Scale the format from 10-minute quick rounds to 60-minute full sessions based on the stakes.
- Every pre-mortem must produce specific mitigations with owners and deadlines. A list of risks without actions is planning theater.
- Revisit pre-mortem results regularly. Risks evolve as the project progresses. Some are mitigated, new ones emerge, and the likelihood of others changes based on what you learn during execution.