5 Whys

The 5 Whys is a deceptively simple root-cause analysis technique: when a problem happens, ask "why?" — then ask "why?" about the answer, and keep going until you reach the underlying systemic cause rather than the surface symptom. The method is associated with Toyota and has become a default tool in engineering post-mortems, incident analysis, and process improvement. Its value is not in the exact count (sometimes 3 is enough, sometimes 7) but in the discipline of refusing to stop at the first plausible-sounding cause.

5 Whys cascade from symptom to root cause, with common traps to avoid

Origin

Sakichi Toyoda, founder of Toyota Industries, developed the 5 Whys in the 1930s for his automatic loom business. Taiichi Ohno, the architect of the Toyota Production System, institutionalized it in post-war Toyota manufacturing as a core lean-management tool. It spread from automotive manufacturing into general management literature through books like The Toyota Way (Jeffrey Liker) and into software engineering through the DevOps and SRE communities, where Google's SRE book and Etsy's post-mortem culture both cite it explicitly. It is now ubiquitous in incident response at tech companies.

The Framework

Start:  State the problem as a factual observation.
Why 1:  Why did this happen?
Why 2:  Why did that happen?
Why 3:  Why did that happen?
Why 4:  Why did that happen?
Why 5:  Why did that happen?
Stop:   When the answer points to a systemic cause (process, design,
        incentive) rather than a person or single action.

Example

Problem: The payment service returned 500s to 8,200 customers for
         17 minutes on April 9.

Why 1?   Because the retry handler threw a null pointer exception
         when a downstream service returned an unexpected shape.

Why 2?   Because the deploy yesterday changed the downstream service's
         response format without updating the retry handler.

Why 3?   Because the retry handler's test suite did not exercise the
         new response format; the test was a static mock.

Why 4?   Because we do not have a contract-testing discipline between
         services that would have caught the shape change.

Why 5?   Because we adopted microservices without adopting the
         contract-testing practice that keeps them safe at scale,
         and no one on the platform team has been staffed to
         introduce it.

Stop:    The systemic root cause is a missing engineering practice
         at the organizational level. Fixes:
         - Immediate: add contract tests for the retry handler.
         - Systemic: staff a contract-testing initiative in Q3.

Notice what the 5 Whys produced that a 1-Why would not: the real problem is not a specific bug, it is a missing practice. The 1-Why fix ("fix the null pointer") leaves the next bug just around the corner.

How to Use It

In a Post-Mortem

1. State the problem as a factual observation, not an attribution.
   Good: "The payment service returned 500s for 17 minutes."
   Bad:  "Sam broke the payment service."

2. Ask "why?" Accept the first plausible answer, but do not stop.

3. Drill into the answer. Ask "why?" again.

4. Continue until you hit one of three stopping conditions:
   - Organizational/systemic cause (fixable via process/design)
   - Physical law or external constraint (cannot be fixed further)
   - You are looping (you have passed the useful depth)

5. Identify fixes at multiple levels:
   - Local fix (for the immediate trigger)
   - Systemic fix (for the root cause)
   - Monitoring (so you detect this class of issue faster next time)

In a Retro

5 Whys works in retros too, especially for "why did we miss the deadline" or "why did this initiative fail" conversations.

Why did Q2 launch slip?
  Because the API spec was incomplete at sprint start.

Why?
  Because the product team was still iterating on requirements.

Why?
  Because the customer research came back late.

Why?
  Because we started customer research after engineering kickoff.

Why?
  Because our intake process does not require customer research
  before engineering estimates.

Root cause: intake process design. Fix: update the intake
template to require validated research before eng estimates.

Multiple Paths

Real problems often have multiple causal chains. Run 5 Whys down each path separately rather than trying to merge them.

Incident had 3 contributing causes. Run 5 Whys on each:
  Path A: why was the alert silenced?
  Path B: why did the deploy succeed despite the test failure?
  Path C: why was the on-call engineer paged late?

Each path has its own root cause and its own fix.

Tech & Company Example

A quarterly review surfaces that OKR completion was 40%, the worst in two years. Leadership wants to know why. Bad analysis:

"Why was OKR completion low? Because teams missed their targets.
 Why? Because they were too ambitious. Fix: set easier OKRs."

This is a 2-Whys and the conclusion is almost certainly wrong. 5-Whys version:

Why was OKR completion 40%?
  Because 18 of 45 OKRs landed <50% of target.

Why those 18?
  Because they were concentrated in two orgs — Platform and Data —
  and those orgs had significantly more unplanned work than others.

Why was there more unplanned work there?
  Because those orgs absorbed the bulk of incident response and
  the live-site load was 2x the prior quarter.

Why was live-site load 2x?
  Because the Q1 launch of Product X shipped with known reliability
  debt and the team has not had capacity to remediate.

Why has the team not had capacity?
  Because every quarter the OKRs are filled with new-feature work
  rather than reliability work, because feature work is how the org
  measures success at review time.

Root cause: The performance/review system rewards new features and
does not reward reliability work. Every quarter, the team is locally
rational (ship features) and the outcome is globally irrational
(chronic unplanned work that destroys OKR completion).

Fix:
  - Local:    Q3 OKRs explicitly include reliability work as a
              top-level goal (25% allocation).
  - Systemic: The performance calibration process now counts
              reliability investment, as measured by on-call load
              reduction, as equivalent to feature delivery.

The correct conclusion is the opposite of the superficial one ("lower OKR ambition"). The real fix is to change how the organization rewards reliability work, which shows up as OKR completion two quarters later.

When It Works

Post-incident reviews and blameless post-mortems
Retrospectives where a specific outcome needs to be understood
Process improvement (why is this process slow?)
Quality problems (why is this defect rate high?)
Team dynamics (why is this team missing commitments?)

When It Does Not Work

Problems with genuinely multiple independent causes — 5 Whys implies a causal chain. Some problems are additive or statistical (e.g., "quality regressed because 10 small things got 5% worse each"); a causal tree or Ishikawa is more apt.
Systemic problems with long feedback loops — Organizational dysfunction rarely has a single root cause; it has many reinforcing loops. 5 Whys finds one thread; systems thinking tools find the web.
Highly emotional or interpersonal situations — Asking "why?" to a person who is hurt or scared reads as interrogation. Use different frames for human-factors issues.
When facts are genuinely unknown — 5 Whys on speculation is just speculation.

Common Failure Modes

Stopping at Blame — "Why did the site go down? Because Sam deployed bad code." Stops there. The 5 Whys discipline is specifically to keep going past the person-blame answer to the systemic one.
Stopping at First Plausible — The first explanation that sounds right is accepted. Real 5 Whys keeps asking "why?" even when the current answer sounds complete.
Leading the Witness — Asking "why?" with a predetermined answer in mind. Participants feel interrogated and the conversation closes down.
Wishful Why — Leaving out uncomfortable causes (management, culture, incentives) because they are hard to fix. This is the most common failure in organizational 5 Whys.
Counting as a Ritual — Mechanically asking five "why"s when three were enough or seven were needed. The number is a heuristic, not a rule.
Single-path bias — Treating one causal chain as complete when the incident had three independent contributing factors.
Post-mortem as performance — Running 5 Whys with known answers and a predetermined conclusion. Everyone recognizes the theater.

Facilitation Tips

1. Write the problem on a whiteboard or doc. Everyone can see it.
2. Ask "why?" openly; let the group answer.
3. Write each answer literally, then ask "why?" about the answer.
4. If the group gives up at "we just need to be more careful," that
   is a stop sign, not a conclusion. Push once more.
5. When you reach a systemic cause, pause and ask: "Is this fixable?
   At what level?" Write fixes at local, systemic, and monitoring
   levels.
6. Beware the blame attractor. If an answer is "because Person X
   did Y," that is usually a rung too early. Keep going.

Ishikawa / Fishbone Diagram — Broader cause-categorization tool (People, Process, Tools, Environment, Materials, Measurement). Use when causes are multiple and not a single chain.
Fault Tree Analysis — Formal engineering technique for reliability analysis, commonly used in aerospace and nuclear; 5 Whys is its informal cousin.
Bow-Tie Analysis — Causes on one side, consequences on the other, the event in the middle. Used in safety-critical industries.
SRE Incident Review — Google's post-incident review practice; uses 5 Whys among other tools.
After-Action Review (AAR) — US Army post-event learning practice: what was supposed to happen, what actually happened, what did we learn, what do we do differently. 5 Whys fits inside this.
Blameless Post-Mortem (Etsy) — Cultural frame in which 5 Whys is conducted; blamelessness is a precondition for 5 Whys working.