3 min read
On this page

Systematic Diagnosis

When something breaks, the natural instinct is to start guessing. Change a setting, restart something, try random fixes until it works. But guessing is slow and unreliable. Systematic diagnosis is how you find the actual root cause — not just the symptom — and fix it once instead of fighting the same problem repeatedly.

The Everyday Version

The Car That Will Not Start

Your car does not start. You could randomly replace parts and hope for the best. Or you could think systematically.

Symptoms: Turn the key, nothing happens.

Step 1: What are the possible causes?
- Dead battery
- Bad starter motor
- No fuel
- Faulty ignition switch
- Loose or corroded battery connections

Step 2: Start with the most likely and easiest to check
- Battery: Do the dashboard lights come on?
  → Yes → Battery probably has some charge
  → No → Battery is likely dead or disconnected

- If lights come on: Does the engine crank (try to turn over)?
  → Yes but won't start → Fuel or ignition problem
  → No → Starter motor or connection issue

- Check fuel gauge: Is there fuel?
  → Empty → That is your answer
  → Has fuel → Move to next possibility

Step 3: Follow the chain
Each check eliminates possibilities and narrows
the search. You do not randomly replace the starter
when the fuel tank is empty.

The key principle: do not replace parts randomly. Diagnose first, then fix.

The Leaking Roof

Symptom: Water stain on the ceiling.

Wrong approach:
- Immediately patch the stain on the ceiling
- Water keeps coming → patch again
- Repeat forever

Systematic approach:
1. Where is the water coming from?
   - Check the ceiling stain location
   - Go to the attic above that spot
   - Follow the water trail (water travels before dripping)
   - Find the actual entry point on the roof

2. What is causing the entry?
   - Missing shingle?
   - Cracked flashing around the chimney?
   - Clogged gutter causing water backup?

3. Fix the root cause
   - Replace the shingle, fix the flashing, or clear the gutter
   - THEN repair the ceiling stain

Fixing the symptom (the stain) without fixing the cause
(the roof) means the problem always comes back.

The Slow Internet

Symptom: Web pages load slowly.

Random approach:
- Restart the router
- Call the ISP
- Buy a new router
- Blame the weather

Systematic approach:
1. Isolate the problem layer
   - Is it one device or all devices?
     → One device: Problem is with that device
     → All devices: Problem is with the network or ISP

2. If one device:
   - Try a different browser → Same problem? Not the browser.
   - Connect via ethernet instead of Wi-Fi → Faster?
     → Yes: Wi-Fi issue on that device
     → No: Something else on the device (malware scan, check
       for background downloads)

3. If all devices:
   - Run a speed test directly connected to the modem
     (bypass the router)
   - Fast: Router is the problem
   - Slow: ISP issue or modem problem
   - Check at different times: Slow only at peak hours?
     → Congestion. ISP issue.

Each step eliminates a category of causes.
In 5 minutes, you know exactly where the problem is.

The Diagnostic Framework

Step 1: Observe and describe the symptom precisely
- "It is slow" is vague
- "The login page takes 12 seconds to load, but other
  pages load in 1 second" is actionable

Step 2: Reproduce the problem
- Can you make it happen again?
- Does it happen every time or intermittently?
- What are the exact conditions?

Step 3: Form a hypothesis
- Based on the symptoms, what is the most likely cause?

Step 4: Test the hypothesis
- Change one thing at a time
- Did the symptom change?

Step 5: If the hypothesis is wrong, eliminate it and try the next
- Each failed test narrows the possibilities

Step 6: Find the root cause, not just the trigger
- "The server crashed" is the trigger
- "The server ran out of memory because a query returned
  10 million rows instead of 100" is the root cause

Connecting to Technology

Reading Stack Traces

When software crashes, it usually tells you where. A stack trace is the trail of function calls that led to the error.

Error: NullPointerException
  at UserService.getProfile(UserService.java:42)
  at ProfileController.show(ProfileController.java:18)
  at RequestHandler.handle(RequestHandler.java:95)

Reading bottom to top:
1. A request came in (RequestHandler, line 95)
2. It was routed to the profile page (ProfileController, line 18)
3. The profile service tried to get user data (UserService, line 42)
4. Something was null that should not have been

The stack trace tells you:
- WHERE the crash happened (UserService.java, line 42)
- HOW the code got there (the call chain)
- WHAT went wrong (NullPointerException)

It does not tell you WHY — that requires understanding
what data was null and why it was missing.

Using Logs

Logs are the breadcrumb trail your software leaves behind.

Good logging practice:

[2026-04-18 10:15:01] INFO  User 12345 logged in
[2026-04-18 10:15:02] INFO  User 12345 requested /profile
[2026-04-18 10:15:02] DEBUG Querying database for user 12345
[2026-04-18 10:15:05] WARN  Database query took 3 seconds (threshold: 1s)
[2026-04-18 10:15:05] INFO  Profile loaded, rendering page
[2026-04-18 10:15:06] INFO  Page rendered in 4.2 seconds

Reading this log tells a story:
- The user logged in fine
- They requested their profile
- The database query was slow (3 seconds, warning threshold is 1)
- Total page load: 4.2 seconds
- Root cause of slowness: database query

Without logs:
"The page is slow" → Where? No idea. Check everything.

With logs:
"The page is slow" → Database query on line X took 3x
longer than expected → Investigate that specific query.

Reproducing Bugs

A bug you cannot reproduce is a bug you cannot fix with confidence.

Steps to reproduce:
1. What exactly did the user do?
   - "They clicked the button" is not enough
   - "They clicked Submit on the order form with an empty
     cart while logged in as an admin user" is useful

2. What was the state of the system?
   - What data was in the database?
   - What time of day? (Timezone bugs are common)
   - What version of the software?

3. Can you do the exact same thing and get the same error?
   - Yes → You have a reproducible bug. Now you can test fixes.
   - No → The bug depends on something you have not identified yet
     (timing, data, load, specific user state)

4. If it is intermittent:
   - What is different between when it happens and when it does not?
   - Does it happen more under load?
   - Does it happen with specific data?
   - Does it happen at specific times?

Isolating Components

When a system has many parts, isolate which part is misbehaving.

Web application not working:

Is it the browser?
→ Try a different browser, try curl from command line

Is it the network?
→ Can you ping the server? Can you reach other sites?

Is it the web server?
→ Check if the server process is running
→ Check the server's access logs

Is it the application code?
→ Check application logs for errors
→ Does a simpler endpoint work?

Is it the database?
→ Can you query the database directly?
→ Is the database responding slowly?

Is it an external service?
→ Does the app depend on a third-party API?
→ Is that API responding?

Systematic isolation:
  Browser → Network → Web Server → App Code → Database → External API

Test each layer. The first one that fails is where
your problem lives.

Root Cause vs Symptom

Symptom: The application is slow on Mondays.

Quick fix (treats symptom):
- Restart the servers every Monday morning
- Works temporarily, problem returns next week

Root cause investigation:
- Why Mondays? What is different about Monday?
- Monday morning: Automated reports run at 6 AM
- Reports query the entire transaction history
- These queries lock the database for 20 minutes
- All other queries queue up behind the lock
- Users arrive at 9 AM to a sluggish system

Root cause fix:
- Move reports to a read replica database
- Or run reports during off-hours on Sunday
- Or optimize the report queries to not lock tables

The symptom was "slow on Mondays."
The root cause was "unoptimized report queries locking
the production database during business hours."

The Five Whys

A technique for drilling to the root cause by asking "why" repeatedly.

Problem: The website went down.

Why? The server ran out of memory.
Why? A process was consuming 32 GB of RAM.
Why? It was loading an entire dataset into memory.
Why? The query had no LIMIT clause.
Why? The developer assumed the table had only a few hundred rows,
     but it had grown to 50 million rows.

Root cause: Missing LIMIT clause combined with
unexpected data growth.

Fix: Add the LIMIT clause AND add monitoring
for table sizes so growth does not surprise you again.

Notice: The first "why" (out of memory) would lead
to "add more memory" — which only delays the problem.
The fifth "why" reveals the real fix.

Common Pitfalls

  • Guessing instead of diagnosing. Randomly changing things and hoping the problem goes away wastes time and can introduce new problems.
  • Fixing the symptom, not the cause. Restarting the server every day instead of finding out why it crashes. The problem always comes back.
  • Changing multiple things at once. If you change three settings and the problem goes away, you do not know which change fixed it. Change one thing at a time.
  • Not reproducing the bug first. If you cannot reproduce it, you cannot verify your fix actually works. You might ship a "fix" that fixes nothing.
  • Ignoring intermittent problems. "It only happens sometimes" does not mean it is not important. Intermittent bugs often have the most serious root causes (race conditions, resource leaks).
  • Stopping at the first explanation. The first thing that looks wrong might not be the root cause. Keep asking "why" until you reach something you can fix permanently.

Key Takeaways

  • Systematic diagnosis means following evidence to the root cause instead of guessing at fixes.
  • Always observe, reproduce, hypothesize, test, and eliminate — in that order.
  • Symptoms and root causes are different things. Fixing symptoms feels productive but leaves the real problem untouched.
  • Isolate components to narrow down where the problem lives. Test each layer of the system independently.
  • Logs, stack traces, and reproduction steps are your primary diagnostic tools. Without them, you are guessing.
  • The Five Whys technique helps you drill past surface explanations to find the actual root cause.