Hypothesis Testing

Hypothesis testing is the scientific method applied to problems. Instead of guessing randomly or trying everything, you form a specific prediction about what is wrong, design a test that would prove or disprove it, and let the evidence guide your next step. It is how scientists work, and it is how the best troubleshooters work too.

The Everyday Version

The Slow Wi-Fi and the Microwave

You notice the Wi-Fi gets slow every evening around dinner time.

Observation: Wi-Fi is slow between 6 PM and 7 PM daily.

Hypothesis 1: "The microwave is interfering with the Wi-Fi."
- Microwaves operate at 2.4 GHz
- Many Wi-Fi routers also use 2.4 GHz
- Dinner time = microwave usage time

Test: Use the microwave and check Wi-Fi speed.
Then stop the microwave and check again.

Results:
- Microwave on: 5 Mbps
- Microwave off: 50 Mbps
- Microwave on again: 5 Mbps

Conclusion: Hypothesis confirmed. The microwave
interferes with 2.4 GHz Wi-Fi.

Fix: Switch the router to 5 GHz band, or move the
router away from the kitchen.

What makes this effective is not the answer — it is the process. You did not assume the microwave was the problem. You formed a hypothesis and tested it.

The Plant That Is Dying

Your houseplant is wilting. Several possible causes exist.

Observation: Leaves are yellowing and drooping.

Hypothesis 1: "It needs more water."
Test: Water it thoroughly and check in 2 days.
Result: No improvement. → Hypothesis rejected.

Hypothesis 2: "It is getting too much sun."
Test: Move it to a shadier spot for a week.
Result: No improvement. → Hypothesis rejected.

Hypothesis 3: "The soil has no nutrients left."
Test: Add fertilizer and check in a week.
Result: New growth appears, leaves start recovering.
→ Hypothesis confirmed.

Key principle: Each failed test eliminates a possibility
and moves you closer to the answer.

The Car That Pulls to the Right

Observation: Car drifts right when you let go of the steering wheel.

Hypothesis 1: "Tire pressure is uneven."
Test: Check all four tires.
Result: Left front is 28 PSI, right front is 32 PSI.
All others are 32 PSI.
Fix: Inflate left front to 32 PSI.
Result: Car still pulls right.
→ Low pressure contributed, but not the root cause.

Hypothesis 2: "The alignment is off."
Test: Take it to a mechanic for an alignment check.
Result: The front wheels are misaligned.
Fix: Realign the wheels.
Result: Car drives straight.
→ Hypothesis confirmed.

Note: The tire pressure was a real issue, but not THE issue.
Without systematic testing, you might have stopped at
"fixed the tire pressure" and lived with the pulling.

The Scientific Method for Troubleshooting

1. Observe: What exactly is happening?
   - Be specific and measurable
   - "The page loads in 8 seconds" not "the page is slow"

2. Hypothesize: What could cause this?
   - Based on your observations and knowledge
   - Rank by likelihood and ease of testing

3. Predict: If my hypothesis is correct, what should I see?
   - "If the database is slow, then queries should take
     longer than 100ms in the logs"
   - A good hypothesis makes testable predictions

4. Test: Run the experiment
   - Change only one variable at a time
   - Record the results precisely

5. Analyze: Did the results match the prediction?
   - Yes → Hypothesis is supported (not proven — more
     evidence is always useful)
   - No → Hypothesis is rejected. Form a new one.

6. Repeat: Continue until you find the root cause

Connecting to Technology

A/B Testing

A/B testing is hypothesis testing at scale, applied to product decisions.

Hypothesis: "Changing the signup button from blue to green
will increase signups by at least 5%."

Prediction: If correct, the green button group will have
a signup rate at least 5% higher than the blue button group.

Experiment setup:
- Group A (control): See the blue button
- Group B (test): See the green button
- Sample size: 10,000 users per group
- Duration: 2 weeks
- Metric: Signup completion rate

Results:
- Group A (blue): 12.1% signup rate
- Group B (green): 12.3% signup rate
- Difference: 0.2% (not 5%)
- Statistical significance: No (p = 0.45)

Conclusion: Hypothesis rejected. Button color does not
meaningfully affect signups. Look elsewhere for improvement.

Without A/B testing:
Someone would have changed the button to green,
noticed signups fluctuate naturally, and claimed
the color change "worked." That is not evidence.

Logging Before and After Changes

When you deploy a change, logging lets you test whether your hypothesis about the change's effect is correct.

Hypothesis: "The new caching layer will reduce
average response time from 500ms to 100ms."

Before the change:
- Log average response time for one week
- Baseline: 480ms average, 95th percentile at 1200ms

Deploy the change.

After the change:
- Log average response time for one week
- New: 120ms average, 95th percentile at 350ms

Analysis:
- Average improved from 480ms to 120ms (4x improvement)
- 95th percentile improved from 1200ms to 350ms
- Hypothesis was close: predicted 100ms, got 120ms

What if you had not logged the "before"?
- You would have no baseline to compare against
- You could not prove the change helped
- If response time was 120ms, you would not know
  if that was an improvement or the same as before

Controlled Experiments in Production

Scenario: Users report that search results are less
relevant since last week's update.

Hypothesis: "The relevance scoring change in commit
abc123 degraded search quality."

Controlled experiment:
1. Route 10% of search traffic to the old algorithm
2. Route 10% of search traffic to the new algorithm
3. Keep 80% on the current (new) algorithm
4. Measure click-through rates on search results

Metrics after 3 days:
- Old algorithm: 34% click-through
- New algorithm: 28% click-through
- Users are 6% less likely to click results with the
  new algorithm

Conclusion: Hypothesis confirmed. The relevance scoring
change reduced search quality. Roll back the change
and investigate further.

Without a controlled experiment:
- "Let's just roll it back and see"
- But other changes happened too
- Was it the scoring change, the UI change,
  or normal traffic variation?
- You cannot tell without isolating the variable

Performance Profiling

Hypothesis: "The API is slow because of the database queries."

Test with a profiler:

API request breakdown:
- Authentication check: 5ms
- Input validation: 2ms
- Database query 1: 8ms
- Database query 2: 450ms  ← suspicious
- Business logic: 12ms
- Response formatting: 3ms
- Total: 480ms

Result: Database query 2 takes 450ms out of 480ms total.
Hypothesis is partially confirmed — it is not "database
queries" in general, it is one specific query.

Next hypothesis: "Query 2 is slow because it scans
the full table instead of using an index."

Test: Run the query with EXPLAIN to see execution plan.
Result: Full table scan on a 5 million row table.
Confirmed.

Fix: Add an index on the relevant column.
Result: Query 2 drops from 450ms to 3ms.
Total API time: 33ms.

Each hypothesis was specific and testable.
Each test pointed clearly to the next step.

Canary Deployments

A canary deployment is a hypothesis test for new software versions.

Hypothesis: "The new version of the service handles
production traffic correctly."

Experiment:
1. Deploy the new version to 1% of servers (the "canary")
2. Keep 99% of servers on the old version
3. Monitor error rates, latency, and resource usage
4. Compare canary servers to the rest

Possible outcomes:

Canary looks healthy (after 1 hour):
- Error rates match the old version
- Latency is the same or better
- No unusual resource consumption
→ Hypothesis supported. Gradually roll out to more servers.

Canary shows problems:
- Error rate spikes from 0.1% to 5%
→ Hypothesis rejected. Roll back the canary immediately.
   Only 1% of traffic was affected.

This is hypothesis testing applied to deployment risk.
Instead of deploying everywhere and hoping, you test
with a small sample first.

Designing Good Tests

A good hypothesis test has these properties:

1. One variable at a time
   - If you change the database AND the caching AND the code,
     and performance improves, what fixed it?
   - Change one thing, measure, then change the next

2. Clear success and failure criteria
   - Before the test: "If the hypothesis is correct,
     I expect to see X"
   - After the test: "I saw Y, which does/does not match"
   - No moving the goalposts after seeing the results

3. Measurable outcomes
   - "It feels faster" is not a test result
   - "Response time dropped from 480ms to 120ms" is

4. Sufficient sample size
   - Testing with one user does not tell you much
   - Testing with 10,000 users gives you confidence
   - Small samples produce misleading results

5. A control group
   - Compare against something unchanged
   - Without a control, you cannot distinguish your change
     from natural variation

Common Pitfalls

Confirmation bias. Looking for evidence that supports your hypothesis while ignoring evidence that contradicts it. Actively try to disprove your hypothesis, not prove it.
Testing multiple changes at once. If you changed three things and the problem went away, you do not know which change fixed it. You may have introduced two new bugs while fixing one.
No baseline measurement. Without a "before" number, you cannot evaluate your "after" number. Always measure the current state before making changes.
Declaring success too early. The problem disappeared for a day and you closed the ticket. But it was a timing-dependent issue and it will return on the next Monday morning. Monitor for long enough to be confident.
Ignoring negative results. A rejected hypothesis is valuable information. It eliminates a possibility and narrows your search. Do not view it as wasted time.
Vague hypotheses. "Something is wrong with the server" is not testable. "The server runs out of memory when processing files larger than 100 MB" is testable. Be specific.

Key Takeaways

Hypothesis testing applies the scientific method to troubleshooting: observe, hypothesize, predict, test, analyze, repeat.
Every hypothesis must be specific enough to be proven wrong. If it cannot be disproven, it is not useful.
Change one variable at a time. Multiple changes make it impossible to know what worked.
A/B testing, canary deployments, and performance profiling are all forms of hypothesis testing applied to software.
Always measure before and after. Without a baseline, you cannot evaluate your results.
Rejected hypotheses are progress, not failure. Each elimination narrows the search space and moves you closer to the answer.