Thinking in Distributions

"How long will this take?" is the wrong question. It asks for a point estimate — a single number that pretends certainty exists. The better question is: "What is the range of likely outcomes?" A point estimate of "2 weeks" tells you nothing about whether 2 weeks means "definitely 2 weeks" or "somewhere between 1 week and 3 months with 2 weeks being the optimistic scenario." Distributions capture uncertainty. Point estimates hide it. Engineers who think in distributions make better plans, set better expectations, and get surprised less often.

Point Estimates Are Always Wrong

Every point estimate is wrong by definition. The question is how wrong.

"This will take 2 weeks"

What the engineer means:
  Best case: 1 week (everything goes perfectly)
  Most likely: 2 weeks (my gut estimate)
  Worst case: 5 weeks (unknown unknowns)
  Expected value: probably 2.5-3 weeks

What the manager hears:
  "It will be done in exactly 2 weeks"

What actually happens:
  3.5 weeks later, everyone is surprised and unhappy.

The problem is not that the engineer estimated poorly. The problem is that a single number cannot represent a range of possible outcomes. Distributions solve this by explicitly representing the range and the shape of uncertainty.

Types of Distributions Engineers Should Know

Normal Distribution (Bell Curve)

Shape: Symmetric around the mean
Use:   Measurement errors, aggregated metrics
Example: API response times under normal load

       *
      ***
     *****
    *******
   *********
  ***********
  |    |    |
 p10  p50  p90

Response time: 50ms mean, 10ms standard deviation
  p50: 50ms (half of requests faster, half slower)
  p90: 63ms
  p99: 73ms

Log-Normal Distribution (Right-Skewed)

Shape: Long tail to the right
Use:   Task durations, incident recovery times, bug fix times
Example: Time to resolve production incidents

  *
  **
  ****
  *******
  ***********
  *****************
  |    |         |
 p10  p50       p99

Incident resolution:
  p50: 30 minutes (most incidents resolve quickly)
  p90: 4 hours (some take much longer)
  p99: 48 hours (rare catastrophic incidents)

The mean (average) is misleading here because the long
tail pulls it up. Median (p50) is more representative.

Bimodal Distribution

Shape: Two peaks
Use:   Deployment times (either quick or very slow), page load times
Example: Page load time (cached vs uncached)

  *              *
  **            **
  ****        ****
  ******    ******
  |    |    |    |
  cached    uncached

Page load: either 100ms (cache hit) or 2s (cache miss)
  Mean: 500ms (meaningless, nobody experiences 500ms)
  Reality: Two distinct user experiences

Applied to Sprint Planning

Sprint planning with point estimates is a recipe for persistent failure. Points (1, 2, 3, 5, 8) are supposed to represent complexity, but they function as point estimates of time. Distributions are better.

Point estimate approach:
  Task: Build authentication flow
  Estimate: 5 story points
  Team velocity: 30 points/sprint
  Conclusion: This takes 1/6 of the sprint

Distribution approach:
  Task: Build authentication flow
  Best case: 2 days (if we use an existing library)
  Most likely: 5 days (standard implementation)
  Worst case: 12 days (edge cases in SSO integration)
  Shape: Right-skewed (unknowns push it longer)

  Plan for: 5-8 days (p50 to p75)
  Buffer for: up to 12 days (have a fallback plan)

The Cone of Uncertainty

Time horizon vs. estimate accuracy:

  At project start:    Actual could be 0.25x to 4x estimate
  After requirements:  Actual could be 0.5x to 2x estimate
  After design:        Actual could be 0.67x to 1.5x estimate
  After coding starts: Actual could be 0.8x to 1.25x estimate

  The distribution narrows as you learn more.
  Early estimates should have wide ranges.
  Late estimates can have narrow ranges.

  Giving a single number at project start is fiction.
  Giving a range (2-8 weeks) is honest.

Applied to SLA Targets

SLA targets are inherently distributional. An SLA of "99.9% uptime" is a statement about the tail of a distribution, not the average.

Response time SLA:

  Wrong way: "Average response time under 200ms"
  Problem: Average can be 200ms while 5% of users
           experience 5-second response times.

  Right way: "p99 response time under 500ms"
  This constrains the tail of the distribution,
  ensuring even the worst 1% of users get
  acceptable performance.

Latency percentiles for a real service:
  p50:  45ms   (typical user experience)
  p90:  120ms  (still acceptable)
  p95:  350ms  (starting to feel slow)
  p99:  1200ms (bad experience for 1 in 100 users)
  p99.9: 8000ms (terrible for 1 in 1000)

  If you only measure the average (100ms), you miss
  the 1% of users having a terrible time.

Why p99 Matters More Than Average

Service with 1M requests per day:
  Average latency: 50ms (sounds great)
  p99 latency: 5000ms

  1% of 1M = 10,000 requests per day at 5+ seconds
  That is 10,000 frustrated users.
  The average hides them completely.

  Worse: high-value users often trigger the p99 path
  (complex queries, large accounts, heavy usage).
  Your best customers get the worst experience.

Applied to Capacity Planning

Capacity planning based on averages guarantees outages. You must plan for the distribution, especially the tail.

Average daily traffic: 10,000 requests per minute

Distribution of daily peaks:
  p50 peak:  15,000 req/min (normal daily peak)
  p90 peak:  25,000 req/min (busy day)
  p99 peak:  50,000 req/min (viral event, sale, etc.)

Plan for average (10K):
  You go down every day at peak.

Plan for p50 peak (15K):
  You go down on busy days.

Plan for p90 peak (25K):
  You survive 90% of days. Down on exceptional days.

Plan for p99 peak (50K):
  You survive 99% of days. Occasional strain on extreme days.

Which to choose depends on the cost of downtime vs. the
cost of over-provisioning. This is an EV calculation
applied to a distribution.

Auto-Scaling & Distributions

Without distributional thinking:
  Provision for: average load
  Result: outages during peaks, wasted capacity during lulls

With distributional thinking:
  Base capacity: p50 of expected load
  Auto-scale trigger: when load exceeds p75
  Scale-up target: p95 of expected load
  Hard ceiling: p99 (beyond this, shed load gracefully)

  This matches capacity to the actual distribution
  of traffic, not a single number.

Applied to Bug Estimation

"How many bugs will the release have?" is a distributional question.

Historical data for releases of similar size:

  p10:  2 bugs  (very clean release)
  p25:  5 bugs
  p50:  12 bugs (typical)
  p75:  20 bugs
  p90:  35 bugs (rough release)
  p99:  80 bugs (something went very wrong)

Useful planning outputs:
  "Expect 10-20 bugs (p25 to p75 range)"
  "Plan QA capacity for up to 35 bugs (p90)"
  "If we see 40+ bugs, something systemic is wrong"

Communicating Distributions

The hardest part of distributional thinking is communicating it to stakeholders who want a single number.

Bad: "It will take 6 weeks"
Better: "Most likely 6 weeks, could be 4-10 weeks"
Best: "We are 50% confident in 6 weeks, 80% confident
       in 8 weeks, and 95% confident in 12 weeks"

Practical format for stakeholders:
  "If things go well: 4 weeks
   Most likely: 6 weeks
   If we hit problems: 10 weeks
   We will know more after the first week of work"

The Three-Point Estimate

A practical technique for distribution-based planning:

For each task, estimate three numbers:
  O = Optimistic (best realistic case, 10th percentile)
  M = Most likely (what you genuinely expect, mode)
  P = Pessimistic (worst realistic case, 90th percentile)

PERT estimate: (O + 4M + P) / 6
Standard deviation: (P - O) / 6

Example:
  O = 3 days, M = 5 days, P = 15 days
  PERT = (3 + 20 + 15) / 6 = 6.3 days
  SD = (15 - 3) / 6 = 2 days

  Range: 4.3 to 8.3 days (mean +/- 1 SD)
  This is more honest than saying "5 days"

Monte Carlo Simulation for Projects

For projects with multiple uncertain tasks, use Monte Carlo simulation (even a rough version on a spreadsheet) to get a distribution of total project time.

Project with 5 tasks, each with a range:
  Task 1: 2-5 days
  Task 2: 1-3 days
  Task 3: 3-10 days
  Task 4: 1-2 days
  Task 5: 5-15 days

Simple sum of best cases: 12 days
Simple sum of worst cases: 35 days
Simple sum of most likely: 20 days

Monte Carlo (1000 simulations):
  p25: 17 days
  p50: 22 days
  p75: 27 days
  p90: 31 days

  The p50 (22 days) is higher than the "most likely"
  (20 days) because right-skewed task distributions
  compound: when several tasks go long, they go long
  at the same time.

Common Pitfalls

Reporting single numbers: Giving a point estimate when you should give a range. The point estimate creates false precision and guarantees disappointment when reality differs.
Confusing mean and median: For skewed distributions (which most engineering metrics are), the mean is misleading. Median (p50) represents the typical case. Use percentiles.
Planning for the average: Average traffic, average task duration, average bug count. Averages ignore the tail that causes outages, missed deadlines, and quality crises.
Ignoring that distributions have shapes: Not all uncertainty is normally distributed. Task durations are right-skewed. Incident counts follow Poisson distributions. Using the wrong distribution model leads to wrong conclusions.
Precision without accuracy: Saying "this will take 14.5 days" implies precision that does not exist. "2-4 weeks" is less precise but more accurate and more useful.
Not updating the distribution: As work progresses and uncertainty resolves, the distribution should narrow. A 2-8 week range at project start should narrow to 4-5 weeks after the first sprint. Update your ranges as you learn.

Key Takeaways

Replace point estimates with ranges. "2-4 weeks" is more honest and more useful than "3 weeks."
Use percentiles (p50, p90, p99) instead of averages for metrics, SLAs, and capacity planning. Averages hide the tail.
Plan for the distribution, not the average. Capacity planning for average load guarantees outages at peak.
Communicate uncertainty explicitly using three-point estimates: optimistic, most likely, pessimistic.
Task duration distributions are typically right-skewed: things rarely finish much early but frequently finish much late. Account for this asymmetry.
The distribution narrows as you learn. Update your ranges as uncertainty resolves through the course of the project.