5 min read
On this page

Probability Fundamentals

Probability theory is essential for algorithm analysis, machine learning, cryptography, and systems design. This covers discrete probability — the foundation for randomized algorithms and average-case analysis.

Sample Spaces and Events

A sample space Ω is the set of all possible outcomes of a random experiment.

An event A is a subset of the sample space: A ⊆ Ω.

Example: Rolling a die. Ω = {1, 2, 3, 4, 5, 6}. Event "roll an even number" = {2, 4, 6}.

Axioms of Probability

A probability function P assigns a real number to each event, satisfying:

  1. Non-negativity: P(A) ≥ 0 for every event A.
  2. Normalization: P(Ω) = 1.
  3. Additivity: For mutually exclusive events A₁, A₂, ...: P(A₁ ∪ A₂ ∪ ...) = Σ P(Aᵢ).

Consequences

  • P(∅) = 0
  • P(Aᶜ) = 1 - P(A)
  • P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
  • If A ⊆ B, then P(A) ≤ P(B)
  • 0 ≤ P(A) ≤ 1

For finite sample spaces with equally likely outcomes:

P(A) = |A| / |Ω|

Conditional Probability

The probability of A given B:

P(A | B) = P(A ∩ B) / P(B)    (when P(B) > 0)

Example: Draw two cards without replacement. P(second is ace | first is ace) = 3/51.

Multiplication Rule

P(A ∩ B) = P(A | B) · P(B) = P(B | A) · P(A)

Chain rule:

P(A₁ ∩ A₂ ∩ ... ∩ Aₙ) = P(A₁) · P(A₂|A₁) · P(A₃|A₁∩A₂) · ... · P(Aₙ|A₁∩...∩Aₙ₋₁)

Law of Total Probability

If B₁, B₂, ..., Bₙ partition the sample space (mutually exclusive, exhaustive):

P(A) = Σᵢ P(A | Bᵢ) · P(Bᵢ)

This is extremely useful for computing probabilities by conditioning on cases.

Bayes' Theorem

P(B | A) = P(A | B) · P(B) / P(A)

With total probability in the denominator:

P(Bⱼ | A) = P(A | Bⱼ) · P(Bⱼ) / Σᵢ P(A | Bᵢ) · P(Bᵢ)

Terminology:

  • P(B): Prior — belief before observing A.
  • P(A | B): Likelihood — probability of evidence given hypothesis.
  • P(B | A): Posterior — updated belief after observing A.

Example: A test for a disease is 99% sensitive and 99% specific. Disease prevalence is 1%.

P(disease | positive) = (0.99 × 0.01) / (0.99 × 0.01 + 0.01 × 0.99)
                       = 0.0099 / 0.0198 = 0.5

Despite the test being 99% accurate, a positive result only gives 50% probability of disease. This is the base rate fallacy — the low prevalence dominates.

Independence

Events A and B are independent if:

P(A ∩ B) = P(A) · P(B)

Equivalently: P(A | B) = P(A) and P(B | A) = P(B).

Mutual independence of A₁, ..., Aₙ: For every subset S ⊆ {1, ..., n}:

P(⋂ᵢ∈S Aᵢ) = Πᵢ∈S P(Aᵢ)

Pairwise independence does NOT imply mutual independence.

Example: Two coin flips are independent. P(both heads) = P(H₁) · P(H₂) = 1/4.

Random Variables

A random variable X is a function from the sample space to real numbers: X: Ω → ℝ.

Discrete Random Variables

Takes values from a countable set. Defined by its probability mass function (PMF):

p(x) = P(X = x)

Properties: p(x) ≥ 0 for all x, and Σ p(x) = 1.

Continuous Random Variables

Takes values from an uncountable set (e.g., ℝ). Defined by its probability density function (PDF) f(x):

P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx

Properties: f(x) ≥ 0 and ∫ f(x) dx = 1.

Note: For continuous RVs, P(X = x) = 0 for any specific x.

Cumulative Distribution Function (CDF)

F(x) = P(X ≤ x)

Properties: F is non-decreasing, right-continuous, F(-∞) = 0, F(∞) = 1.

Expectation (Expected Value)

The expected value E[X] is the "average" value weighted by probabilities.

Discrete: E[X] = Σ x · P(X = x)

Continuous: E[X] = ∫ x · f(x) dx

Properties

  • Linearity: E[aX + bY] = aE[X] + bE[Y] (always, even if X, Y are dependent!)
  • E[c] = c
  • If X ≥ 0, then E[X] ≥ 0.
  • If X and Y are independent: E[XY] = E[X] · E[Y].

Linearity of expectation is arguably the most powerful tool in probabilistic analysis. It works without independence and simplifies many counting arguments.

Example: Expected number of fixed points in a random permutation of n elements.

Let Xᵢ = 1 if element i is in position i. E[Xᵢ] = 1/n. E[X] = E[Σ Xᵢ] = Σ E[Xᵢ] = n · (1/n) = 1.

The expected number of fixed points is 1, regardless of n.

Law of the Unconscious Statistician (LOTUS)

E[g(X)] = Σ g(x) · P(X = x)     (discrete)
E[g(X)] = ∫ g(x) · f(x) dx      (continuous)

Variance and Standard Deviation

The variance Var(X) measures the spread of X around its mean:

Var(X) = E[(X - E[X])²] = E[X²] - (E[X])²

The standard deviation σ(X) = √Var(X).

Properties

  • Var(X) ≥ 0
  • Var(c) = 0
  • Var(aX + b) = a² Var(X)
  • If X, Y are independent: Var(X + Y) = Var(X) + Var(Y)
  • In general: Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y)

Covariance: Cov(X, Y) = E[XY] - E[X]E[Y].

Common Discrete Distributions

Bernoulli(p)

Single trial: X = 1 (success, probability p) or X = 0 (failure, probability 1-p).

E[X] = p,   Var(X) = p(1-p)

Binomial(n, p)

Number of successes in n independent Bernoulli trials.

P(X = k) = C(n, k) · pᵏ · (1-p)ⁿ⁻ᵏ,   k = 0, 1, ..., n
E[X] = np,   Var(X) = np(1-p)

Geometric(p)

Number of trials until the first success.

P(X = k) = (1-p)ᵏ⁻¹ · p,   k = 1, 2, 3, ...
E[X] = 1/p,   Var(X) = (1-p)/p²

Memoryless property: P(X > s + t | X > s) = P(X > t).

Poisson(λ)

Number of events in a fixed interval (when events occur independently at rate λ).

P(X = k) = e⁻λ · λᵏ / k!,   k = 0, 1, 2, ...
E[X] = λ,   Var(X) = λ

Approximates Binomial(n, p) when n is large and p is small, with λ = np.

Common Continuous Distributions

Uniform(a, b)

f(x) = 1/(b-a)   for a ≤ x ≤ b
E[X] = (a+b)/2,   Var(X) = (b-a)²/12

Normal (Gaussian) N(μ, σ²)

f(x) = (1/(σ√(2π))) · e^(-(x-μ)²/(2σ²))
E[X] = μ,   Var(X) = σ²

The standard normal Z ~ N(0, 1). Any normal X can be standardized: Z = (X - μ)/σ.

68-95-99.7 rule: Approximately 68% of values within 1σ, 95% within 2σ, 99.7% within 3σ.

Central Limit Theorem: The sum (or average) of many independent RVs tends toward a normal distribution, regardless of the original distribution. This is why the normal distribution appears everywhere.

Exponential(λ)

f(x) = λe⁻λˣ   for x ≥ 0
E[X] = 1/λ,   Var(X) = 1/λ²

Continuous analog of geometric. Memoryless: P(X > s + t | X > s) = P(X > t).

Models time between events in a Poisson process.

Real-World Applications

  • Algorithm analysis: Expected running time (quicksort: O(n log n) expected), hash table analysis (expected O(1) lookup).
  • Randomized algorithms: Monte Carlo (may be wrong), Las Vegas (always correct, random time). Probability bounds guarantee performance.
  • Machine learning: Naive Bayes classifier uses Bayes' theorem. All probabilistic models rest on these foundations.
  • Cryptography: Security proofs bound adversary success probability.
  • Systems: Queueing models use Poisson arrivals and exponential service times. Network reliability uses independence.
  • A/B testing: Hypothesis testing and confidence intervals use normal distribution and Bayes' theorem.
  • Error analysis: Bernoulli trials model bit errors. Binomial gives error counts.