Probability Fundamentals
Probability theory is essential for algorithm analysis, machine learning, cryptography, and systems design. This covers discrete probability — the foundation for randomized algorithms and average-case analysis.
Sample Spaces and Events
A sample space Ω is the set of all possible outcomes of a random experiment.
An event A is a subset of the sample space: A ⊆ Ω.
Example: Rolling a die. Ω = {1, 2, 3, 4, 5, 6}. Event "roll an even number" = {2, 4, 6}.
Axioms of Probability
A probability function P assigns a real number to each event, satisfying:
- Non-negativity: P(A) ≥ 0 for every event A.
- Normalization: P(Ω) = 1.
- Additivity: For mutually exclusive events A₁, A₂, ...: P(A₁ ∪ A₂ ∪ ...) = Σ P(Aᵢ).
Consequences
- P(∅) = 0
- P(Aᶜ) = 1 - P(A)
- P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
- If A ⊆ B, then P(A) ≤ P(B)
- 0 ≤ P(A) ≤ 1
For finite sample spaces with equally likely outcomes:
P(A) = |A| / |Ω|
Conditional Probability
The probability of A given B:
P(A | B) = P(A ∩ B) / P(B) (when P(B) > 0)
Example: Draw two cards without replacement. P(second is ace | first is ace) = 3/51.
Multiplication Rule
P(A ∩ B) = P(A | B) · P(B) = P(B | A) · P(A)
Chain rule:
P(A₁ ∩ A₂ ∩ ... ∩ Aₙ) = P(A₁) · P(A₂|A₁) · P(A₃|A₁∩A₂) · ... · P(Aₙ|A₁∩...∩Aₙ₋₁)
Law of Total Probability
If B₁, B₂, ..., Bₙ partition the sample space (mutually exclusive, exhaustive):
P(A) = Σᵢ P(A | Bᵢ) · P(Bᵢ)
This is extremely useful for computing probabilities by conditioning on cases.
Bayes' Theorem
P(B | A) = P(A | B) · P(B) / P(A)
With total probability in the denominator:
P(Bⱼ | A) = P(A | Bⱼ) · P(Bⱼ) / Σᵢ P(A | Bᵢ) · P(Bᵢ)
Terminology:
- P(B): Prior — belief before observing A.
- P(A | B): Likelihood — probability of evidence given hypothesis.
- P(B | A): Posterior — updated belief after observing A.
Example: A test for a disease is 99% sensitive and 99% specific. Disease prevalence is 1%.
P(disease | positive) = (0.99 × 0.01) / (0.99 × 0.01 + 0.01 × 0.99)
= 0.0099 / 0.0198 = 0.5
Despite the test being 99% accurate, a positive result only gives 50% probability of disease. This is the base rate fallacy — the low prevalence dominates.
Independence
Events A and B are independent if:
P(A ∩ B) = P(A) · P(B)
Equivalently: P(A | B) = P(A) and P(B | A) = P(B).
Mutual independence of A₁, ..., Aₙ: For every subset S ⊆ {1, ..., n}:
P(⋂ᵢ∈S Aᵢ) = Πᵢ∈S P(Aᵢ)
Pairwise independence does NOT imply mutual independence.
Example: Two coin flips are independent. P(both heads) = P(H₁) · P(H₂) = 1/4.
Random Variables
A random variable X is a function from the sample space to real numbers: X: Ω → ℝ.
Discrete Random Variables
Takes values from a countable set. Defined by its probability mass function (PMF):
p(x) = P(X = x)
Properties: p(x) ≥ 0 for all x, and Σ p(x) = 1.
Continuous Random Variables
Takes values from an uncountable set (e.g., ℝ). Defined by its probability density function (PDF) f(x):
P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx
Properties: f(x) ≥ 0 and ∫ f(x) dx = 1.
Note: For continuous RVs, P(X = x) = 0 for any specific x.
Cumulative Distribution Function (CDF)
F(x) = P(X ≤ x)
Properties: F is non-decreasing, right-continuous, F(-∞) = 0, F(∞) = 1.
Expectation (Expected Value)
The expected value E[X] is the "average" value weighted by probabilities.
Discrete: E[X] = Σ x · P(X = x)
Continuous: E[X] = ∫ x · f(x) dx
Properties
- Linearity: E[aX + bY] = aE[X] + bE[Y] (always, even if X, Y are dependent!)
- E[c] = c
- If X ≥ 0, then E[X] ≥ 0.
- If X and Y are independent: E[XY] = E[X] · E[Y].
Linearity of expectation is arguably the most powerful tool in probabilistic analysis. It works without independence and simplifies many counting arguments.
Example: Expected number of fixed points in a random permutation of n elements.
Let Xᵢ = 1 if element i is in position i. E[Xᵢ] = 1/n. E[X] = E[Σ Xᵢ] = Σ E[Xᵢ] = n · (1/n) = 1.
The expected number of fixed points is 1, regardless of n.
Law of the Unconscious Statistician (LOTUS)
E[g(X)] = Σ g(x) · P(X = x) (discrete)
E[g(X)] = ∫ g(x) · f(x) dx (continuous)
Variance and Standard Deviation
The variance Var(X) measures the spread of X around its mean:
Var(X) = E[(X - E[X])²] = E[X²] - (E[X])²
The standard deviation σ(X) = √Var(X).
Properties
- Var(X) ≥ 0
- Var(c) = 0
- Var(aX + b) = a² Var(X)
- If X, Y are independent: Var(X + Y) = Var(X) + Var(Y)
- In general: Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y)
Covariance: Cov(X, Y) = E[XY] - E[X]E[Y].
Common Discrete Distributions
Bernoulli(p)
Single trial: X = 1 (success, probability p) or X = 0 (failure, probability 1-p).
E[X] = p, Var(X) = p(1-p)
Binomial(n, p)
Number of successes in n independent Bernoulli trials.
P(X = k) = C(n, k) · pᵏ · (1-p)ⁿ⁻ᵏ, k = 0, 1, ..., n
E[X] = np, Var(X) = np(1-p)
Geometric(p)
Number of trials until the first success.
P(X = k) = (1-p)ᵏ⁻¹ · p, k = 1, 2, 3, ...
E[X] = 1/p, Var(X) = (1-p)/p²
Memoryless property: P(X > s + t | X > s) = P(X > t).
Poisson(λ)
Number of events in a fixed interval (when events occur independently at rate λ).
P(X = k) = e⁻λ · λᵏ / k!, k = 0, 1, 2, ...
E[X] = λ, Var(X) = λ
Approximates Binomial(n, p) when n is large and p is small, with λ = np.
Common Continuous Distributions
Uniform(a, b)
f(x) = 1/(b-a) for a ≤ x ≤ b
E[X] = (a+b)/2, Var(X) = (b-a)²/12
Normal (Gaussian) N(μ, σ²)
f(x) = (1/(σ√(2π))) · e^(-(x-μ)²/(2σ²))
E[X] = μ, Var(X) = σ²
The standard normal Z ~ N(0, 1). Any normal X can be standardized: Z = (X - μ)/σ.
68-95-99.7 rule: Approximately 68% of values within 1σ, 95% within 2σ, 99.7% within 3σ.
Central Limit Theorem: The sum (or average) of many independent RVs tends toward a normal distribution, regardless of the original distribution. This is why the normal distribution appears everywhere.
Exponential(λ)
f(x) = λe⁻λˣ for x ≥ 0
E[X] = 1/λ, Var(X) = 1/λ²
Continuous analog of geometric. Memoryless: P(X > s + t | X > s) = P(X > t).
Models time between events in a Poisson process.
Real-World Applications
- Algorithm analysis: Expected running time (quicksort: O(n log n) expected), hash table analysis (expected O(1) lookup).
- Randomized algorithms: Monte Carlo (may be wrong), Las Vegas (always correct, random time). Probability bounds guarantee performance.
- Machine learning: Naive Bayes classifier uses Bayes' theorem. All probabilistic models rest on these foundations.
- Cryptography: Security proofs bound adversary success probability.
- Systems: Queueing models use Poisson arrivals and exponential service times. Network reliability uses independence.
- A/B testing: Hypothesis testing and confidence intervals use normal distribution and Bayes' theorem.
- Error analysis: Bernoulli trials model bit errors. Binomial gives error counts.