Advanced Probability
Advanced probability tools used extensively in algorithm analysis, randomized algorithms, and theoretical computer science.
Markov Chains
A Markov chain is a sequence of random variables X₀, X₁, X₂, ... where the future depends only on the present, not the past:
P(Xₙ₊₁ = j | Xₙ = i, Xₙ₋₁ = iₙ₋₁, ..., X₀ = i₀) = P(Xₙ₊₁ = j | Xₙ = i)
This is the Markov property (memorylessness).
Transition Matrix
For states {1, 2, ..., k}, the transition matrix P has entries:
Pᵢⱼ = P(Xₙ₊₁ = j | Xₙ = i)
Each row sums to 1 (it's a stochastic matrix).
The n-step transition probabilities: P(Xₙ = j | X₀ = i) = (Pⁿ)ᵢⱼ.
Classification of States
- Absorbing: Pᵢᵢ = 1 (once entered, never left).
- Recurrent: The chain returns to state i with probability 1.
- Transient: There is a positive probability of never returning.
- Periodic: The chain can only return to state i at multiples of some period d > 1.
- Aperiodic: Period = 1.
- Ergodic: Recurrent and aperiodic.
Stationary Distribution
A distribution π is stationary if πP = π (left eigenvector for eigenvalue 1).
For an irreducible (all states communicate) and aperiodic chain, the stationary distribution exists, is unique, and:
lim_{n→∞} Pⁿᵢⱼ = πⱼ (regardless of starting state)
πⱼ = 1 / E[return time to j].
Example: PageRank is the stationary distribution of a random walk on the web graph.
Expected Hitting Times
The expected time to reach state j starting from state i. Computed by solving a system of linear equations.
Random Walks
A random walk on a graph: at each step, move to a uniformly random neighbor.
Random Walk on ℤ
Start at 0, move +1 or -1 with equal probability.
- Recurrent in 1D and 2D (returns to origin with probability 1).
- Transient in 3D and higher (positive probability of never returning).
- Expected return time: ∞ in 1D (null recurrent).
- After n steps, expected distance from origin: √(2n/π) ≈ O(√n).
Gambler's Ruin
A gambler with (n-i). At each round, +1 with probability 1-p.
Probability of ruin (reaching i:
- If p ≠ 1/2: P(ruin) = ((1-p)/p)^i - ((1-p)/p)^n) / (1 - ((1-p)/p)^n)
- If p = 1/2: P(ruin) = 1 - i/n
Even with fair odds (p = 1/2), the gambler eventually goes broke given infinite play (ruin probability → 1 as n → ∞).
Random Walk on Graphs
- Cover time: Expected time to visit all vertices. For complete graph Kₙ: Θ(n log n) (coupon collector). For path Pₙ: Θ(n²). For general graph: O(n³), Ω(n log n).
- Commute time: Expected round-trip time between u and v. Related to effective resistance in electrical networks.
Birthday Problem
Problem: In a group of n people, what is the probability that at least two share a birthday?
P(collision) = 1 - P(all different)
= 1 - (365/365)(364/365)(363/365)...(365-n+1)/365
= 1 - 365! / (365ⁿ · (365-n)!)
Approximation: P(collision) ≈ 1 - e^(-n²/(2·365)).
- n = 23: P ≈ 50.7% (surprisingly high!)
- n = 50: P ≈ 97%
- n = 70: P ≈ 99.9%
Generalized birthday problem: With m possible values, expect a collision after about √(πm/2) ≈ 1.18√m samples.
Applications:
- Birthday attack in cryptography: Finding hash collisions requires ~√(2ⁿ) = 2^(n/2) hashes for an n-bit hash. This is why hash outputs should be at least 256 bits (128-bit collision resistance).
- Hash table analysis: Expected number of collisions.
- DNA matching: Probability of coincidental matches.
Coupon Collector Problem
Problem: There are n types of coupons, each drawn uniformly at random. How many draws to collect all n types?
Expected time: E[T] = n · Hₙ = n · (1 + 1/2 + 1/3 + ... + 1/n) ≈ n ln n + γn
where Hₙ is the n-th harmonic number and γ ≈ 0.5772 is the Euler-Mascheroni constant.
Derivation: Divide into phases. Phase i starts when we have i-1 distinct coupons and ends when we get the i-th distinct. In phase i, each draw has probability (n-i+1)/n of being new. So phase i takes geometric time with mean n/(n-i+1).
E[T] = Σᵢ₌₁ⁿ n/(n-i+1) = n(1 + 1/2 + ... + 1/n) = nHₙ
Applications: How long until a random hash function maps to every bucket? Expected Θ(n log n) insertions for n buckets.
Balls and Bins
Problem: Throw m balls into n bins uniformly at random.
Maximum Load
When m = n balls go into n bins:
- Expected maximum load: Θ(log n / log log n).
- With "power of two choices" (each ball chooses the less-loaded of 2 random bins): maximum load drops to Θ(log log n). This is an exponential improvement!
Empty Bins
Expected number of empty bins: n · (1 - 1/n)ⁿ ≈ n/e.
Collisions
Expected number of collisions (balls sharing a bin): m - n + n(1-1/n)^m.
Applications
- Hash tables: Keys are balls, buckets are bins. Maximum load determines worst-case lookup.
- Load balancing: Tasks are balls, servers are bins. Power of two choices is used in real load balancers.
- Caching: Hash-based cache placement.
Probabilistic Method
The probabilistic method proves existence of combinatorial objects by showing a random construction works with positive probability.
Key idea: If P(object has property) > 0, then an object with that property must exist.
Example: Graph Coloring Lower Bound
Theorem: There exists a graph with chromatic number > k and girth > g (for any k, g).
Proof: Take a random graph G(n, p) with appropriate p. Show that with positive probability, it has few short cycles (delete vertices to eliminate them) and large chromatic number (via independence number bound). The detailed proof uses first moment method and alteration. ∎
First Moment Method
If X is a non-negative integer random variable:
- P(X ≥ 1) ≤ E[X] (Markov)
- P(X = 0) ≤ E[X] / E[X] is vacuous, but if E[X] < 1, then P(X ≥ 1) < 1, so P(X = 0) > 0.
Deletion method: Show E[bad events] < n/2, delete at most n/2 vertices to fix, leaving n/2 vertices.
Second Moment Method
If E[X] > 0 and Var(X) is small relative to (E[X])²:
P(X = 0) ≤ Var(X) / (E[X])² (Paley-Zygmund or Chebyshev)
If the RHS < 1, then X > 0 with positive probability.
Lovász Local Lemma (LLL)
When bad events are "mostly independent," we can avoid all of them simultaneously.
Symmetric version: If each bad event has probability ≤ p, and each depends on at most d others, and ep(d+1) ≤ 1, then:
P(no bad event occurs) > 0
Constructive LLL (Moser-Tardos, 2010): An efficient algorithm that finds the desired configuration in expected polynomial time.
Application: k-SAT — if every clause has k literals and shares variables with at most 2^k/e - 1 other clauses, the formula is satisfiable.
Concentration Inequalities
These bound how much a random variable deviates from its expectation.
Markov's Inequality
For non-negative X and a > 0:
P(X ≥ a) ≤ E[X] / a
Weak but universally applicable.
Chebyshev's Inequality
For any X with finite variance, and k > 0:
P(|X - E[X]| ≥ k) ≤ Var(X) / k²
Equivalently: P(|X - μ| ≥ kσ) ≤ 1/k².
Stronger than Markov, requires variance.
Chernoff Bounds
For X = X₁ + X₂ + ... + Xₙ where Xᵢ are independent Bernoulli with μ = E[X]:
Upper tail: P(X ≥ (1 + δ)μ) ≤ (e^δ / (1+δ)^(1+δ))^μ
Simplified forms:
P(X ≥ (1+δ)μ) ≤ e^(-μδ²/3) for 0 < δ ≤ 1
P(X ≤ (1-δ)μ) ≤ e^(-μδ²/2) for 0 < δ < 1
Two-sided: P(|X - μ| ≥ δμ) ≤ 2e^(-μδ²/3).
Example: Flip 1000 fair coins. μ = 500. What is P(X ≥ 600)?
δ = 0.2, P ≤ e^(-500 · 0.04/3) ≈ e^(-6.67) ≈ 0.0013.
Hoeffding's Inequality
For independent RVs Xᵢ ∈ [aᵢ, bᵢ] with S = Σ Xᵢ:
P(S - E[S] ≥ t) ≤ exp(-2t² / Σ(bᵢ - aᵢ)²)
More general than Chernoff — works for bounded (not just Bernoulli) variables.
Applications of Concentration
- Randomized algorithms: High-probability guarantees for running time, approximation quality.
- Hashing: Prove that hash tables have small maximum load.
- Random graphs: Properties of G(n, p) — connectivity threshold, chromatic number, clique number.
- Machine learning: Generalization bounds (VC dimension + Hoeffding → PAC learning).
- Network design: Prove that random routing achieves near-optimal load.
Real-World Applications
- PageRank: Stationary distribution of a random walk on the web graph.
- MCMC: Markov Chain Monte Carlo methods use Markov chains to sample from complex distributions (Bayesian inference, statistical physics).
- Load balancing: Power of two choices reduces max load from O(log n) to O(log log n).
- Bloom filters: False positive probability analysis uses independence and union bounds.
- Randomized testing: Random test generation, property-based testing.
- Streaming algorithms: Concentration inequalities prove accuracy of sketches (count-min, HyperLogLog).
- Distributed systems: Randomized consensus, gossip protocols analyzed via Markov chains.