6 min read
On this page

Bayesian Statistics

Bayesian statistics treats parameters as random variables with probability distributions, updated as data is observed. It provides a coherent framework for uncertainty quantification, sequential learning, and decision-making.

Bayesian Framework

Bayes' Theorem for Parameters

P(θ | data) = P(data | θ) · P(θ) / P(data)

| Term | Name | Role | |---|---|---| | P(θ) | Prior | Belief about θ before seeing data | | P(data | θ) | Likelihood | How probable the data is given θ | | P(θ | data) | Posterior | Updated belief after seeing data | | P(data) | Evidence (marginal likelihood) | Normalizing constant |

Since P(data) is a constant with respect to θ:

Posterior ∝ Likelihood × Prior
P(θ | data) ∝ P(data | θ) · P(θ)

Bayesian vs Frequentist

| Aspect | Frequentist | Bayesian | |---|---|---| | Parameters | Fixed but unknown | Random variables | | Probability | Long-run frequency | Degree of belief | | Inference | Point estimates, CIs, p-values | Posterior distribution | | Prior information | Not formally incorporated | Incorporated via prior | | Interpretation of interval | 95% of such intervals contain θ | 95% probability θ is in interval |

Prior Distributions

Informative Priors

Incorporate genuine prior knowledge: expert opinion, previous studies, physical constraints.

Non-informative (Vague) Priors

Express minimal prior knowledge:

  • Uniform: P(θ) ∝ 1 (flat prior). Not always sensible (improper for unbounded θ).
  • Jeffreys prior: P(θ) ∝ √I(θ) where I(θ) is Fisher information. Invariant to reparameterization.
  • Reference priors: Maximize expected information gain from data.

Conjugate Priors

A prior is conjugate to a likelihood if the posterior belongs to the same family as the prior.

| Likelihood | Conjugate Prior | Posterior | |---|---|---| | Bernoulli/Binomial | Beta(α, β) | Beta(α + successes, β + failures) | | Poisson | Gamma(α, β) | Gamma(α + Σxᵢ, β + n) | | Normal (known σ²) | Normal(μ₀, σ₀²) | Normal(weighted mean, combined precision) | | Normal (known μ) | Inverse-Gamma(α, β) | Inverse-Gamma(α + n/2, β + SS/2) | | Multinomial | Dirichlet(α₁,...,αₖ) | Dirichlet(α₁ + n₁,...,αₖ + nₖ) | | Exponential | Gamma(α, β) | Gamma(α + n, β + Σxᵢ) |

Conjugate priors allow closed-form posterior computation — no numerical integration needed.

Example: Beta-Binomial

Prior: θ ~ Beta(α, β). Data: k successes in n trials.

Posterior: θ | data ~ Beta(α + k, β + n - k).

Posterior mean: (α + k) / (α + β + n). A weighted average between the prior mean α/(α+β) and the sample proportion k/n.

As n → ∞, the posterior concentrates around k/n (data overwhelms the prior).

Bayesian Inference

Point Estimates

  • Posterior mean: E[θ | data]. Minimizes squared error loss.
  • Posterior median: Minimizes absolute error loss.
  • MAP estimate: Mode of posterior. Equals MLE when prior is uniform.

Credible Intervals

A (1-α) credible interval [L, U] satisfies:

P(L ≤ θ ≤ U | data) = 1 - α

Highest Posterior Density (HPD) interval: The shortest interval with (1-α) probability. All points inside have higher posterior density than points outside.

Unlike frequentist confidence intervals, credible intervals have the intuitive interpretation: "there is a (1-α) probability that θ lies in this interval" (given the model and prior).

Posterior Predictive Distribution

Predict new observation x_new:

P(x_new | data) = ∫ P(x_new | θ) P(θ | data) dθ

Averages over parameter uncertainty. Predictions are more honest than plug-in estimates.

Bayesian Hypothesis Testing

Bayes Factor

Compare two hypotheses:

BF₁₀ = P(data | H₁) / P(data | H₀)

| BF₁₀ | Evidence for H₁ | |---|---| | 1-3 | Barely worth mentioning | | 3-10 | Substantial | | 10-30 | Strong | | 30-100 | Very strong | | > 100 | Decisive |

Posterior odds = Prior odds × Bayes factor:

P(H₁|data)/P(H₀|data) = [P(H₁)/P(H₀)] × BF₁₀

Advantages over p-values

  • Can provide evidence for H₀ (not just fail to reject)
  • Natural stopping: collect data until BF is decisive
  • Not affected by multiple testing in the same way

MCMC Methods

When the posterior doesn't have a closed form, use Markov Chain Monte Carlo to draw samples from it.

Metropolis-Hastings Algorithm

  1. Start at θ₀.
  2. Propose θ* from proposal distribution q(θ* | θₜ).
  3. Compute acceptance ratio: α = min(1, P(θ*|data)q(θₜ|θ*) / P(θₜ|data)q(θ*|θₜ)).
  4. Accept θ* with probability α; otherwise keep θₜ.
  5. Repeat.

The chain converges to the posterior distribution.

Random walk MH: q(θ* | θₜ) = N(θₜ, σ²). Tune σ for ~23% acceptance rate (in high dimensions).

Gibbs Sampling

Sample each parameter from its full conditional distribution (conditional on all other parameters and data).

For each iteration:
    θ₁ ~ P(θ₁ | θ₂, ..., θₖ, data)
    θ₂ ~ P(θ₂ | θ₁, θ₃, ..., θₖ, data)
    ⋮
    θₖ ~ P(θₖ | θ₁, ..., θₖ₋₁, data)

Works when full conditionals are known (often conjugate). No acceptance step needed.

MCMC Diagnostics

  • Burn-in: Discard initial samples (before convergence).
  • Trace plots: Visual check for convergence (should look like "hairy caterpillar").
  • R̂ (Gelman-Rubin): Compare within-chain and between-chain variance. R̂ < 1.01 indicates convergence.
  • Effective sample size (ESS): Accounts for autocorrelation. ESS ≫ 1 needed.
  • Autocorrelation plots: Should decay quickly. Thinning can help.

Modern MCMC

  • Hamiltonian Monte Carlo (HMC): Uses gradient information for efficient exploration. Much better in high dimensions.
  • No-U-Turn Sampler (NUTS): Adaptive HMC that automatically tunes step size and trajectory length. Default in Stan.
  • Stan: Probabilistic programming language for Bayesian inference. Uses NUTS.
  • PyMC: Python library for Bayesian modeling.

Hierarchical Models

Model data at multiple levels:

Data: yᵢⱼ ~ N(θⱼ, σ²)          (observations within groups)
Group: θⱼ ~ N(μ, τ²)            (group-level parameters)
Hyperprior: μ ~ N(0, 100), τ ~ HalfCauchy(0, 5)

Partial pooling: Group estimates are shrunk toward the overall mean. Groups with less data are shrunk more. This is more principled than either complete pooling (ignore groups) or no pooling (treat groups independently).

Model Comparison

WAIC (Widely Applicable Information Criterion)

Bayesian analog of AIC. Uses the full posterior, not just point estimates.

LOO-CV (Leave-One-Out Cross-Validation)

Approximate LOO using PSIS (Pareto Smoothed Importance Sampling). Implemented in the loo R package and ArviZ.

Posterior Predictive Checks

Simulate data from the posterior predictive distribution and compare with observed data. If the model is good, simulated data should "look like" real data.

Applications in CS

  • A/B testing: Bayesian A/B testing provides posterior probability that variant B beats A. Can stop early when posterior is decisive.
  • Spam filtering: Naive Bayes classifier is fundamentally Bayesian. Prior probabilities of spam vs ham, updated with word frequencies.
  • Recommendation systems: Bayesian matrix factorization, Thompson sampling for exploration.
  • Bandits: Thompson sampling draws from posterior of each arm's reward, selects the best. Optimal exploration-exploitation.
  • Hyperparameter tuning: Bayesian optimization uses a Gaussian process posterior to select next hyperparameters to try.
  • Natural language processing: Latent Dirichlet Allocation (topic modeling) is Bayesian.
  • Robotics: Bayesian filtering (Kalman filter, particle filter) for state estimation.
  • Reinforcement learning: Bayesian RL maintains posterior over MDPs. Posterior sampling for exploration.