Bayesian Statistics

Bayesian statistics treats parameters as random variables with probability distributions, updated as data is observed. It provides a coherent framework for uncertainty quantification, sequential learning, and decision-making.

Bayesian Framework

Bayes' Theorem for Parameters

P(θ | data) = P(data | θ) · P(θ) / P(data)

Term	Name	Role
P(θ)	Prior	Belief about θ before seeing data
P(data \| θ)	Likelihood	How probable the data is given θ
P(θ \| data)	Posterior	Updated belief after seeing data
P(data)	Evidence (marginal likelihood)	Normalizing constant

Since P(data) is a constant with respect to θ:

Posterior ∝ Likelihood × Prior
P(θ | data) ∝ P(data | θ) · P(θ)

Bayesian vs Frequentist

Aspect	Frequentist	Bayesian
Parameters	Fixed but unknown	Random variables
Probability	Long-run frequency	Degree of belief
Inference	Point estimates, CIs, p-values	Posterior distribution
Prior information	Not formally incorporated	Incorporated via prior
Interpretation of interval	95% of such intervals contain θ	95% probability θ is in interval

Prior Distributions

Informative Priors

Incorporate genuine prior knowledge: expert opinion, previous studies, physical constraints.

Non-informative (Vague) Priors

Express minimal prior knowledge:

Uniform: P(θ) ∝ 1 (flat prior). Not always sensible (improper for unbounded θ).
Jeffreys prior: P(θ) ∝ √I(θ) where I(θ) is Fisher information. Invariant to reparameterization.
Reference priors: Maximize expected information gain from data.

Conjugate Priors

A prior is conjugate to a likelihood if the posterior belongs to the same family as the prior.

Likelihood	Conjugate Prior	Posterior
Bernoulli/Binomial	Beta(α, β)	Beta(α + successes, β + failures)
Poisson	Gamma(α, β)	Gamma(α + Σxᵢ, β + n)
Normal (known σ²)	Normal(μ₀, σ₀²)	Normal(weighted mean, combined precision)
Normal (known μ)	Inverse-Gamma(α, β)	Inverse-Gamma(α + n/2, β + SS/2)
Multinomial	Dirichlet(α₁,...,αₖ)	Dirichlet(α₁ + n₁,...,αₖ + nₖ)
Exponential	Gamma(α, β)	Gamma(α + n, β + Σxᵢ)

Conjugate priors allow closed-form posterior computation — no numerical integration needed.

Example: Beta-Binomial

Prior: θ ~ Beta(α, β). Data: k successes in n trials.

Posterior: θ | data ~ Beta(α + k, β + n - k).

Posterior mean: (α + k) / (α + β + n). A weighted average between the prior mean α/(α+β) and the sample proportion k/n.

As n → ∞, the posterior concentrates around k/n (data overwhelms the prior).

Bayesian Inference

Point Estimates

Posterior mean: E[θ | data]. Minimizes squared error loss.
Posterior median: Minimizes absolute error loss.
MAP estimate: Mode of posterior. Equals MLE when prior is uniform.

Credible Intervals

A (1-α) credible interval [L, U] satisfies:

P(L ≤ θ ≤ U | data) = 1 - α

Highest Posterior Density (HPD) interval: The shortest interval with (1-α) probability. All points inside have higher posterior density than points outside.

Unlike frequentist confidence intervals, credible intervals have the intuitive interpretation: "there is a (1-α) probability that θ lies in this interval" (given the model and prior).

Posterior Predictive Distribution

Predict new observation x_new:

P(x_new | data) = ∫ P(x_new | θ) P(θ | data) dθ

Averages over parameter uncertainty. Predictions are more honest than plug-in estimates.

Bayesian Hypothesis Testing

Bayes Factor

Compare two hypotheses:

BF₁₀ = P(data | H₁) / P(data | H₀)

BF₁₀	Evidence for H₁
1-3	Barely worth mentioning
3-10	Substantial
10-30	Strong
30-100	Very strong
> 100	Decisive

Posterior odds = Prior odds × Bayes factor:

P(H₁|data)/P(H₀|data) = [P(H₁)/P(H₀)] × BF₁₀

Advantages over p-values

Can provide evidence for H₀ (not just fail to reject)
Natural stopping: collect data until BF is decisive
Not affected by multiple testing in the same way

MCMC Methods

When the posterior doesn't have a closed form, use Markov Chain Monte Carlo to draw samples from it.

Metropolis-Hastings Algorithm

Start at θ₀.
Propose θ* from proposal distribution q(θ* | θₜ).
Compute acceptance ratio: α = min(1, P(θ*|data)q(θₜ|θ*) / P(θₜ|data)q(θ*|θₜ)).
Accept θ* with probability α; otherwise keep θₜ.
Repeat.

The chain converges to the posterior distribution.

Random walk MH: q(θ* | θₜ) = N(θₜ, σ²). Tune σ for ~23% acceptance rate (in high dimensions).

Gibbs Sampling

Sample each parameter from its full conditional distribution (conditional on all other parameters and data).

For each iteration:
    θ₁ ~ P(θ₁ | θ₂, ..., θₖ, data)
    θ₂ ~ P(θ₂ | θ₁, θ₃, ..., θₖ, data)
    ⋮
    θₖ ~ P(θₖ | θ₁, ..., θₖ₋₁, data)

Works when full conditionals are known (often conjugate). No acceptance step needed.

MCMC Diagnostics

Burn-in: Discard initial samples (before convergence).
Trace plots: Visual check for convergence (should look like "hairy caterpillar").
R̂ (Gelman-Rubin): Compare within-chain and between-chain variance. R̂ < 1.01 indicates convergence.
Effective sample size (ESS): Accounts for autocorrelation. ESS ≫ 1 needed.
Autocorrelation plots: Should decay quickly. Thinning can help.

Modern MCMC

Hamiltonian Monte Carlo (HMC): Uses gradient information for efficient exploration. Much better in high dimensions.
No-U-Turn Sampler (NUTS): Adaptive HMC that automatically tunes step size and trajectory length. Default in Stan.
Stan: Probabilistic programming language for Bayesian inference. Uses NUTS.
PyMC: Python library for Bayesian modeling.

Hierarchical Models

Model data at multiple levels:

Data: yᵢⱼ ~ N(θⱼ, σ²)          (observations within groups)
Group: θⱼ ~ N(μ, τ²)            (group-level parameters)
Hyperprior: μ ~ N(0, 100), τ ~ HalfCauchy(0, 5)

Partial pooling: Group estimates are shrunk toward the overall mean. Groups with less data are shrunk more. This is more principled than either complete pooling (ignore groups) or no pooling (treat groups independently).

A/B testing: Bayesian A/B testing provides posterior probability that variant B beats A. Can stop early when posterior is decisive.
Spam filtering: Naive Bayes classifier is fundamentally Bayesian. Prior probabilities of spam vs ham, updated with word frequencies.
Recommendation systems: Bayesian matrix factorization, Thompson sampling for exploration.
Bandits: Thompson sampling draws from posterior of each arm's reward, selects the best. Optimal exploration-exploitation.
Hyperparameter tuning: Bayesian optimization uses a Gaussian process posterior to select next hyperparameters to try.
Natural language processing: Latent Dirichlet Allocation (topic modeling) is Bayesian.
Robotics: Bayesian filtering (Kalman filter, particle filter) for state estimation.
Reinforcement learning: Bayesian RL maintains posterior over MDPs. Posterior sampling for exploration.