Probabilistic Models

Naive Bayes

Apply Bayes' theorem with the conditional independence assumption:

P(y|x_1,...,x_d) ∝ P(y) * prod_{j=1}^{d} P(x_j|y)

Prediction: y* = argmax_y P(y) * prod_j P(x_j|y)

In log space (numerically stable):

y* = argmax_y [ log P(y) + sum_j log P(x_j|y) ]

Variants by Feature Distribution

Variant	P(x_j\|y)	Use Case
Gaussian NB	N(mu_{jy}, sigma_{jy}^2)	Continuous features
Multinomial NB	Multinomial	Word counts, text data
Bernoulli NB	Bernoulli	Binary features
Complement NB	Complement class	Imbalanced text

Laplace Smoothing

For multinomial NB, avoid zero probabilities:

P(x_j = v | y=c) = (count(x_j=v, y=c) + alpha) / (count(y=c) + alpha * |V|)

alpha = 1 is standard Laplace smoothing; alpha < 1 is Lidstone smoothing.

Despite the naive independence assumption, NB works well for text classification, spam filtering, and as a fast baseline.

Gaussian Mixture Models (GMM)

Model data as a mixture of K Gaussian components:

P(x) = sum_{k=1}^{K} pi_k * N(x | mu_k, Sigma_k)

Where pi_k are mixing coefficients (sum to 1), and each component has mean mu_k and covariance Sigma_k.

Expectation-Maximization (EM) Algorithm

Iteratively estimate parameters when latent variables (component assignments) are unobserved.

E-step: compute responsibilities (soft assignments):

r_{nk} = pi_k * N(x_n | mu_k, Sigma_k) / sum_j pi_j * N(x_n | mu_j, Sigma_j)

M-step: update parameters using responsibilities:

N_k = sum_n r_{nk}
mu_k = (1/N_k) * sum_n r_{nk} * x_n
Sigma_k = (1/N_k) * sum_n r_{nk} * (x_n - mu_k)(x_n - mu_k)^T
pi_k = N_k / N

def em_gmm(X, K, max_iter=100, tol=1e-6):
    n, d = X.shape
    # Initialize
    mu = X[np.random.choice(n, K, replace=False)]
    sigma = [np.eye(d)] * K
    pi = np.ones(K) / K

    for _ in range(max_iter):
        # E-step
        r = np.zeros((n, K))
        for k in range(K):
            r[:, k] = pi[k] * multivariate_normal_pdf(X, mu[k], sigma[k])
        r /= r.sum(axis=1, keepdims=True)

        # M-step
        for k in range(K):
            N_k = r[:, k].sum()
            mu[k] = (r[:, k] @ X) / N_k
            diff = X - mu[k]
            sigma[k] = (r[:, k][:, None] * diff).T @ diff / N_k
            pi[k] = N_k / n

        # Check log-likelihood convergence
        ll = compute_log_likelihood(X, mu, sigma, pi)
        if converged(ll, tol): break

    return mu, sigma, pi, r

Properties: EM monotonically increases log-likelihood but converges to local optima. Run multiple restarts and pick the best.

Choosing K

BIC: -2 * log L + p * log(n) (penalizes complexity)
AIC: -2 * log L + 2p
Lower BIC/AIC is better. BIC gives stronger penalty, favoring simpler models.

Latent Dirichlet Allocation (LDA)

Generative model for topic modeling. Each document is a mixture of topics; each topic is a distribution over words.

For each document d:
    theta_d ~ Dirichlet(alpha)              # topic proportions
    For each word position n:
        z_dn ~ Categorical(theta_d)         # topic assignment
        w_dn ~ Categorical(beta_{z_dn})     # word from topic

Where beta_k is the word distribution for topic k, drawn from Dirichlet(eta).

Inference via collapsed Gibbs sampling or variational inference.

Bayesian Linear Regression

Place a prior on weights: w ~ N(0, sigma_w^2 * I).

Posterior (conjugate Gaussian):

P(w | X, y) = N(w | mu_n, Sigma_n)

Sigma_n = (sigma_noise^{-2} * X^T X + sigma_w^{-2} * I)^{-1}
mu_n = sigma_noise^{-2} * Sigma_n * X^T y

Predictive distribution for new x*:

P(y* | x*, X, y) = N(mu_n^T x*,  sigma_noise^2 + x*^T Sigma_n x*)

Key advantage: we get uncertainty estimates (predictive variance) for free.

MAP vs Full Bayesian

MAP: point estimate w* = argmax P(w|data) -- equivalent to Ridge when prior is Gaussian
Full Bayesian: integrate over all w, giving predictive distributions. More computationally expensive but principled uncertainty quantification.

Gaussian Processes

A GP defines a distribution over functions: f(x) ~ GP(m(x), k(x, x')).

Any finite collection of function values is jointly Gaussian:

[f(x_1), ..., f(x_n)]^T ~ N(m, K)

where K_ij = k(x_i, x_j) is the kernel/covariance function.

GP Regression

Given training data (X, y) with y = f(x) + epsilon, epsilon ~ N(0, sigma_n^2):

Predictive distribution at test point x*:

f* | X, y, x* ~ N(mu*, sigma*^2)

mu* = k*^T (K + sigma_n^2 I)^{-1} y
sigma*^2 = k(x*, x*) - k*^T (K + sigma_n^2 I)^{-1} k*

where k* = [k(x*, x_1), ..., k(x*, x_n)]^T.

def gp_predict(X_train, y_train, X_test, kernel, sigma_n):
    K = kernel(X_train, X_train) + sigma_n**2 * np.eye(len(X_train))
    K_s = kernel(X_train, X_test)
    K_ss = kernel(X_test, X_test)

    L = np.linalg.cholesky(K)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))

    mu = K_s.T @ alpha
    v = np.linalg.solve(L, K_s)
    sigma = K_ss - v.T @ v

    return mu, sigma

Complexity: O(n^3) for training (Cholesky decomposition), limiting to ~10K points. Sparse GP approximations (inducing points) scale to larger datasets.

Variational Inference

Approximate intractable posterior P(z|x) with a tractable family q(z; phi).

Evidence Lower Bound (ELBO)

log P(x) = ELBO(q) + KL(q(z) || P(z|x))

ELBO(q) = E_q[log P(x, z)] - E_q[log q(z)]
         = E_q[log P(x|z)] - KL(q(z) || P(z))

Since KL >= 0, ELBO is a lower bound on log P(x). Maximizing ELBO w.r.t. q minimizes KL(q || P(z|x)).

Mean-Field Variational Inference

Assume q(z) factorizes: q(z) = prod_j q_j(z_j).

Optimal factor (coordinate ascent):

log q_j*(z_j) = E_{q_{-j}}[log P(x, z)] + const

Iterate over factors until convergence.

Stochastic Variational Inference (SVI)

For large datasets, use stochastic optimization on the ELBO:

Subsample a mini-batch of data
Compute noisy gradient of ELBO w.r.t. variational parameters
Update parameters with SGD/Adam

Enables variational inference on millions of data points.

Variational Autoencoders (VAE)

Use neural networks for both q(z|x; phi) (encoder) and P(x|z; theta) (decoder). Optimize ELBO using the reparameterization trick:

z = mu + sigma * epsilon,   epsilon ~ N(0, I)

This makes the sampling step differentiable, enabling backpropagation through the encoder.

Comparison of Inference Methods

Method	Scalability	Exactness	Flexibility
Exact inference	Small models	Exact	Limited
MCMC (Gibbs, HMC)	Moderate	Asymptotic	High
Variational (CAVI)	Moderate	Approximate	Moderate
SVI	Large-scale	Approximate	High
Laplace approx	Fast	Rough	Limited

Practical Notes

Naive Bayes: first thing to try for text classification. Fast, few hyperparameters.
GMM: better than K-means when clusters are elliptical or overlap. Use BIC for model selection.
GPs: excellent for small datasets where uncertainty matters (Bayesian optimization, active learning).
Variational inference: when MCMC is too slow and you need approximate Bayesian inference at scale.