4 min read
On this page

Probabilistic Models

Naive Bayes

Apply Bayes' theorem with the conditional independence assumption:

P(y|x_1,...,x_d) ∝ P(y) * prod_{j=1}^{d} P(x_j|y)

Prediction: y* = argmax_y P(y) * prod_j P(x_j|y)

In log space (numerically stable):

y* = argmax_y [ log P(y) + sum_j log P(x_j|y) ]

Variants by Feature Distribution

| Variant | P(x_j|y) | Use Case | |---------------|----------------------|------------------------------| | Gaussian NB | N(mu_{jy}, sigma_{jy}^2) | Continuous features | | Multinomial NB| Multinomial | Word counts, text data | | Bernoulli NB | Bernoulli | Binary features | | Complement NB | Complement class | Imbalanced text |

Laplace Smoothing

For multinomial NB, avoid zero probabilities:

P(x_j = v | y=c) = (count(x_j=v, y=c) + alpha) / (count(y=c) + alpha * |V|)

alpha = 1 is standard Laplace smoothing; alpha < 1 is Lidstone smoothing.

Despite the naive independence assumption, NB works well for text classification, spam filtering, and as a fast baseline.

Gaussian Mixture Models (GMM)

Model data as a mixture of K Gaussian components:

P(x) = sum_{k=1}^{K} pi_k * N(x | mu_k, Sigma_k)

Where pi_k are mixing coefficients (sum to 1), and each component has mean mu_k and covariance Sigma_k.

Expectation-Maximization (EM) Algorithm

Iteratively estimate parameters when latent variables (component assignments) are unobserved.

E-step: compute responsibilities (soft assignments):

r_{nk} = pi_k * N(x_n | mu_k, Sigma_k) / sum_j pi_j * N(x_n | mu_j, Sigma_j)

M-step: update parameters using responsibilities:

N_k = sum_n r_{nk}
mu_k = (1/N_k) * sum_n r_{nk} * x_n
Sigma_k = (1/N_k) * sum_n r_{nk} * (x_n - mu_k)(x_n - mu_k)^T
pi_k = N_k / N
def em_gmm(X, K, max_iter=100, tol=1e-6):
    n, d = X.shape
    # Initialize
    mu = X[np.random.choice(n, K, replace=False)]
    sigma = [np.eye(d)] * K
    pi = np.ones(K) / K

    for _ in range(max_iter):
        # E-step
        r = np.zeros((n, K))
        for k in range(K):
            r[:, k] = pi[k] * multivariate_normal_pdf(X, mu[k], sigma[k])
        r /= r.sum(axis=1, keepdims=True)

        # M-step
        for k in range(K):
            N_k = r[:, k].sum()
            mu[k] = (r[:, k] @ X) / N_k
            diff = X - mu[k]
            sigma[k] = (r[:, k][:, None] * diff).T @ diff / N_k
            pi[k] = N_k / n

        # Check log-likelihood convergence
        ll = compute_log_likelihood(X, mu, sigma, pi)
        if converged(ll, tol): break

    return mu, sigma, pi, r

Properties: EM monotonically increases log-likelihood but converges to local optima. Run multiple restarts and pick the best.

Choosing K

  • BIC: -2 * log L + p * log(n) (penalizes complexity)
  • AIC: -2 * log L + 2p
  • Lower BIC/AIC is better. BIC gives stronger penalty, favoring simpler models.

Latent Dirichlet Allocation (LDA)

Generative model for topic modeling. Each document is a mixture of topics; each topic is a distribution over words.

For each document d:
    theta_d ~ Dirichlet(alpha)              # topic proportions
    For each word position n:
        z_dn ~ Categorical(theta_d)         # topic assignment
        w_dn ~ Categorical(beta_{z_dn})     # word from topic

Where beta_k is the word distribution for topic k, drawn from Dirichlet(eta).

Inference via collapsed Gibbs sampling or variational inference.

Bayesian Linear Regression

Place a prior on weights: w ~ N(0, sigma_w^2 * I).

Posterior (conjugate Gaussian):

P(w | X, y) = N(w | mu_n, Sigma_n)

Sigma_n = (sigma_noise^{-2} * X^T X + sigma_w^{-2} * I)^{-1}
mu_n = sigma_noise^{-2} * Sigma_n * X^T y

Predictive distribution for new x*:

P(y* | x*, X, y) = N(mu_n^T x*,  sigma_noise^2 + x*^T Sigma_n x*)

Key advantage: we get uncertainty estimates (predictive variance) for free.

MAP vs Full Bayesian

  • MAP: point estimate w* = argmax P(w|data) -- equivalent to Ridge when prior is Gaussian
  • Full Bayesian: integrate over all w, giving predictive distributions. More computationally expensive but principled uncertainty quantification.

Gaussian Processes

A GP defines a distribution over functions: f(x) ~ GP(m(x), k(x, x')).

Any finite collection of function values is jointly Gaussian:

[f(x_1), ..., f(x_n)]^T ~ N(m, K)

where K_ij = k(x_i, x_j) is the kernel/covariance function.

GP Regression

Given training data (X, y) with y = f(x) + epsilon, epsilon ~ N(0, sigma_n^2):

Predictive distribution at test point x*:

f* | X, y, x* ~ N(mu*, sigma*^2)

mu* = k*^T (K + sigma_n^2 I)^{-1} y
sigma*^2 = k(x*, x*) - k*^T (K + sigma_n^2 I)^{-1} k*

where k* = [k(x*, x_1), ..., k(x*, x_n)]^T.

def gp_predict(X_train, y_train, X_test, kernel, sigma_n):
    K = kernel(X_train, X_train) + sigma_n**2 * np.eye(len(X_train))
    K_s = kernel(X_train, X_test)
    K_ss = kernel(X_test, X_test)

    L = np.linalg.cholesky(K)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))

    mu = K_s.T @ alpha
    v = np.linalg.solve(L, K_s)
    sigma = K_ss - v.T @ v

    return mu, sigma

Complexity: O(n^3) for training (Cholesky decomposition), limiting to ~10K points. Sparse GP approximations (inducing points) scale to larger datasets.

Variational Inference

Approximate intractable posterior P(z|x) with a tractable family q(z; phi).

Evidence Lower Bound (ELBO)

log P(x) = ELBO(q) + KL(q(z) || P(z|x))

ELBO(q) = E_q[log P(x, z)] - E_q[log q(z)]
         = E_q[log P(x|z)] - KL(q(z) || P(z))

Since KL >= 0, ELBO is a lower bound on log P(x). Maximizing ELBO w.r.t. q minimizes KL(q || P(z|x)).

Mean-Field Variational Inference

Assume q(z) factorizes: q(z) = prod_j q_j(z_j).

Optimal factor (coordinate ascent):

log q_j*(z_j) = E_{q_{-j}}[log P(x, z)] + const

Iterate over factors until convergence.

Stochastic Variational Inference (SVI)

For large datasets, use stochastic optimization on the ELBO:

  1. Subsample a mini-batch of data
  2. Compute noisy gradient of ELBO w.r.t. variational parameters
  3. Update parameters with SGD/Adam

Enables variational inference on millions of data points.

Variational Autoencoders (VAE)

Use neural networks for both q(z|x; phi) (encoder) and P(x|z; theta) (decoder). Optimize ELBO using the reparameterization trick:

z = mu + sigma * epsilon,   epsilon ~ N(0, I)

This makes the sampling step differentiable, enabling backpropagation through the encoder.

Comparison of Inference Methods

| Method | Scalability | Exactness | Flexibility | |---------------------|-------------|------------|-------------| | Exact inference | Small models| Exact | Limited | | MCMC (Gibbs, HMC) | Moderate | Asymptotic | High | | Variational (CAVI) | Moderate | Approximate| Moderate | | SVI | Large-scale | Approximate| High | | Laplace approx | Fast | Rough | Limited |

Practical Notes

  • Naive Bayes: first thing to try for text classification. Fast, few hyperparameters.
  • GMM: better than K-means when clusters are elliptical or overlap. Use BIC for model selection.
  • GPs: excellent for small datasets where uncertainty matters (Bayesian optimization, active learning).
  • Variational inference: when MCMC is too slow and you need approximate Bayesian inference at scale.