Probabilistic Models
Naive Bayes
Apply Bayes' theorem with the conditional independence assumption:
P(y|x_1,...,x_d) ∝ P(y) * prod_{j=1}^{d} P(x_j|y)
Prediction: y* = argmax_y P(y) * prod_j P(x_j|y)
In log space (numerically stable):
y* = argmax_y [ log P(y) + sum_j log P(x_j|y) ]
Variants by Feature Distribution
| Variant | P(x_j|y) | Use Case | |---------------|----------------------|------------------------------| | Gaussian NB | N(mu_{jy}, sigma_{jy}^2) | Continuous features | | Multinomial NB| Multinomial | Word counts, text data | | Bernoulli NB | Bernoulli | Binary features | | Complement NB | Complement class | Imbalanced text |
Laplace Smoothing
For multinomial NB, avoid zero probabilities:
P(x_j = v | y=c) = (count(x_j=v, y=c) + alpha) / (count(y=c) + alpha * |V|)
alpha = 1 is standard Laplace smoothing; alpha < 1 is Lidstone smoothing.
Despite the naive independence assumption, NB works well for text classification, spam filtering, and as a fast baseline.
Gaussian Mixture Models (GMM)
Model data as a mixture of K Gaussian components:
P(x) = sum_{k=1}^{K} pi_k * N(x | mu_k, Sigma_k)
Where pi_k are mixing coefficients (sum to 1), and each component has mean mu_k and covariance Sigma_k.
Expectation-Maximization (EM) Algorithm
Iteratively estimate parameters when latent variables (component assignments) are unobserved.
E-step: compute responsibilities (soft assignments):
r_{nk} = pi_k * N(x_n | mu_k, Sigma_k) / sum_j pi_j * N(x_n | mu_j, Sigma_j)
M-step: update parameters using responsibilities:
N_k = sum_n r_{nk}
mu_k = (1/N_k) * sum_n r_{nk} * x_n
Sigma_k = (1/N_k) * sum_n r_{nk} * (x_n - mu_k)(x_n - mu_k)^T
pi_k = N_k / N
def em_gmm(X, K, max_iter=100, tol=1e-6):
n, d = X.shape
# Initialize
mu = X[np.random.choice(n, K, replace=False)]
sigma = [np.eye(d)] * K
pi = np.ones(K) / K
for _ in range(max_iter):
# E-step
r = np.zeros((n, K))
for k in range(K):
r[:, k] = pi[k] * multivariate_normal_pdf(X, mu[k], sigma[k])
r /= r.sum(axis=1, keepdims=True)
# M-step
for k in range(K):
N_k = r[:, k].sum()
mu[k] = (r[:, k] @ X) / N_k
diff = X - mu[k]
sigma[k] = (r[:, k][:, None] * diff).T @ diff / N_k
pi[k] = N_k / n
# Check log-likelihood convergence
ll = compute_log_likelihood(X, mu, sigma, pi)
if converged(ll, tol): break
return mu, sigma, pi, r
Properties: EM monotonically increases log-likelihood but converges to local optima. Run multiple restarts and pick the best.
Choosing K
- BIC: -2 * log L + p * log(n) (penalizes complexity)
- AIC: -2 * log L + 2p
- Lower BIC/AIC is better. BIC gives stronger penalty, favoring simpler models.
Latent Dirichlet Allocation (LDA)
Generative model for topic modeling. Each document is a mixture of topics; each topic is a distribution over words.
For each document d:
theta_d ~ Dirichlet(alpha) # topic proportions
For each word position n:
z_dn ~ Categorical(theta_d) # topic assignment
w_dn ~ Categorical(beta_{z_dn}) # word from topic
Where beta_k is the word distribution for topic k, drawn from Dirichlet(eta).
Inference via collapsed Gibbs sampling or variational inference.
Bayesian Linear Regression
Place a prior on weights: w ~ N(0, sigma_w^2 * I).
Posterior (conjugate Gaussian):
P(w | X, y) = N(w | mu_n, Sigma_n)
Sigma_n = (sigma_noise^{-2} * X^T X + sigma_w^{-2} * I)^{-1}
mu_n = sigma_noise^{-2} * Sigma_n * X^T y
Predictive distribution for new x*:
P(y* | x*, X, y) = N(mu_n^T x*, sigma_noise^2 + x*^T Sigma_n x*)
Key advantage: we get uncertainty estimates (predictive variance) for free.
MAP vs Full Bayesian
- MAP: point estimate w* = argmax P(w|data) -- equivalent to Ridge when prior is Gaussian
- Full Bayesian: integrate over all w, giving predictive distributions. More computationally expensive but principled uncertainty quantification.
Gaussian Processes
A GP defines a distribution over functions: f(x) ~ GP(m(x), k(x, x')).
Any finite collection of function values is jointly Gaussian:
[f(x_1), ..., f(x_n)]^T ~ N(m, K)
where K_ij = k(x_i, x_j) is the kernel/covariance function.
GP Regression
Given training data (X, y) with y = f(x) + epsilon, epsilon ~ N(0, sigma_n^2):
Predictive distribution at test point x*:
f* | X, y, x* ~ N(mu*, sigma*^2)
mu* = k*^T (K + sigma_n^2 I)^{-1} y
sigma*^2 = k(x*, x*) - k*^T (K + sigma_n^2 I)^{-1} k*
where k* = [k(x*, x_1), ..., k(x*, x_n)]^T.
def gp_predict(X_train, y_train, X_test, kernel, sigma_n):
K = kernel(X_train, X_train) + sigma_n**2 * np.eye(len(X_train))
K_s = kernel(X_train, X_test)
K_ss = kernel(X_test, X_test)
L = np.linalg.cholesky(K)
alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
mu = K_s.T @ alpha
v = np.linalg.solve(L, K_s)
sigma = K_ss - v.T @ v
return mu, sigma
Complexity: O(n^3) for training (Cholesky decomposition), limiting to ~10K points. Sparse GP approximations (inducing points) scale to larger datasets.
Variational Inference
Approximate intractable posterior P(z|x) with a tractable family q(z; phi).
Evidence Lower Bound (ELBO)
log P(x) = ELBO(q) + KL(q(z) || P(z|x))
ELBO(q) = E_q[log P(x, z)] - E_q[log q(z)]
= E_q[log P(x|z)] - KL(q(z) || P(z))
Since KL >= 0, ELBO is a lower bound on log P(x). Maximizing ELBO w.r.t. q minimizes KL(q || P(z|x)).
Mean-Field Variational Inference
Assume q(z) factorizes: q(z) = prod_j q_j(z_j).
Optimal factor (coordinate ascent):
log q_j*(z_j) = E_{q_{-j}}[log P(x, z)] + const
Iterate over factors until convergence.
Stochastic Variational Inference (SVI)
For large datasets, use stochastic optimization on the ELBO:
- Subsample a mini-batch of data
- Compute noisy gradient of ELBO w.r.t. variational parameters
- Update parameters with SGD/Adam
Enables variational inference on millions of data points.
Variational Autoencoders (VAE)
Use neural networks for both q(z|x; phi) (encoder) and P(x|z; theta) (decoder). Optimize ELBO using the reparameterization trick:
z = mu + sigma * epsilon, epsilon ~ N(0, I)
This makes the sampling step differentiable, enabling backpropagation through the encoder.
Comparison of Inference Methods
| Method | Scalability | Exactness | Flexibility | |---------------------|-------------|------------|-------------| | Exact inference | Small models| Exact | Limited | | MCMC (Gibbs, HMC) | Moderate | Asymptotic | High | | Variational (CAVI) | Moderate | Approximate| Moderate | | SVI | Large-scale | Approximate| High | | Laplace approx | Fast | Rough | Limited |
Practical Notes
- Naive Bayes: first thing to try for text classification. Fast, few hyperparameters.
- GMM: better than K-means when clusters are elliptical or overlap. Use BIC for model selection.
- GPs: excellent for small datasets where uncertainty matters (Bayesian optimization, active learning).
- Variational inference: when MCMC is too slow and you need approximate Bayesian inference at scale.