Kernel Methods
Support Vector Machines
SVMs find the hyperplane that maximizes the margin between classes.
Hard-Margin SVM
For linearly separable data with labels y_i in {-1, +1}:
Maximize: margin = 2 / ||w||
Subject to: y_i * (w^T x_i + b) >= 1 for all i
Equivalent convex optimization (primal):
Minimize: (1/2) ||w||^2
Subject to: y_i * (w^T x_i + b) >= 1
Dual Formulation
Using Lagrange multipliers alpha_i >= 0:
Maximize: sum(alpha_i) - (1/2) * sum_i sum_j alpha_i alpha_j y_i y_j (x_i^T x_j)
Subject to: alpha_i >= 0, sum(alpha_i * y_i) = 0
Solution: w* = sum(alpha_i * y_i * x_i). Only support vectors have alpha_i > 0 (points on the margin boundary).
Decision function: f(x) = sign(sum(alpha_i * y_i * x_i^T x + b))
Soft-Margin SVM
Allow misclassifications via slack variables xi_i >= 0:
Minimize: (1/2) ||w||^2 + C * sum(xi_i)
Subject to: y_i * (w^T x_i + b) >= 1 - xi_i, xi_i >= 0
- C is the regularization parameter
- Large C: few misclassifications (potential overfitting)
- Small C: wider margin, more misclassifications (more regularization)
Dual adds box constraint: 0 <= alpha_i <= C.
Hinge Loss Interpretation
Soft-margin SVM is equivalent to minimizing:
L(w) = (1/2) ||w||^2 + C * sum max(0, 1 - y_i * f(x_i))
The hinge loss max(0, 1 - y*f(x)) penalizes points inside the margin or misclassified.
The Kernel Trick
Replace inner products x_i^T x_j with a kernel function K(x_i, x_j) = phi(x_i)^T phi(x_j), where phi maps to a (possibly infinite-dimensional) feature space.
The dual becomes:
Maximize: sum(alpha_i) - (1/2) * sum_i sum_j alpha_i alpha_j y_i y_j K(x_i, x_j)
We never compute phi(x) explicitly -- only inner products via K.
Mercer's Condition
A function K(x, z) is a valid kernel if and only if the Gram matrix K_ij = K(x_i, x_j) is positive semi-definite for any set of points. Equivalently, K can be written as an inner product in some feature space.
Common Kernels
| Kernel | K(x, z) | Feature Space | Use Case | |-------------|-------------------------------------|---------------|-----------------------| | Linear | x^T z | Finite (d) | Linearly separable | | Polynomial | (x^T z + c)^p | Finite | Polynomial boundaries | | RBF/Gaussian| exp(-gamma * ||x - z||^2) | Infinite | General non-linear | | Sigmoid | tanh(alpha * x^T z + c) | -- | Neural network analog | | Laplacian | exp(-gamma * ||x - z||_1) | Infinite | Sparse/sharp features |
RBF Kernel in Detail
K(x, z) = exp(-||x - z||^2 / (2 * sigma^2)) where gamma = 1/(2*sigma^2)
- Maps to infinite-dimensional space
- gamma controls the "reach" of each support vector
- Large gamma: tight influence (overfitting); small gamma: broad influence (underfitting)
def rbf_kernel(X1, X2, gamma):
# X1: (n1, d), X2: (n2, d)
sq_dist = np.sum(X1**2, axis=1)[:, None] + np.sum(X2**2, axis=1)[None, :] - 2 * X1 @ X2.T
return np.exp(-gamma * sq_dist)
SMO Algorithm
Sequential Minimal Optimization (Platt, 1998): solves the SVM dual by optimizing two alpha_i at a time.
def smo_simplified(X, y, C, kernel, tol=1e-3, max_passes=5):
n = len(y)
alpha = np.zeros(n)
b = 0.0
passes = 0
while passes < max_passes:
num_changed = 0
for i in range(n):
E_i = decision_function(X, y, alpha, b, kernel, X[i]) - y[i]
if (y[i]*E_i < -tol and alpha[i] < C) or \
(y[i]*E_i > tol and alpha[i] > 0):
j = select_random(i, n) # heuristic selection
E_j = decision_function(X, y, alpha, b, kernel, X[j]) - y[j]
# Compute bounds L, H for alpha_j
if y[i] != y[j]:
L = max(0, alpha[j] - alpha[i])
H = min(C, C + alpha[j] - alpha[i])
else:
L = max(0, alpha[i] + alpha[j] - C)
H = min(C, alpha[i] + alpha[j])
if L == H: continue
eta = 2*kernel(X[i],X[j]) - kernel(X[i],X[i]) - kernel(X[j],X[j])
if eta >= 0: continue
# Update alpha_j, clip, then update alpha_i
alpha_j_new = alpha[j] - y[j]*(E_i - E_j) / eta
alpha[j] = np.clip(alpha_j_new, L, H)
alpha[i] += y[i]*y[j]*(alpha[j] - alpha_j_new)
# Update b
b = compute_b(X, y, alpha, E_i, E_j, i, j, kernel, C)
num_changed += 1
passes = passes + 1 if num_changed == 0 else 0
return alpha, b
Support Vector Regression (SVR)
Adapt SVM for regression using epsilon-insensitive loss:
L_epsilon(y, f(x)) = max(0, |y - f(x)| - epsilon)
Optimization:
Minimize: (1/2)||w||^2 + C * sum(xi_i + xi_i*)
Subject to: y_i - f(x_i) <= epsilon + xi_i
f(x_i) - y_i <= epsilon + xi_i*
xi_i, xi_i* >= 0
Points within the epsilon-tube incur no loss. Support vectors lie on or outside the tube.
Kernel PCA
Apply PCA in the kernel-induced feature space without explicit mapping.
- Compute kernel matrix K_ij = K(x_i, x_j)
- Center the kernel matrix: K_c = K - 1_n K - K 1_n + 1_n K 1_n
- Eigendecompose K_c = V Lambda V^T
- Project: z_k(x) = sum_i alpha_i^k K(x_i, x)
This captures non-linear structure that linear PCA misses (e.g., concentric circles).
Reproducing Kernel Hilbert Space (RKHS)
A Hilbert space H of functions with kernel K such that:
- Reproducing property: f(x) = <f, K(x, .)>_H for all f in H
- Representer theorem: the minimizer of any regularized empirical risk in RKHS has the form:
f*(x) = sum_{i=1}^{n} alpha_i * K(x_i, x)
This guarantees that kernel methods need only store n coefficients, regardless of the (possibly infinite) dimensionality of the feature space.
RKHS Norm and Regularization
The RKHS norm ||f||_H controls function complexity. Kernel ridge regression:
Minimize: (1/n) sum(y_i - f(x_i))^2 + lambda * ||f||_H^2
Solution: alpha = (K + n*lambda*I)^{-1} y
Practical Considerations
| Aspect | Guidance | |---------------|---------------------------------------------------| | Kernel choice | Start with RBF; linear for high-dim sparse data | | C tuning | Cross-validate over log scale: 10^{-3} to 10^3 | | gamma tuning | Try 1/d, then cross-validate on log scale | | Scaling | Always standardize features before kernel methods | | Large n | SVMs scale O(n^2)--O(n^3); use linear SVM or approx kernels | | Multi-class | One-vs-One (n_classes*(n_classes-1)/2 classifiers) or One-vs-All |
Kernel Approximation for Scalability
For large datasets, approximate the kernel with random features (Rahimi & Recht, 2007):
K(x, z) ~ z(x)^T z(z) where z(x) in R^D, D << n
For RBF: z(x) = sqrt(2/D) * cos(Wx + b) with W sampled from N(0, gamma*I).
This converts a kernel method into a linear method on D-dimensional features, enabling SGD training.