Kernel Methods

Support Vector Machines

SVMs find the hyperplane that maximizes the margin between classes.

Hard-Margin SVM

For linearly separable data with labels y_i in {-1, +1}:

Maximize:   margin = 2 / ||w||
Subject to: y_i * (w^T x_i + b) >= 1   for all i

Equivalent convex optimization (primal):

Minimize:   (1/2) ||w||^2
Subject to: y_i * (w^T x_i + b) >= 1

Dual Formulation

Using Lagrange multipliers alpha_i >= 0:

Maximize:   sum(alpha_i) - (1/2) * sum_i sum_j alpha_i alpha_j y_i y_j (x_i^T x_j)
Subject to: alpha_i >= 0,  sum(alpha_i * y_i) = 0

Solution: w* = sum(alpha_i * y_i * x_i). Only support vectors have alpha_i > 0 (points on the margin boundary).

Decision function: f(x) = sign(sum(alpha_i * y_i * x_i^T x + b))

Soft-Margin SVM

Allow misclassifications via slack variables xi_i >= 0:

Minimize:   (1/2) ||w||^2 + C * sum(xi_i)
Subject to: y_i * (w^T x_i + b) >= 1 - xi_i,   xi_i >= 0

C is the regularization parameter
Large C: few misclassifications (potential overfitting)
Small C: wider margin, more misclassifications (more regularization)

Dual adds box constraint: 0 <= alpha_i <= C.

Hinge Loss Interpretation

Soft-margin SVM is equivalent to minimizing:

L(w) = (1/2) ||w||^2 + C * sum max(0, 1 - y_i * f(x_i))

The hinge loss max(0, 1 - y*f(x)) penalizes points inside the margin or misclassified.

The Kernel Trick

Replace inner products x_i^T x_j with a kernel function K(x_i, x_j) = phi(x_i)^T phi(x_j), where phi maps to a (possibly infinite-dimensional) feature space.

The dual becomes:

Maximize: sum(alpha_i) - (1/2) * sum_i sum_j alpha_i alpha_j y_i y_j K(x_i, x_j)

We never compute phi(x) explicitly -- only inner products via K.

A function K(x, z) is a valid kernel if and only if the Gram matrix K_ij = K(x_i, x_j) is positive semi-definite for any set of points. Equivalently, K can be written as an inner product in some feature space.

Common Kernels

Kernel	K(x, z)	Feature Space	Use Case
Linear	x^T z	Finite (d)	Linearly separable
Polynomial	(x^T z + c)^p	Finite	Polynomial boundaries
RBF/Gaussian	exp(-gamma * \|\|x - z\|\|^2)	Infinite	General non-linear
Sigmoid	tanh(alpha * x^T z + c)	--	Neural network analog
Laplacian	exp(-gamma * \|\|x - z\|\|_1)	Infinite	Sparse/sharp features

RBF Kernel in Detail

K(x, z) = exp(-||x - z||^2 / (2 * sigma^2))   where gamma = 1/(2*sigma^2)

Maps to infinite-dimensional space
gamma controls the "reach" of each support vector
Large gamma: tight influence (overfitting); small gamma: broad influence (underfitting)

def rbf_kernel(X1, X2, gamma):
    # X1: (n1, d), X2: (n2, d)
    sq_dist = np.sum(X1**2, axis=1)[:, None] + np.sum(X2**2, axis=1)[None, :] - 2 * X1 @ X2.T
    return np.exp(-gamma * sq_dist)

SMO Algorithm

Sequential Minimal Optimization (Platt, 1998): solves the SVM dual by optimizing two alpha_i at a time.

def smo_simplified(X, y, C, kernel, tol=1e-3, max_passes=5):
    n = len(y)
    alpha = np.zeros(n)
    b = 0.0
    passes = 0

    while passes < max_passes:
        num_changed = 0
        for i in range(n):
            E_i = decision_function(X, y, alpha, b, kernel, X[i]) - y[i]
            if (y[i]*E_i < -tol and alpha[i] < C) or \
               (y[i]*E_i > tol and alpha[i] > 0):
                j = select_random(i, n)    # heuristic selection
                E_j = decision_function(X, y, alpha, b, kernel, X[j]) - y[j]

                # Compute bounds L, H for alpha_j
                if y[i] != y[j]:
                    L = max(0, alpha[j] - alpha[i])
                    H = min(C, C + alpha[j] - alpha[i])
                else:
                    L = max(0, alpha[i] + alpha[j] - C)
                    H = min(C, alpha[i] + alpha[j])
                if L == H: continue

                eta = 2*kernel(X[i],X[j]) - kernel(X[i],X[i]) - kernel(X[j],X[j])
                if eta >= 0: continue

                # Update alpha_j, clip, then update alpha_i
                alpha_j_new = alpha[j] - y[j]*(E_i - E_j) / eta
                alpha[j] = np.clip(alpha_j_new, L, H)
                alpha[i] += y[i]*y[j]*(alpha[j] - alpha_j_new)

                # Update b
                b = compute_b(X, y, alpha, E_i, E_j, i, j, kernel, C)
                num_changed += 1
        passes = passes + 1 if num_changed == 0 else 0
    return alpha, b

Support Vector Regression (SVR)

Adapt SVM for regression using epsilon-insensitive loss:

L_epsilon(y, f(x)) = max(0, |y - f(x)| - epsilon)

Optimization:

Minimize: (1/2)||w||^2 + C * sum(xi_i + xi_i*)
Subject to: y_i - f(x_i) <= epsilon + xi_i
            f(x_i) - y_i <= epsilon + xi_i*
            xi_i, xi_i* >= 0

Points within the epsilon-tube incur no loss. Support vectors lie on or outside the tube.

Kernel PCA

Apply PCA in the kernel-induced feature space without explicit mapping.

Compute kernel matrix K_ij = K(x_i, x_j)
Center the kernel matrix: K_c = K - 1_n K - K 1_n + 1_n K 1_n
Eigendecompose K_c = V Lambda V^T
Project: z_k(x) = sum_i alpha_i^k K(x_i, x)

This captures non-linear structure that linear PCA misses (e.g., concentric circles).

Reproducing Kernel Hilbert Space (RKHS)

A Hilbert space H of functions with kernel K such that:

Reproducing property: f(x) = <f, K(x, .)>_H for all f in H
Representer theorem: the minimizer of any regularized empirical risk in RKHS has the form:

f*(x) = sum_{i=1}^{n} alpha_i * K(x_i, x)

This guarantees that kernel methods need only store n coefficients, regardless of the (possibly infinite) dimensionality of the feature space.

RKHS Norm and Regularization

The RKHS norm ||f||_H controls function complexity. Kernel ridge regression:

Minimize: (1/n) sum(y_i - f(x_i))^2 + lambda * ||f||_H^2
Solution: alpha = (K + n*lambda*I)^{-1} y

Practical Considerations

Aspect	Guidance
Kernel choice	Start with RBF; linear for high-dim sparse data
C tuning	Cross-validate over log scale: 10^{-3} to 10^3
gamma tuning	Try 1/d, then cross-validate on log scale
Scaling	Always standardize features before kernel methods
Large n	SVMs scale O(n^2)--O(n^3); use linear SVM or approx kernels
Multi-class	One-vs-One (n_classes*(n_classes-1)/2 classifiers) or One-vs-All

Kernel Approximation for Scalability

For large datasets, approximate the kernel with random features (Rahimi & Recht, 2007):

K(x, z) ~ z(x)^T z(z)   where z(x) in R^D, D << n

For RBF: z(x) = sqrt(2/D) * cos(Wx + b) with W sampled from N(0, gamma*I).

This converts a kernel method into a linear method on D-dimensional features, enabling SGD training.