Linear Models
Linear Regression
Model: y = X * w + b, or in augmented form y = X_aug * w where X_aug includes a bias column.
Goal: minimize the sum of squared residuals.
L(w) = ||y - Xw||^2 = (y - Xw)^T (y - Xw)
Normal Equation (Closed-Form Solution)
Set gradient to zero:
dL/dw = -2 X^T (y - Xw) = 0
w* = (X^T X)^{-1} X^T y
- Time complexity: O(n * d^2 + d^3) where n = samples, d = features
- Requires X^T X to be invertible (fails when features are collinear)
- Practical for d < ~10,000
def normal_equation(X, y):
# X: (n, d), y: (n,)
return np.linalg.solve(X.T @ X, X.T @ y)
Gradient Descent
Iteratively update weights in the direction of steepest descent:
w_{t+1} = w_t - alpha * dL/dw
= w_t - alpha * (-2/n) * X^T (y - X * w_t)
def gradient_descent(X, y, lr=0.01, epochs=1000):
n, d = X.shape
w = np.zeros(d)
for _ in range(epochs):
gradient = -(2/n) * X.T @ (y - X @ w)
w -= lr * gradient
return w
Variants:
- Batch GD: full dataset per step (stable, slow for large n)
- Stochastic GD: single sample per step (noisy, fast)
- Mini-batch GD: batch of B samples (practical compromise, B=32..512)
Convergence Conditions
For convex quadratic loss with Hessian eigenvalues lambda_min, lambda_max:
Convergence guaranteed if: lr < 2 / lambda_max
Optimal fixed lr: 2 / (lambda_min + lambda_max)
Condition number: kappa = lambda_max / lambda_min (higher = slower)
Regularized Linear Regression
Ridge Regression (L2)
L_ridge(w) = ||y - Xw||^2 + lambda * ||w||_2^2
Closed-form: w* = (X^T X + lambda * I)^{-1} X^T y
Properties:
- Shrinks coefficients toward zero but never exactly zero
- Makes X^T X + lambda I always invertible (solves collinearity)
- Equivalent to MAP estimate with Gaussian prior on w
Lasso Regression (L1)
L_lasso(w) = ||y - Xw||^2 + lambda * ||w||_1
Properties:
- Produces sparse solutions (some w_j = 0 exactly)
- Performs implicit feature selection
- No closed-form; solved via coordinate descent or ISTA
- Equivalent to MAP with Laplace prior on w
def coordinate_descent_lasso(X, y, lam, epochs=100):
n, d = X.shape
w = np.zeros(d)
for _ in range(epochs):
for j in range(d):
residual = y - X @ w + X[:, j] * w[j]
rho = X[:, j] @ residual
w[j] = soft_threshold(rho, lam * n) / (X[:, j] @ X[:, j])
return w
def soft_threshold(rho, threshold):
if rho > threshold: return rho - threshold
elif rho < -threshold: return rho + threshold
else: return 0.0
Elastic Net
Combines L1 and L2:
L_enet(w) = ||y - Xw||^2 + lambda_1 * ||w||_1 + lambda_2 * ||w||_2^2
Or equivalently with mixing ratio alpha in [0, 1]:
L = ||y - Xw||^2 + lambda * (alpha * ||w||_1 + (1-alpha) * ||w||_2^2)
- alpha=1: pure Lasso; alpha=0: pure Ridge
- Handles groups of correlated features better than Lasso alone
Logistic Regression
Binary classification: model P(y=1|x) using the sigmoid function.
P(y=1|x) = sigma(w^T x + b) = 1 / (1 + exp(-(w^T x + b)))
Loss Function (Binary Cross-Entropy)
L(w) = -(1/n) * sum[ y_i * log(p_i) + (1 - y_i) * log(1 - p_i) ]
This is convex -- gradient descent finds the global minimum.
dL/dw = (1/n) * X^T (sigma(Xw) - y)
Note: same form as linear regression gradient but with sigma applied.
Multi-Class: Softmax Regression
For K classes, predict probability vector:
P(y=k|x) = exp(w_k^T x) / sum_{j=1}^{K} exp(w_j^T x)
Loss: categorical cross-entropy (negative log-likelihood).
Decision Boundary
Logistic regression produces a linear decision boundary:
w^T x + b = 0 defines the boundary
Points are classified by which side they fall on. The sigmoid output gives calibrated probabilities (a major advantage over SVMs).
The Perceptron
The simplest linear classifier. Binary output via sign function.
y_hat = sign(w^T x + b)
Perceptron Learning Algorithm
def perceptron(X, y, max_iter=1000):
# y in {-1, +1}
w = np.zeros(X.shape[1])
b = 0.0
for _ in range(max_iter):
converged = True
for i in range(len(X)):
if y[i] * (w @ X[i] + b) <= 0: # misclassified
w += y[i] * X[i]
b += y[i]
converged = False
if converged:
break
return w, b
Perceptron Convergence Theorem: if data is linearly separable with margin gamma, the algorithm converges in at most (R / gamma)^2 updates, where R = max ||x_i||.
Limitation: only works for linearly separable data (XOR problem).
Linear Discriminant Analysis (LDA)
Both a classifier and dimensionality reduction technique.
Fisher's Criterion
Find projection w that maximizes class separation:
J(w) = (w^T S_B w) / (w^T S_W w)
Where:
- S_B = between-class scatter = (mu_1 - mu_2)(mu_1 - mu_2)^T
- S_W = within-class scatter = sum of class covariance matrices
Solution: w* = S_W^{-1} (mu_1 - mu_2)
Assumptions
- Each class is Gaussian with equal covariance
- When covariances differ, use Quadratic Discriminant Analysis (QDA)
Multi-Class LDA
Projects onto at most (K-1) dimensions. Solves generalized eigenvalue problem:
S_B w = lambda * S_W w
Take eigenvectors corresponding to the (K-1) largest eigenvalues.
Feature Scaling
Linear models are sensitive to feature scales. Gradient descent converges faster with scaled features.
Methods
| Method | Formula | Range | Use Case | |--------------------|------------------------------|------------|----------------------| | Standardization | (x - mu) / sigma | ~[-3, 3] | Default choice | | Min-Max | (x - min) / (max - min) | [0, 1] | Bounded features | | Robust scaling | (x - median) / IQR | varies | Outlier-heavy data | | Log transform | log(1 + x) | varies | Right-skewed data |
Critical rule: fit scaler on training data only, then apply to test data.
class StandardScaler:
def fit(self, X_train):
self.mean = X_train.mean(axis=0)
self.std = X_train.std(axis=0)
def transform(self, X):
return (X - self.mean) / self.std
def fit_transform(self, X_train):
self.fit(X_train)
return self.transform(X_train)
Polynomial Features
Extend linear models to capture non-linear relationships:
[x_1, x_2] -> [1, x_1, x_2, x_1^2, x_1*x_2, x_2^2]
With degree p and d features: number of terms = C(d+p, p). Grows fast -- combine with regularization.
Comparison of Linear Models
| Model | Loss | Regularization | Output | Sparsity | |--------------------|-------------------|----------------|--------------|----------| | Linear Regression | MSE | None | Continuous | No | | Ridge | MSE + L2 | L2 | Continuous | No | | Lasso | MSE + L1 | L1 | Continuous | Yes | | Elastic Net | MSE + L1 + L2 | L1 + L2 | Continuous | Yes | | Logistic Regression| Cross-entropy | Optional | Probability | Optional | | Perceptron | Hinge-like | None | Binary | No | | LDA | Generative | None | Class label | No |