4 min read
On this page

Linear Models

Linear Regression

Model: y = X * w + b, or in augmented form y = X_aug * w where X_aug includes a bias column.

Goal: minimize the sum of squared residuals.

L(w) = ||y - Xw||^2 = (y - Xw)^T (y - Xw)

Normal Equation (Closed-Form Solution)

Set gradient to zero:

dL/dw = -2 X^T (y - Xw) = 0
w* = (X^T X)^{-1} X^T y
  • Time complexity: O(n * d^2 + d^3) where n = samples, d = features
  • Requires X^T X to be invertible (fails when features are collinear)
  • Practical for d < ~10,000
def normal_equation(X, y):
    # X: (n, d), y: (n,)
    return np.linalg.solve(X.T @ X, X.T @ y)

Gradient Descent

Iteratively update weights in the direction of steepest descent:

w_{t+1} = w_t - alpha * dL/dw
        = w_t - alpha * (-2/n) * X^T (y - X * w_t)
def gradient_descent(X, y, lr=0.01, epochs=1000):
    n, d = X.shape
    w = np.zeros(d)
    for _ in range(epochs):
        gradient = -(2/n) * X.T @ (y - X @ w)
        w -= lr * gradient
    return w

Variants:

  • Batch GD: full dataset per step (stable, slow for large n)
  • Stochastic GD: single sample per step (noisy, fast)
  • Mini-batch GD: batch of B samples (practical compromise, B=32..512)

Convergence Conditions

For convex quadratic loss with Hessian eigenvalues lambda_min, lambda_max:

Convergence guaranteed if: lr < 2 / lambda_max
Optimal fixed lr: 2 / (lambda_min + lambda_max)
Condition number: kappa = lambda_max / lambda_min  (higher = slower)

Regularized Linear Regression

Ridge Regression (L2)

L_ridge(w) = ||y - Xw||^2 + lambda * ||w||_2^2

Closed-form: w* = (X^T X + lambda * I)^{-1} X^T y

Properties:

  • Shrinks coefficients toward zero but never exactly zero
  • Makes X^T X + lambda I always invertible (solves collinearity)
  • Equivalent to MAP estimate with Gaussian prior on w

Lasso Regression (L1)

L_lasso(w) = ||y - Xw||^2 + lambda * ||w||_1

Properties:

  • Produces sparse solutions (some w_j = 0 exactly)
  • Performs implicit feature selection
  • No closed-form; solved via coordinate descent or ISTA
  • Equivalent to MAP with Laplace prior on w
def coordinate_descent_lasso(X, y, lam, epochs=100):
    n, d = X.shape
    w = np.zeros(d)
    for _ in range(epochs):
        for j in range(d):
            residual = y - X @ w + X[:, j] * w[j]
            rho = X[:, j] @ residual
            w[j] = soft_threshold(rho, lam * n) / (X[:, j] @ X[:, j])
    return w

def soft_threshold(rho, threshold):
    if rho > threshold:    return rho - threshold
    elif rho < -threshold: return rho + threshold
    else:                  return 0.0

Elastic Net

Combines L1 and L2:

L_enet(w) = ||y - Xw||^2 + lambda_1 * ||w||_1 + lambda_2 * ||w||_2^2

Or equivalently with mixing ratio alpha in [0, 1]:

L = ||y - Xw||^2 + lambda * (alpha * ||w||_1 + (1-alpha) * ||w||_2^2)
  • alpha=1: pure Lasso; alpha=0: pure Ridge
  • Handles groups of correlated features better than Lasso alone

Logistic Regression

Binary classification: model P(y=1|x) using the sigmoid function.

P(y=1|x) = sigma(w^T x + b) = 1 / (1 + exp(-(w^T x + b)))

Loss Function (Binary Cross-Entropy)

L(w) = -(1/n) * sum[ y_i * log(p_i) + (1 - y_i) * log(1 - p_i) ]

This is convex -- gradient descent finds the global minimum.

dL/dw = (1/n) * X^T (sigma(Xw) - y)

Note: same form as linear regression gradient but with sigma applied.

Multi-Class: Softmax Regression

For K classes, predict probability vector:

P(y=k|x) = exp(w_k^T x) / sum_{j=1}^{K} exp(w_j^T x)

Loss: categorical cross-entropy (negative log-likelihood).

Decision Boundary

Logistic regression produces a linear decision boundary:

w^T x + b = 0  defines the boundary

Points are classified by which side they fall on. The sigmoid output gives calibrated probabilities (a major advantage over SVMs).

The Perceptron

The simplest linear classifier. Binary output via sign function.

y_hat = sign(w^T x + b)

Perceptron Learning Algorithm

def perceptron(X, y, max_iter=1000):
    # y in {-1, +1}
    w = np.zeros(X.shape[1])
    b = 0.0
    for _ in range(max_iter):
        converged = True
        for i in range(len(X)):
            if y[i] * (w @ X[i] + b) <= 0:  # misclassified
                w += y[i] * X[i]
                b += y[i]
                converged = False
        if converged:
            break
    return w, b

Perceptron Convergence Theorem: if data is linearly separable with margin gamma, the algorithm converges in at most (R / gamma)^2 updates, where R = max ||x_i||.

Limitation: only works for linearly separable data (XOR problem).

Linear Discriminant Analysis (LDA)

Both a classifier and dimensionality reduction technique.

Fisher's Criterion

Find projection w that maximizes class separation:

J(w) = (w^T S_B w) / (w^T S_W w)

Where:

  • S_B = between-class scatter = (mu_1 - mu_2)(mu_1 - mu_2)^T
  • S_W = within-class scatter = sum of class covariance matrices

Solution: w* = S_W^{-1} (mu_1 - mu_2)

Assumptions

  • Each class is Gaussian with equal covariance
  • When covariances differ, use Quadratic Discriminant Analysis (QDA)

Multi-Class LDA

Projects onto at most (K-1) dimensions. Solves generalized eigenvalue problem:

S_B w = lambda * S_W w

Take eigenvectors corresponding to the (K-1) largest eigenvalues.

Feature Scaling

Linear models are sensitive to feature scales. Gradient descent converges faster with scaled features.

Methods

Method Formula Range Use Case
Standardization (x - mu) / sigma ~[-3, 3] Default choice
Min-Max (x - min) / (max - min) [0, 1] Bounded features
Robust scaling (x - median) / IQR varies Outlier-heavy data
Log transform log(1 + x) varies Right-skewed data

Critical rule: fit scaler on training data only, then apply to test data.

class StandardScaler:
    def fit(self, X_train):
        self.mean = X_train.mean(axis=0)
        self.std = X_train.std(axis=0)

    def transform(self, X):
        return (X - self.mean) / self.std

    def fit_transform(self, X_train):
        self.fit(X_train)
        return self.transform(X_train)

Polynomial Features

Extend linear models to capture non-linear relationships:

[x_1, x_2] -> [1, x_1, x_2, x_1^2, x_1*x_2, x_2^2]

With degree p and d features: number of terms = C(d+p, p). Grows fast -- combine with regularization.

Comparison of Linear Models

Model Loss Regularization Output Sparsity
Linear Regression MSE None Continuous No
Ridge MSE + L2 L2 Continuous No
Lasso MSE + L1 L1 Continuous Yes
Elastic Net MSE + L1 + L2 L1 + L2 Continuous Yes
Logistic Regression Cross-entropy Optional Probability Optional
Perceptron Hinge-like None Binary No
LDA Generative None Class label No