4 min read
On this page

Linear Models

Linear Regression

Model: y = X * w + b, or in augmented form y = X_aug * w where X_aug includes a bias column.

Goal: minimize the sum of squared residuals.

L(w) = ||y - Xw||^2 = (y - Xw)^T (y - Xw)

Normal Equation (Closed-Form Solution)

Set gradient to zero:

dL/dw = -2 X^T (y - Xw) = 0
w* = (X^T X)^{-1} X^T y
  • Time complexity: O(n * d^2 + d^3) where n = samples, d = features
  • Requires X^T X to be invertible (fails when features are collinear)
  • Practical for d < ~10,000
def normal_equation(X, y):
    # X: (n, d), y: (n,)
    return np.linalg.solve(X.T @ X, X.T @ y)

Gradient Descent

Iteratively update weights in the direction of steepest descent:

w_{t+1} = w_t - alpha * dL/dw
        = w_t - alpha * (-2/n) * X^T (y - X * w_t)
def gradient_descent(X, y, lr=0.01, epochs=1000):
    n, d = X.shape
    w = np.zeros(d)
    for _ in range(epochs):
        gradient = -(2/n) * X.T @ (y - X @ w)
        w -= lr * gradient
    return w

Variants:

  • Batch GD: full dataset per step (stable, slow for large n)
  • Stochastic GD: single sample per step (noisy, fast)
  • Mini-batch GD: batch of B samples (practical compromise, B=32..512)

Convergence Conditions

For convex quadratic loss with Hessian eigenvalues lambda_min, lambda_max:

Convergence guaranteed if: lr < 2 / lambda_max
Optimal fixed lr: 2 / (lambda_min + lambda_max)
Condition number: kappa = lambda_max / lambda_min  (higher = slower)

Regularized Linear Regression

Ridge Regression (L2)

L_ridge(w) = ||y - Xw||^2 + lambda * ||w||_2^2

Closed-form: w* = (X^T X + lambda * I)^{-1} X^T y

Properties:

  • Shrinks coefficients toward zero but never exactly zero
  • Makes X^T X + lambda I always invertible (solves collinearity)
  • Equivalent to MAP estimate with Gaussian prior on w

Lasso Regression (L1)

L_lasso(w) = ||y - Xw||^2 + lambda * ||w||_1

Properties:

  • Produces sparse solutions (some w_j = 0 exactly)
  • Performs implicit feature selection
  • No closed-form; solved via coordinate descent or ISTA
  • Equivalent to MAP with Laplace prior on w
def coordinate_descent_lasso(X, y, lam, epochs=100):
    n, d = X.shape
    w = np.zeros(d)
    for _ in range(epochs):
        for j in range(d):
            residual = y - X @ w + X[:, j] * w[j]
            rho = X[:, j] @ residual
            w[j] = soft_threshold(rho, lam * n) / (X[:, j] @ X[:, j])
    return w

def soft_threshold(rho, threshold):
    if rho > threshold:    return rho - threshold
    elif rho < -threshold: return rho + threshold
    else:                  return 0.0

Elastic Net

Combines L1 and L2:

L_enet(w) = ||y - Xw||^2 + lambda_1 * ||w||_1 + lambda_2 * ||w||_2^2

Or equivalently with mixing ratio alpha in [0, 1]:

L = ||y - Xw||^2 + lambda * (alpha * ||w||_1 + (1-alpha) * ||w||_2^2)
  • alpha=1: pure Lasso; alpha=0: pure Ridge
  • Handles groups of correlated features better than Lasso alone

Logistic Regression

Binary classification: model P(y=1|x) using the sigmoid function.

P(y=1|x) = sigma(w^T x + b) = 1 / (1 + exp(-(w^T x + b)))

Loss Function (Binary Cross-Entropy)

L(w) = -(1/n) * sum[ y_i * log(p_i) + (1 - y_i) * log(1 - p_i) ]

This is convex -- gradient descent finds the global minimum.

dL/dw = (1/n) * X^T (sigma(Xw) - y)

Note: same form as linear regression gradient but with sigma applied.

Multi-Class: Softmax Regression

For K classes, predict probability vector:

P(y=k|x) = exp(w_k^T x) / sum_{j=1}^{K} exp(w_j^T x)

Loss: categorical cross-entropy (negative log-likelihood).

Decision Boundary

Logistic regression produces a linear decision boundary:

w^T x + b = 0  defines the boundary

Points are classified by which side they fall on. The sigmoid output gives calibrated probabilities (a major advantage over SVMs).

The Perceptron

The simplest linear classifier. Binary output via sign function.

y_hat = sign(w^T x + b)

Perceptron Learning Algorithm

def perceptron(X, y, max_iter=1000):
    # y in {-1, +1}
    w = np.zeros(X.shape[1])
    b = 0.0
    for _ in range(max_iter):
        converged = True
        for i in range(len(X)):
            if y[i] * (w @ X[i] + b) <= 0:  # misclassified
                w += y[i] * X[i]
                b += y[i]
                converged = False
        if converged:
            break
    return w, b

Perceptron Convergence Theorem: if data is linearly separable with margin gamma, the algorithm converges in at most (R / gamma)^2 updates, where R = max ||x_i||.

Limitation: only works for linearly separable data (XOR problem).

Linear Discriminant Analysis (LDA)

Both a classifier and dimensionality reduction technique.

Fisher's Criterion

Find projection w that maximizes class separation:

J(w) = (w^T S_B w) / (w^T S_W w)

Where:

  • S_B = between-class scatter = (mu_1 - mu_2)(mu_1 - mu_2)^T
  • S_W = within-class scatter = sum of class covariance matrices

Solution: w* = S_W^{-1} (mu_1 - mu_2)

Assumptions

  • Each class is Gaussian with equal covariance
  • When covariances differ, use Quadratic Discriminant Analysis (QDA)

Multi-Class LDA

Projects onto at most (K-1) dimensions. Solves generalized eigenvalue problem:

S_B w = lambda * S_W w

Take eigenvectors corresponding to the (K-1) largest eigenvalues.

Feature Scaling

Linear models are sensitive to feature scales. Gradient descent converges faster with scaled features.

Methods

| Method | Formula | Range | Use Case | |--------------------|------------------------------|------------|----------------------| | Standardization | (x - mu) / sigma | ~[-3, 3] | Default choice | | Min-Max | (x - min) / (max - min) | [0, 1] | Bounded features | | Robust scaling | (x - median) / IQR | varies | Outlier-heavy data | | Log transform | log(1 + x) | varies | Right-skewed data |

Critical rule: fit scaler on training data only, then apply to test data.

class StandardScaler:
    def fit(self, X_train):
        self.mean = X_train.mean(axis=0)
        self.std = X_train.std(axis=0)

    def transform(self, X):
        return (X - self.mean) / self.std

    def fit_transform(self, X_train):
        self.fit(X_train)
        return self.transform(X_train)

Polynomial Features

Extend linear models to capture non-linear relationships:

[x_1, x_2] -> [1, x_1, x_2, x_1^2, x_1*x_2, x_2^2]

With degree p and d features: number of terms = C(d+p, p). Grows fast -- combine with regularization.

Comparison of Linear Models

| Model | Loss | Regularization | Output | Sparsity | |--------------------|-------------------|----------------|--------------|----------| | Linear Regression | MSE | None | Continuous | No | | Ridge | MSE + L2 | L2 | Continuous | No | | Lasso | MSE + L1 | L1 | Continuous | Yes | | Elastic Net | MSE + L1 + L2 | L1 + L2 | Continuous | Yes | | Logistic Regression| Cross-entropy | Optional | Probability | Optional | | Perceptron | Hinge-like | None | Binary | No | | LDA | Generative | None | Class label | No |