Linear Models

Linear Regression

Model: y = X * w + b, or in augmented form y = X_aug * w where X_aug includes a bias column.

Goal: minimize the sum of squared residuals.

L(w) = ||y - Xw||^2 = (y - Xw)^T (y - Xw)

Normal Equation (Closed-Form Solution)

Set gradient to zero:

dL/dw = -2 X^T (y - Xw) = 0
w* = (X^T X)^{-1} X^T y

Time complexity: O(n * d^2 + d^3) where n = samples, d = features
Requires X^T X to be invertible (fails when features are collinear)
Practical for d < ~10,000

def normal_equation(X, y):
    # X: (n, d), y: (n,)
    return np.linalg.solve(X.T @ X, X.T @ y)

Gradient Descent

Iteratively update weights in the direction of steepest descent:

w_{t+1} = w_t - alpha * dL/dw
        = w_t - alpha * (-2/n) * X^T (y - X * w_t)

def gradient_descent(X, y, lr=0.01, epochs=1000):
    n, d = X.shape
    w = np.zeros(d)
    for _ in range(epochs):
        gradient = -(2/n) * X.T @ (y - X @ w)
        w -= lr * gradient
    return w

Variants:

Batch GD: full dataset per step (stable, slow for large n)
Stochastic GD: single sample per step (noisy, fast)
Mini-batch GD: batch of B samples (practical compromise, B=32..512)

Convergence Conditions

For convex quadratic loss with Hessian eigenvalues lambda_min, lambda_max:

Convergence guaranteed if: lr < 2 / lambda_max
Optimal fixed lr: 2 / (lambda_min + lambda_max)
Condition number: kappa = lambda_max / lambda_min  (higher = slower)

Regularized Linear Regression

Ridge Regression (L2)

L_ridge(w) = ||y - Xw||^2 + lambda * ||w||_2^2

Closed-form: w* = (X^T X + lambda * I)^{-1} X^T y

Properties:

Shrinks coefficients toward zero but never exactly zero
Makes X^T X + lambda I always invertible (solves collinearity)
Equivalent to MAP estimate with Gaussian prior on w

Lasso Regression (L1)

L_lasso(w) = ||y - Xw||^2 + lambda * ||w||_1

Properties:

Produces sparse solutions (some w_j = 0 exactly)
Performs implicit feature selection
No closed-form; solved via coordinate descent or ISTA
Equivalent to MAP with Laplace prior on w

def coordinate_descent_lasso(X, y, lam, epochs=100):
    n, d = X.shape
    w = np.zeros(d)
    for _ in range(epochs):
        for j in range(d):
            residual = y - X @ w + X[:, j] * w[j]
            rho = X[:, j] @ residual
            w[j] = soft_threshold(rho, lam * n) / (X[:, j] @ X[:, j])
    return w

def soft_threshold(rho, threshold):
    if rho > threshold:    return rho - threshold
    elif rho < -threshold: return rho + threshold
    else:                  return 0.0

Elastic Net

Combines L1 and L2:

L_enet(w) = ||y - Xw||^2 + lambda_1 * ||w||_1 + lambda_2 * ||w||_2^2

Or equivalently with mixing ratio alpha in [0, 1]:

L = ||y - Xw||^2 + lambda * (alpha * ||w||_1 + (1-alpha) * ||w||_2^2)

alpha=1: pure Lasso; alpha=0: pure Ridge
Handles groups of correlated features better than Lasso alone

Logistic Regression

Binary classification: model P(y=1|x) using the sigmoid function.

P(y=1|x) = sigma(w^T x + b) = 1 / (1 + exp(-(w^T x + b)))

Loss Function (Binary Cross-Entropy)

L(w) = -(1/n) * sum[ y_i * log(p_i) + (1 - y_i) * log(1 - p_i) ]

This is convex -- gradient descent finds the global minimum.

dL/dw = (1/n) * X^T (sigma(Xw) - y)

Note: same form as linear regression gradient but with sigma applied.

Multi-Class: Softmax Regression

For K classes, predict probability vector:

P(y=k|x) = exp(w_k^T x) / sum_{j=1}^{K} exp(w_j^T x)

Loss: categorical cross-entropy (negative log-likelihood).

Decision Boundary

Logistic regression produces a linear decision boundary:

w^T x + b = 0  defines the boundary

Points are classified by which side they fall on. The sigmoid output gives calibrated probabilities (a major advantage over SVMs).

The Perceptron

The simplest linear classifier. Binary output via sign function.

y_hat = sign(w^T x + b)

Perceptron Learning Algorithm

def perceptron(X, y, max_iter=1000):
    # y in {-1, +1}
    w = np.zeros(X.shape[1])
    b = 0.0
    for _ in range(max_iter):
        converged = True
        for i in range(len(X)):
            if y[i] * (w @ X[i] + b) <= 0:  # misclassified
                w += y[i] * X[i]
                b += y[i]
                converged = False
        if converged:
            break
    return w, b

Perceptron Convergence Theorem: if data is linearly separable with margin gamma, the algorithm converges in at most (R / gamma)^2 updates, where R = max ||x_i||.

Limitation: only works for linearly separable data (XOR problem).

Linear Discriminant Analysis (LDA)

Both a classifier and dimensionality reduction technique.

Fisher's Criterion

Find projection w that maximizes class separation:

J(w) = (w^T S_B w) / (w^T S_W w)

Where:

S_B = between-class scatter = (mu_1 - mu_2)(mu_1 - mu_2)^T
S_W = within-class scatter = sum of class covariance matrices

Solution: w* = S_W^{-1} (mu_1 - mu_2)

Assumptions

Each class is Gaussian with equal covariance
When covariances differ, use Quadratic Discriminant Analysis (QDA)

Multi-Class LDA

Projects onto at most (K-1) dimensions. Solves generalized eigenvalue problem:

S_B w = lambda * S_W w

Take eigenvectors corresponding to the (K-1) largest eigenvalues.

Feature Scaling

Linear models are sensitive to feature scales. Gradient descent converges faster with scaled features.

Methods

Method	Formula	Range	Use Case
Standardization	(x - mu) / sigma	~[-3, 3]	Default choice
Min-Max	(x - min) / (max - min)	[0, 1]	Bounded features
Robust scaling	(x - median) / IQR	varies	Outlier-heavy data
Log transform	log(1 + x)	varies	Right-skewed data

Critical rule: fit scaler on training data only, then apply to test data.

class StandardScaler:
    def fit(self, X_train):
        self.mean = X_train.mean(axis=0)
        self.std = X_train.std(axis=0)

    def transform(self, X):
        return (X - self.mean) / self.std

    def fit_transform(self, X_train):
        self.fit(X_train)
        return self.transform(X_train)

Polynomial Features

Extend linear models to capture non-linear relationships:

[x_1, x_2] -> [1, x_1, x_2, x_1^2, x_1*x_2, x_2^2]

With degree p and d features: number of terms = C(d+p, p). Grows fast -- combine with regularization.

Comparison of Linear Models

Model	Loss	Regularization	Output	Sparsity
Linear Regression	MSE	None	Continuous	No
Ridge	MSE + L2	L2	Continuous	No
Lasso	MSE + L1	L1	Continuous	Yes
Elastic Net	MSE + L1 + L2	L1 + L2	Continuous	Yes
Logistic Regression	Cross-entropy	Optional	Probability	Optional
Perceptron	Hinge-like	None	Binary	No
LDA	Generative	None	Class label	No