Machine Learning Foundations

ML algorithm selection decision tree

Learning Paradigms

Supervised Learning

Learn a mapping f: X -> Y from labeled training data {(x_i, y_i)}.

Classification: y is discrete (spam detection, image recognition)
Regression: y is continuous (price prediction, temperature forecasting)

The goal is to minimize expected risk over the true data distribution:

R(f) = E_{(x,y)~P}[L(f(x), y)]

Since P is unknown, we minimize empirical risk over training data:

R_emp(f) = (1/n) * sum_{i=1}^{n} L(f(x_i), y_i)

Unsupervised Learning

Find structure in unlabeled data {x_i}. No target variable.

Clustering: group similar points (K-means, DBSCAN)
Dimensionality reduction: compress features (PCA, t-SNE)
Density estimation: model P(x) (GMM, KDE)
Anomaly detection: identify outliers (isolation forest)

Semi-Supervised Learning

Combine small labeled set with large unlabeled set. Relies on assumptions:

Smoothness: nearby points share labels
Cluster: points in the same cluster share labels
Manifold: data lies on a lower-dimensional manifold

Examples: label propagation, self-training, MixMatch.

Self-Supervised Learning

Generate supervisory signal from the data itself. The model solves a pretext task to learn representations useful for downstream tasks.

Contrastive: SimCLR, MoCo (learn similar/dissimilar pairs)
Masked prediction: BERT (masked tokens), MAE (masked patches)
Next-token prediction: GPT-style autoregressive pretraining

Reinforcement Learning

An agent interacts with an environment to maximize cumulative reward.

Agent -> Action a_t -> Environment -> (State s_{t+1}, Reward r_t) -> Agent

Key distinction: delayed rewards, exploration-exploitation tradeoff.

Bias-Variance Tradeoff

For a model trained on dataset D, predicting target y = f(x) + epsilon where epsilon ~ N(0, sigma^2):

E_D[(y - f_hat(x))^2] = Bias(f_hat(x))^2 + Var(f_hat(x)) + sigma^2
                         \_____________/   \____________/   \______/
                          underfitting      overfitting     irreducible

Where:

Bias(f_hat(x)) = E_D[f_hat(x)] - f(x)         # systematic error
Var(f_hat(x))  = E_D[(f_hat(x) - E[f_hat(x)])^2]  # sensitivity to training set

Property	High Bias	High Variance
Model	Too simple	Too complex
Training error	High	Low
Test error	High	High
Fix	More features	More data, regularize

Regularization controls complexity to navigate this tradeoff: L1, L2, dropout, early stopping, data augmentation.

Cross-Validation

K-Fold Cross-Validation

def k_fold_cv(data, model_class, k=5):
    folds = split_into_k_folds(data, k)
    scores = []
    for i in range(k):
        val_fold = folds[i]
        train_folds = concatenate(folds[:i] + folds[i+1:])
        model = model_class.fit(train_folds)
        scores.append(evaluate(model, val_fold))
    return mean(scores), std(scores)

Variants

Method	Use Case
Hold-out	Large datasets, quick iteration
K-Fold (k=5,10)	Standard practice
Stratified K-Fold	Imbalanced classification
Leave-One-Out (LOO)	Very small datasets (high variance)
Time-Series Split	Temporal data (no future leakage)
Group K-Fold	Grouped data (e.g., patients)

Nested Cross-Validation

Outer loop evaluates model performance; inner loop tunes hyperparameters. Prevents optimistic bias from tuning on the test fold.

For each outer fold:
    For each inner fold:
        Tune hyperparameters
    Train with best hyperparams on outer train set
    Evaluate on outer test fold

Evaluation Metrics

Classification Metrics

From the confusion matrix (TP, FP, TN, FN):

Accuracy  = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)          # of predicted positives, how many correct?
Recall    = TP / (TP + FN)          # of actual positives, how many found?
F1        = 2 * P * R / (P + R)     # harmonic mean of precision and recall
F_beta    = (1 + beta^2) * P * R / (beta^2 * P + R)

When to use what:

Accuracy: balanced classes only
Precision: cost of false positives is high (spam filter)
Recall: cost of false negatives is high (cancer screening)
F1: balanced tradeoff between precision and recall

ROC and AUC

ROC curve plots TPR (recall) vs FPR (= FP / (FP + TN)) at varying thresholds.

AUC = 1.0: perfect classifier
AUC = 0.5: random guessing
AUC is threshold-independent and invariant to class imbalance

Precision-Recall curve: preferred when classes are heavily imbalanced (use AUPRC instead of AUROC).

Regression Metrics

MSE  = (1/n) * sum(y_i - y_hat_i)^2         # penalizes large errors
RMSE = sqrt(MSE)                              # same units as y
MAE  = (1/n) * sum|y_i - y_hat_i|            # robust to outliers
R^2  = 1 - SS_res / SS_tot                   # proportion of variance explained
MAPE = (1/n) * sum|y_i - y_hat_i| / |y_i|   # percentage error

Multi-Class Extensions

Macro-average: compute metric per class, take mean (treats all classes equally)
Micro-average: aggregate TP/FP/FN globally, then compute (biased toward frequent classes)
Weighted: macro weighted by class support

The No Free Lunch Theorem

Theorem (Wolpert, 1996): Averaged over all possible data-generating distributions, no learning algorithm is universally better than any other.

Implications:

There is no single "best" algorithm -- domain knowledge matters
Assumptions about the data distribution are necessary and inescapable
An algorithm that excels on one problem class must perform worse on another
Model selection (cross-validation, domain expertise) is essential

This does NOT mean all algorithms are equal in practice. Real-world distributions are not uniform -- structured data has patterns that certain algorithms exploit better than others.

PAC Learning Framework

A concept class C is PAC-learnable if there exists an algorithm A such that for any epsilon > 0, delta > 0, and distribution D over X:

P[error(h) <= epsilon] >= 1 - delta

using a number of samples polynomial in 1/epsilon, 1/delta, and problem size.

Sample complexity for finite hypothesis class H:

m >= (1/epsilon) * (ln|H| + ln(1/delta))

This formalizes how much data is needed for reliable generalization.

VC Dimension

The VC dimension of a hypothesis class H is the largest set of points that H can shatter (classify in all 2^n possible ways).

Linear classifier in R^d: VC dim = d + 1
Higher VC dim means more expressive but needs more data
Generalization bound: with probability 1 - delta:

R(f) <= R_emp(f) + O(sqrt((VC(H) * ln(n) + ln(1/delta)) / n))

Practical Checklist

Define the problem: classification, regression, ranking, generation?
Collect and explore data: distributions, missing values, class balance
Establish a baseline: simple model (majority class, linear regression)
Feature engineering: domain knowledge, transformations
Model selection: cross-validation, appropriate metrics
Hyperparameter tuning: random/Bayesian search with nested CV
Error analysis: where does the model fail? Bias or variance?
Iterate: better features, more data, different architecture