Tree-Based Methods

Decision Trees

Recursively partition the feature space using axis-aligned splits.

       [Feature x2 < 3.5?]
        /              \
    [x1 < 2.0?]     Class B
     /        \
  Class A   Class B

Splitting Criteria

Information Gain (ID3):

IG(S, A) = H(S) - sum_{v in values(A)} (|S_v| / |S|) * H(S_v)

H(S) = -sum_{k} p_k * log2(p_k)     # entropy

Gain Ratio (C4.5): normalizes by split information to avoid bias toward multi-valued attributes.

GainRatio(S, A) = IG(S, A) / SplitInfo(S, A)
SplitInfo = -sum_{v} (|S_v|/|S|) * log2(|S_v|/|S|)

Gini Impurity (CART):

Gini(S) = 1 - sum_{k} p_k^2

For binary splits: Gini = 2 * p * (1 - p), maximized at p = 0.5.

Regression: minimize variance reduction or MSE at each split.

Building a Decision Tree

def build_tree(X, y, max_depth, min_samples):
    if max_depth == 0 or len(y) < min_samples or is_pure(y):
        return LeafNode(prediction=aggregate(y))

    best_feature, best_threshold = find_best_split(X, y)
    if best_feature is None:
        return LeafNode(prediction=aggregate(y))

    left_mask = X[:, best_feature] <= best_threshold
    left = build_tree(X[left_mask], y[left_mask], max_depth-1, min_samples)
    right = build_tree(X[~left_mask], y[~left_mask], max_depth-1, min_samples)
    return SplitNode(best_feature, best_threshold, left, right)

def find_best_split(X, y):
    best_gain, best_feat, best_thresh = -inf, None, None
    for feature in range(X.shape[1]):
        thresholds = unique_sorted(X[:, feature])
        for thresh in midpoints(thresholds):
            gain = compute_gain(y, X[:, feature] <= thresh)
            if gain > best_gain:
                best_gain, best_feat, best_thresh = gain, feature, thresh
    return best_feat, best_thresh

ID3 vs C4.5 vs CART

Feature	ID3	C4.5	CART
Split type	Multi-way	Multi-way	Binary only
Criterion	Info gain	Gain ratio	Gini / MSE
Feature types	Categorical	Cat + continuous	Cat + continuous
Missing values	No	Yes (surrogate)	Yes (surrogate)
Pruning	No	Error-based	Cost-complexity

Pruning

Pre-pruning (early stopping): limit max_depth, min_samples_leaf, min_impurity_decrease.

Post-pruning (cost-complexity / minimal cost-complexity):

R_alpha(T) = R(T) + alpha * |T|

Where |T| = number of leaves, R(T) = training error. Increase alpha from 0 and prune subtrees when a single leaf achieves lower R_alpha than the subtree. Select alpha via cross-validation.

Random Forests

Ensemble of decorrelated decision trees using bagging + feature randomization.

def random_forest(X, y, n_trees=100, max_features="sqrt"):
    trees = []
    for _ in range(n_trees):
        # Bootstrap sample
        indices = np.random.choice(len(X), size=len(X), replace=True)
        X_boot, y_boot = X[indices], y[indices]

        # Train tree with random feature subsets at each split
        tree = build_tree(X_boot, y_boot, max_features=max_features)
        trees.append(tree)
    return trees

def predict(trees, x):
    predictions = [tree.predict(x) for tree in trees]
    return majority_vote(predictions)  # or mean for regression

Key hyperparameters:

n_trees: more is generally better (diminishing returns)
max_features: sqrt(d) for classification, d/3 for regression
max_depth: often unlimited (each tree overfits, ensemble averages out)

Out-of-Bag (OOB) Error: each tree uses ~63.2% of data; evaluate on the remaining ~36.8%. Free cross-validation estimate.

Why Random Forests Work

For B trees with pairwise correlation rho and individual variance sigma^2:

Var(ensemble) = rho * sigma^2 + (1 - rho) * sigma^2 / B

Feature randomization reduces rho; bagging reduces variance through averaging. As B -> infinity, variance is bounded by rho * sigma^2.

Gradient Boosting

Build an additive model by fitting each new tree to the negative gradient (residuals) of the loss:

F_0(x) = argmin_gamma sum L(y_i, gamma)
For m = 1 to M:
    r_im = -dL(y_i, F_{m-1}(x_i)) / dF_{m-1}(x_i)    # pseudo-residuals
    Fit tree h_m to targets r_im
    F_m(x) = F_{m-1}(x) + eta * h_m(x)                 # eta = learning rate

For MSE loss, pseudo-residuals are simply y_i - F_{m-1}(x_i).

XGBoost

Key innovations:

Regularized objective: L(w) + Omega(tree) where Omega = gamma*T + (1/2)lambda||w||^2
Second-order approximation: uses both gradient g_i and Hessian h_i

Optimal leaf weight: w_j* = -sum(g_i) / (sum(h_i) + lambda)
Optimal split gain: Gain = (1/2) * [G_L^2/(H_L+lambda) + G_R^2/(H_R+lambda)
                            - (G_L+G_R)^2/(H_L+H_R+lambda)] - gamma

Sparsity-aware: native handling of missing values (learns optimal direction)
Column block: sorted feature values for efficient split finding
Weighted quantile sketch: approximate split finding for distributed training

LightGBM

Leaf-wise growth (vs level-wise): grows the leaf with highest gain first -- faster convergence but risk of overfitting on small data
GOSS (Gradient-based One-Side Sampling): keeps all high-gradient instances, samples low-gradient ones
EFB (Exclusive Feature Bundling): bundles mutually exclusive sparse features
Histogram-based: bins continuous features into discrete buckets (O(n) vs O(n log n))

CatBoost

Ordered boosting: uses permutation-based scheme to prevent target leakage (prediction shift)
Symmetric trees: balanced (oblivious) trees as weak learners
Native categorical support: ordered target statistics with random permutations

AdaBoost

Adaptive Boosting: re-weight samples based on previous errors.

def adaboost(X, y, n_estimators=50):
    n = len(y)
    weights = np.ones(n) / n
    models, alphas = [], []

    for _ in range(n_estimators):
        model = fit_weak_learner(X, y, sample_weights=weights)
        predictions = model.predict(X)

        err = np.sum(weights * (predictions != y)) / np.sum(weights)
        alpha = 0.5 * np.log((1 - err) / (err + 1e-10))

        weights *= np.exp(-alpha * y * predictions)
        weights /= np.sum(weights)

        models.append(model)
        alphas.append(alpha)

    return models, alphas

AdaBoost minimizes exponential loss. Equivalent to forward stagewise additive modeling.

Stacking (Stacked Generalization)

Train a meta-learner on the outputs of base models:

Level 0: Train models M1, M2, ..., Mk on training data
         Generate predictions P1, P2, ..., Pk using cross-validation
Level 1: Train meta-model on [P1, P2, ..., Pk] -> y

Use cross-validated predictions for level 0 to avoid overfitting. The meta-learner (often logistic regression or linear model) learns optimal combination weights.

Comparison

Method	Bias	Variance	Interpretable	Speed
Single tree	High	High	Yes	Fast
Random Forest	Low	Low	Partial	Fast
Gradient Boost	Low	Low	No	Moderate
AdaBoost	Low	Moderate	No	Moderate

Feature Importance

Impurity-based: sum of impurity decrease weighted by samples reaching each node (biased toward high-cardinality features)
Permutation importance: measure accuracy drop when feature values are shuffled (model-agnostic, more reliable)
SHAP values: game-theoretic attribution (see TreeSHAP for exact computation in O(TLD^2))