3 min read
On this page

Tree-Based Methods

Decision Trees

Recursively partition the feature space using axis-aligned splits.

       [Feature x2 < 3.5?]
        /              \
    [x1 < 2.0?]     Class B
     /        \
  Class A   Class B

Splitting Criteria

Information Gain (ID3):

IG(S, A) = H(S) - sum_{v in values(A)} (|S_v| / |S|) * H(S_v)

H(S) = -sum_{k} p_k * log2(p_k)     # entropy

Gain Ratio (C4.5): normalizes by split information to avoid bias toward multi-valued attributes.

GainRatio(S, A) = IG(S, A) / SplitInfo(S, A)
SplitInfo = -sum_{v} (|S_v|/|S|) * log2(|S_v|/|S|)

Gini Impurity (CART):

Gini(S) = 1 - sum_{k} p_k^2

For binary splits: Gini = 2 * p * (1 - p), maximized at p = 0.5.

Regression: minimize variance reduction or MSE at each split.

Building a Decision Tree

def build_tree(X, y, max_depth, min_samples):
    if max_depth == 0 or len(y) < min_samples or is_pure(y):
        return LeafNode(prediction=aggregate(y))

    best_feature, best_threshold = find_best_split(X, y)
    if best_feature is None:
        return LeafNode(prediction=aggregate(y))

    left_mask = X[:, best_feature] <= best_threshold
    left = build_tree(X[left_mask], y[left_mask], max_depth-1, min_samples)
    right = build_tree(X[~left_mask], y[~left_mask], max_depth-1, min_samples)
    return SplitNode(best_feature, best_threshold, left, right)

def find_best_split(X, y):
    best_gain, best_feat, best_thresh = -inf, None, None
    for feature in range(X.shape[1]):
        thresholds = unique_sorted(X[:, feature])
        for thresh in midpoints(thresholds):
            gain = compute_gain(y, X[:, feature] <= thresh)
            if gain > best_gain:
                best_gain, best_feat, best_thresh = gain, feature, thresh
    return best_feat, best_thresh

ID3 vs C4.5 vs CART

Feature ID3 C4.5 CART
Split type Multi-way Multi-way Binary only
Criterion Info gain Gain ratio Gini / MSE
Feature types Categorical Cat + continuous Cat + continuous
Missing values No Yes (surrogate) Yes (surrogate)
Pruning No Error-based Cost-complexity

Pruning

Pre-pruning (early stopping): limit max_depth, min_samples_leaf, min_impurity_decrease.

Post-pruning (cost-complexity / minimal cost-complexity):

R_alpha(T) = R(T) + alpha * |T|

Where |T| = number of leaves, R(T) = training error. Increase alpha from 0 and prune subtrees when a single leaf achieves lower R_alpha than the subtree. Select alpha via cross-validation.

Random Forests

Ensemble of decorrelated decision trees using bagging + feature randomization.

def random_forest(X, y, n_trees=100, max_features="sqrt"):
    trees = []
    for _ in range(n_trees):
        # Bootstrap sample
        indices = np.random.choice(len(X), size=len(X), replace=True)
        X_boot, y_boot = X[indices], y[indices]

        # Train tree with random feature subsets at each split
        tree = build_tree(X_boot, y_boot, max_features=max_features)
        trees.append(tree)
    return trees

def predict(trees, x):
    predictions = [tree.predict(x) for tree in trees]
    return majority_vote(predictions)  # or mean for regression

Key hyperparameters:

  • n_trees: more is generally better (diminishing returns)
  • max_features: sqrt(d) for classification, d/3 for regression
  • max_depth: often unlimited (each tree overfits, ensemble averages out)

Out-of-Bag (OOB) Error: each tree uses ~63.2% of data; evaluate on the remaining ~36.8%. Free cross-validation estimate.

Why Random Forests Work

For B trees with pairwise correlation rho and individual variance sigma^2:

Var(ensemble) = rho * sigma^2 + (1 - rho) * sigma^2 / B

Feature randomization reduces rho; bagging reduces variance through averaging. As B -> infinity, variance is bounded by rho * sigma^2.

Gradient Boosting

Build an additive model by fitting each new tree to the negative gradient (residuals) of the loss:

F_0(x) = argmin_gamma sum L(y_i, gamma)
For m = 1 to M:
    r_im = -dL(y_i, F_{m-1}(x_i)) / dF_{m-1}(x_i)    # pseudo-residuals
    Fit tree h_m to targets r_im
    F_m(x) = F_{m-1}(x) + eta * h_m(x)                 # eta = learning rate

For MSE loss, pseudo-residuals are simply y_i - F_{m-1}(x_i).

XGBoost

Key innovations:

  • Regularized objective: L(w) + Omega(tree) where Omega = gamma*T + (1/2)lambda||w||^2
  • Second-order approximation: uses both gradient g_i and Hessian h_i
Optimal leaf weight: w_j* = -sum(g_i) / (sum(h_i) + lambda)
Optimal split gain: Gain = (1/2) * [G_L^2/(H_L+lambda) + G_R^2/(H_R+lambda)
                            - (G_L+G_R)^2/(H_L+H_R+lambda)] - gamma
  • Sparsity-aware: native handling of missing values (learns optimal direction)
  • Column block: sorted feature values for efficient split finding
  • Weighted quantile sketch: approximate split finding for distributed training

LightGBM

  • Leaf-wise growth (vs level-wise): grows the leaf with highest gain first -- faster convergence but risk of overfitting on small data
  • GOSS (Gradient-based One-Side Sampling): keeps all high-gradient instances, samples low-gradient ones
  • EFB (Exclusive Feature Bundling): bundles mutually exclusive sparse features
  • Histogram-based: bins continuous features into discrete buckets (O(n) vs O(n log n))

CatBoost

  • Ordered boosting: uses permutation-based scheme to prevent target leakage (prediction shift)
  • Symmetric trees: balanced (oblivious) trees as weak learners
  • Native categorical support: ordered target statistics with random permutations

AdaBoost

Adaptive Boosting: re-weight samples based on previous errors.

def adaboost(X, y, n_estimators=50):
    n = len(y)
    weights = np.ones(n) / n
    models, alphas = [], []

    for _ in range(n_estimators):
        model = fit_weak_learner(X, y, sample_weights=weights)
        predictions = model.predict(X)

        err = np.sum(weights * (predictions != y)) / np.sum(weights)
        alpha = 0.5 * np.log((1 - err) / (err + 1e-10))

        weights *= np.exp(-alpha * y * predictions)
        weights /= np.sum(weights)

        models.append(model)
        alphas.append(alpha)

    return models, alphas

AdaBoost minimizes exponential loss. Equivalent to forward stagewise additive modeling.

Stacking (Stacked Generalization)

Train a meta-learner on the outputs of base models:

Level 0: Train models M1, M2, ..., Mk on training data
         Generate predictions P1, P2, ..., Pk using cross-validation
Level 1: Train meta-model on [P1, P2, ..., Pk] -> y

Use cross-validated predictions for level 0 to avoid overfitting. The meta-learner (often logistic regression or linear model) learns optimal combination weights.

Comparison

Method Bias Variance Interpretable Speed
Single tree High High Yes Fast
Random Forest Low Low Partial Fast
Gradient Boost Low Low No Moderate
AdaBoost Low Moderate No Moderate

Feature Importance

  • Impurity-based: sum of impurity decrease weighted by samples reaching each node (biased toward high-cardinality features)
  • Permutation importance: measure accuracy drop when feature values are shuffled (model-agnostic, more reliable)
  • SHAP values: game-theoretic attribution (see TreeSHAP for exact computation in O(TLD^2))