3 min read
On this page

Tree-Based Methods

Decision Trees

Recursively partition the feature space using axis-aligned splits.

       [Feature x2 < 3.5?]
        /              \
    [x1 < 2.0?]     Class B
     /        \
  Class A   Class B

Splitting Criteria

Information Gain (ID3):

IG(S, A) = H(S) - sum_{v in values(A)} (|S_v| / |S|) * H(S_v)

H(S) = -sum_{k} p_k * log2(p_k)     # entropy

Gain Ratio (C4.5): normalizes by split information to avoid bias toward multi-valued attributes.

GainRatio(S, A) = IG(S, A) / SplitInfo(S, A)
SplitInfo = -sum_{v} (|S_v|/|S|) * log2(|S_v|/|S|)

Gini Impurity (CART):

Gini(S) = 1 - sum_{k} p_k^2

For binary splits: Gini = 2 * p * (1 - p), maximized at p = 0.5.

Regression: minimize variance reduction or MSE at each split.

Building a Decision Tree

def build_tree(X, y, max_depth, min_samples):
    if max_depth == 0 or len(y) < min_samples or is_pure(y):
        return LeafNode(prediction=aggregate(y))

    best_feature, best_threshold = find_best_split(X, y)
    if best_feature is None:
        return LeafNode(prediction=aggregate(y))

    left_mask = X[:, best_feature] <= best_threshold
    left = build_tree(X[left_mask], y[left_mask], max_depth-1, min_samples)
    right = build_tree(X[~left_mask], y[~left_mask], max_depth-1, min_samples)
    return SplitNode(best_feature, best_threshold, left, right)

def find_best_split(X, y):
    best_gain, best_feat, best_thresh = -inf, None, None
    for feature in range(X.shape[1]):
        thresholds = unique_sorted(X[:, feature])
        for thresh in midpoints(thresholds):
            gain = compute_gain(y, X[:, feature] <= thresh)
            if gain > best_gain:
                best_gain, best_feat, best_thresh = gain, feature, thresh
    return best_feat, best_thresh

ID3 vs C4.5 vs CART

| Feature | ID3 | C4.5 | CART | |-----------------|----------------|-----------------|-------------------| | Split type | Multi-way | Multi-way | Binary only | | Criterion | Info gain | Gain ratio | Gini / MSE | | Feature types | Categorical | Cat + continuous| Cat + continuous | | Missing values | No | Yes (surrogate) | Yes (surrogate) | | Pruning | No | Error-based | Cost-complexity |

Pruning

Pre-pruning (early stopping): limit max_depth, min_samples_leaf, min_impurity_decrease.

Post-pruning (cost-complexity / minimal cost-complexity):

R_alpha(T) = R(T) + alpha * |T|

Where |T| = number of leaves, R(T) = training error. Increase alpha from 0 and prune subtrees when a single leaf achieves lower R_alpha than the subtree. Select alpha via cross-validation.

Random Forests

Ensemble of decorrelated decision trees using bagging + feature randomization.

def random_forest(X, y, n_trees=100, max_features="sqrt"):
    trees = []
    for _ in range(n_trees):
        # Bootstrap sample
        indices = np.random.choice(len(X), size=len(X), replace=True)
        X_boot, y_boot = X[indices], y[indices]

        # Train tree with random feature subsets at each split
        tree = build_tree(X_boot, y_boot, max_features=max_features)
        trees.append(tree)
    return trees

def predict(trees, x):
    predictions = [tree.predict(x) for tree in trees]
    return majority_vote(predictions)  # or mean for regression

Key hyperparameters:

  • n_trees: more is generally better (diminishing returns)
  • max_features: sqrt(d) for classification, d/3 for regression
  • max_depth: often unlimited (each tree overfits, ensemble averages out)

Out-of-Bag (OOB) Error: each tree uses ~63.2% of data; evaluate on the remaining ~36.8%. Free cross-validation estimate.

Why Random Forests Work

For B trees with pairwise correlation rho and individual variance sigma^2:

Var(ensemble) = rho * sigma^2 + (1 - rho) * sigma^2 / B

Feature randomization reduces rho; bagging reduces variance through averaging. As B -> infinity, variance is bounded by rho * sigma^2.

Gradient Boosting

Build an additive model by fitting each new tree to the negative gradient (residuals) of the loss:

F_0(x) = argmin_gamma sum L(y_i, gamma)
For m = 1 to M:
    r_im = -dL(y_i, F_{m-1}(x_i)) / dF_{m-1}(x_i)    # pseudo-residuals
    Fit tree h_m to targets r_im
    F_m(x) = F_{m-1}(x) + eta * h_m(x)                 # eta = learning rate

For MSE loss, pseudo-residuals are simply y_i - F_{m-1}(x_i).

XGBoost

Key innovations:

  • Regularized objective: L(w) + Omega(tree) where Omega = gamma*T + (1/2)lambda||w||^2
  • Second-order approximation: uses both gradient g_i and Hessian h_i
Optimal leaf weight: w_j* = -sum(g_i) / (sum(h_i) + lambda)
Optimal split gain: Gain = (1/2) * [G_L^2/(H_L+lambda) + G_R^2/(H_R+lambda)
                            - (G_L+G_R)^2/(H_L+H_R+lambda)] - gamma
  • Sparsity-aware: native handling of missing values (learns optimal direction)
  • Column block: sorted feature values for efficient split finding
  • Weighted quantile sketch: approximate split finding for distributed training

LightGBM

  • Leaf-wise growth (vs level-wise): grows the leaf with highest gain first -- faster convergence but risk of overfitting on small data
  • GOSS (Gradient-based One-Side Sampling): keeps all high-gradient instances, samples low-gradient ones
  • EFB (Exclusive Feature Bundling): bundles mutually exclusive sparse features
  • Histogram-based: bins continuous features into discrete buckets (O(n) vs O(n log n))

CatBoost

  • Ordered boosting: uses permutation-based scheme to prevent target leakage (prediction shift)
  • Symmetric trees: balanced (oblivious) trees as weak learners
  • Native categorical support: ordered target statistics with random permutations

AdaBoost

Adaptive Boosting: re-weight samples based on previous errors.

def adaboost(X, y, n_estimators=50):
    n = len(y)
    weights = np.ones(n) / n
    models, alphas = [], []

    for _ in range(n_estimators):
        model = fit_weak_learner(X, y, sample_weights=weights)
        predictions = model.predict(X)

        err = np.sum(weights * (predictions != y)) / np.sum(weights)
        alpha = 0.5 * np.log((1 - err) / (err + 1e-10))

        weights *= np.exp(-alpha * y * predictions)
        weights /= np.sum(weights)

        models.append(model)
        alphas.append(alpha)

    return models, alphas

AdaBoost minimizes exponential loss. Equivalent to forward stagewise additive modeling.

Stacking (Stacked Generalization)

Train a meta-learner on the outputs of base models:

Level 0: Train models M1, M2, ..., Mk on training data
         Generate predictions P1, P2, ..., Pk using cross-validation
Level 1: Train meta-model on [P1, P2, ..., Pk] -> y

Use cross-validated predictions for level 0 to avoid overfitting. The meta-learner (often logistic regression or linear model) learns optimal combination weights.

Comparison

| Method | Bias | Variance | Interpretable | Speed | |-----------------|--------|----------|---------------|----------| | Single tree | High | High | Yes | Fast | | Random Forest | Low | Low | Partial | Fast | | Gradient Boost | Low | Low | No | Moderate | | AdaBoost | Low | Moderate | No | Moderate |

Feature Importance

  • Impurity-based: sum of impurity decrease weighted by samples reaching each node (biased toward high-cardinality features)
  • Permutation importance: measure accuracy drop when feature values are shuffled (model-agnostic, more reliable)
  • SHAP values: game-theoretic attribution (see TreeSHAP for exact computation in O(TLD^2))