Tree-Based Methods
Decision Trees
Recursively partition the feature space using axis-aligned splits.
[Feature x2 < 3.5?]
/ \
[x1 < 2.0?] Class B
/ \
Class A Class B
Splitting Criteria
Information Gain (ID3):
IG(S, A) = H(S) - sum_{v in values(A)} (|S_v| / |S|) * H(S_v)
H(S) = -sum_{k} p_k * log2(p_k) # entropy
Gain Ratio (C4.5): normalizes by split information to avoid bias toward multi-valued attributes.
GainRatio(S, A) = IG(S, A) / SplitInfo(S, A)
SplitInfo = -sum_{v} (|S_v|/|S|) * log2(|S_v|/|S|)
Gini Impurity (CART):
Gini(S) = 1 - sum_{k} p_k^2
For binary splits: Gini = 2 * p * (1 - p), maximized at p = 0.5.
Regression: minimize variance reduction or MSE at each split.
Building a Decision Tree
def build_tree(X, y, max_depth, min_samples):
if max_depth == 0 or len(y) < min_samples or is_pure(y):
return LeafNode(prediction=aggregate(y))
best_feature, best_threshold = find_best_split(X, y)
if best_feature is None:
return LeafNode(prediction=aggregate(y))
left_mask = X[:, best_feature] <= best_threshold
left = build_tree(X[left_mask], y[left_mask], max_depth-1, min_samples)
right = build_tree(X[~left_mask], y[~left_mask], max_depth-1, min_samples)
return SplitNode(best_feature, best_threshold, left, right)
def find_best_split(X, y):
best_gain, best_feat, best_thresh = -inf, None, None
for feature in range(X.shape[1]):
thresholds = unique_sorted(X[:, feature])
for thresh in midpoints(thresholds):
gain = compute_gain(y, X[:, feature] <= thresh)
if gain > best_gain:
best_gain, best_feat, best_thresh = gain, feature, thresh
return best_feat, best_thresh
ID3 vs C4.5 vs CART
| Feature | ID3 | C4.5 | CART | |-----------------|----------------|-----------------|-------------------| | Split type | Multi-way | Multi-way | Binary only | | Criterion | Info gain | Gain ratio | Gini / MSE | | Feature types | Categorical | Cat + continuous| Cat + continuous | | Missing values | No | Yes (surrogate) | Yes (surrogate) | | Pruning | No | Error-based | Cost-complexity |
Pruning
Pre-pruning (early stopping): limit max_depth, min_samples_leaf, min_impurity_decrease.
Post-pruning (cost-complexity / minimal cost-complexity):
R_alpha(T) = R(T) + alpha * |T|
Where |T| = number of leaves, R(T) = training error. Increase alpha from 0 and prune subtrees when a single leaf achieves lower R_alpha than the subtree. Select alpha via cross-validation.
Random Forests
Ensemble of decorrelated decision trees using bagging + feature randomization.
def random_forest(X, y, n_trees=100, max_features="sqrt"):
trees = []
for _ in range(n_trees):
# Bootstrap sample
indices = np.random.choice(len(X), size=len(X), replace=True)
X_boot, y_boot = X[indices], y[indices]
# Train tree with random feature subsets at each split
tree = build_tree(X_boot, y_boot, max_features=max_features)
trees.append(tree)
return trees
def predict(trees, x):
predictions = [tree.predict(x) for tree in trees]
return majority_vote(predictions) # or mean for regression
Key hyperparameters:
- n_trees: more is generally better (diminishing returns)
- max_features: sqrt(d) for classification, d/3 for regression
- max_depth: often unlimited (each tree overfits, ensemble averages out)
Out-of-Bag (OOB) Error: each tree uses ~63.2% of data; evaluate on the remaining ~36.8%. Free cross-validation estimate.
Why Random Forests Work
For B trees with pairwise correlation rho and individual variance sigma^2:
Var(ensemble) = rho * sigma^2 + (1 - rho) * sigma^2 / B
Feature randomization reduces rho; bagging reduces variance through averaging. As B -> infinity, variance is bounded by rho * sigma^2.
Gradient Boosting
Build an additive model by fitting each new tree to the negative gradient (residuals) of the loss:
F_0(x) = argmin_gamma sum L(y_i, gamma)
For m = 1 to M:
r_im = -dL(y_i, F_{m-1}(x_i)) / dF_{m-1}(x_i) # pseudo-residuals
Fit tree h_m to targets r_im
F_m(x) = F_{m-1}(x) + eta * h_m(x) # eta = learning rate
For MSE loss, pseudo-residuals are simply y_i - F_{m-1}(x_i).
XGBoost
Key innovations:
- Regularized objective: L(w) + Omega(tree) where Omega = gamma*T + (1/2)lambda||w||^2
- Second-order approximation: uses both gradient g_i and Hessian h_i
Optimal leaf weight: w_j* = -sum(g_i) / (sum(h_i) + lambda)
Optimal split gain: Gain = (1/2) * [G_L^2/(H_L+lambda) + G_R^2/(H_R+lambda)
- (G_L+G_R)^2/(H_L+H_R+lambda)] - gamma
- Sparsity-aware: native handling of missing values (learns optimal direction)
- Column block: sorted feature values for efficient split finding
- Weighted quantile sketch: approximate split finding for distributed training
LightGBM
- Leaf-wise growth (vs level-wise): grows the leaf with highest gain first -- faster convergence but risk of overfitting on small data
- GOSS (Gradient-based One-Side Sampling): keeps all high-gradient instances, samples low-gradient ones
- EFB (Exclusive Feature Bundling): bundles mutually exclusive sparse features
- Histogram-based: bins continuous features into discrete buckets (O(n) vs O(n log n))
CatBoost
- Ordered boosting: uses permutation-based scheme to prevent target leakage (prediction shift)
- Symmetric trees: balanced (oblivious) trees as weak learners
- Native categorical support: ordered target statistics with random permutations
AdaBoost
Adaptive Boosting: re-weight samples based on previous errors.
def adaboost(X, y, n_estimators=50):
n = len(y)
weights = np.ones(n) / n
models, alphas = [], []
for _ in range(n_estimators):
model = fit_weak_learner(X, y, sample_weights=weights)
predictions = model.predict(X)
err = np.sum(weights * (predictions != y)) / np.sum(weights)
alpha = 0.5 * np.log((1 - err) / (err + 1e-10))
weights *= np.exp(-alpha * y * predictions)
weights /= np.sum(weights)
models.append(model)
alphas.append(alpha)
return models, alphas
AdaBoost minimizes exponential loss. Equivalent to forward stagewise additive modeling.
Stacking (Stacked Generalization)
Train a meta-learner on the outputs of base models:
Level 0: Train models M1, M2, ..., Mk on training data
Generate predictions P1, P2, ..., Pk using cross-validation
Level 1: Train meta-model on [P1, P2, ..., Pk] -> y
Use cross-validated predictions for level 0 to avoid overfitting. The meta-learner (often logistic regression or linear model) learns optimal combination weights.
Comparison
| Method | Bias | Variance | Interpretable | Speed | |-----------------|--------|----------|---------------|----------| | Single tree | High | High | Yes | Fast | | Random Forest | Low | Low | Partial | Fast | | Gradient Boost | Low | Low | No | Moderate | | AdaBoost | Low | Moderate | No | Moderate |
Feature Importance
- Impurity-based: sum of impurity decrease weighted by samples reaching each node (biased toward high-cardinality features)
- Permutation importance: measure accuracy drop when feature values are shuffled (model-agnostic, more reliable)
- SHAP values: game-theoretic attribution (see TreeSHAP for exact computation in O(TLD^2))