6 min read
On this page

Practical Machine Learning

ML training pipeline

Feature Engineering

Transform raw data into features that better represent the underlying problem.

Common Transformations

Technique Description Example
Binning Discretize continuous features Age -> age groups
Log transform Reduce skewness log(1 + income)
Polynomial Capture non-linear relationships x, x^2, x*y
Interaction Combine features price_per_sqft = price/area
Cyclical encoding sin/cos for periodic features hour -> sin(2pih/24)
Target encoding Replace category with mean target city -> avg_price_in_city
Frequency encoding Replace category with its count city -> count(city)
Date decomposition Extract year, month, day, weekday, etc. timestamp -> components
def cyclical_encode(values, period):
    """Encode periodic features as sin/cos pair."""
    return np.sin(2 * np.pi * values / period), np.cos(2 * np.pi * values / period)

def target_encode(train_df, col, target, smoothing=10):
    """Target encoding with smoothing to prevent overfitting."""
    global_mean = train_df[target].mean()
    agg = train_df.groupby(col)[target].agg(['mean', 'count'])
    smooth = (agg['count'] * agg['mean'] + smoothing * global_mean) / (agg['count'] + smoothing)
    return smooth

Text Features

  • Bag of words / TF-IDF: sparse word count vectors
  • Word embeddings: Word2Vec, GloVe, FastText (dense, semantic)
  • Sentence embeddings: from pretrained transformers (BERT, Sentence-BERT)

Feature Selection

Remove irrelevant or redundant features to reduce overfitting and improve interpretability.

Filter Methods

Rank features independently of the model:

  • Variance threshold: remove near-constant features
  • Correlation: remove one of highly correlated pairs (|r| > 0.95)
  • Mutual information: I(X; Y) measures non-linear dependency
  • Chi-squared: for categorical features vs categorical target
  • ANOVA F-test: for continuous features vs categorical target

Wrapper Methods

Use model performance to evaluate feature subsets:

  • Forward selection: start empty, greedily add best feature
  • Backward elimination: start with all, greedily remove worst
  • Recursive Feature Elimination (RFE): train model, remove least important feature, repeat

Embedded Methods

Feature selection built into the model:

  • Lasso (L1): drives coefficients to zero
  • Tree-based importance: feature importance from random forests / gradient boosting
  • Elastic net: combines L1 selection with L2 stability

Handling Missing Data

Strategies

Method When to Use Implementation
Drop rows Very few missing, MCAR df.dropna()
Drop columns >50% missing in a feature df.drop(columns=[...])
Mean/median/mode Simple baseline, not too many missing SimpleImputer
KNN imputation Features are correlated KNNImputer
Iterative (MICE) Complex patterns IterativeImputer
Indicator variable Missingness is informative Add is_missing column
Model-native Tree methods handle missing natively XGBoost, LightGBM

Missing data types:

  • MCAR (Missing Completely at Random): missingness unrelated to any data
  • MAR (Missing at Random): missingness depends on observed data
  • MNAR (Missing Not at Random): missingness depends on unobserved values

Handling Imbalanced Classes

When one class is much rarer (e.g., fraud detection: 99.9% non-fraud).

Data-Level Methods

Method Description
Random oversample Duplicate minority samples
Random undersample Remove majority samples
SMOTE Generate synthetic minority samples via interpolation
ADASYN Focus synthetic samples on harder examples
Tomek links Remove ambiguous majority samples near boundary

SMOTE (Synthetic Minority Over-sampling Technique):

def smote(X_minority, k=5, n_synthetic=100):
    knn = NearestNeighbors(n_neighbors=k).fit(X_minority)
    synthetic = []

    for _ in range(n_synthetic):
        idx = np.random.randint(len(X_minority))
        x = X_minority[idx]
        neighbors = knn.kneighbors([x], return_distance=False)[0]
        nn = X_minority[np.random.choice(neighbors)]

        # Interpolate between x and its neighbor
        lam = np.random.random()
        synthetic.append(x + lam * (nn - x))

    return np.array(synthetic)

Algorithm-Level Methods

  • Class weights: weight loss inversely proportional to class frequency
  • Focal loss: down-weight easy examples: L = -alpha * (1-p)^gamma * log(p)
  • Threshold adjustment: optimize decision threshold on validation set using PR curve

Evaluation for Imbalanced Data

Do NOT use accuracy. Use: precision, recall, F1, AUPRC, Matthews Correlation Coefficient (MCC).

Experiment Tracking

MLflow

import mlflow

with mlflow.start_run():
    mlflow.log_param("lr", 0.001)
    mlflow.log_param("epochs", 50)

    for epoch in range(50):
        train_loss = train_one_epoch(model, data)
        val_acc = evaluate(model, val_data)
        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_acc", val_acc, step=epoch)

    mlflow.log_artifact("model.pkl")
    mlflow.sklearn.log_model(model, "model")

Weights & Biases (W&B)

import wandb

wandb.init(project="my-project", config={"lr": 0.001, "epochs": 50})

for epoch in range(50):
    train_loss = train_one_epoch(model, data)
    wandb.log({"train_loss": train_loss, "epoch": epoch})

wandb.finish()

Key capabilities: hyperparameter sweeps, artifact versioning, team collaboration, experiment comparison dashboards.

Model Serving

Deployment Patterns

Pattern Description Latency Use Case
Batch Score entire dataset periodically High Recommendations
Real-time API REST/gRPC endpoint Low Search ranking
Streaming Score events as they arrive Medium Fraud detection
Edge Run model on device Lowest Mobile, IoT

Model Optimization for Serving

  • Quantization: FP32 -> INT8 (2-4x speedup, minimal accuracy loss)
  • Pruning: remove near-zero weights (structured or unstructured)
  • Distillation: train small student model to mimic large teacher
  • ONNX: standard format for cross-framework deployment
  • TensorRT / OpenVINO: hardware-specific optimization

A/B Testing

Compare model versions in production with statistical rigor.

Process

  1. Define metric and minimum detectable effect (MDE)
  2. Calculate sample size: n = (z_{alpha/2} + z_beta)^2 * 2 * sigma^2 / MDE^2
  3. Randomly split traffic (e.g., 50/50 or 95/5 for risky changes)
  4. Run until sample size reached (do NOT peek and stop early without correction)
  5. Compute p-value or confidence interval

Pitfalls

  • Peeking: checking results before planned sample size inflates false positive rate. Use sequential testing (SPRT) if early stopping is needed.
  • Multiple comparisons: testing many metrics requires Bonferroni or FDR correction.
  • Simpson's paradox: aggregate results can contradict subgroup results. Stratify analysis.
  • Network effects: user interactions violate independence assumption.

Data Versioning

Track datasets alongside code for reproducibility.

  • DVC (Data Version Control): git-like commands for data, stores large files externally (S3, GCS)
  • Delta Lake / Lakehouse: versioned data tables with time travel
  • LakeFS: git-like branching for data lakes
# DVC workflow
dvc init
dvc add data/training_set.csv     # track with DVC, not git
git add data/training_set.csv.dvc  # commit the metadata
dvc push                           # push data to remote storage

Model Monitoring

What to Monitor

Category Metrics
Data drift KS test, PSI, chi-squared on feature distributions
Concept drift Degradation in online metrics over time
Prediction drift Distribution shift in model outputs
Performance Latency (p50, p99), throughput, error rates
Resource usage CPU, memory, GPU utilization

Data Drift Detection

Population Stability Index (PSI):

PSI = sum_i (actual_i - expected_i) * ln(actual_i / expected_i)
  • PSI < 0.1: no significant change
  • PSI 0.1-0.25: moderate shift, investigate
  • PSI > 0.25: significant shift, retrain

Retraining Strategies

  • Scheduled: retrain on fixed cadence (daily, weekly)
  • Triggered: retrain when drift exceeds threshold
  • Online learning: continuous updates with each new sample
  • Shadow mode: deploy new model alongside old, compare before switching

Responsible AI

Model Interpretability

SHAP (SHapley Additive exPlanations): game-theoretic attribution of prediction to features.

phi_j = sum_{S subset F\{j}} |S|!(|F|-|S|-1)! / |F|! * [f(S u {j}) - f(S)]

Properties: local accuracy, missingness, consistency. TreeSHAP computes exact values for tree models in polynomial time.

LIME (Local Interpretable Model-agnostic Explanations):

  1. Perturb input around the point of interest
  2. Get model predictions for perturbed samples
  3. Fit a simple interpretable model (linear, decision tree) weighted by proximity
  4. Use the simple model as the local explanation

Fairness Metrics

Metric Definition
Demographic parity P(Y_hat=1|A=0) = P(Y_hat=1|A=1)
Equalized odds Equal TPR and FPR across groups
Equal opportunity Equal TPR across groups
Predictive parity Equal precision across groups
Calibration P(Y=1|Y_hat=p, A=a) = p for all a

Impossibility result: except in trivial cases, you cannot simultaneously satisfy demographic parity, equalized odds, and predictive parity (Chouldechova, 2017).

Mitigation Strategies

  • Pre-processing: rebalance or transform training data (reweighting, fair representations)
  • In-processing: add fairness constraints to the optimization (adversarial debiasing, constrained optimization)
  • Post-processing: adjust decision thresholds per group to equalize the chosen metric

ML Pipeline Checklist

  1. Define success metrics AND fairness criteria upfront
  2. Audit training data for representation and label quality
  3. Establish baselines and document assumptions
  4. Track experiments with versioned data, code, and configs
  5. Evaluate on held-out test set with stratified metrics
  6. Deploy with monitoring for drift, performance, and fairness
  7. Document model cards: intended use, limitations, ethical considerations
  8. Plan for model updates and deprecation