6 min read
On this page

Practical Machine Learning

ML training pipeline

Feature Engineering

Transform raw data into features that better represent the underlying problem.

Common Transformations

| Technique | Description | Example | |---------------------|------------------------------------------|----------------------------| | Binning | Discretize continuous features | Age -> age groups | | Log transform | Reduce skewness | log(1 + income) | | Polynomial | Capture non-linear relationships | x, x^2, xy | | Interaction | Combine features | price_per_sqft = price/area| | Cyclical encoding | sin/cos for periodic features | hour -> sin(2pi*h/24) | | Target encoding | Replace category with mean target | city -> avg_price_in_city | | Frequency encoding | Replace category with its count | city -> count(city) | | Date decomposition | Extract year, month, day, weekday, etc. | timestamp -> components |

def cyclical_encode(values, period):
    """Encode periodic features as sin/cos pair."""
    return np.sin(2 * np.pi * values / period), np.cos(2 * np.pi * values / period)

def target_encode(train_df, col, target, smoothing=10):
    """Target encoding with smoothing to prevent overfitting."""
    global_mean = train_df[target].mean()
    agg = train_df.groupby(col)[target].agg(['mean', 'count'])
    smooth = (agg['count'] * agg['mean'] + smoothing * global_mean) / (agg['count'] + smoothing)
    return smooth

Text Features

  • Bag of words / TF-IDF: sparse word count vectors
  • Word embeddings: Word2Vec, GloVe, FastText (dense, semantic)
  • Sentence embeddings: from pretrained transformers (BERT, Sentence-BERT)

Feature Selection

Remove irrelevant or redundant features to reduce overfitting and improve interpretability.

Filter Methods

Rank features independently of the model:

  • Variance threshold: remove near-constant features
  • Correlation: remove one of highly correlated pairs (|r| > 0.95)
  • Mutual information: I(X; Y) measures non-linear dependency
  • Chi-squared: for categorical features vs categorical target
  • ANOVA F-test: for continuous features vs categorical target

Wrapper Methods

Use model performance to evaluate feature subsets:

  • Forward selection: start empty, greedily add best feature
  • Backward elimination: start with all, greedily remove worst
  • Recursive Feature Elimination (RFE): train model, remove least important feature, repeat

Embedded Methods

Feature selection built into the model:

  • Lasso (L1): drives coefficients to zero
  • Tree-based importance: feature importance from random forests / gradient boosting
  • Elastic net: combines L1 selection with L2 stability

Handling Missing Data

Strategies

| Method | When to Use | Implementation | |---------------------|------------------------------------------|--------------------------| | Drop rows | Very few missing, MCAR | df.dropna() | | Drop columns | >50% missing in a feature | df.drop(columns=[...]) | | Mean/median/mode | Simple baseline, not too many missing | SimpleImputer | | KNN imputation | Features are correlated | KNNImputer | | Iterative (MICE) | Complex patterns | IterativeImputer | | Indicator variable | Missingness is informative | Add is_missing column | | Model-native | Tree methods handle missing natively | XGBoost, LightGBM |

Missing data types:

  • MCAR (Missing Completely at Random): missingness unrelated to any data
  • MAR (Missing at Random): missingness depends on observed data
  • MNAR (Missing Not at Random): missingness depends on unobserved values

Handling Imbalanced Classes

When one class is much rarer (e.g., fraud detection: 99.9% non-fraud).

Data-Level Methods

| Method | Description | |----------------|--------------------------------------------------| | Random oversample| Duplicate minority samples | | Random undersample| Remove majority samples | | SMOTE | Generate synthetic minority samples via interpolation | | ADASYN | Focus synthetic samples on harder examples | | Tomek links | Remove ambiguous majority samples near boundary |

SMOTE (Synthetic Minority Over-sampling Technique):

def smote(X_minority, k=5, n_synthetic=100):
    knn = NearestNeighbors(n_neighbors=k).fit(X_minority)
    synthetic = []

    for _ in range(n_synthetic):
        idx = np.random.randint(len(X_minority))
        x = X_minority[idx]
        neighbors = knn.kneighbors([x], return_distance=False)[0]
        nn = X_minority[np.random.choice(neighbors)]

        # Interpolate between x and its neighbor
        lam = np.random.random()
        synthetic.append(x + lam * (nn - x))

    return np.array(synthetic)

Algorithm-Level Methods

  • Class weights: weight loss inversely proportional to class frequency
  • Focal loss: down-weight easy examples: L = -alpha * (1-p)^gamma * log(p)
  • Threshold adjustment: optimize decision threshold on validation set using PR curve

Evaluation for Imbalanced Data

Do NOT use accuracy. Use: precision, recall, F1, AUPRC, Matthews Correlation Coefficient (MCC).

Experiment Tracking

MLflow

import mlflow

with mlflow.start_run():
    mlflow.log_param("lr", 0.001)
    mlflow.log_param("epochs", 50)

    for epoch in range(50):
        train_loss = train_one_epoch(model, data)
        val_acc = evaluate(model, val_data)
        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_acc", val_acc, step=epoch)

    mlflow.log_artifact("model.pkl")
    mlflow.sklearn.log_model(model, "model")

Weights & Biases (W&B)

import wandb

wandb.init(project="my-project", config={"lr": 0.001, "epochs": 50})

for epoch in range(50):
    train_loss = train_one_epoch(model, data)
    wandb.log({"train_loss": train_loss, "epoch": epoch})

wandb.finish()

Key capabilities: hyperparameter sweeps, artifact versioning, team collaboration, experiment comparison dashboards.

Model Serving

Deployment Patterns

| Pattern | Description | Latency | Use Case | |---------------|------------------------------------------|----------|-----------------------| | Batch | Score entire dataset periodically | High | Recommendations | | Real-time API | REST/gRPC endpoint | Low | Search ranking | | Streaming | Score events as they arrive | Medium | Fraud detection | | Edge | Run model on device | Lowest | Mobile, IoT |

Model Optimization for Serving

  • Quantization: FP32 -> INT8 (2-4x speedup, minimal accuracy loss)
  • Pruning: remove near-zero weights (structured or unstructured)
  • Distillation: train small student model to mimic large teacher
  • ONNX: standard format for cross-framework deployment
  • TensorRT / OpenVINO: hardware-specific optimization

A/B Testing

Compare model versions in production with statistical rigor.

Process

  1. Define metric and minimum detectable effect (MDE)
  2. Calculate sample size: n = (z_{alpha/2} + z_beta)^2 * 2 * sigma^2 / MDE^2
  3. Randomly split traffic (e.g., 50/50 or 95/5 for risky changes)
  4. Run until sample size reached (do NOT peek and stop early without correction)
  5. Compute p-value or confidence interval

Pitfalls

  • Peeking: checking results before planned sample size inflates false positive rate. Use sequential testing (SPRT) if early stopping is needed.
  • Multiple comparisons: testing many metrics requires Bonferroni or FDR correction.
  • Simpson's paradox: aggregate results can contradict subgroup results. Stratify analysis.
  • Network effects: user interactions violate independence assumption.

Data Versioning

Track datasets alongside code for reproducibility.

  • DVC (Data Version Control): git-like commands for data, stores large files externally (S3, GCS)
  • Delta Lake / Lakehouse: versioned data tables with time travel
  • LakeFS: git-like branching for data lakes
# DVC workflow
dvc init
dvc add data/training_set.csv     # track with DVC, not git
git add data/training_set.csv.dvc  # commit the metadata
dvc push                           # push data to remote storage

Model Monitoring

What to Monitor

| Category | Metrics | |-----------------|---------------------------------------------| | Data drift | KS test, PSI, chi-squared on feature distributions | | Concept drift | Degradation in online metrics over time | | Prediction drift| Distribution shift in model outputs | | Performance | Latency (p50, p99), throughput, error rates | | Resource usage | CPU, memory, GPU utilization |

Data Drift Detection

Population Stability Index (PSI):

PSI = sum_i (actual_i - expected_i) * ln(actual_i / expected_i)
  • PSI < 0.1: no significant change
  • PSI 0.1-0.25: moderate shift, investigate
  • PSI > 0.25: significant shift, retrain

Retraining Strategies

  • Scheduled: retrain on fixed cadence (daily, weekly)
  • Triggered: retrain when drift exceeds threshold
  • Online learning: continuous updates with each new sample
  • Shadow mode: deploy new model alongside old, compare before switching

Responsible AI

Model Interpretability

SHAP (SHapley Additive exPlanations): game-theoretic attribution of prediction to features.

phi_j = sum_{S subset F\{j}} |S|!(|F|-|S|-1)! / |F|! * [f(S u {j}) - f(S)]

Properties: local accuracy, missingness, consistency. TreeSHAP computes exact values for tree models in polynomial time.

LIME (Local Interpretable Model-agnostic Explanations):

  1. Perturb input around the point of interest
  2. Get model predictions for perturbed samples
  3. Fit a simple interpretable model (linear, decision tree) weighted by proximity
  4. Use the simple model as the local explanation

Fairness Metrics

| Metric | Definition | |-------------------------|--------------------------------------------| | Demographic parity | P(Y_hat=1|A=0) = P(Y_hat=1|A=1) | | Equalized odds | Equal TPR and FPR across groups | | Equal opportunity | Equal TPR across groups | | Predictive parity | Equal precision across groups | | Calibration | P(Y=1|Y_hat=p, A=a) = p for all a |

Impossibility result: except in trivial cases, you cannot simultaneously satisfy demographic parity, equalized odds, and predictive parity (Chouldechova, 2017).

Mitigation Strategies

  • Pre-processing: rebalance or transform training data (reweighting, fair representations)
  • In-processing: add fairness constraints to the optimization (adversarial debiasing, constrained optimization)
  • Post-processing: adjust decision thresholds per group to equalize the chosen metric

ML Pipeline Checklist

  1. Define success metrics AND fairness criteria upfront
  2. Audit training data for representation and label quality
  3. Establish baselines and document assumptions
  4. Track experiments with versioned data, code, and configs
  5. Evaluate on held-out test set with stratified metrics
  6. Deploy with monitoring for drift, performance, and fairness
  7. Document model cards: intended use, limitations, ethical considerations
  8. Plan for model updates and deprecation