4 min read
On this page

Evaluating ML Systems

Overview

Offline metrics tell you how a model performs on a test set. Online metrics tell you how it performs in the real world. These two numbers often disagree. A model with 99% accuracy on a carefully curated test set can fail spectacularly when deployed to actual users, because the test set does not reflect the messy, shifting distribution of real-world inputs.

The gap between offline evaluation and production performance is where most ML projects fail. Bridging that gap requires measuring the right things at the right time, and never trusting a single number to tell the whole story.

Offline Metrics

Classification Metrics

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, 
    f1_score, roc_auc_score, confusion_matrix,
)

def evaluate_classifier(y_true, y_pred, y_prob=None):
    """Compute standard classification metrics.
    
    No single metric tells the full story. Always report
    multiple metrics and understand the trade-offs.
    """
    metrics = {
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred, average="weighted"),
        "recall": recall_score(y_true, y_pred, average="weighted"),
        "f1": f1_score(y_true, y_pred, average="weighted"),
    }
    
    if y_prob is not None:
        metrics["auc_roc"] = roc_auc_score(
            y_true, y_prob, multi_class="ovr", average="weighted"
        )
    
    return metrics

When to use which metric:

Metric      Use when...                           Example
Accuracy    Classes are balanced                   Sentiment (pos/neg, 50/50)
Precision   False positives are costly             Spam filter (don't lose real email)
Recall      False negatives are costly             Fraud detection (don't miss fraud)
F1          You need balance between P and R       Most classification tasks
AUC-ROC     You want threshold-independent eval    Ranking, scoring tasks

The Accuracy Trap

def demonstrate_accuracy_trap():
    """Accuracy is misleading for imbalanced classes.
    
    If 99% of emails are not spam, a model that always
    predicts 'not spam' has 99% accuracy but catches
    zero spam.
    """
    # 10,000 emails: 9,900 legitimate, 100 spam
    y_true = [0] * 9900 + [1] * 100
    
    # "Model" that always predicts not-spam
    y_pred_dumb = [0] * 10000
    
    # Actual spam filter
    y_pred_model = [0] * 9900 + [1] * 80 + [0] * 20
    
    print("Dumb model (always predicts not-spam):")
    print(f"  Accuracy: {accuracy_score(y_true, y_pred_dumb):.2%}")
    print(f"  Recall:   {recall_score(y_true, y_pred_dumb):.2%}")
    
    print("\nActual model:")
    print(f"  Accuracy: {accuracy_score(y_true, y_pred_model):.2%}")
    print(f"  Recall:   {recall_score(y_true, y_pred_model):.2%}")
    
    # Output:
    # Dumb model: Accuracy 99.00%, Recall 0.00%
    # Actual model: Accuracy 99.80%, Recall 80.00%

Regression Metrics

from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

def evaluate_regression(y_true, y_pred):
    """Standard regression metrics."""
    return {
        "mae": mean_absolute_error(y_true, y_pred),
        "rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
        "mape": np.mean(np.abs((y_true - y_pred) / y_true)) * 100,
    }

Ranking Metrics

For search and recommendation systems, order matters more than exact values. Key metrics: MRR (mean reciprocal rank, how high is the first relevant result), Precision@K (of the top K results, how many are relevant), and NDCG@K (measures ranking quality accounting for both relevance and position).

Why Offline Metrics Lie

Distribution Mismatch

Your test set is a frozen snapshot. Production data evolves:

def demonstrate_distribution_shift():
    """
    January: Train model on 2024 data, test on held-out 2024 data.
    Result: 95% accuracy. Ship it.
    
    March: A new slang term becomes popular. Users start using
    it in support tickets. The model has never seen it.
    Result: Accuracy drops to 82% on tickets containing
    the new term. Overall accuracy: 91%.
    
    The test set said 95%. Production says 91% and falling.
    """
    pass

Label Leakage

Features in your test set that would not be available at inference time:

def demonstrate_label_leakage():
    """
    You're predicting whether a patient will be readmitted.
    
    Your dataset includes 'discharge_medication_count'.
    Patients who are readmitted tend to have higher
    medication counts at discharge -- because doctors
    prescribe more when the patient is sicker.
    
    Your model achieves 98% AUC on the test set.
    In production, it performs at 72% AUC.
    
    Why: the test set included post-decision features
    that leak information about the outcome.
    """
    pass

Adversarial Inputs

Real users produce inputs your test set never anticipated:

Test set input:  "What is the return policy for electronics?"
Model output:    Correct answer about return policy.

Production input: "whats ur return polcy for electornics lol"
Model output:     Confused response about electricity.

Production input: "RETURN POLICY NOW!!!!"
Model output:     Misclassified as complaint, not question.

Production input: "return policy" (just two words, no context)
Model output:     Generic response, wrong product category.

Online Metrics

User Satisfaction Metrics

What actually matters once the model is deployed:

def track_online_metrics(prediction_log):
    """Metrics that reflect real user experience."""
    
    metrics = {
        # Did the user accept the model's suggestion?
        "acceptance_rate": (
            sum(1 for p in prediction_log if p["user_accepted"]) 
            / len(prediction_log)
        ),
        
        # Did the user complete their task?
        "task_completion_rate": (
            sum(1 for p in prediction_log if p["task_completed"])
            / len(prediction_log)
        ),
        
        # How long did it take?
        "avg_time_to_completion": np.mean([
            p["completion_time_seconds"] 
            for p in prediction_log 
            if p["task_completed"]
        ]),
        
        # Did the user come back?
        "next_day_retention": calculate_retention(prediction_log),
    }
    
    return metrics

A/B Testing for Models

The gold standard for comparing model versions:

def ab_test_analysis(control_metrics, treatment_metrics):
    """Compare two model versions with statistical rigor.
    
    Don't just compare averages. Check if the difference
    is statistically significant and practically meaningful.
    """
    from scipy import stats
    
    # Statistical significance
    t_stat, p_value = stats.ttest_ind(
        control_metrics["satisfaction_scores"],
        treatment_metrics["satisfaction_scores"],
    )
    
    # Effect size
    control_mean = np.mean(control_metrics["satisfaction_scores"])
    treatment_mean = np.mean(treatment_metrics["satisfaction_scores"])
    effect_size = treatment_mean - control_mean
    relative_improvement = effect_size / control_mean * 100
    
    result = {
        "control_mean": control_mean,
        "treatment_mean": treatment_mean,
        "absolute_difference": effect_size,
        "relative_improvement_pct": relative_improvement,
        "p_value": p_value,
        "significant": p_value < 0.05,
    }
    
    return result

Business Metrics

The metrics leadership cares about:

Online metric                  Business translation
Task completion rate           Users can do what they came to do
Time to completion             Users are more productive
Support ticket deflection      Fewer tickets = lower support costs
Revenue per session            Users are buying more
Churn rate                     Users are staying longer
Error escalation rate          Fewer problems reach human agents

The Evaluation Gap

Research vs Production Evaluation

Research evaluation:
  - Fixed test set, run once
  - Single number (accuracy, F1)
  - Compared to other models on same benchmark
  - Static: the test set never changes

Production evaluation:
  - Continuous monitoring on live traffic
  - Multiple metrics (user satisfaction, latency, cost)
  - Compared to business goals and user expectations
  - Dynamic: the data distribution shifts constantly

Closing the Gap

A production evaluation pipeline should include: (1) standard offline metrics as a sanity check, (2) slice-based evaluation to find weak spots, (3) latency measurement, (4) cost estimation per 1,000 requests, and (5) comparison to the current production model. Run all five before deploying any model change.

Real-World Example: Search Ranking Model

A team builds a new search ranking model for an e-commerce site.

Offline evaluation: NDCG@10 improves from 0.72 to 0.78 on the test set. The team is excited and prepares to deploy.

Shadow deployment: The new model runs alongside the old one on live traffic, but its results are not shown to users. The team discovers that the new model is 3x slower (800ms vs 250ms latency). At scale, this would degrade the user experience.

After optimization: Latency is reduced to 300ms. The team runs an A/B test: 50% of users see the new model's results.

A/B test results: Click-through rate improves by 4%. But revenue per session is flat. Users click more but buy the same amount. The model is surfacing more "interesting" results that attract clicks but not purchases.

Iteration: The team adds purchase-weighted relevance to the training signal. The updated model shows both higher click-through rate (+3%) and higher revenue per session (+1.5%). This version ships.

The offline metric (NDCG) was directionally correct but did not predict the revenue impact. Only the online A/B test revealed the full picture.

Common Pitfalls

  • Trusting a single metric: No single number captures model quality. Report accuracy, precision, recall, and task-specific metrics together.
  • Evaluating only on the test set: The test set is a starting point, not the final answer. Shadow deployments and A/B tests are required before you can trust a model in production.
  • Ignoring class imbalance: Accuracy on imbalanced datasets is meaningless. Use precision, recall, F1, or AUC instead.
  • Not measuring latency and cost: A model that is 5% more accurate but 10x slower or 10x more expensive is often a bad trade.
  • Skipping slice-based evaluation: Overall metrics hide failures on subgroups. Always break down performance by relevant dimensions.
  • Comparing to the wrong baseline: Compare your model to the current production system, not to random chance. A model with 90% accuracy is useless if the existing rule-based system achieves 89%.

Key Takeaways

  • Offline metrics (accuracy, F1, AUC) are necessary but insufficient. They tell you how a model performs on a frozen test set, not on live traffic.
  • Online metrics (user satisfaction, task completion, business outcomes) are the true measure of model quality. Always validate with A/B tests before full deployment.
  • The evaluation gap between research and production is real. A model that wins on benchmarks can fail in production due to distribution shift, latency, cost, or edge cases.
  • Use multiple evaluation strategies: offline metrics for fast iteration, shadow deployments for safety, A/B tests for final validation.
  • Always evaluate by slice. Overall metrics hide failures on subgroups that matter to your users and your business.