Evaluating ML Systems
Overview
Offline metrics tell you how a model performs on a test set. Online metrics tell you how it performs in the real world. These two numbers often disagree. A model with 99% accuracy on a carefully curated test set can fail spectacularly when deployed to actual users, because the test set does not reflect the messy, shifting distribution of real-world inputs.
The gap between offline evaluation and production performance is where most ML projects fail. Bridging that gap requires measuring the right things at the right time, and never trusting a single number to tell the whole story.
Offline Metrics
Classification Metrics
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix,
)
def evaluate_classifier(y_true, y_pred, y_prob=None):
"""Compute standard classification metrics.
No single metric tells the full story. Always report
multiple metrics and understand the trade-offs.
"""
metrics = {
"accuracy": accuracy_score(y_true, y_pred),
"precision": precision_score(y_true, y_pred, average="weighted"),
"recall": recall_score(y_true, y_pred, average="weighted"),
"f1": f1_score(y_true, y_pred, average="weighted"),
}
if y_prob is not None:
metrics["auc_roc"] = roc_auc_score(
y_true, y_prob, multi_class="ovr", average="weighted"
)
return metrics
When to use which metric:
Metric Use when... Example
Accuracy Classes are balanced Sentiment (pos/neg, 50/50)
Precision False positives are costly Spam filter (don't lose real email)
Recall False negatives are costly Fraud detection (don't miss fraud)
F1 You need balance between P and R Most classification tasks
AUC-ROC You want threshold-independent eval Ranking, scoring tasks
The Accuracy Trap
def demonstrate_accuracy_trap():
"""Accuracy is misleading for imbalanced classes.
If 99% of emails are not spam, a model that always
predicts 'not spam' has 99% accuracy but catches
zero spam.
"""
# 10,000 emails: 9,900 legitimate, 100 spam
y_true = [0] * 9900 + [1] * 100
# "Model" that always predicts not-spam
y_pred_dumb = [0] * 10000
# Actual spam filter
y_pred_model = [0] * 9900 + [1] * 80 + [0] * 20
print("Dumb model (always predicts not-spam):")
print(f" Accuracy: {accuracy_score(y_true, y_pred_dumb):.2%}")
print(f" Recall: {recall_score(y_true, y_pred_dumb):.2%}")
print("\nActual model:")
print(f" Accuracy: {accuracy_score(y_true, y_pred_model):.2%}")
print(f" Recall: {recall_score(y_true, y_pred_model):.2%}")
# Output:
# Dumb model: Accuracy 99.00%, Recall 0.00%
# Actual model: Accuracy 99.80%, Recall 80.00%
Regression Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
def evaluate_regression(y_true, y_pred):
"""Standard regression metrics."""
return {
"mae": mean_absolute_error(y_true, y_pred),
"rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
"mape": np.mean(np.abs((y_true - y_pred) / y_true)) * 100,
}
Ranking Metrics
For search and recommendation systems, order matters more than exact values. Key metrics: MRR (mean reciprocal rank, how high is the first relevant result), Precision@K (of the top K results, how many are relevant), and NDCG@K (measures ranking quality accounting for both relevance and position).
Why Offline Metrics Lie
Distribution Mismatch
Your test set is a frozen snapshot. Production data evolves:
def demonstrate_distribution_shift():
"""
January: Train model on 2024 data, test on held-out 2024 data.
Result: 95% accuracy. Ship it.
March: A new slang term becomes popular. Users start using
it in support tickets. The model has never seen it.
Result: Accuracy drops to 82% on tickets containing
the new term. Overall accuracy: 91%.
The test set said 95%. Production says 91% and falling.
"""
pass
Label Leakage
Features in your test set that would not be available at inference time:
def demonstrate_label_leakage():
"""
You're predicting whether a patient will be readmitted.
Your dataset includes 'discharge_medication_count'.
Patients who are readmitted tend to have higher
medication counts at discharge -- because doctors
prescribe more when the patient is sicker.
Your model achieves 98% AUC on the test set.
In production, it performs at 72% AUC.
Why: the test set included post-decision features
that leak information about the outcome.
"""
pass
Adversarial Inputs
Real users produce inputs your test set never anticipated:
Test set input: "What is the return policy for electronics?"
Model output: Correct answer about return policy.
Production input: "whats ur return polcy for electornics lol"
Model output: Confused response about electricity.
Production input: "RETURN POLICY NOW!!!!"
Model output: Misclassified as complaint, not question.
Production input: "return policy" (just two words, no context)
Model output: Generic response, wrong product category.
Online Metrics
User Satisfaction Metrics
What actually matters once the model is deployed:
def track_online_metrics(prediction_log):
"""Metrics that reflect real user experience."""
metrics = {
# Did the user accept the model's suggestion?
"acceptance_rate": (
sum(1 for p in prediction_log if p["user_accepted"])
/ len(prediction_log)
),
# Did the user complete their task?
"task_completion_rate": (
sum(1 for p in prediction_log if p["task_completed"])
/ len(prediction_log)
),
# How long did it take?
"avg_time_to_completion": np.mean([
p["completion_time_seconds"]
for p in prediction_log
if p["task_completed"]
]),
# Did the user come back?
"next_day_retention": calculate_retention(prediction_log),
}
return metrics
A/B Testing for Models
The gold standard for comparing model versions:
def ab_test_analysis(control_metrics, treatment_metrics):
"""Compare two model versions with statistical rigor.
Don't just compare averages. Check if the difference
is statistically significant and practically meaningful.
"""
from scipy import stats
# Statistical significance
t_stat, p_value = stats.ttest_ind(
control_metrics["satisfaction_scores"],
treatment_metrics["satisfaction_scores"],
)
# Effect size
control_mean = np.mean(control_metrics["satisfaction_scores"])
treatment_mean = np.mean(treatment_metrics["satisfaction_scores"])
effect_size = treatment_mean - control_mean
relative_improvement = effect_size / control_mean * 100
result = {
"control_mean": control_mean,
"treatment_mean": treatment_mean,
"absolute_difference": effect_size,
"relative_improvement_pct": relative_improvement,
"p_value": p_value,
"significant": p_value < 0.05,
}
return result
Business Metrics
The metrics leadership cares about:
Online metric Business translation
Task completion rate Users can do what they came to do
Time to completion Users are more productive
Support ticket deflection Fewer tickets = lower support costs
Revenue per session Users are buying more
Churn rate Users are staying longer
Error escalation rate Fewer problems reach human agents
The Evaluation Gap
Research vs Production Evaluation
Research evaluation:
- Fixed test set, run once
- Single number (accuracy, F1)
- Compared to other models on same benchmark
- Static: the test set never changes
Production evaluation:
- Continuous monitoring on live traffic
- Multiple metrics (user satisfaction, latency, cost)
- Compared to business goals and user expectations
- Dynamic: the data distribution shifts constantly
Closing the Gap
A production evaluation pipeline should include: (1) standard offline metrics as a sanity check, (2) slice-based evaluation to find weak spots, (3) latency measurement, (4) cost estimation per 1,000 requests, and (5) comparison to the current production model. Run all five before deploying any model change.
Real-World Example: Search Ranking Model
A team builds a new search ranking model for an e-commerce site.
Offline evaluation: NDCG@10 improves from 0.72 to 0.78 on the test set. The team is excited and prepares to deploy.
Shadow deployment: The new model runs alongside the old one on live traffic, but its results are not shown to users. The team discovers that the new model is 3x slower (800ms vs 250ms latency). At scale, this would degrade the user experience.
After optimization: Latency is reduced to 300ms. The team runs an A/B test: 50% of users see the new model's results.
A/B test results: Click-through rate improves by 4%. But revenue per session is flat. Users click more but buy the same amount. The model is surfacing more "interesting" results that attract clicks but not purchases.
Iteration: The team adds purchase-weighted relevance to the training signal. The updated model shows both higher click-through rate (+3%) and higher revenue per session (+1.5%). This version ships.
The offline metric (NDCG) was directionally correct but did not predict the revenue impact. Only the online A/B test revealed the full picture.
Common Pitfalls
- Trusting a single metric: No single number captures model quality. Report accuracy, precision, recall, and task-specific metrics together.
- Evaluating only on the test set: The test set is a starting point, not the final answer. Shadow deployments and A/B tests are required before you can trust a model in production.
- Ignoring class imbalance: Accuracy on imbalanced datasets is meaningless. Use precision, recall, F1, or AUC instead.
- Not measuring latency and cost: A model that is 5% more accurate but 10x slower or 10x more expensive is often a bad trade.
- Skipping slice-based evaluation: Overall metrics hide failures on subgroups. Always break down performance by relevant dimensions.
- Comparing to the wrong baseline: Compare your model to the current production system, not to random chance. A model with 90% accuracy is useless if the existing rule-based system achieves 89%.
Key Takeaways
- Offline metrics (accuracy, F1, AUC) are necessary but insufficient. They tell you how a model performs on a frozen test set, not on live traffic.
- Online metrics (user satisfaction, task completion, business outcomes) are the true measure of model quality. Always validate with A/B tests before full deployment.
- The evaluation gap between research and production is real. A model that wins on benchmarks can fail in production due to distribution shift, latency, cost, or edge cases.
- Use multiple evaluation strategies: offline metrics for fast iteration, shadow deployments for safety, A/B tests for final validation.
- Always evaluate by slice. Overall metrics hide failures on subgroups that matter to your users and your business.