A/B Testing Models
Your new model has better accuracy on the test set. Should you ship it? Not yet. Offline metrics (accuracy, F1, AUC) tell you how the model performs on historical data. Online metrics (click-through rate, conversion, revenue, user satisfaction) tell you how it performs in the real world. These two sets of metrics disagree more often than you would expect. A/B testing is how you resolve the disagreement.
Why Offline Metrics Are Not Enough
Offline evaluation uses a held-out dataset with known labels. The model makes predictions, you compare to ground truth, and you get a number. This number is necessary but insufficient.
Reasons offline metrics mislead:
1. Distribution mismatch: test set does not reflect current production traffic
2. Proxy metrics: accuracy on labels != business outcome you care about
3. Missing feedback loops: a recommendation model changes what users see,
which changes what they click, which changes the "correct" labels
4. Latency effects: a more accurate model that takes 3x longer may lose users
5. Edge cases: test set is averaged; production has long tails that matter
Real example: a search ranking model with 2% higher NDCG (a ranking quality metric) on the test set was A/B tested in production. It showed a 1.5% decrease in click-through rate. The model was more "correct" by the test set's definition but surfaced results that were technically relevant yet not what users wanted.
A/B Testing Fundamentals
Split production traffic between two groups: control (current model) and treatment (new model). Measure a business metric. Determine if the difference is statistically significant.
# Simple A/B test setup
import hashlib
def assign_variant(user_id, experiment_name, treatment_fraction=0.5):
"""Deterministic assignment: same user always gets same variant."""
hash_input = f"{experiment_name}:{user_id}"
hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
bucket = (hash_value % 10000) / 10000 # 0.0 to 0.9999
return "treatment" if bucket < treatment_fraction else "control"
def get_prediction(user_id, input_data, experiment_name="search_v2"):
variant = assign_variant(user_id, experiment_name)
if variant == "treatment":
return new_model.predict(input_data), variant
else:
return current_model.predict(input_data), variant
Key requirements:
1. Random assignment: Users are randomly assigned to control or treatment.
No self-selection, no cherry-picking.
2. Deterministic: The same user always sees the same variant for the
duration of the experiment. Flip-flopping confuses users
and corrupts your data.
3. Sufficient sample: You need enough data to detect a meaningful difference.
Under-powered tests produce inconclusive results.
4. Single variable: Change one thing at a time. If you change the model
and the UI simultaneously, you cannot attribute the result.
Statistical Significance
You need to determine whether the observed difference between control and treatment is real or just noise.
from scipy import stats
import numpy as np
def evaluate_ab_test(control_conversions, control_total,
treatment_conversions, treatment_total,
confidence_level=0.95):
"""Evaluate a simple A/B test on conversion rate."""
control_rate = control_conversions / control_total
treatment_rate = treatment_conversions / treatment_total
# Pooled proportion
pooled = (control_conversions + treatment_conversions) / (control_total + treatment_total)
# Standard error
se = np.sqrt(pooled * (1 - pooled) * (1/control_total + 1/treatment_total))
# Z-score and p-value
z_score = (treatment_rate - control_rate) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_score))) # two-tailed
significant = p_value < (1 - confidence_level)
return {
"control_rate": control_rate,
"treatment_rate": treatment_rate,
"relative_lift": (treatment_rate - control_rate) / control_rate,
"p_value": p_value,
"significant": significant,
}
# Example
result = evaluate_ab_test(
control_conversions=1200, control_total=50000,
treatment_conversions=1350, treatment_total=50000,
confidence_level=0.95
)
# relative_lift: 12.5%, p_value: 0.0012, significant: True
Sample Size Planning
Running a test that is too small wastes time — you will not detect real differences. Running one too long wastes traffic on a potentially worse model.
from scipy.stats import norm
def required_sample_size(baseline_rate, minimum_detectable_effect,
alpha=0.05, power=0.80):
"""
Calculate required sample size per variant.
baseline_rate: current conversion rate (e.g., 0.024 for 2.4%)
minimum_detectable_effect: relative change to detect (e.g., 0.10 for 10%)
"""
p1 = baseline_rate
p2 = baseline_rate * (1 + minimum_detectable_effect)
z_alpha = norm.ppf(1 - alpha / 2)
z_beta = norm.ppf(power)
pooled_p = (p1 + p2) / 2
n = ((z_alpha * np.sqrt(2 * pooled_p * (1 - pooled_p)) +
z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2) / (p2 - p1) ** 2
return int(np.ceil(n))
# To detect a 10% relative lift on a 2.4% baseline conversion rate:
n = required_sample_size(0.024, 0.10)
# ~31,000 users per variant, ~62,000 total
Shadow Mode
Before splitting real traffic, run the new model in shadow mode: it receives the same inputs as the production model, produces predictions, but its results are not shown to users. You log both models' predictions and compare.
class ShadowDeployment:
def __init__(self, production_model, shadow_model, logger):
self.production = production_model
self.shadow = shadow_model
self.logger = logger
def predict(self, input_data, request_id):
# Production model serves the actual response
prod_result = self.production.predict(input_data)
# Shadow model runs in parallel, results are logged but not served
try:
shadow_result = self.shadow.predict(input_data)
self.logger.log({
"request_id": request_id,
"input": input_data,
"production_prediction": prod_result,
"shadow_prediction": shadow_result,
"agreement": prod_result == shadow_result,
})
except Exception as e:
self.logger.log_error(f"Shadow model failed: {e}")
return prod_result # always serve production result
Shadow mode tells you:
- How often the models agree or disagree
- Where they disagree (which input patterns cause divergence)
- Whether the shadow model is faster or slower
- Whether the shadow model fails on edge cases the production model handles
Shadow mode does not tell you which model produces better business outcomes — for that, you need the real A/B test. But it de-risks the A/B test by catching crashes, latency regressions, and obvious quality problems before any user is affected.
Canary Deployments
A middle ground between shadow mode and full A/B test. Route a small percentage (1-5%) of traffic to the new model. Monitor closely. If things look good, gradually increase traffic.
import time
class CanaryDeployment:
def __init__(self, stable_model, canary_model, monitor):
self.stable = stable_model
self.canary = canary_model
self.monitor = monitor
self.canary_fraction = 0.0
def set_canary_fraction(self, fraction):
"""Gradually increase canary traffic."""
self.canary_fraction = fraction
def predict(self, user_id, input_data):
variant = assign_variant(user_id, "canary", self.canary_fraction)
if variant == "treatment":
result = self.canary.predict(input_data)
self.monitor.log("canary", input_data, result)
else:
result = self.stable.predict(input_data)
self.monitor.log("stable", input_data, result)
return result
# Rollout schedule
# Day 1: 1% canary traffic - check for errors and latency
# Day 2: 5% canary traffic - check for quality metrics
# Day 3: 25% canary traffic - statistical power for business metrics
# Day 5: 50% canary traffic - full A/B test
# Day 10: 100% canary traffic - rollout complete (or rollback)
Automatic Rollback
The real value of canary deployments: you can automatically roll back if something goes wrong.
class CanaryMonitor:
def __init__(self, max_error_rate=0.05, max_latency_p99_ms=500,
min_confidence=0.5):
self.thresholds = {
"error_rate": max_error_rate,
"latency_p99": max_latency_p99_ms,
"min_avg_confidence": min_confidence,
}
def should_rollback(self, canary_metrics):
"""Return True if canary should be rolled back."""
if canary_metrics["error_rate"] > self.thresholds["error_rate"]:
return True, "Error rate exceeded threshold"
if canary_metrics["latency_p99"] > self.thresholds["latency_p99"]:
return True, "Latency exceeded threshold"
if canary_metrics["avg_confidence"] < self.thresholds["min_avg_confidence"]:
return True, "Prediction confidence too low"
return False, "Canary looks healthy"
Multi-Armed Bandits
Traditional A/B tests split traffic 50/50 for the entire experiment, which means 50% of users see the worse model for weeks. Multi-armed bandits dynamically allocate more traffic to the better-performing variant.
import numpy as np
class ThompsonSampling:
"""Thompson Sampling bandit for model selection."""
def __init__(self, n_models):
# Beta distribution parameters for each model
self.successes = np.ones(n_models) # alpha
self.failures = np.ones(n_models) # beta
def select_model(self):
"""Sample from each model's distribution, pick the highest."""
samples = [
np.random.beta(self.successes[i], self.failures[i])
for i in range(len(self.successes))
]
return np.argmax(samples)
def update(self, model_index, reward):
"""Update beliefs based on observed reward."""
if reward:
self.successes[model_index] += 1
else:
self.failures[model_index] += 1
# Usage
bandit = ThompsonSampling(n_models=2) # control and treatment
# For each request
model_idx = bandit.select_model()
prediction = models[model_idx].predict(input_data)
# ... observe outcome ...
bandit.update(model_idx, reward=user_clicked)
Bandits converge faster and reduce exposure to the worse model, but they make statistical analysis harder and can be tricky to implement correctly. Use them when the cost of serving the worse model is high (ad serving, pricing) and standard A/B testing when you need clean causal inference.
Common Pitfalls
- Peeking at results early. Checking p-values repeatedly during the experiment inflates the false positive rate. Either commit to a fixed sample size upfront, or use sequential testing methods designed for continuous monitoring.
- Not accounting for novelty effects. Users click on new things because they are new, not because they are better. Run experiments for at least 2 weeks to let novelty wear off.
- Testing on the wrong metric. If you optimize for click-through rate, you may get more clicks but lower conversion. Choose a metric that aligns with actual business value.
- Network effects between variants. In social or marketplace applications, control and treatment users interact with each other, contaminating results. Use cluster-based randomization (randomize by geographic region or social cluster).
- Ignoring guardrail metrics. You test one primary metric, but you should also monitor guardrails — metrics that should not degrade. A model that improves search relevance but doubles page load time is a net loss.
- Running too many simultaneous experiments on the same surface. If three experiments modify the same feature, interactions between them make results uninterpretable. Coordinate experiments across teams.
Key Takeaways
- Offline metrics (accuracy, F1) and online metrics (CTR, revenue) disagree regularly. A/B testing is the only way to know which model is actually better for your users.
- Use shadow mode to de-risk deployment before any user sees the new model. It catches crashes, latency regressions, and obvious quality problems.
- Canary deployments let you gradually shift traffic and automatically roll back on failure. Start at 1%, increase over days.
- Calculate sample size before starting the experiment. Under-powered tests waste time and produce inconclusive results.
- Do not peek at results. Commit to a sample size or use sequential testing methods.
- Multi-armed bandits reduce exposure to the worse model but sacrifice clean statistical inference. Use them when the cost of serving wrong predictions is high.