Monitoring ML Systems

Software monitoring tells you whether your application is up and responding. ML monitoring tells you whether your model is still giving good answers. A model can return HTTP 200 on every request while silently producing garbage predictions because the world changed and the model did not. Traditional monitoring catches the first problem. ML monitoring catches the second.

Why ML Monitoring Is Different

A web server either works or it does not. A model degrades. The inputs shift. The relationship between inputs and outputs changes. The user population evolves. None of this triggers an error — the model keeps predicting, just badly.

Software failure:  Server returns 500. Alert fires. You fix it.
ML failure:        Model returns predictions with high confidence.
                   Predictions are wrong. No alert fires.
                   You find out weeks later when a business metric drops.

ML monitoring exists to close that gap.

The Four Layers of ML Monitoring

Layer 1: Infrastructure Monitoring

Standard software monitoring. Is the model server up? Is latency acceptable? Is the GPU running out of memory?

Metrics to track:
  - Request latency (p50, p95, p99)
  - Throughput (requests/second)
  - Error rate (HTTP 5xx)
  - GPU utilization and memory
  - CPU and RAM usage
  - Model loading time
  - Queue depth (if using async inference)

This is table stakes. Use Prometheus, Datadog, CloudWatch — whatever your team already uses.

Layer 2: Data Monitoring (Input Drift)

The most important layer for catching silent failures. Data drift means the distribution of inputs to your model has changed compared to what it saw during training.

# Detecting data drift with simple statistics
import numpy as np
from scipy import stats

def detect_drift(reference_data, production_data, threshold=0.05):
    """
    Compare distributions using Kolmogorov-Smirnov test.
    Returns True if drift is detected.
    """
    drift_results = {}
    for feature in reference_data.columns:
        stat, p_value = stats.ks_2samp(
            reference_data[feature].dropna(),
            production_data[feature].dropna()
        )
        drift_results[feature] = {
            "statistic": stat,
            "p_value": p_value,
            "drifted": p_value < threshold
        }
    return drift_results

# Run this on a window of recent production data vs training data
# Example: compare last 24 hours of inputs to training distribution

Types of data drift:

Covariate shift:    Input distribution changes. Users start asking different
                    kinds of questions. New product categories appear.
                    The model was never trained on these inputs.

Concept drift:      The relationship between inputs and outputs changes.
                    What was spam last month is not spam this month.
                    Customer sentiment around a topic reverses.

Label drift:        The distribution of correct answers changes.
                    Your fraud model was trained on 1% fraud rate,
                    but fraud rate increased to 5%.

Practical Drift Detection

For numerical features, track statistical moments:

class FeatureMonitor:
    def __init__(self, reference_stats):
        self.reference = reference_stats  # mean, std, quantiles from training data
    
    def check(self, production_window):
        """Check a window of production data against reference."""
        alerts = []
        
        for feature, ref in self.reference.items():
            prod_mean = production_window[feature].mean()
            prod_std = production_window[feature].std()
            
            # Mean shifted by more than 2 standard deviations
            if abs(prod_mean - ref["mean"]) > 2 * ref["std"]:
                alerts.append(f"{feature}: mean shifted from {ref['mean']:.3f} to {prod_mean:.3f}")
            
            # Null rate increased significantly
            prod_null_rate = production_window[feature].isna().mean()
            if prod_null_rate > ref["null_rate"] * 2:
                alerts.append(f"{feature}: null rate increased to {prod_null_rate:.1%}")
            
            # New categories appeared (for categorical features)
            if feature in ref.get("categories", {}):
                new_cats = set(production_window[feature].unique()) - set(ref["categories"])
                if new_cats:
                    alerts.append(f"{feature}: new categories {new_cats}")
        
        return alerts

For text inputs, track embedding drift — embed a sample of production inputs and compare the centroid to the training centroid:

def text_drift_score(reference_embeddings, production_embeddings):
    """Measure how far production text has drifted from training text."""
    ref_centroid = np.mean(reference_embeddings, axis=0)
    prod_centroid = np.mean(production_embeddings, axis=0)
    
    cosine_sim = np.dot(ref_centroid, prod_centroid) / (
        np.linalg.norm(ref_centroid) * np.linalg.norm(prod_centroid)
    )
    return 1 - cosine_sim  # 0 = no drift, higher = more drift

Layer 3: Prediction Monitoring (Output Drift)

Track the distribution of your model's outputs, even when you do not have ground truth labels yet.

# Track prediction distribution over time
class PredictionMonitor:
    def __init__(self):
        self.windows = {}
    
    def log_prediction(self, prediction, confidence, timestamp):
        window_key = timestamp.strftime("%Y-%m-%d-%H")  # hourly windows
        if window_key not in self.windows:
            self.windows[window_key] = {"predictions": [], "confidences": []}
        self.windows[window_key]["predictions"].append(prediction)
        self.windows[window_key]["confidences"].append(confidence)
    
    def check_window(self, window_key, reference_stats):
        window = self.windows[window_key]
        alerts = []
        
        # Prediction distribution shift
        pred_mean = np.mean(window["predictions"])
        if abs(pred_mean - reference_stats["pred_mean"]) > reference_stats["pred_std"] * 2:
            alerts.append(f"Prediction mean shifted to {pred_mean:.3f}")
        
        # Confidence calibration
        avg_confidence = np.mean(window["confidences"])
        if avg_confidence < reference_stats["min_confidence"]:
            alerts.append(f"Average confidence dropped to {avg_confidence:.3f}")
        
        # All-same predictions (model collapse)
        unique_preds = len(set(window["predictions"]))
        if unique_preds == 1 and len(window["predictions"]) > 100:
            alerts.append("Model returning identical predictions for all inputs")
        
        return alerts

Things to watch for:

- Prediction distribution shift (model predicting more positives than usual)
- Confidence collapse (average confidence drops significantly)
- Prediction uniformity (model returns the same answer for everything)
- Latency increase (may indicate input complexity changed)

Layer 4: Outcome Monitoring (Ground Truth)

When you eventually get ground truth labels (from user feedback, manual review, or delayed signals), compare them to predictions.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def evaluate_window(predictions, ground_truth):
    """Evaluate model performance on a window of labeled data."""
    metrics = {
        "accuracy": accuracy_score(ground_truth, predictions),
        "precision": precision_score(ground_truth, predictions, average="weighted"),
        "recall": recall_score(ground_truth, predictions, average="weighted"),
        "f1": f1_score(ground_truth, predictions, average="weighted"),
    }
    return metrics

# Compare to baseline metrics from validation set
# Alert if any metric drops more than X% from baseline

The challenge: ground truth often arrives with a delay. Fraud labels come after investigation (days). Content quality labels come after user complaints (hours). Recommendation quality comes after purchase/engagement (minutes to days). Design your monitoring pipeline to handle delayed feedback.

Feature Stores

Feature stores solve a specific problem: ensuring that the features used during training are identical to the features used during serving. This sounds trivial, but in practice it is the source of a large class of silent ML failures.

The problem:
  Training:  features computed in batch with PySpark, from a data warehouse
  Serving:   features computed in real-time with Python, from a production DB
  Result:    subtle differences in feature computation cause prediction errors
             that are invisible to monitoring

A feature store provides a single source of truth for feature definitions and computation:

# Feast feature store example
from feast import FeatureStore

store = FeatureStore(repo_path="feature_repo/")

# At training time
training_df = store.get_historical_features(
    entity_df=entity_df,  # entities + timestamps
    features=[
        "user_features:purchase_count_30d",
        "user_features:avg_session_duration",
        "item_features:price",
        "item_features:category",
    ],
).to_df()

# At serving time - same features, same computation
serving_features = store.get_online_features(
    features=[
        "user_features:purchase_count_30d",
        "user_features:avg_session_duration",
        "item_features:price",
        "item_features:category",
    ],
    entity_rows=[{"user_id": 12345, "item_id": 678}],
).to_dict()

Feature stores matter most when:

Multiple models share the same features
Feature computation is complex (aggregations over time windows, joins across tables)
Training/serving skew has caused production incidents before

For simple models with few features, a feature store adds complexity without proportional benefit. You can enforce training/serving consistency by sharing feature computation code between pipelines.

Alerting Strategy

The hardest part of ML monitoring is setting alert thresholds. Too sensitive and you get alert fatigue. Too loose and you miss real problems.

Tiered alerting:

P1 (page someone):
  - Model server is down (standard infra alert)
  - Prediction throughput drops to zero
  - 100% of predictions are the same value (model collapse)
  - Ground truth metrics drop more than 20% from baseline

P2 (Slack notification, investigate within hours):
  - Data drift detected on >3 features simultaneously
  - Average confidence drops more than 15%
  - Prediction distribution shifts significantly
  - Feature null rate doubles

P3 (Daily report, investigate within days):
  - Mild drift on individual features
  - Gradual confidence decline
  - Slow shift in prediction distribution
  - Ground truth metrics decline 5-10%

Common Pitfalls

Monitoring only infrastructure, not data or predictions. An ML system can be healthy by every infrastructure metric and still produce useless predictions. Monitor all four layers.
No reference distribution. You cannot detect drift if you do not know what "normal" looks like. Save statistics from your training data and first successful production deployment as a baseline.
Alerting on single-feature drift. Individual features drift naturally. Alert when multiple features drift simultaneously or when output distribution changes. Single-feature drift is informational, not actionable.
Not logging model inputs and outputs. You cannot debug a production issue if you do not know what the model saw and what it predicted. Log everything (with appropriate PII protections).
Delayed ground truth treated as no ground truth. Even if labels arrive weeks later, they are valuable for detecting model degradation. Build a feedback pipeline that joins predictions with delayed labels.
Ignoring training/serving skew. The most insidious ML bug: the model works perfectly in evaluation but fails in production because features are computed differently. A feature store or shared computation code prevents this.

Key Takeaways

ML monitoring has four layers: infrastructure, data (input drift), predictions (output drift), and outcomes (ground truth comparison). Most teams only implement the first layer.
Data drift detection is the highest-leverage monitoring investment. It catches problems before they affect users.
Feature stores enforce consistency between training and serving, eliminating an entire class of silent production bugs.
Log model inputs, outputs, and confidence scores. You cannot debug what you did not record.
Set tiered alerts. Not every drift signal needs to wake someone up. Reserve paging for model collapse and severe metric drops.
Ground truth may arrive with a delay, but it always arrives eventually. Build the pipeline to use it when it does.