4 min read
On this page

Regression Testing for ML

Overview

Your new model is better on average. The aggregate metrics improved across the board. You deploy it, and within hours, a specific customer segment reports that the system is broken for them. The new model handles 95% of cases better but handles 5% of cases significantly worse. Nobody noticed because the 5% was invisible in the averages.

ML regression testing exists to prevent this. It is the practice of maintaining a curated set of inputs with known expected outputs, running every model change against that set, and blocking deployment when important cases degrade. If traditional software testing asks "does the code still work?", ML regression testing asks "does the model still work for the cases we know matter?"

Why ML Needs Regression Testing

The Average Hides the Worst Case

def demonstrate_hidden_regression():
    """
    Model v1: 90% accuracy overall
      - English inputs:   92% accuracy (8,000 examples)
      - Spanish inputs:   78% accuracy (1,500 examples)
      - Short queries:    88% accuracy (3,000 examples)
      - Long queries:     91% accuracy (5,500 examples)
    
    Model v2: 91% accuracy overall (improvement!)
      - English inputs:   93% accuracy (+1, good)
      - Spanish inputs:   71% accuracy (-7, disaster)
      - Short queries:    94% accuracy (+6, great)
      - Long queries:     90% accuracy (-1, minor)
    
    Overall accuracy went up. But Spanish-speaking users
    are now having a significantly worse experience.
    Without regression tests for Spanish inputs,
    this ships unnoticed.
    """
    pass

ML Changes Are Unpredictable

In traditional software, changing function A does not break function B (if properly designed). In ML, changing the training data or hyperparameters can cause unpredictable shifts in model behavior:

Traditional software:
  Change: Fix bug in payment processing
  Impact: Payment processing works better
  Other systems: Unaffected

ML:
  Change: Add 5,000 new training examples
  Impact: Overall accuracy improves by 2%
  Surprise: Model now classifies "urgent" emails differently
  Surprise: Latency increased because the model now generates
            longer responses for certain input types

Building a Regression Test Suite

Golden Datasets

A curated set of examples with verified correct outputs:

def build_golden_dataset():
    """Build a golden dataset for regression testing.
    
    Include:
    - Common cases (the bread and butter)
    - Edge cases (known tricky inputs)
    - Bug fixes (inputs that caused past failures)
    - Critical user flows (high-value scenarios)
    """
    golden_examples = [
        # Common cases
        {
            "id": "common_001",
            "input": "What is your return policy?",
            "expected_output_contains": ["30 days", "refund"],
            "category": "common",
            "priority": "high",
        },
        # Edge case: empty-ish input
        {
            "id": "edge_001",
            "input": "???",
            "expected_behavior": "asks_for_clarification",
            "category": "edge_case",
            "priority": "medium",
        },
        # Past bug: model used to hallucinate a phone number
        {
            "id": "bugfix_001",
            "input": "What is your support phone number?",
            "expected_output_not_contains": ["555-", "1-800"],
            "expected_behavior": "directs_to_help_page",
            "category": "bugfix",
            "priority": "critical",
        },
        # Critical flow: billing question
        {
            "id": "critical_001",
            "input": "I was charged twice for my subscription",
            "expected_output_contains": ["sorry", "billing", "refund"],
            "expected_behavior": "routes_to_billing",
            "category": "critical_flow",
            "priority": "critical",
        },
    ]
    
    return golden_examples

Assertion Types for ML

ML assertions are softer than traditional software assertions. Common types:

  • output_contains: Check that output includes expected terms
  • output_not_contains: Check that output excludes forbidden terms (e.g., hallucinated phone numbers)
  • output_length_in_range: Verify word count is within bounds
  • semantic_similarity_above_threshold: Check that output is semantically similar to a reference (threshold typically 0.8)
  • classification_matches: Verify that the effective classification matches expected

Running the Test Suite

def run_regression_tests(model, golden_dataset):
    """Run all regression tests and report results."""
    results = {
        "total": 0,
        "passed": 0,
        "failed": 0,
        "failures": [],
    }
    
    for example in golden_dataset:
        results["total"] += 1
        output = model.generate(example["input"])
        
        test_passed = True
        failure_reasons = []
        
        # Run all applicable assertions
        if "expected_output_contains" in example:
            check = MLAssertions.output_contains(
                output, example["expected_output_contains"]
            )
            if not check["passed"]:
                test_passed = False
                failure_reasons.append(
                    f"Missing terms: {check['missing_terms']}"
                )
        
        if "expected_output_not_contains" in example:
            check = MLAssertions.output_not_contains(
                output, example["expected_output_not_contains"]
            )
            if not check["passed"]:
                test_passed = False
                failure_reasons.append(
                    f"Found forbidden: {check['found_forbidden']}"
                )
        
        if test_passed:
            results["passed"] += 1
        else:
            results["failed"] += 1
            results["failures"].append({
                "id": example["id"],
                "input": example["input"],
                "output": output[:200],
                "reasons": failure_reasons,
                "priority": example.get("priority", "medium"),
            })
    
    return results

Snapshot Testing for Embeddings

When your embedding model changes, all downstream systems can break:

import numpy as np

def embedding_regression_test(
    new_model, reference_embeddings, threshold=0.95
):
    """Test that a new embedding model preserves relative distances.
    
    If 'cat' was closer to 'dog' than to 'car' in the old model,
    it should still be closer in the new model. The absolute
    vectors can change, but the relative ordering should not.
    """
    test_pairs = [
        # (text_a, text_b) pairs that should be similar
        ("machine learning", "deep learning"),
        ("python programming", "software development"),
        ("customer complaint", "user dissatisfied"),
    ]
    
    dissimilar_pairs = [
        # (text_a, text_b) pairs that should be dissimilar
        ("machine learning", "pizza recipe"),
        ("python programming", "ocean waves"),
    ]
    
    failures = []
    
    # Check that similar pairs remain similar
    for text_a, text_b in test_pairs:
        emb_a = new_model.encode(text_a)
        emb_b = new_model.encode(text_b)
        similarity = cosine_similarity(emb_a, emb_b)
        
        if similarity < threshold:
            failures.append({
                "pair": (text_a, text_b),
                "similarity": similarity,
                "expected": f">= {threshold}",
                "type": "similar_pair_diverged",
            })
    
    # Check that dissimilar pairs remain dissimilar
    for text_a, text_b in dissimilar_pairs:
        emb_a = new_model.encode(text_a)
        emb_b = new_model.encode(text_b)
        similarity = cosine_similarity(emb_a, emb_b)
        
        if similarity > 0.5:
            failures.append({
                "pair": (text_a, text_b),
                "similarity": similarity,
                "expected": "< 0.5",
                "type": "dissimilar_pair_converged",
            })
    
    return {
        "passed": len(failures) == 0,
        "failures": failures,
    }

CI/CD for ML

The ML Testing Pipeline

Code change or data change triggers pipeline:

Stage 1: Unit tests (seconds)
  - Input validation, output format, deterministic logic

Stage 2: Regression tests (minutes)
  - Run golden dataset, block on critical failures

Stage 3: Evaluation suite (minutes to hours)
  - Full offline metrics, slice-based, comparison to production

Stage 4: Shadow deployment (hours to days)
  - Run alongside production, compare outputs, no user impact

Stage 5: Canary deployment (hours to days)
  - 5% of traffic, monitor online metrics, auto-rollback

Growing Your Test Suite

Every production incident becomes a regression test. When a model fails, add the failing input to the golden dataset with priority: "critical". This ensures the same failure never happens again.

Review the test suite monthly: remove irrelevant tests, add tests for new features, update expected outputs when requirements change, and verify the suite catches known regressions by running against an intentionally degraded model.

Real-World Example: Deploying a New Summarization Model

A team upgrades their document summarization model from v3 to v4.

Step 1: Run the golden dataset (200 documents with expert-written reference summaries). Results: 195 pass, 5 fail. All 5 failures are on legal documents where the new model omits key clauses.

Step 2: Investigate the failures. The new model was trained on more general text and has less exposure to legal language. The team adds 50 legal documents to the training data and retrains.

Step 3: Rerun. All 200 pass. Run full evaluation: ROUGE scores improve on average. Slice-based evaluation shows improvement across all categories including legal.

Step 4: Shadow deployment for 48 hours. Compare v3 and v4 outputs on live traffic. No anomalies detected.

Step 5: Canary deployment at 10%. Monitor acceptance rate (do users edit the summaries?). V4 acceptance rate is 2% higher than v3. Roll out to 100%.

Step 6: Add the 5 failing legal documents to the golden dataset permanently. Future model changes will be tested against them.

Common Pitfalls

  • No regression test suite at all: Many ML teams ship model updates without any regression testing. This is the equivalent of deploying code without running tests.
  • Testing only on aggregate metrics: If you only check overall accuracy, you will miss regressions on specific subgroups that matter.
  • Stale golden datasets: If your golden dataset does not evolve with your product, it stops catching relevant failures. Review and update it regularly.
  • Flaky tests due to non-determinism: LLM outputs are non-deterministic. Use temperature=0 for regression tests or use assertion types that tolerate variation (semantic similarity, contains-check instead of exact match).
  • Blocking on every failure: Not all failures are equal. A critical flow regression should block deployment. A minor edge case regression might be acceptable. Use priority levels.
  • Manual regression testing: If humans have to run the tests manually, they will eventually skip them. Automate regression tests in CI/CD.

Key Takeaways

  • ML regression testing prevents the "better on average, worse for you" problem. Every model change should be tested against a curated set of known-good examples.
  • Build a golden dataset that includes common cases, edge cases, past bug fixes, and critical user flows. Grow it from every production incident.
  • Use soft assertions for ML: semantic similarity, contains-checks, classification matching. Exact match is too brittle for non-deterministic outputs.
  • Embed regression tests in CI/CD. Block deployment on critical failures. Warn on minor degradations.
  • Test embeddings separately. A new embedding model can silently break all downstream systems that depend on vector similarity.
  • Auto-rollback is your safety net. Monitor online metrics and revert automatically when they cross thresholds.