5 min read
On this page

Bias & Dataset Versioning

Overview

Every dataset has bias. This is not a moral judgment but a statistical reality. The data you collect reflects who collected it, how it was collected, and what was available at the time. A hiring dataset trained on historical decisions encodes past discrimination. A medical dataset from one hospital reflects that hospital's patient demographics. A language dataset from the internet overrepresents English speakers with internet access.

Acknowledging bias is the first step. Measuring it is the second. Versioning your datasets so you can trace which data produced which model is the third. Together, these practices make the difference between ML systems that fail silently and systems you can debug, improve, and trust.

Types of Dataset Bias

Selection Bias

The data you have is not representative of the data you will encounter:

# Example: selection bias in a product review dataset
def demonstrate_selection_bias():
    """
    Problem: Your training data comes from users who voluntarily
    left reviews. But most users don't leave reviews.
    
    Users who leave reviews tend to have extreme opinions:
    either very happy or very unhappy.
    
    Result: Your sentiment model thinks everything is either
    great or terrible. It struggles with neutral opinions.
    """
    review_distribution = {
        "1 star": 0.25,  # Overrepresented (angry users review)
        "2 star": 0.05,
        "3 star": 0.05,  # Underrepresented (neutral users don't bother)
        "4 star": 0.10,
        "5 star": 0.55,  # Overrepresented (happy users review)
    }
    
    actual_satisfaction = {
        "1 star": 0.05,
        "2 star": 0.10,
        "3 star": 0.40,  # Most users are actually neutral
        "4 star": 0.30,
        "5 star": 0.15,
    }
    
    return review_distribution, actual_satisfaction

Measurement Bias

The way you measure or label data introduces systematic errors:

# Example: measurement bias in labeling
def demonstrate_measurement_bias():
    """
    Problem: Different annotators have different thresholds
    for what counts as "toxic" content.
    
    Annotator A flags anything mildly negative as toxic.
    Annotator B only flags explicit hate speech.
    
    If annotator A labeled most of your training data,
    your model will be overly sensitive.
    """
    # Same comment, different annotators
    comment = "This product is a complete waste of money"
    
    annotator_labels = {
        "annotator_A": "toxic",       # Strict threshold
        "annotator_B": "not_toxic",   # Lenient threshold
        "annotator_C": "not_toxic",   # Lenient threshold
    }
    
    # Majority vote: not_toxic (2 vs 1)
    # But if annotator_A labeled 80% of the dataset alone,
    # the model learns A's threshold, not the majority's
    return annotator_labels

Historical Bias

The data reflects past practices, including past discrimination:

Examples of historical bias in datasets:

Hiring data:
  If a company historically hired fewer women for engineering roles,
  a model trained on that data will score women lower.
  The model is "accurate" -- it predicts past behavior.
  But past behavior was biased.

Loan approval data:
  If certain zip codes were historically redlined,
  a model trained on approval/denial data will learn
  to deny applicants from those zip codes.

Medical data:
  If clinical trials historically underrepresented minorities,
  a model trained on trial data will be less accurate
  for minority patients.

Survivorship Bias

You only see data from entities that survived a process. If you train a model to predict startup success using data from existing companies, you have no data on the thousands that failed. The model learns what survivors look like, not what distinguishes success from failure.

Detecting Bias

Demographic Analysis

Examine how your data breaks down across relevant groups:

import pandas as pd

def analyze_dataset_demographics(df, label_col, demographic_cols):
    """Check if labels are distributed evenly across demographic groups.
    
    If approval rates differ significantly between groups,
    that is a signal of potential bias.
    """
    report = {}
    
    for demo_col in demographic_cols:
        group_stats = df.groupby(demo_col)[label_col].agg([
            "count", "mean", "std"
        ])
        
        # Check for representation imbalance
        total = group_stats["count"].sum()
        group_stats["representation"] = group_stats["count"] / total
        
        # Check for outcome imbalance
        overall_mean = df[label_col].mean()
        group_stats["deviation_from_mean"] = (
            group_stats["mean"] - overall_mean
        )
        
        report[demo_col] = group_stats
        
        # Flag large disparities
        max_deviation = group_stats["deviation_from_mean"].abs().max()
        if max_deviation > 0.1:
            print(f"Warning: {demo_col} shows >{max_deviation:.1%} "
                  f"deviation in outcomes between groups")
    
    return report

Slice-Based Evaluation

Don't just measure overall performance. Break it down by subgroups:

def slice_based_evaluation(model, test_data, slicing_functions):
    """Evaluate model performance on specific data slices.
    
    Overall accuracy can hide poor performance on minorities.
    A model with 95% accuracy overall might have 60% accuracy
    on a specific subgroup.
    """
    overall_metrics = evaluate(model, test_data)
    print(f"Overall accuracy: {overall_metrics['accuracy']:.3f}")
    print(f"Overall F1:       {overall_metrics['f1']:.3f}")
    print()
    
    for slice_name, slice_fn in slicing_functions.items():
        slice_data = [d for d in test_data if slice_fn(d)]
        
        if len(slice_data) < 30:
            print(f"Slice '{slice_name}': too few examples "
                  f"({len(slice_data)}) for reliable metrics")
            continue
        
        slice_metrics = evaluate(model, slice_data)
        
        gap = overall_metrics["accuracy"] - slice_metrics["accuracy"]
        flag = " *** ALERT ***" if gap > 0.1 else ""
        
        print(f"Slice '{slice_name}' (n={len(slice_data)}):")
        print(f"  Accuracy: {slice_metrics['accuracy']:.3f} "
              f"(gap: {gap:+.3f}){flag}")
        print(f"  F1:       {slice_metrics['f1']:.3f}")

# Define slicing functions
slices = {
    "short_text": lambda d: len(d["text"].split()) < 10,
    "long_text": lambda d: len(d["text"].split()) > 100,
    "contains_negation": lambda d: any(
        w in d["text"].lower() for w in ["not", "no", "never", "don't"]
    ),
    "non_english_characters": lambda d: any(
        ord(c) > 127 for c in d["text"]
    ),
}

Data Distribution Comparison

Compare your training data distribution to production data using statistical tests like the Kolmogorov-Smirnov test. For each feature, check whether the training and production distributions differ significantly. If they do, your model may perform poorly in production even if it did well on the test set.

Dataset Versioning

Why Version Datasets

The same code with different data produces different models. If you cannot reproduce a model, you cannot debug it:

Model v2.3 is worse than v2.2. Why?

Without dataset versioning:
  "We changed the data at some point, but we don't know
   exactly what changed or when."

With dataset versioning:
  Model v2.2 used dataset v1.4.0 (hash: abc123)
  Model v2.3 used dataset v1.5.0 (hash: def456)
  Diff: 2,340 new examples added, 89 labels corrected,
  3 duplicate examples removed.
  Root cause: 200 of the new examples were mislabeled.

DVC (Data Version Control)

Git for data. Tracks data files alongside code:

# Initialize DVC in your git repo
dvc init

# Track a large data file
dvc add data/training_set.parquet

# This creates data/training_set.parquet.dvc (a small pointer file)
# The pointer file goes in git, the actual data goes in DVC storage

git add data/training_set.parquet.dvc data/.gitignore
git commit -m "Add training set v1.0"

# Push data to remote storage (S3, GCS, Azure, etc.)
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc push

# Later: reproduce exact data state from any git commit
git checkout v2.2-release
dvc checkout  # Downloads the exact data from that commit

Git LFS

Simpler than DVC, works well for smaller datasets. Track data files by extension (git lfs track "*.parquet"), and the large files are stored in LFS rather than the git repo itself. Good for teams already comfortable with git workflows.

Choosing a Versioning Approach

Approach     Best for                    Complexity
git-lfs      Small datasets (<1 GB)      Low
DVC          Medium datasets (1-100 GB)  Medium
LakeFS       Large data lakes (>100 GB)  High
Manual       Don't do this               N/A

Reproducibility

Practical Reproducibility Checklist

For every model you ship to production, record:

Data:
  [ ] Exact dataset version (hash or version tag)
  [ ] Data preprocessing steps (code version)
  [ ] Train/validation/test split ratios and method

Code:
  [ ] Git commit hash
  [ ] All dependency versions (requirements.txt or lock file)
  [ ] Python/runtime version

Training:
  [ ] Random seed
  [ ] Hyperparameters
  [ ] Hardware (GPU type, number of GPUs)
  [ ] Training duration and final metrics

Evaluation:
  [ ] Test set version (hash or version tag)
  [ ] All evaluation metrics
  [ ] Slice-based metrics for key subgroups

Real-World Example: Detecting Bias in a Loan Approval Model

A bank builds a model to predict loan defaults. The model achieves 92% accuracy overall but a fairness audit reveals problems.

Step 1: Demographic analysis. The training data contains 70% white applicants, 15% Hispanic, 10% Black, 5% other. The actual applicant pool is more evenly distributed. The model has less data to learn from for minority groups.

Step 2: Slice-based evaluation. Accuracy for white applicants is 94%. Accuracy for Black applicants is 81%. The model is significantly worse for underrepresented groups.

Step 3: Feature analysis. The model uses zip code as a feature. Zip codes correlate with race due to historical segregation. Removing zip code and using income-based features instead reduces the accuracy gap from 13 points to 4 points.

Step 4: Dataset versioning. The team versions the dataset, records the fairness metrics for each version, and establishes a rule: no model ships if the accuracy gap between any two demographic groups exceeds 5 percentage points.

Step 5: Ongoing monitoring. In production, the team monitors approval rates by demographic group weekly. If a drift is detected, they investigate whether the data distribution has shifted.

Common Pitfalls

  • Assuming your dataset is unbiased: Every dataset has bias. The question is not "is there bias?" but "what kind of bias, and how much does it matter for this use case?"
  • Measuring only overall metrics: A model with 95% overall accuracy can have 60% accuracy on a minority subgroup. Always evaluate by slice.
  • Using protected attributes as features: Even if you remove race or gender from the feature set, proxies like zip code or name can encode the same information.
  • Not versioning datasets: When your model degrades, you need to know if the data changed. Without versioning, you are debugging blind.
  • Manual dataset management: Copying files to folders named "data_v2_final_FINAL" does not scale. Use proper versioning tools.
  • Treating reproducibility as optional: If you cannot reproduce a model, you cannot debug it, audit it, or explain it. Reproducibility is a requirement, not a nice-to-have.

Key Takeaways

  • Every dataset has bias: selection bias, measurement bias, historical bias, and survivorship bias. Acknowledge it, measure it, and mitigate it.
  • Slice-based evaluation is essential. Overall metrics hide disparities between subgroups. Break down performance by every relevant dimension.
  • Dataset versioning is as important as code versioning. Use DVC, git-lfs, or LakeFS depending on your data scale.
  • Reproducibility requires recording the exact dataset version, code commit, random seed, and hyperparameters for every model you train.
  • Bias detection is not a one-time audit. Monitor fairness metrics continuously in production, just as you monitor accuracy and latency.
  • The same data plus the same code should produce the same model. If it does not, you have a reproducibility gap that will eventually cause problems.