Data Collection & Labeling

Overview

Data is the most expensive part of any ML project. Not compute, not model architecture, not deployment infrastructure. Getting high-quality, representative training data consumes more time and budget than everything else combined. Yet most teams underestimate this cost and jump straight to modeling.

The quality of your data puts a hard ceiling on the quality of your model. A simple model trained on excellent data will outperform a sophisticated model trained on noisy, biased, or insufficient data. Every hour spent improving your data pipeline pays more dividends than an hour spent tuning hyperparameters.

Where to Get Training Data

Existing Databases

The cheapest data is data you already have. Most companies sit on valuable training data without realizing it:

# Mining existing databases for training data
import pandas as pd

def extract_training_data_from_logs(db_connection):
    """Extract labeled examples from existing user behavior.
    
    If users already categorize support tickets manually,
    those categorizations are free training labels.
    """
    query = """
        SELECT 
            ticket_text,
            agent_assigned_category,
            resolution_status
        FROM support_tickets
        WHERE agent_assigned_category IS NOT NULL
          AND resolution_status = 'resolved'
          AND created_at > '2024-01-01'
    """
    df = pd.read_sql(query, db_connection)
    
    # Filter for high-confidence labels
    # Only use tickets where the category wasn't changed later
    df = df[df["agent_assigned_category"] == df["final_category"]]
    
    return df

Sources to check first:

Application databases: User-generated categorizations, tags, ratings
Search logs: Queries paired with clicked results are implicit relevance labels
Customer support systems: Agent-categorized tickets, resolved issues
Content moderation queues: Already-reviewed content with accept/reject decisions

User Behavior Logs

Implicit signals from user behavior are abundant but noisy:

# Implicit labeling from user behavior
def extract_implicit_labels(click_logs):
    """Convert user clicks into training signals.
    
    Caveat: clicks indicate interest, not quality.
    A user clicking a search result doesn't mean
    the result was good -- they might have bounced
    immediately.
    """
    labeled_pairs = []
    for session in click_logs:
        query = session["query"]
        for result in session["results"]:
            if result["clicked"] and result["dwell_time"] > 30:
                # Clicked and stayed = positive signal
                labeled_pairs.append({
                    "query": query,
                    "document": result["doc_id"],
                    "label": "relevant",
                })
            elif not result["clicked"] and result["position"] <= 3:
                # Shown prominently but ignored = negative signal
                labeled_pairs.append({
                    "query": query,
                    "document": result["doc_id"],
                    "label": "not_relevant",
                })
    return labeled_pairs

Public Datasets

Useful for prototyping and benchmarking, but rarely sufficient for production:

Hugging Face Datasets: Thousands of pre-labeled datasets across tasks
Kaggle: Competition datasets with known baselines
Government open data: Census, weather, financial filings
Academic datasets: ImageNet, SQuAD, GLUE benchmarks

The catch: public datasets reflect someone else's problem. Your production distribution will differ.

Synthetic Data Generation

When real data is scarce, expensive, or privacy-sensitive:

# Generating synthetic training data with an LLM
def generate_synthetic_examples(category, num_examples=50):
    """Generate synthetic training examples for a text classifier.
    
    Use a strong model to generate training data for a weaker,
    cheaper model. This is called model distillation.
    """
    prompt = f"""Generate {num_examples} realistic customer support 
    tickets for the category: {category}.
    
    Requirements:
    - Each ticket should be 1-3 sentences
    - Include variety in tone (angry, polite, confused)
    - Include common typos and informal language
    - Output as JSON array of strings
    """
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.9,  # Higher temperature for diversity
    )
    
    return json.loads(response.choices[0].message.content)

Synthetic data works well for augmenting real data. It works poorly as a complete replacement.

Labeling Strategies

Manual Labeling

The gold standard, but expensive and slow:

Cost comparison (approximate per 1,000 labels):

In-house domain experts:    $500-2,000 (highest quality, slowest)
Crowdsourcing platforms:    $50-200    (variable quality, fast)
Offshore labeling services: $20-100    (needs heavy QA)

Best practices for manual labeling:

Write detailed labeling guidelines with examples and edge cases
Include "I don't know" as a valid label to avoid forced guessing
Have multiple annotators label each example to measure agreement
Start with a small pilot before scaling up

Semi-Automated Labeling

Use a model to pre-label, then have humans correct:

def semi_automated_labeling_pipeline(unlabeled_data):
    """Pre-label with a model, then route to human review.
    
    This is 3-5x faster than labeling from scratch because
    humans are faster at verifying than generating.
    """
    results = []
    for item in unlabeled_data:
        prediction = model.predict(item)
        confidence = prediction["confidence"]
        
        if confidence > 0.95:
            # High confidence: accept automatically
            results.append({
                "data": item,
                "label": prediction["label"],
                "source": "auto",
            })
        else:
            # Low confidence: send to human review
            results.append({
                "data": item,
                "label": None,  # Human will fill this in
                "suggested_label": prediction["label"],
                "source": "needs_review",
            })
    
    return results

Active Learning

Let the model choose which examples to label next:

def active_learning_selection(model, unlabeled_pool, budget=100):
    """Select the most informative examples for labeling.
    
    Instead of labeling random examples, label the ones
    where the model is most uncertain. This gets better
    results with fewer labels.
    """
    predictions = model.predict_proba(unlabeled_pool)
    
    # Uncertainty sampling: pick examples where the model
    # is least confident (closest to 50/50)
    uncertainty = 1 - predictions.max(axis=1)
    
    # Select the top-k most uncertain examples
    indices = uncertainty.argsort()[-budget:]
    
    return unlabeled_pool[indices]

Active learning can reduce labeling costs by 50-80% compared to random sampling. The key insight: not all labels are equally valuable. Labels on examples where the model is already confident teach it nothing.

Label Quality Over Quantity

Inter-Annotator Agreement

If two humans disagree on a label, the example is ambiguous. Measure this:

from sklearn.metrics import cohen_kappa_score

def measure_agreement(annotator_1_labels, annotator_2_labels):
    """Measure how often two annotators agree.
    
    Cohen's Kappa accounts for agreement by chance.
    Kappa > 0.8: excellent agreement
    Kappa 0.6-0.8: good agreement
    Kappa < 0.6: your task definition needs work
    """
    kappa = cohen_kappa_score(annotator_1_labels, annotator_2_labels)
    
    if kappa < 0.6:
        print("Warning: Low agreement. Review labeling guidelines.")
        print("Common causes:")
        print("  - Ambiguous category definitions")
        print("  - Missing edge case examples in guidelines")
        print("  - Annotators need more training")
    
    return kappa

Cleaning Noisy Labels

Real-world labels always contain errors. Detect and fix them:

def find_label_errors(features, labels, model):
    """Find likely mislabeled examples using confident learning.
    
    Train a model, then look for examples where the model
    is very confident the label is wrong. These are
    candidates for relabeling.
    """
    from cleanlab import Datalab
    
    lab = Datalab(data={"features": features, "labels": labels})
    lab.find_issues()
    
    # Get examples ranked by likelihood of being mislabeled
    label_issues = lab.get_issues("label")
    mislabeled = label_issues[label_issues["is_label_issue"]]
    
    print(f"Found {len(mislabeled)} likely mislabeled examples")
    print(f"out of {len(labels)} total ({100*len(mislabeled)/len(labels):.1f}%)")
    
    return mislabeled.index.tolist()

A dataset with 1,000 clean labels will outperform a dataset with 10,000 noisy labels for most tasks.

Real-World Example: Building a Content Moderation Dataset

A social media platform needs to detect toxic comments.

Phase 1: Bootstrap with existing data. Pull 50,000 comments that were reported by users and reviewed by moderators. This gives you labels for free, but with selection bias: only reported comments are labeled, and most comments are never reported.

Phase 2: Active learning on unlabeled data. Train a model on the Phase 1 data. Run it on 1 million unlabeled comments. Select the 5,000 where the model is most uncertain. Send those to human annotators. This fills gaps in the training distribution.

Phase 3: Synthetic augmentation. Generate 2,000 synthetic toxic comments covering categories underrepresented in real data (subtle sarcasm, coded language, context-dependent toxicity). Mix with real data at a 10% ratio.

Phase 4: Continuous labeling. In production, sample 100 comments per day for human review. Use disagreements between the model and humans as training signals. The dataset grows and improves continuously.

Common Pitfalls

Collecting data without a clear task definition: If you don't know exactly what the model needs to predict, you will collect the wrong data. Define the task first, then design the data collection.
Assuming more data always helps: After a certain point, adding more of the same type of data has diminishing returns. Diversity and quality matter more than raw volume.
Ignoring class imbalance: If 95% of your labels are one class, the model learns to always predict that class. Oversample the minority class or use weighted loss functions.
Not versioning your datasets: When your model breaks, you need to know whether the data changed. Track every version of every dataset.
Labeling in isolation: Labels created by people who don't understand the downstream task are often useless. Annotators need context about why the labels matter.
Skipping the pilot round: Always label a small batch first, measure agreement, fix the guidelines, then scale up. Scaling bad guidelines wastes money.

Key Takeaways

Data collection and labeling consume more time and budget than any other part of an ML project. Plan for this upfront.
Start with data you already have: application databases, user behavior logs, and existing human decisions are free training labels.
Label quality matters more than quantity. One thousand clean, consistent labels outperform ten thousand noisy ones.
Use semi-automated labeling and active learning to reduce costs by 50-80% without sacrificing quality.
Measure inter-annotator agreement early. Low agreement means your task definition is ambiguous, and no amount of data will fix that.
Synthetic data is a useful supplement but not a replacement for real data. Mix it in at low ratios and validate carefully.