Data Collection & Labeling
Overview
Data is the most expensive part of any ML project. Not compute, not model architecture, not deployment infrastructure. Getting high-quality, representative training data consumes more time and budget than everything else combined. Yet most teams underestimate this cost and jump straight to modeling.
The quality of your data puts a hard ceiling on the quality of your model. A simple model trained on excellent data will outperform a sophisticated model trained on noisy, biased, or insufficient data. Every hour spent improving your data pipeline pays more dividends than an hour spent tuning hyperparameters.
Where to Get Training Data
Existing Databases
The cheapest data is data you already have. Most companies sit on valuable training data without realizing it:
# Mining existing databases for training data
import pandas as pd
def extract_training_data_from_logs(db_connection):
"""Extract labeled examples from existing user behavior.
If users already categorize support tickets manually,
those categorizations are free training labels.
"""
query = """
SELECT
ticket_text,
agent_assigned_category,
resolution_status
FROM support_tickets
WHERE agent_assigned_category IS NOT NULL
AND resolution_status = 'resolved'
AND created_at > '2024-01-01'
"""
df = pd.read_sql(query, db_connection)
# Filter for high-confidence labels
# Only use tickets where the category wasn't changed later
df = df[df["agent_assigned_category"] == df["final_category"]]
return df
Sources to check first:
- Application databases: User-generated categorizations, tags, ratings
- Search logs: Queries paired with clicked results are implicit relevance labels
- Customer support systems: Agent-categorized tickets, resolved issues
- Content moderation queues: Already-reviewed content with accept/reject decisions
User Behavior Logs
Implicit signals from user behavior are abundant but noisy:
# Implicit labeling from user behavior
def extract_implicit_labels(click_logs):
"""Convert user clicks into training signals.
Caveat: clicks indicate interest, not quality.
A user clicking a search result doesn't mean
the result was good -- they might have bounced
immediately.
"""
labeled_pairs = []
for session in click_logs:
query = session["query"]
for result in session["results"]:
if result["clicked"] and result["dwell_time"] > 30:
# Clicked and stayed = positive signal
labeled_pairs.append({
"query": query,
"document": result["doc_id"],
"label": "relevant",
})
elif not result["clicked"] and result["position"] <= 3:
# Shown prominently but ignored = negative signal
labeled_pairs.append({
"query": query,
"document": result["doc_id"],
"label": "not_relevant",
})
return labeled_pairs
Public Datasets
Useful for prototyping and benchmarking, but rarely sufficient for production:
- Hugging Face Datasets: Thousands of pre-labeled datasets across tasks
- Kaggle: Competition datasets with known baselines
- Government open data: Census, weather, financial filings
- Academic datasets: ImageNet, SQuAD, GLUE benchmarks
The catch: public datasets reflect someone else's problem. Your production distribution will differ.
Synthetic Data Generation
When real data is scarce, expensive, or privacy-sensitive:
# Generating synthetic training data with an LLM
def generate_synthetic_examples(category, num_examples=50):
"""Generate synthetic training examples for a text classifier.
Use a strong model to generate training data for a weaker,
cheaper model. This is called model distillation.
"""
prompt = f"""Generate {num_examples} realistic customer support
tickets for the category: {category}.
Requirements:
- Each ticket should be 1-3 sentences
- Include variety in tone (angry, polite, confused)
- Include common typos and informal language
- Output as JSON array of strings
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.9, # Higher temperature for diversity
)
return json.loads(response.choices[0].message.content)
Synthetic data works well for augmenting real data. It works poorly as a complete replacement.
Labeling Strategies
Manual Labeling
The gold standard, but expensive and slow:
Cost comparison (approximate per 1,000 labels):
In-house domain experts: $500-2,000 (highest quality, slowest)
Crowdsourcing platforms: $50-200 (variable quality, fast)
Offshore labeling services: $20-100 (needs heavy QA)
Best practices for manual labeling:
- Write detailed labeling guidelines with examples and edge cases
- Include "I don't know" as a valid label to avoid forced guessing
- Have multiple annotators label each example to measure agreement
- Start with a small pilot before scaling up
Semi-Automated Labeling
Use a model to pre-label, then have humans correct:
def semi_automated_labeling_pipeline(unlabeled_data):
"""Pre-label with a model, then route to human review.
This is 3-5x faster than labeling from scratch because
humans are faster at verifying than generating.
"""
results = []
for item in unlabeled_data:
prediction = model.predict(item)
confidence = prediction["confidence"]
if confidence > 0.95:
# High confidence: accept automatically
results.append({
"data": item,
"label": prediction["label"],
"source": "auto",
})
else:
# Low confidence: send to human review
results.append({
"data": item,
"label": None, # Human will fill this in
"suggested_label": prediction["label"],
"source": "needs_review",
})
return results
Active Learning
Let the model choose which examples to label next:
def active_learning_selection(model, unlabeled_pool, budget=100):
"""Select the most informative examples for labeling.
Instead of labeling random examples, label the ones
where the model is most uncertain. This gets better
results with fewer labels.
"""
predictions = model.predict_proba(unlabeled_pool)
# Uncertainty sampling: pick examples where the model
# is least confident (closest to 50/50)
uncertainty = 1 - predictions.max(axis=1)
# Select the top-k most uncertain examples
indices = uncertainty.argsort()[-budget:]
return unlabeled_pool[indices]
Active learning can reduce labeling costs by 50-80% compared to random sampling. The key insight: not all labels are equally valuable. Labels on examples where the model is already confident teach it nothing.
Label Quality Over Quantity
Inter-Annotator Agreement
If two humans disagree on a label, the example is ambiguous. Measure this:
from sklearn.metrics import cohen_kappa_score
def measure_agreement(annotator_1_labels, annotator_2_labels):
"""Measure how often two annotators agree.
Cohen's Kappa accounts for agreement by chance.
Kappa > 0.8: excellent agreement
Kappa 0.6-0.8: good agreement
Kappa < 0.6: your task definition needs work
"""
kappa = cohen_kappa_score(annotator_1_labels, annotator_2_labels)
if kappa < 0.6:
print("Warning: Low agreement. Review labeling guidelines.")
print("Common causes:")
print(" - Ambiguous category definitions")
print(" - Missing edge case examples in guidelines")
print(" - Annotators need more training")
return kappa
Cleaning Noisy Labels
Real-world labels always contain errors. Detect and fix them:
def find_label_errors(features, labels, model):
"""Find likely mislabeled examples using confident learning.
Train a model, then look for examples where the model
is very confident the label is wrong. These are
candidates for relabeling.
"""
from cleanlab import Datalab
lab = Datalab(data={"features": features, "labels": labels})
lab.find_issues()
# Get examples ranked by likelihood of being mislabeled
label_issues = lab.get_issues("label")
mislabeled = label_issues[label_issues["is_label_issue"]]
print(f"Found {len(mislabeled)} likely mislabeled examples")
print(f"out of {len(labels)} total ({100*len(mislabeled)/len(labels):.1f}%)")
return mislabeled.index.tolist()
A dataset with 1,000 clean labels will outperform a dataset with 10,000 noisy labels for most tasks.
Real-World Example: Building a Content Moderation Dataset
A social media platform needs to detect toxic comments.
Phase 1: Bootstrap with existing data. Pull 50,000 comments that were reported by users and reviewed by moderators. This gives you labels for free, but with selection bias: only reported comments are labeled, and most comments are never reported.
Phase 2: Active learning on unlabeled data. Train a model on the Phase 1 data. Run it on 1 million unlabeled comments. Select the 5,000 where the model is most uncertain. Send those to human annotators. This fills gaps in the training distribution.
Phase 3: Synthetic augmentation. Generate 2,000 synthetic toxic comments covering categories underrepresented in real data (subtle sarcasm, coded language, context-dependent toxicity). Mix with real data at a 10% ratio.
Phase 4: Continuous labeling. In production, sample 100 comments per day for human review. Use disagreements between the model and humans as training signals. The dataset grows and improves continuously.
Common Pitfalls
- Collecting data without a clear task definition: If you don't know exactly what the model needs to predict, you will collect the wrong data. Define the task first, then design the data collection.
- Assuming more data always helps: After a certain point, adding more of the same type of data has diminishing returns. Diversity and quality matter more than raw volume.
- Ignoring class imbalance: If 95% of your labels are one class, the model learns to always predict that class. Oversample the minority class or use weighted loss functions.
- Not versioning your datasets: When your model breaks, you need to know whether the data changed. Track every version of every dataset.
- Labeling in isolation: Labels created by people who don't understand the downstream task are often useless. Annotators need context about why the labels matter.
- Skipping the pilot round: Always label a small batch first, measure agreement, fix the guidelines, then scale up. Scaling bad guidelines wastes money.
Key Takeaways
- Data collection and labeling consume more time and budget than any other part of an ML project. Plan for this upfront.
- Start with data you already have: application databases, user behavior logs, and existing human decisions are free training labels.
- Label quality matters more than quantity. One thousand clean, consistent labels outperform ten thousand noisy ones.
- Use semi-automated labeling and active learning to reduce costs by 50-80% without sacrificing quality.
- Measure inter-annotator agreement early. Low agreement means your task definition is ambiguous, and no amount of data will fix that.
- Synthetic data is a useful supplement but not a replacement for real data. Mix it in at low ratios and validate carefully.