Data Augmentation & Synthetic Data
Overview
You often need more training data than you have. Collecting and labeling real data is slow and expensive. Data augmentation and synthetic data generation let you multiply your existing dataset or create new examples from scratch. When done well, these techniques improve model robustness and reduce overfitting. When done poorly, they introduce noise and artifacts that degrade performance.
The core principle is simple: create new training examples that are different enough to teach the model something new, but realistic enough that they represent the actual distribution the model will encounter in production.
Text Augmentation Techniques
Synonym Replacement
The simplest augmentation: swap words with their synonyms.
import random
from nltk.corpus import wordnet
def synonym_replacement(text, n_replacements=2):
"""Replace n random words with synonyms.
Simple but limited. Works for basic text classification
but can change meaning in subtle ways.
"""
words = text.split()
new_words = words.copy()
# Get indices of content words (skip stopwords)
replaceable = [
i for i, w in enumerate(words)
if len(w) > 3 # crude filter for content words
]
random.shuffle(replaceable)
for idx in replaceable[:n_replacements]:
word = words[idx]
synonyms = []
for syn in wordnet.synsets(word):
for lemma in syn.lemmas():
if lemma.name() != word:
synonyms.append(lemma.name().replace("_", " "))
if synonyms:
new_words[idx] = random.choice(synonyms)
return " ".join(new_words)
# Example
original = "The customer complained about slow delivery times"
augmented = synonym_replacement(original)
# "The customer grumbled about slow delivery times"
Limitations: synonym replacement preserves meaning inconsistently. "Fast car" and "quick car" are fine, but "fast friend" and "quick friend" mean different things.
Back-Translation
Translate to another language and back. The round-trip introduces natural paraphrasing:
from transformers import MarianMTModel, MarianTokenizer
def back_translate(text, intermediate_lang="fr"):
"""Translate text to another language and back.
The round-trip introduces natural variation:
'The food was excellent' -> 'La nourriture etait excellente'
-> 'The food was outstanding'
"""
# English -> French
en_fr_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
en_fr_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
tokens = en_fr_tokenizer(text, return_tensors="pt", padding=True)
translated = en_fr_model.generate(**tokens)
french_text = en_fr_tokenizer.decode(translated[0], skip_special_tokens=True)
# French -> English
fr_en_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-fr-en")
fr_en_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en")
tokens = fr_en_tokenizer(french_text, return_tensors="pt", padding=True)
back_translated = fr_en_model.generate(**tokens)
result = fr_en_tokenizer.decode(back_translated[0], skip_special_tokens=True)
return result
Back-translation produces more natural paraphrases than synonym replacement. Use multiple intermediate languages (French, German, Chinese) to generate diverse variations.
Paraphrasing with LLMs
Modern LLMs generate high-quality paraphrases with controllable variation:
def llm_paraphrase(text, variation="moderate"):
"""Use an LLM to paraphrase text with controlled variation.
Low variation: minor rewording, same structure
Moderate variation: different structure, same meaning
High variation: same core meaning, very different expression
"""
variation_instructions = {
"low": "Slightly reword this text while keeping the same structure.",
"moderate": "Rewrite this text with a different sentence structure but the same meaning.",
"high": "Express the same idea in a completely different way.",
}
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"{variation_instructions[variation]} "
"Return only the paraphrased text, nothing else."
},
{"role": "user", "content": text}
],
temperature=0.7,
)
return response.choices[0].message.content.strip()
LLM-Generated Synthetic Data
Generating Training Data for Fine-Tuning
Use a strong model to generate training data for a smaller, cheaper model:
def generate_classification_training_data(
categories: list[str],
examples_per_category: int = 100,
domain: str = "customer support",
):
"""Generate a synthetic classification dataset.
Strategy: use GPT-4o to generate diverse examples,
then use them to fine-tune GPT-4o-mini or an open-source model.
"""
dataset = []
for category in categories:
prompt = f"""Generate {examples_per_category} realistic {domain}
messages that belong to the category: "{category}".
Requirements:
- Each message should be 1-4 sentences
- Include variety: formal/informal, short/long, clear/ambiguous
- Include realistic typos and abbreviations
- Each message should clearly belong to this category
- Output as a JSON array of strings
Examples of the kind of diversity I want:
- A one-word complaint
- A polite request with full sentences
- An angry message with typos
- A message that is borderline but still fits this category"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.9,
)
messages = json.loads(response.choices[0].message.content)
for msg in messages:
dataset.append({"text": msg, "label": category})
return dataset
You can also generate instruction-response pairs for instruction tuning, the approach used by Stanford Alpaca and many open-source fine-tuned models. The same principle applies: use a strong model (GPT-4o) to generate diverse training pairs, then fine-tune a smaller model on the results.
When Synthetic Data Helps vs When It Misleads
Where Synthetic Data Works Well
Training classifiers: Synthetic examples expand coverage of underrepresented classes. If you have 1,000 examples of category A and 50 of category B, generating 500 synthetic examples of category B balances the dataset.
Bootstrapping new tasks: When you have zero training data for a new task, synthetic data gives you a starting point. Train on synthetic data first, then fine-tune on real data as it becomes available.
Data privacy: When real data contains PII, synthetic data with the same statistical properties lets you train without privacy risks.
def augment_minority_class(dataset, target_class, target_count):
"""Generate synthetic examples to balance a skewed dataset."""
current_examples = [
d for d in dataset if d["label"] == target_class
]
current_count = len(current_examples)
if current_count >= target_count:
return dataset
needed = target_count - current_count
# Use existing examples as seed for generation
seed_examples = random.sample(
current_examples, min(10, current_count)
)
seed_text = "\n".join(
f"- {ex['text']}" for ex in seed_examples
)
prompt = f"""Here are examples of the category "{target_class}":
{seed_text}
Generate {needed} more examples in the same style but with variety.
Output as a JSON array of strings."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.9,
)
new_examples = json.loads(response.choices[0].message.content)
for text in new_examples:
dataset.append({"text": text, "label": target_class})
return dataset
Where Synthetic Data Misleads
Evaluation sets: Never use synthetic data for evaluation. If you generate test data with the same model you trained on, you are measuring how well the model replicates itself, not how well it handles real inputs.
Distribution shift: Synthetic data reflects the generating model's understanding, not reality. An LLM asked to generate "angry customer messages" produces its idea of anger, which may not match how your actual customers express frustration.
Subtle patterns: Synthetic data lacks the messy, inconsistent patterns that exist in real data. Models trained only on synthetic data are brittle when deployed against real inputs.
Good uses of synthetic data:
Training set augmentation YES (mix with real data)
Minority class balancing YES (supplement real examples)
Privacy-safe model development YES (replaces PII-containing data)
Cold-start bootstrapping YES (until real data is available)
Bad uses of synthetic data:
Test set creation NO (always use real data)
Final model evaluation NO (measures nothing useful)
Replacing real data entirely NO (distribution shift)
Evaluating bias/fairness NO (inherits generator biases)
Quality Control for Generated Data
Automated Filtering
Not all synthetic examples are good. Filter aggressively with three checks: (1) length check (discard examples that are too short or too long), (2) deduplication (use embedding similarity to remove near-duplicates above 0.95 cosine similarity), and (3) label verification (use a classifier trained on real data to confirm the label is correct; discard mismatches).
Human Spot-Checking
Automate what you can, but always spot-check with humans. Don't review everything. Review a random sample of 100 examples and measure the error rate. If the error rate is above 5%, fix the generation process before using the data.
Mixing Ratios
The ratio of synthetic to real data matters:
Recommended mixing ratios:
Scenario Real : Synthetic
Plenty of real data (>10k) Only use real data
Moderate real data (1k-10k) 70:30 to 50:50
Small real data (<1k) 30:70 (augment heavily)
No real data (cold start) 0:100 (temporary, replace ASAP)
Real-World Example: Multi-Language Support
A company has a text classifier trained on 50,000 English customer messages. They need to support Spanish, French, and German, but have fewer than 500 labeled messages in each language.
Step 1: Back-translate the English dataset to create 50,000 synthetic examples per language. This gives a rough starting point.
Step 2: Use GPT-4o to generate 2,000 native-sounding examples per language, with prompts written by native speakers who understand the cultural differences in how customers communicate.
Step 3: Have native speakers label 500 real messages per language as a test set. Never use synthetic data for evaluation.
Step 4: Train with a mix of back-translated, LLM-generated, and real data. Use the real test set to measure actual performance.
Result: The augmented model achieves 85% accuracy in new languages compared to 91% in English. Without augmentation, accuracy was 62%. The model improves to 89% after one month of collecting real labeled data in production.
Common Pitfalls
- Using synthetic data for evaluation: This is the most common and most damaging mistake. Your test set must always be real, representative data. Synthetic evaluation tells you nothing about production performance.
- Over-augmenting: If you generate 100x more synthetic data than real data, the model learns the synthetic distribution, not the real one. Keep ratios reasonable.
- Not checking for duplicates: LLMs often generate near-identical examples. Deduplicate aggressively before training.
- Ignoring the generator's biases: GPT-4o has its own biases about how text should look. Synthetic data inherits these biases. Check that synthetic examples match real data distributions.
- Augmenting without a baseline: Always measure performance on real data before and after augmentation. If augmentation doesn't improve your real-data metrics, it is not helping.
- Treating all augmentation techniques equally: Back-translation works well for classification but poorly for extraction tasks where exact spans matter. Match the technique to the task.
Key Takeaways
- Data augmentation multiplies your existing dataset. Synthetic generation creates data from scratch. Both require careful quality control.
- For text, the most effective augmentation techniques are back-translation, LLM paraphrasing, and minority class generation. Synonym replacement is limited.
- Use synthetic data for training, never for evaluation. The evaluation set must always be real, representative data.
- Quality control is mandatory: filter for duplicates, verify labels, check that synthetic examples match real data distributions.
- Mix synthetic and real data at appropriate ratios. More real data means less synthetic data needed.
- LLM-generated synthetic data is a powerful tool for bootstrapping new tasks and balancing class distributions, but it cannot replace real data collection long-term.