Data Preparation

Overview

The quality of your fine-tuned model is determined by the quality of your training data. This is not a platitude — it is the single most predictive factor. A model fine-tuned on 1,000 carefully curated examples will outperform one trained on 100,000 noisy examples almost every time.

Data preparation for fine-tuning is methodical work: collecting examples, formatting them correctly, cleaning errors, removing duplicates, balancing categories, and validating the final dataset. Shortcuts here cost you weeks of debugging later.

Data Format

Fine-tuning data for language models follows the instruction-response format: a conversation with the input (instruction + context) and the desired output.

OpenAI Format

# Each training example is a conversation
training_example = {
    "messages": [
        {
            "role": "system",
            "content": "You are a customer support agent for Acme Corp."
        },
        {
            "role": "user",
            "content": "I was charged twice for order #4521. I need a refund for the duplicate charge."
        },
        {
            "role": "assistant",
            "content": "I can see the duplicate charge on order #4521. I've initiated a refund of $49.99 to your original payment method. You should see it within 3-5 business days. Is there anything else I can help with?"
        }
    ]
}

Alpaca/Stanford Format (Open Source)

# Common format for open source fine-tuning
training_example = {
    "instruction": "Classify this support ticket by category and priority.",
    "input": "My account was hacked and someone changed my password and email address.",
    "output": "Category: security\nPriority: critical\nReason: Account compromise with credential changes requires immediate escalation."
}

Multi-Turn Conversations

# For conversational fine-tuning, include full dialogues
training_example = {
    "messages": [
        {"role": "system", "content": "You are a technical support agent."},
        {"role": "user", "content": "My app crashes when I try to upload a file."},
        {"role": "assistant", "content": "What type of file are you uploading, and how large is it?"},
        {"role": "user", "content": "It's a PDF, about 25MB."},
        {"role": "assistant", "content": "The upload limit is 10MB for PDFs. Please compress the file or split it into smaller parts. You can use our built-in compression tool at Settings > File Tools > Compress PDF."}
    ]
}

Data Collection Strategies

From Existing Systems

The best training data comes from real interactions that have been validated.

def extract_training_data_from_logs(support_logs: list[dict]) -> list[dict]:
    """Extract high-quality training examples from support chat logs.
    
    Filter for: resolved conversations with positive satisfaction scores.
    """
    training_data = []
    
    for log in support_logs:
        # Quality filters
        if log["resolution_status"] != "resolved":
            continue
        if log.get("satisfaction_score", 0) < 4:  # 4 out of 5
            continue
        if log["agent_type"] != "human":  # Learn from human agents
            continue
        if len(log["messages"]) < 2:  # Need at least one exchange
            continue
        
        # Convert to training format
        messages = [{"role": "system", "content": SYSTEM_PROMPT}]
        for msg in log["messages"]:
            role = "user" if msg["sender"] == "customer" else "assistant"
            messages.append({"role": role, "content": msg["text"]})
        
        training_data.append({"messages": messages})
    
    return training_data

From Manual Annotation

When you don't have existing high-quality data, create it manually.

Annotation process:

  1. Collect raw inputs (real user queries, documents, etc.)
  2. Write ideal outputs for each input
  3. Have a second person review every example
  4. Resolve disagreements with a third reviewer
  5. Format into training structure
  
Budget: Expect 5-15 minutes per example for complex tasks.
1,000 examples = 80-250 person-hours of annotation work.

Synthetic Data Generation

Use a stronger model to generate training data for a weaker model.

def generate_synthetic_examples(seed_examples: list[dict], 
                                 num_to_generate: int = 500) -> list[dict]:
    """Use GPT-4o to generate training data for fine-tuning GPT-4o-mini."""
    
    # Show seed examples so the generator understands the pattern
    seed_text = "\n\n".join([
        f"Input: {ex['input']}\nOutput: {ex['output']}"
        for ex in seed_examples[:5]
    ])
    
    synthetic_examples = []
    
    for _ in range(num_to_generate):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": f"""Generate a new training example 
following the same pattern as these examples. Create a realistic, diverse input 
and the correct output. Do not copy existing examples.

Existing examples:
{seed_text}

Return JSON: {{"input": "...", "output": "..."}}"""},
                {"role": "user", "content": "Generate one new example."}
            ],
            temperature=0.9,  # High temperature for diversity
            response_format={"type": "json_object"}
        )
        
        example = json.loads(response.choices[0].message.content)
        synthetic_examples.append(example)
    
    return synthetic_examples

Synthetic data guidelines:
  - Always validate synthetic examples (spot-check 10-20%)
  - Use high temperature (0.8-1.0) for diversity
  - Mix synthetic data with real data (never 100% synthetic)
  - Typical mix: 30% real + 70% synthetic, or 50/50
  - Synthetic data is better for augmentation than replacement

Data Cleaning

Removing Bad Examples

def clean_training_data(examples: list[dict]) -> list[dict]:
    """Filter out problematic training examples."""
    cleaned = []
    removed_reasons = {}
    
    for example in examples:
        messages = example["messages"]
        
        # Get the assistant's response
        assistant_msgs = [m for m in messages if m["role"] == "assistant"]
        if not assistant_msgs:
            removed_reasons["no_response"] = removed_reasons.get("no_response", 0) + 1
            continue
        
        response = assistant_msgs[-1]["content"]
        
        # Check for empty or near-empty responses
        if len(response.strip()) < 10:
            removed_reasons["too_short"] = removed_reasons.get("too_short", 0) + 1
            continue
        
        # Check for error messages or non-answers
        error_patterns = [
            "I cannot", "I'm unable to", "error occurred",
            "something went wrong", "please try again"
        ]
        if any(p in response.lower() for p in error_patterns):
            removed_reasons["error_response"] = removed_reasons.get("error_response", 0) + 1
            continue
        
        # Check for PII that shouldn't be in training data
        if contains_pii(response):
            removed_reasons["pii_detected"] = removed_reasons.get("pii_detected", 0) + 1
            continue
        
        cleaned.append(example)
    
    print(f"Kept {len(cleaned)}/{len(examples)} examples")
    print(f"Removed: {removed_reasons}")
    
    return cleaned

Deduplication

Duplicate examples cause the model to memorize rather than generalize. Embed all training inputs and remove pairs with cosine similarity above 0.95. Even exact duplicates are common in production data (the same support question asked by different users gets the same template response).

Balancing

If your training data is imbalanced (90% of one category, 10% of another), the model will be biased toward the majority category. Two strategies:

Upsample: Duplicate minority class examples to match the majority count
Downsample: Randomly remove majority class examples to match the minority count

Print category distributions before and after balancing to verify.

Validation Split

Always hold out data for evaluation. Never train on your test set.

Split guidelines:
  - Train (85%): Model learns from this
  - Validation (10%): Used during training to detect overfitting
  - Test (5%): Used ONLY for final evaluation, never during training
  
  For small datasets (< 1000 examples):
  - Train (80%), Validation (10%), Test (10%)
  - Minimum test set: 50 examples for meaningful metrics

Shuffle before splitting. Stratify by category if applicable so each split has representative examples from all classes.

Data Quality Checklist

Before starting fine-tuning, verify:

Format:
  [ ] All examples follow the correct message format
  [ ] System prompts are consistent across examples
  [ ] No truncated or incomplete responses
  [ ] Token counts are within model limits

Content:
  [ ] Responses are factually correct
  [ ] Responses match the desired style and tone
  [ ] No PII (names, emails, phone numbers, etc.)
  [ ] No harmful, biased, or offensive content
  [ ] Edge cases are represented in the data

Distribution:
  [ ] Categories are reasonably balanced
  [ ] Input lengths are varied (short, medium, long)
  [ ] Topics are diverse within the domain
  [ ] Near-duplicates have been removed

Splits:
  [ ] Train/val/test split is clean (no leakage)
  [ ] Test set is representative of production inputs
  [ ] Test set is held out and never used for training decisions

Common Pitfalls

Garbage in, garbage out: This cliche is especially true for fine-tuning. Every error in your training data teaches the model to make that error. Spend 80% of your fine-tuning effort on data quality.
Training on model outputs: Fine-tuning GPT-4o-mini on outputs from GPT-4o-mini creates a feedback loop. Use human-written or expert-reviewed outputs as the gold standard.
Ignoring PII: If your training data contains personal information, the model may memorize and reproduce it. Scrub PII before training.
No held-out test set: If you evaluate on training data, you measure memorization, not generalization. Always hold out a test set that the model never sees during training.
Imbalanced data without addressing it: A model trained on 95% positive and 5% negative examples will almost always predict positive. Balance your categories or at least acknowledge the bias.
Quantity over quality: 500 expert-reviewed examples beat 50,000 noisy scraped examples. If you have to choose between more data and better data, choose better data every time.

Key Takeaways

Data quality is the most important factor in fine-tuning success. Invest 80% of your fine-tuning time in data preparation.
Use the instruction-response format: system prompt, user message, ideal assistant response. Multi-turn conversations work the same way with more messages.
Clean your data systematically: remove bad examples, deduplicate, balance categories, scrub PII. Validate with human review before training.
Synthetic data generation (using a stronger model to create training data) is a legitimate augmentation strategy, but always mix with real data and validate quality.
Always split into train/validation/test sets. Never evaluate on training data. Hold the test set sacred — use it only for final evaluation.