When to Fine-Tune

Overview

Fine-tuning is the process of taking a pre-trained model and training it further on your own data so it learns your specific patterns, style, or domain knowledge. It is a powerful technique — and it is almost never the first thing you should try.

The escalation ladder for AI quality improvement is: prompting, few-shot examples, RAG, then fine-tuning. Each step is more expensive and complex than the last. Most production AI features never need fine-tuning because the earlier steps are sufficient.

The Escalation Ladder

Step 1: Better prompts (cost: $0, time: minutes)
  → Be more specific, add constraints, define output format
  → Fixes 60% of quality issues

Step 2: Few-shot examples (cost: $0, time: hours)
  → Add 3-5 examples of ideal input-output pairs
  → Fixes 20% of remaining issues

Step 3: RAG (cost: low, time: days)
  → Retrieve relevant context from your documents
  → Fixes knowledge gaps and hallucination

Step 4: Fine-tuning (cost: medium-high, time: weeks)
  → Train on your specific data
  → Fixes style, format, and deep domain behavior

Step 5: Train from scratch (cost: very high, time: months)
  → You almost certainly don't need this
  → Only for fundamental capabilities no existing model has

# Step 1: Better prompting (try this first)
# Before: vague prompt
response = call_llm("Summarize this support ticket.")

# After: specific prompt with format requirements
response = call_llm("""Summarize this support ticket in exactly 3 lines:
Line 1: Customer issue (one sentence)
Line 2: What the customer has already tried
Line 3: Recommended next action

Use formal tone. Do not include the customer's personal information.""")

# If this works, you're done. No fine-tuning needed.

When Fine-Tuning Is the Right Choice

You Need a Specific Style or Format

When your output must consistently match a specific writing style, tone, or format that is hard to specify in a prompt.

Example: A legal document generator that must produce output in a
specific firm's house style — paragraph structure, citation format,
clause ordering, terminology. A 1000-word style guide in the system
prompt helps but doesn't capture the subtleties that 500 examples can.

Example: A medical note summarizer that must follow SOAP format
(Subjective, Objective, Assessment, Plan) with specific abbreviations
and section ordering used by a particular hospital system.

You Need Domain Knowledge Baked In

When the model needs to understand specialized terminology, relationships, or reasoning patterns that aren't well-represented in its training data.

# General model: struggles with domain-specific terminology
prompt = "What is the clinical significance of a widened QRS complex?"
# Generic response, may miss nuances a cardiologist needs

# Fine-tuned model: trained on cardiology literature and clinical notes
# Understands QRS complex in the context of specific conditions,
# drug effects, and clinical decision-making
# Gives more precise, clinically relevant answers

You Need to Reduce Latency or Cost

A smaller fine-tuned model can match or exceed the quality of a larger general model for your specific task. This means faster inference and lower cost.

Before fine-tuning:
  GPT-4o for customer email classification
  Latency: 800ms, Cost: $0.005/email
  Accuracy: 94%

After fine-tuning:
  GPT-4o-mini fine-tuned on 5,000 classified emails
  Latency: 200ms, Cost: $0.0005/email
  Accuracy: 96%

10x cost reduction, 4x latency reduction, 2% accuracy improvement.
This is the strongest business case for fine-tuning.

You Need Consistent Behavior at Scale

At high volume, even small inconsistencies in prompt-based output create problems. Fine-tuning produces more consistent behavior than prompting because the patterns are learned, not instructed.

Prompt-based classification (10,000 requests):
  98% follow the exact output format
  2% include extra explanation despite instructions
  0.5% return wrong format entirely
  → 50 failures per 10K requests requiring error handling

Fine-tuned classification (10,000 requests):
  99.8% follow the exact output format
  0.2% minor deviations
  → 20 failures per 10K requests

When NOT to Fine-Tune

You Just Need Up-to-Date Information

WRONG approach: Fine-tune the model on your latest product catalog
  → Model only knows products from training time
  → New products require retraining
  → Old products remain in model's weights even after discontinuation

RIGHT approach: Use RAG to retrieve current product information
  → Add/remove products by updating the document store
  → Changes are immediate, no retraining needed
  → Source attribution shows which product page was referenced

You Haven't Tried Prompting or Few-Shot

Common mistake:
  "The model isn't giving good results" → "Let's fine-tune!"
  
Better process:
  "The model isn't giving good results"
  → "Let's look at specific failures"
  → "These failures are all format-related"
  → "Add explicit format instructions to the prompt"
  → "Quality improved from 70% to 90%"
  → "Add 5 few-shot examples for the remaining edge cases"
  → "Quality improved to 95%"
  → "95% is good enough. Ship it."

You Don't Have Enough Quality Data

Fine-tuning amplifies what's in your data. If your data is noisy, biased, or small, your fine-tuned model will be noisy, biased, or overfit.

Data requirements (rough minimums):

  Task                         Minimum examples    Ideal
  ────────────────────────────────────────────────────────
  Classification               200                 2,000
  Extraction                   500                 5,000
  Style/format adaptation      500                 2,000
  Domain knowledge             1,000               10,000
  Open-ended generation        2,000               20,000
  
  Below the minimum: fine-tuning will likely overfit.
  Use few-shot prompting instead.

The Task Is Too General

Fine-tuning makes models better at specific tasks but can make them worse at general tasks. This is called catastrophic forgetting.

If your fine-tuned model needs to:
  - Handle many different task types
  - Work with diverse input formats
  - Generalize to cases not in training data
  
Then fine-tuning is risky. The model may excel at your training
distribution and fail on anything outside it.

Use prompting for general tasks. Use fine-tuning for narrow,
well-defined tasks.

The Decision Matrix

                          Use Prompting/    Use RAG          Use Fine-Tuning
                          Few-Shot
────────────────────────────────────────────────────────────────────────────
Knowledge is external     No                YES              No
Style/format matters      Sometimes         No               YES
Data changes frequently   N/A               YES              No
Need low latency          Sometimes         Sometimes        YES
Have < 500 examples       YES               N/A              No
Need to reduce cost       No                No               YES
Task is narrow            YES               Sometimes        YES
Task is general           YES               Sometimes        Risky
Need explainability       YES (visible      YES (visible     No (learned
                          in prompt)        documents)       behavior)

Real-World Decision Example

A company wants their customer support bot to respond in their brand voice: concise, warm but professional, always ending with a specific call-to-action format.

Attempt 1: Detailed system prompt describing the voice
  Result: 80% of responses match the desired tone
  Problem: Model sometimes reverts to generic assistant voice
  
Attempt 2: Add 5 few-shot examples of ideal responses
  Result: 90% match
  Problem: Some edge cases still don't match, especially for complaints
  
Attempt 3: Fine-tune on 2,000 real support responses in the correct voice
  Result: 97% match
  Problem: None significant

Decision: Fine-tuning was justified because:
  - Tone/style is hard to fully specify in a prompt
  - 2,000 high-quality examples were available from existing support logs
  - The 10% quality gap between few-shot and fine-tuning mattered for brand consistency
  - The model would handle thousands of conversations per day

Common Pitfalls

Fine-tuning as the first step: This is the most expensive mistake. Always try prompting and few-shot first. Many teams spend weeks preparing fine-tuning data only to discover that a better prompt solves the problem.
Fine-tuning for knowledge: Use RAG for knowledge grounding. Fine-tuning bakes knowledge into model weights where it cannot be easily updated, audited, or attributed.
Not evaluating before and after: Without baseline metrics from prompting, you cannot prove fine-tuning was worth the effort. Measure the prompt-based approach first, then measure the fine-tuned approach on the same test set.
Using noisy training data: 1,000 carefully reviewed examples outperform 50,000 scraped examples with errors. Quality over quantity applies double to fine-tuning data.
Catastrophic forgetting: Fine-tuning on a narrow dataset can degrade the model's general capabilities. Test not just your target task but also related tasks the model should still handle.
One-and-done training: Fine-tuned models need retraining as your domain evolves. Budget for ongoing data collection, evaluation, and retraining cycles.

Key Takeaways

Follow the escalation ladder: prompting, few-shot, RAG, then fine-tuning. Each step is more expensive. Most features stop at step 1 or 2.
Fine-tune when you need consistent style/format, domain-specific behavior, or cost/latency reduction through smaller models. These are the strongest use cases.
Do not fine-tune for knowledge (use RAG), for general tasks (use prompting), or when you have fewer than 500 quality examples (use few-shot).
Always measure the baseline (prompt-only performance) before investing in fine-tuning. If prompting gets you to 95% and fine-tuning gets you to 97%, the improvement may not justify the cost.
Fine-tuning is not a one-time investment. Plan for ongoing data collection, evaluation, and retraining as your domain evolves.