When to Fine-Tune
Overview
Fine-tuning is the process of taking a pre-trained model and training it further on your own data so it learns your specific patterns, style, or domain knowledge. It is a powerful technique — and it is almost never the first thing you should try.
The escalation ladder for AI quality improvement is: prompting, few-shot examples, RAG, then fine-tuning. Each step is more expensive and complex than the last. Most production AI features never need fine-tuning because the earlier steps are sufficient.
The Escalation Ladder
Step 1: Better prompts (cost: $0, time: minutes)
→ Be more specific, add constraints, define output format
→ Fixes 60% of quality issues
Step 2: Few-shot examples (cost: $0, time: hours)
→ Add 3-5 examples of ideal input-output pairs
→ Fixes 20% of remaining issues
Step 3: RAG (cost: low, time: days)
→ Retrieve relevant context from your documents
→ Fixes knowledge gaps and hallucination
Step 4: Fine-tuning (cost: medium-high, time: weeks)
→ Train on your specific data
→ Fixes style, format, and deep domain behavior
Step 5: Train from scratch (cost: very high, time: months)
→ You almost certainly don't need this
→ Only for fundamental capabilities no existing model has
# Step 1: Better prompting (try this first)
# Before: vague prompt
response = call_llm("Summarize this support ticket.")
# After: specific prompt with format requirements
response = call_llm("""Summarize this support ticket in exactly 3 lines:
Line 1: Customer issue (one sentence)
Line 2: What the customer has already tried
Line 3: Recommended next action
Use formal tone. Do not include the customer's personal information.""")
# If this works, you're done. No fine-tuning needed.
When Fine-Tuning Is the Right Choice
You Need a Specific Style or Format
When your output must consistently match a specific writing style, tone, or format that is hard to specify in a prompt.
Example: A legal document generator that must produce output in a
specific firm's house style — paragraph structure, citation format,
clause ordering, terminology. A 1000-word style guide in the system
prompt helps but doesn't capture the subtleties that 500 examples can.
Example: A medical note summarizer that must follow SOAP format
(Subjective, Objective, Assessment, Plan) with specific abbreviations
and section ordering used by a particular hospital system.
You Need Domain Knowledge Baked In
When the model needs to understand specialized terminology, relationships, or reasoning patterns that aren't well-represented in its training data.
# General model: struggles with domain-specific terminology
prompt = "What is the clinical significance of a widened QRS complex?"
# Generic response, may miss nuances a cardiologist needs
# Fine-tuned model: trained on cardiology literature and clinical notes
# Understands QRS complex in the context of specific conditions,
# drug effects, and clinical decision-making
# Gives more precise, clinically relevant answers
You Need to Reduce Latency or Cost
A smaller fine-tuned model can match or exceed the quality of a larger general model for your specific task. This means faster inference and lower cost.
Before fine-tuning:
GPT-4o for customer email classification
Latency: 800ms, Cost: $0.005/email
Accuracy: 94%
After fine-tuning:
GPT-4o-mini fine-tuned on 5,000 classified emails
Latency: 200ms, Cost: $0.0005/email
Accuracy: 96%
10x cost reduction, 4x latency reduction, 2% accuracy improvement.
This is the strongest business case for fine-tuning.
You Need Consistent Behavior at Scale
At high volume, even small inconsistencies in prompt-based output create problems. Fine-tuning produces more consistent behavior than prompting because the patterns are learned, not instructed.
Prompt-based classification (10,000 requests):
98% follow the exact output format
2% include extra explanation despite instructions
0.5% return wrong format entirely
→ 50 failures per 10K requests requiring error handling
Fine-tuned classification (10,000 requests):
99.8% follow the exact output format
0.2% minor deviations
→ 20 failures per 10K requests
When NOT to Fine-Tune
You Just Need Up-to-Date Information
WRONG approach: Fine-tune the model on your latest product catalog
→ Model only knows products from training time
→ New products require retraining
→ Old products remain in model's weights even after discontinuation
RIGHT approach: Use RAG to retrieve current product information
→ Add/remove products by updating the document store
→ Changes are immediate, no retraining needed
→ Source attribution shows which product page was referenced
You Haven't Tried Prompting or Few-Shot
Common mistake:
"The model isn't giving good results" → "Let's fine-tune!"
Better process:
"The model isn't giving good results"
→ "Let's look at specific failures"
→ "These failures are all format-related"
→ "Add explicit format instructions to the prompt"
→ "Quality improved from 70% to 90%"
→ "Add 5 few-shot examples for the remaining edge cases"
→ "Quality improved to 95%"
→ "95% is good enough. Ship it."
You Don't Have Enough Quality Data
Fine-tuning amplifies what's in your data. If your data is noisy, biased, or small, your fine-tuned model will be noisy, biased, or overfit.
Data requirements (rough minimums):
Task Minimum examples Ideal
────────────────────────────────────────────────────────
Classification 200 2,000
Extraction 500 5,000
Style/format adaptation 500 2,000
Domain knowledge 1,000 10,000
Open-ended generation 2,000 20,000
Below the minimum: fine-tuning will likely overfit.
Use few-shot prompting instead.
The Task Is Too General
Fine-tuning makes models better at specific tasks but can make them worse at general tasks. This is called catastrophic forgetting.
If your fine-tuned model needs to:
- Handle many different task types
- Work with diverse input formats
- Generalize to cases not in training data
Then fine-tuning is risky. The model may excel at your training
distribution and fail on anything outside it.
Use prompting for general tasks. Use fine-tuning for narrow,
well-defined tasks.
The Decision Matrix
Use Prompting/ Use RAG Use Fine-Tuning
Few-Shot
────────────────────────────────────────────────────────────────────────────
Knowledge is external No YES No
Style/format matters Sometimes No YES
Data changes frequently N/A YES No
Need low latency Sometimes Sometimes YES
Have < 500 examples YES N/A No
Need to reduce cost No No YES
Task is narrow YES Sometimes YES
Task is general YES Sometimes Risky
Need explainability YES (visible YES (visible No (learned
in prompt) documents) behavior)
Real-World Decision Example
A company wants their customer support bot to respond in their brand voice: concise, warm but professional, always ending with a specific call-to-action format.
Attempt 1: Detailed system prompt describing the voice
Result: 80% of responses match the desired tone
Problem: Model sometimes reverts to generic assistant voice
Attempt 2: Add 5 few-shot examples of ideal responses
Result: 90% match
Problem: Some edge cases still don't match, especially for complaints
Attempt 3: Fine-tune on 2,000 real support responses in the correct voice
Result: 97% match
Problem: None significant
Decision: Fine-tuning was justified because:
- Tone/style is hard to fully specify in a prompt
- 2,000 high-quality examples were available from existing support logs
- The 10% quality gap between few-shot and fine-tuning mattered for brand consistency
- The model would handle thousands of conversations per day
Common Pitfalls
- Fine-tuning as the first step: This is the most expensive mistake. Always try prompting and few-shot first. Many teams spend weeks preparing fine-tuning data only to discover that a better prompt solves the problem.
- Fine-tuning for knowledge: Use RAG for knowledge grounding. Fine-tuning bakes knowledge into model weights where it cannot be easily updated, audited, or attributed.
- Not evaluating before and after: Without baseline metrics from prompting, you cannot prove fine-tuning was worth the effort. Measure the prompt-based approach first, then measure the fine-tuned approach on the same test set.
- Using noisy training data: 1,000 carefully reviewed examples outperform 50,000 scraped examples with errors. Quality over quantity applies double to fine-tuning data.
- Catastrophic forgetting: Fine-tuning on a narrow dataset can degrade the model's general capabilities. Test not just your target task but also related tasks the model should still handle.
- One-and-done training: Fine-tuned models need retraining as your domain evolves. Budget for ongoing data collection, evaluation, and retraining cycles.
Key Takeaways
- Follow the escalation ladder: prompting, few-shot, RAG, then fine-tuning. Each step is more expensive. Most features stop at step 1 or 2.
- Fine-tune when you need consistent style/format, domain-specific behavior, or cost/latency reduction through smaller models. These are the strongest use cases.
- Do not fine-tune for knowledge (use RAG), for general tasks (use prompting), or when you have fewer than 500 quality examples (use few-shot).
- Always measure the baseline (prompt-only performance) before investing in fine-tuning. If prompting gets you to 95% and fine-tuning gets you to 97%, the improvement may not justify the cost.
- Fine-tuning is not a one-time investment. Plan for ongoing data collection, evaluation, and retraining as your domain evolves.