Fine-Tuning Techniques
Overview
There are multiple ways to fine-tune a language model, ranging from full fine-tuning (updating every parameter) to parameter-efficient methods like LoRA that train a tiny fraction of the model. The right choice depends on your hardware, budget, data size, and performance requirements.
For most practitioners, the decision is simple: use LoRA. Full fine-tuning requires more memory than most teams have available, and LoRA achieves comparable quality for the vast majority of use cases.
Full Fine-Tuning
Full fine-tuning updates every parameter in the model. This gives the model maximum flexibility to adapt to your data but requires significant compute resources.
Full fine-tuning requirements:
Model Size GPU Memory (training) Typical Hardware
──────────────────────────────────────────────────────────
7B ~60 GB 2x A100 80GB
13B ~120 GB 4x A100 80GB
70B ~600 GB 8x A100 80GB
These numbers assume mixed precision (bf16) training
with a reasonable batch size. Actual requirements vary
with batch size, sequence length, and optimizer.
When to Use Full Fine-Tuning
- You need maximum quality and have the compute budget
- Your task is significantly different from the base model's training distribution
- You have a very large dataset (50K+ examples)
- You are a research lab or large company with dedicated GPU infrastructure
When to Skip Full Fine-Tuning
- You have limited GPU resources (most teams)
- Your dataset is small (under 10K examples)
- Your task is close to what the base model already does well
- You need to iterate quickly (full fine-tuning is slow)
LoRA (Low-Rank Adaptation)
LoRA is the standard parameter-efficient fine-tuning method. Instead of updating all model parameters, LoRA freezes the original weights and trains small adapter matrices that modify the model's behavior.
How LoRA Works
Original model weight matrix W: [4096 x 4096] = 16M parameters
LoRA decomposition:
W' = W + (A x B)
where A: [4096 x 16] = 65K parameters
B: [16 x 4096] = 65K parameters
Total trainable: 130K parameters (0.8% of the original)
The rank (16 in this example) controls the adapter's capacity.
Higher rank = more parameters = more capacity = more memory.
# Fine-tuning with LoRA using Hugging Face PEFT
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank: controls adapter size
lora_alpha=32, # Scaling factor (typically 2x rank)
target_modules=[ # Which layers to adapt
"q_proj", "k_proj", # Attention query and key
"v_proj", "o_proj", # Attention value and output
"gate_proj", "up_proj", # MLP layers
"down_proj"
],
lora_dropout=0.05, # Regularization
task_type=TaskType.CAUSAL_LM
)
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13.6M || all params: 8.0B || trainable%: 0.17%
LoRA Hyperparameters
Parameter What it controls Recommended range
────────────────────────────────────────────────────────────────────
r (rank) Adapter capacity 8-64
Higher = more capacity, Start at 16
more memory, slower
lora_alpha Scaling factor Usually 2x rank
Controls how much the (if r=16, alpha=32)
adapter affects output
target_modules Which layers get adapters At minimum: q_proj, v_proj
Better: all attention + MLP
lora_dropout Regularization 0.05-0.1
Prevents overfitting Higher for small datasets
LoRA memory requirements (approximate):
Model Size GPU Memory (LoRA) Typical Hardware
──────────────────────────────────────────────────────
7B ~16 GB 1x A100 or 1x RTX 4090
13B ~28 GB 1x A100 80GB
70B ~80 GB 2x A100 80GB
LoRA reduces memory by ~4x compared to full fine-tuning.
QLoRA (Quantized LoRA)
QLoRA goes further: it loads the base model in 4-bit quantized precision and trains LoRA adapters on top. This cuts memory requirements by another 2-3x.
from transformers import BitsAndBytesConfig
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normalized float 4-bit
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bf16 for speed
bnb_4bit_use_double_quant=True # Quantize the quantization constants
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto"
)
# Apply LoRA on top of the quantized model
model = get_peft_model(model, lora_config)
QLoRA memory requirements:
Model Size GPU Memory (QLoRA) Typical Hardware
──────────────────────────────────────────────────────
7B ~6 GB 1x RTX 3090/4090
13B ~12 GB 1x RTX 4090
70B ~36 GB 1x A100 80GB
QLoRA makes 70B model fine-tuning possible on a single GPU.
LoRA vs QLoRA Quality
Quality comparison (typical results):
Method Quality vs Full Memory Training Speed
────────────────────────────────────────────────────────────────
Full fine-tune 100% (baseline) Very high Slowest
LoRA 95-99% Medium Fast
QLoRA 93-98% Low Medium
The quality gap is small for most practical tasks.
QLoRA is the best choice when GPU memory is limited.
LoRA is the default choice when you have enough memory.
Training Hyperparameters
Key training arguments:
learning_rate: 2e-4 for LoRA, 2e-5 for full fine-tuning
num_train_epochs: 3 (start here, adjust based on val loss)
batch_size: 4-16 effective (use gradient accumulation)
weight_decay: 0.01
warmup_ratio: 0.03
lr_scheduler: cosine
precision: bf16 on A100/H100
Hyperparameter Guidelines
Parameter Too low Sweet spot Too high
──────────────────────────────────────────────────────────────────────────────
Learning rate Model barely changes Steady improvement Loss spikes,
from base in validation loss divergence
Epochs Underfitting: model Validation loss Overfitting:
hasn't learned yet plateaus memorization
Batch size Noisy gradients, Stable training Requires too
slow convergence much memory
LoRA rank Not enough capacity Good quality with Diminishing
to learn task reasonable memory returns
Training Process
Use Hugging Face's SFTTrainer (from the trl library) for the training loop. Load your dataset from JSONL, format examples using the tokenizer's chat template, and enable packing=True to pack short examples together for efficiency. Save the adapter after training with trainer.save_model().
Evaluation
Loss curves tell you if training is progressing, but they do not tell you if the model is actually better at your task.
Loss Monitoring
Healthy training:
- Training loss decreases steadily
- Validation loss decreases, then plateaus
- Gap between train and val loss is small
Overfitting:
- Training loss keeps decreasing
- Validation loss starts INCREASING
→ Stop training. Use the checkpoint before val loss increased.
Underfitting:
- Both losses are high and not decreasing
→ Increase learning rate, add more epochs, or increase LoRA rank.
Divergence:
- Loss spikes to very high values
→ Learning rate is too high. Reduce by 2-5x and restart.
Task-Specific Evaluation
Run your fine-tuned model on the held-out test set with real task metrics (accuracy, F1, human preference), not just loss. Compare against these baselines:
Always compare fine-tuned model against:
1. Base model with zero-shot prompt
2. Base model with few-shot prompt
3. Base model with best prompt engineering
4. Previous version of fine-tuned model (if iterating)
If the fine-tuned model is not significantly better than #3,
fine-tuning was not worth it. Go back to prompt engineering.
When Your Fine-Tuned Model Is Worse
This happens more often than people admit. Common causes:
- Overfitting: Works on training-like inputs, fails on novel ones. Fix: more diverse data, fewer epochs, higher dropout.
- Catastrophic forgetting: Loses general capabilities. Fix: lower learning rate, mix in 10-20% general-purpose examples.
- Bad training data: Outputs match the format but are factually wrong. Fix: audit data quality.
- Imbalanced data: Better on average but worse on minority categories. Fix: balance or oversample.
API-Based Fine-Tuning
For proprietary models, providers offer fine-tuning APIs (OpenAI, Google). Upload a JSONL file, specify hyperparameters, and the provider handles training infrastructure. The fine-tuned model is available as a new model ID.
API fine-tuning costs (approximate):
Model Training cost Inference cost
───────────────────────────────────────────────────────
GPT-4o-mini $3/1M tokens 2x base rate
GPT-4o $25/1M tokens 2x base rate
A 1,000-example dataset with 500 tokens/example:
500K training tokens x 3 epochs = 1.5M tokens
GPT-4o-mini: ~$4.50 total training cost
Common Pitfalls
- Not comparing against prompting baselines: If you don't measure prompt-only performance first, you cannot prove fine-tuning was worth the effort. Always establish a baseline.
- Training for too many epochs: More epochs does not always mean better. Watch validation loss and stop when it plateaus or increases. Three epochs is a good starting point.
- Using too high a learning rate: This is the most common training failure. If loss spikes or the model produces garbage, reduce the learning rate by 2-5x.
- Ignoring catastrophic forgetting: Test your fine-tuned model on general tasks, not just your specific task. If it lost the ability to follow basic instructions, your learning rate or epoch count is too high.
- Evaluating only on loss: Low loss does not mean the model is good at your task. Always evaluate on real task examples with task-specific metrics.
- Not versioning models and data: Track which data version produced which model. When a fine-tuned model behaves unexpectedly, you need to trace back to the training data.
Key Takeaways
- LoRA is the default fine-tuning technique for most practitioners. It achieves 95-99% of full fine-tuning quality at a fraction of the memory cost. Use QLoRA when GPU memory is very limited.
- The most important hyperparameters are learning rate and number of epochs. Start with learning rate 2e-4 (LoRA) or 2e-5 (full), 3 epochs, and adjust based on validation loss.
- Evaluate on real tasks, not just loss. A model with low loss can still produce wrong answers. Compare against prompting baselines to prove fine-tuning adds value.
- When your fine-tuned model is worse than the base model, the problem is usually data quality, overfitting, or catastrophic forgetting. Diagnose systematically before adding more training.
- API-based fine-tuning (OpenAI, etc.) is the easiest path. Self-hosted fine-tuning with LoRA/QLoRA gives more control but requires GPU infrastructure and ML engineering expertise.