Fine-Tuning Techniques

Overview

There are multiple ways to fine-tune a language model, ranging from full fine-tuning (updating every parameter) to parameter-efficient methods like LoRA that train a tiny fraction of the model. The right choice depends on your hardware, budget, data size, and performance requirements.

For most practitioners, the decision is simple: use LoRA. Full fine-tuning requires more memory than most teams have available, and LoRA achieves comparable quality for the vast majority of use cases.

Full Fine-Tuning

Full fine-tuning updates every parameter in the model. This gives the model maximum flexibility to adapt to your data but requires significant compute resources.

Full fine-tuning requirements:

  Model Size     GPU Memory (training)    Typical Hardware
  ──────────────────────────────────────────────────────────
  7B             ~60 GB                   2x A100 80GB
  13B            ~120 GB                  4x A100 80GB
  70B            ~600 GB                  8x A100 80GB
  
  These numbers assume mixed precision (bf16) training
  with a reasonable batch size. Actual requirements vary
  with batch size, sequence length, and optimizer.

When to Use Full Fine-Tuning

You need maximum quality and have the compute budget
Your task is significantly different from the base model's training distribution
You have a very large dataset (50K+ examples)
You are a research lab or large company with dedicated GPU infrastructure

When to Skip Full Fine-Tuning

You have limited GPU resources (most teams)
Your dataset is small (under 10K examples)
Your task is close to what the base model already does well
You need to iterate quickly (full fine-tuning is slow)

LoRA (Low-Rank Adaptation)

LoRA is the standard parameter-efficient fine-tuning method. Instead of updating all model parameters, LoRA freezes the original weights and trains small adapter matrices that modify the model's behavior.

How LoRA Works

Original model weight matrix W: [4096 x 4096] = 16M parameters

LoRA decomposition:
  W' = W + (A x B)
  where A: [4096 x 16] = 65K parameters
        B: [16 x 4096] = 65K parameters
  
  Total trainable: 130K parameters (0.8% of the original)
  
  The rank (16 in this example) controls the adapter's capacity.
  Higher rank = more parameters = more capacity = more memory.

# Fine-tuning with LoRA using Hugging Face PEFT
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# Configure LoRA
lora_config = LoraConfig(
    r=16,                       # Rank: controls adapter size
    lora_alpha=32,              # Scaling factor (typically 2x rank)
    target_modules=[            # Which layers to adapt
        "q_proj", "k_proj",     # Attention query and key
        "v_proj", "o_proj",     # Attention value and output
        "gate_proj", "up_proj", # MLP layers
        "down_proj"
    ],
    lora_dropout=0.05,          # Regularization
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13.6M || all params: 8.0B || trainable%: 0.17%

LoRA Hyperparameters

Parameter        What it controls              Recommended range
────────────────────────────────────────────────────────────────────
r (rank)         Adapter capacity              8-64
                 Higher = more capacity,        Start at 16
                 more memory, slower
                 
lora_alpha       Scaling factor                Usually 2x rank
                 Controls how much the          (if r=16, alpha=32)
                 adapter affects output
                 
target_modules   Which layers get adapters     At minimum: q_proj, v_proj
                                               Better: all attention + MLP
                                               
lora_dropout     Regularization                0.05-0.1
                 Prevents overfitting           Higher for small datasets

LoRA memory requirements (approximate):

  Model Size     GPU Memory (LoRA)    Typical Hardware
  ──────────────────────────────────────────────────────
  7B             ~16 GB               1x A100 or 1x RTX 4090
  13B            ~28 GB               1x A100 80GB
  70B            ~80 GB               2x A100 80GB
  
  LoRA reduces memory by ~4x compared to full fine-tuning.

QLoRA (Quantized LoRA)

QLoRA goes further: it loads the base model in 4-bit quantized precision and trains LoRA adapters on top. This cuts memory requirements by another 2-3x.

from transformers import BitsAndBytesConfig

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # Normalized float 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bf16 for speed
    bnb_4bit_use_double_quant=True         # Quantize the quantization constants
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto"
)

# Apply LoRA on top of the quantized model
model = get_peft_model(model, lora_config)

QLoRA memory requirements:

  Model Size     GPU Memory (QLoRA)   Typical Hardware
  ──────────────────────────────────────────────────────
  7B             ~6 GB                1x RTX 3090/4090
  13B            ~12 GB               1x RTX 4090
  70B            ~36 GB               1x A100 80GB
  
  QLoRA makes 70B model fine-tuning possible on a single GPU.

LoRA vs QLoRA Quality

Quality comparison (typical results):

  Method          Quality vs Full     Memory     Training Speed
  ────────────────────────────────────────────────────────────────
  Full fine-tune  100% (baseline)     Very high  Slowest
  LoRA            95-99%              Medium     Fast
  QLoRA           93-98%              Low        Medium
  
  The quality gap is small for most practical tasks.
  QLoRA is the best choice when GPU memory is limited.
  LoRA is the default choice when you have enough memory.

Training Hyperparameters

Key training arguments:
  learning_rate:     2e-4 for LoRA, 2e-5 for full fine-tuning
  num_train_epochs:  3 (start here, adjust based on val loss)
  batch_size:        4-16 effective (use gradient accumulation)
  weight_decay:      0.01
  warmup_ratio:      0.03
  lr_scheduler:      cosine
  precision:         bf16 on A100/H100

Hyperparameter Guidelines

Parameter          Too low                 Sweet spot              Too high
──────────────────────────────────────────────────────────────────────────────
Learning rate      Model barely changes    Steady improvement      Loss spikes,
                   from base               in validation loss      divergence
                   
Epochs             Underfitting: model     Validation loss         Overfitting:
                   hasn't learned yet      plateaus                memorization
                   
Batch size         Noisy gradients,        Stable training         Requires too
                   slow convergence                                much memory
                   
LoRA rank          Not enough capacity     Good quality with       Diminishing
                   to learn task           reasonable memory       returns

Training Process

Use Hugging Face's SFTTrainer (from the trl library) for the training loop. Load your dataset from JSONL, format examples using the tokenizer's chat template, and enable packing=True to pack short examples together for efficiency. Save the adapter after training with trainer.save_model().

Evaluation

Loss curves tell you if training is progressing, but they do not tell you if the model is actually better at your task.

Loss Monitoring

Healthy training:
  - Training loss decreases steadily
  - Validation loss decreases, then plateaus
  - Gap between train and val loss is small

Overfitting:
  - Training loss keeps decreasing
  - Validation loss starts INCREASING
  → Stop training. Use the checkpoint before val loss increased.

Underfitting:
  - Both losses are high and not decreasing
  → Increase learning rate, add more epochs, or increase LoRA rank.

Divergence:
  - Loss spikes to very high values
  → Learning rate is too high. Reduce by 2-5x and restart.

Task-Specific Evaluation

Run your fine-tuned model on the held-out test set with real task metrics (accuracy, F1, human preference), not just loss. Compare against these baselines:

Always compare fine-tuned model against:

  1. Base model with zero-shot prompt
  2. Base model with few-shot prompt
  3. Base model with best prompt engineering
  4. Previous version of fine-tuned model (if iterating)
  
If the fine-tuned model is not significantly better than #3,
fine-tuning was not worth it. Go back to prompt engineering.

When Your Fine-Tuned Model Is Worse

This happens more often than people admit. Common causes:

Overfitting: Works on training-like inputs, fails on novel ones. Fix: more diverse data, fewer epochs, higher dropout.
Catastrophic forgetting: Loses general capabilities. Fix: lower learning rate, mix in 10-20% general-purpose examples.
Bad training data: Outputs match the format but are factually wrong. Fix: audit data quality.
Imbalanced data: Better on average but worse on minority categories. Fix: balance or oversample.

API-Based Fine-Tuning

For proprietary models, providers offer fine-tuning APIs (OpenAI, Google). Upload a JSONL file, specify hyperparameters, and the provider handles training infrastructure. The fine-tuned model is available as a new model ID.

API fine-tuning costs (approximate):

  Model             Training cost        Inference cost
  ───────────────────────────────────────────────────────
  GPT-4o-mini       $3/1M tokens         2x base rate
  GPT-4o            $25/1M tokens        2x base rate
  
  A 1,000-example dataset with 500 tokens/example:
  500K training tokens x 3 epochs = 1.5M tokens
  GPT-4o-mini: ~$4.50 total training cost

Common Pitfalls

Not comparing against prompting baselines: If you don't measure prompt-only performance first, you cannot prove fine-tuning was worth the effort. Always establish a baseline.
Training for too many epochs: More epochs does not always mean better. Watch validation loss and stop when it plateaus or increases. Three epochs is a good starting point.
Using too high a learning rate: This is the most common training failure. If loss spikes or the model produces garbage, reduce the learning rate by 2-5x.
Ignoring catastrophic forgetting: Test your fine-tuned model on general tasks, not just your specific task. If it lost the ability to follow basic instructions, your learning rate or epoch count is too high.
Evaluating only on loss: Low loss does not mean the model is good at your task. Always evaluate on real task examples with task-specific metrics.
Not versioning models and data: Track which data version produced which model. When a fine-tuned model behaves unexpectedly, you need to trace back to the training data.

Key Takeaways

LoRA is the default fine-tuning technique for most practitioners. It achieves 95-99% of full fine-tuning quality at a fraction of the memory cost. Use QLoRA when GPU memory is very limited.
The most important hyperparameters are learning rate and number of epochs. Start with learning rate 2e-4 (LoRA) or 2e-5 (full), 3 epochs, and adjust based on validation loss.
Evaluate on real tasks, not just loss. A model with low loss can still produce wrong answers. Compare against prompting baselines to prove fine-tuning adds value.
When your fine-tuned model is worse than the base model, the problem is usually data quality, overfitting, or catastrophic forgetting. Diagnose systematically before adding more training.
API-based fine-tuning (OpenAI, etc.) is the easiest path. Self-hosted fine-tuning with LoRA/QLoRA gives more control but requires GPU infrastructure and ML engineering expertise.