8 min read
On this page

Large Language Models

Overview

Large language models (LLMs) are transformer-based models trained on massive text corpora that exhibit broad capabilities across language understanding and generation tasks. This document covers the full LLM lifecycle: pre-training, fine-tuning, alignment, inference optimization, evaluation, prompt engineering, and agentic use.


Pre-training

Data

LLM training data is drawn from diverse sources.

Source Examples Considerations
Web crawl Common Crawl, C4, RefinedWeb Requires aggressive deduplication and quality filtering
Books Books3, Project Gutenberg Long-form, high quality
Code GitHub, The Stack Improves reasoning and code generation
Scientific Semantic Scholar, arXiv Domain knowledge
Curated Wikipedia, StackExchange High quality, smaller scale

Data processing pipeline:

  1. Language identification and filtering
  2. Deduplication (MinHash, exact substring)
  3. Quality filtering (perplexity, classifier-based)
  4. PII removal and content filtering
  5. Tokenization

Scale: LLaMA 2 trained on 2T tokens; LLaMA 3 on 15T tokens.

Training Infrastructure

Component Details
Hardware Clusters of thousands of GPUs (A100, H100) or TPUs
Interconnect NVLink, InfiniBand for high-bandwidth inter-node communication
Frameworks Megatron-LM, DeepSpeed, JAX/XLA, PyTorch FSDP
Cost GPT-4 estimated at $100M+; LLaMA 3 405B ~30M GPU-hours

Parallelism Strategies

Strategy Splits Communication
Data parallelism (DP) Data batches across replicas All-reduce gradients
Tensor parallelism (TP) Individual layers (attention heads, FFN) All-reduce within layer
Pipeline parallelism (PP) Layer groups across stages Forward/backward between stages
Expert parallelism (EP) MoE experts across devices All-to-all routing
Sequence parallelism (SP) Long sequences across devices Communication at attention
ZeRO / FSDP Optimizer states, gradients, params All-gather when needed

Typical setup: TP within a node (8 GPUs), PP across nodes, DP across node groups.

Training Stability

  • Mixed precision: BF16 forward/backward, FP32 master weights and optimizer states
  • Gradient clipping (typically 1.0)
  • Learning rate warmup + cosine decay
  • Loss spikes: checkpoint averaging, data investigation, learning rate reduction
  • Checkpointing every N steps for recovery

Fine-tuning

Full Fine-tuning

Update all model parameters on task-specific data. Effective but expensive for large models and risks catastrophic forgetting.

Parameter-Efficient Fine-tuning (PEFT)

Train a small number of parameters while freezing the base model.

LoRA (Low-Rank Adaptation):

  • Decompose weight updates as delta_W = A * B where A is d x r and B is r x d
  • Rank r is typically 8-64 (vs d = 4096-8192)
  • Reduces trainable parameters by 100-1000x
  • Applied to attention projection matrices (Q, K, V, O)
  • Merged into base weights at inference (no latency cost)

QLoRA:

  • Quantize base model to 4-bit (NF4 quantization)
  • Apply LoRA adapters in BF16
  • Enables fine-tuning 65B models on a single 48GB GPU
  • Double quantization: quantize the quantization constants

Adapters:

  • Insert small bottleneck layers (down-project, nonlinearity, up-project) between transformer layers
  • Train only adapter parameters; freeze everything else
  • Modular: swap adapters for different tasks

Prefix Tuning / Prompt Tuning:

  • Learn continuous vectors prepended to the input
  • Prefix tuning: learned vectors at every layer
  • Prompt tuning: learned vectors only at the input layer
  • Extremely parameter-efficient but less expressive

Instruction Tuning

Fine-tune on diverse (instruction, response) pairs to follow arbitrary instructions.

  • Datasets: FLAN, OpenAssistant, ShareGPT, Alpaca
  • Dramatically improves zero-shot task performance
  • Self-instruct: use an LLM to generate training instructions
  • Key models: FLAN-T5, Alpaca, Vicuna, Mistral-Instruct

Alignment

Training LLMs to be helpful, harmless, and honest.

RLHF (Reinforcement Learning from Human Feedback)

  1. Supervised fine-tuning (SFT): Fine-tune on high-quality demonstrations
  2. Reward model training: Train a model to predict human preferences between response pairs
  3. RL optimization: Use PPO to optimize the policy (LLM) against the reward model, with a KL penalty to stay close to the SFT model

Challenges: Reward hacking, reward model quality, instability of PPO training.

DPO (Direct Preference Optimization)

Eliminates the explicit reward model by deriving a closed-form loss from the preference data.

L_DPO = -log sigma(beta * (log pi(y_w|x)/pi_ref(y_w|x) - log pi(y_l|x)/pi_ref(y_l|x)))
  • y_w: preferred response, y_l: dispreferred response
  • Simpler and more stable than RLHF
  • No separate reward model needed
  • Widely adopted (LLaMA 2, Zephyr, many open models)

Constitutional AI (CAI)

  1. Generate responses, then self-critique against a set of principles (constitution)
  2. Revise responses based on critiques
  3. Use revised responses as preference data for RLHF/DPO
  • Reduces reliance on human annotation
  • Principles are explicit and auditable

Other Alignment Methods

Method Key Idea
RLAIF AI-generated feedback instead of human
KTO Kahneman-Tversky optimization; binary signal (good/bad)
IPO Identity preference optimization; addresses DPO overfitting
ORPO Odds ratio preference optimization; no reference model needed

Inference Optimization

Quantization

Reduce model precision to lower memory and compute requirements.

Precision Bits Memory (70B model) Quality Impact
FP32 32 ~280 GB Baseline
FP16/BF16 16 ~140 GB Negligible
INT8 8 ~70 GB Minimal
INT4 (GPTQ, AWQ) 4 ~35 GB Small
GGUF Q4_K_M ~4.5 ~40 GB Small
  • GPTQ: Post-training quantization using approximate second-order information
  • AWQ: Activation-aware quantization; protects salient weights
  • GGUF (llama.cpp): CPU-friendly quantization with mixed precision per layer

KV Cache Optimization

Autoregressive generation caches key-value pairs from previous tokens to avoid recomputation.

  • Memory scales as O(batch_size * seq_len * num_layers * hidden_dim)
  • Multi-Query Attention (MQA): Share KV heads across query heads (reduces KV cache by num_heads x)
  • Grouped-Query Attention (GQA): Share KV heads among groups of query heads (LLaMA 2, Mistral)
  • PagedAttention (vLLM): Manage KV cache memory like virtual memory pages; eliminates fragmentation

Speculative Decoding

Use a small draft model to generate candidate tokens, then verify in parallel with the large model.

  1. Draft model generates k tokens autoregressively (fast)
  2. Large model scores all k tokens in a single forward pass (parallel)
  3. Accept tokens that the large model agrees with; reject and resample from the first disagreement
  4. Guarantees identical output distribution to the large model alone
  5. Speedup of 2-3x typical, depending on acceptance rate

Serving Systems

System Key Feature
vLLM PagedAttention, continuous batching
TensorRT-LLM NVIDIA-optimized kernels, in-flight batching
TGI (HuggingFace) Production serving with streaming
SGLang Radix attention for shared prefixes
Ollama Local deployment, GGUF models

Continuous batching: Add new requests to a running batch as slots open, maximizing GPU utilization.


Evaluation

Benchmarks

Benchmark Measures
MMLU Multi-task knowledge (57 subjects)
HumanEval / MBPP Code generation
GSM8K / MATH Mathematical reasoning
HellaSwag Commonsense reasoning
TruthfulQA Truthfulness
MT-Bench Multi-turn conversation quality
GPQA Graduate-level expert QA
Arena Elo (Chatbot Arena) Human preference ranking

Evaluation Challenges

  • Benchmark contamination: test data may appear in training data
  • Metric gaming: optimizing for benchmarks does not guarantee real-world quality
  • Chatbot Arena (crowdsourced human preferences) is currently the most trusted evaluation

Prompt Engineering

Few-Shot Prompting

Provide examples of the desired input-output mapping in the prompt.

Classify the sentiment:
"I love this movie" -> Positive
"Terrible experience" -> Negative
"The food was okay" ->

Chain-of-Thought (CoT)

Instruct the model to reason step-by-step before answering.

Q: If there are 3 cars with 4 wheels each, how many wheels total?
A: Let me think step by step.
   Each car has 4 wheels.
   There are 3 cars.
   3 * 4 = 12 wheels total.
   The answer is 12.
  • Dramatically improves performance on math, logic, and multi-step reasoning
  • Zero-shot CoT: simply append "Let's think step by step"
  • Self-consistency: sample multiple CoT paths, take majority vote

Advanced Prompting Techniques

Technique Description
ReAct Interleave reasoning and actions (tool calls)
Tree of Thought Explore multiple reasoning branches
Retrieval-augmented Include retrieved context in the prompt
Structured output Request JSON, XML, or specific formats
System prompts Set behavioral guidelines and persona

Agents and Tool Use

LLMs as agents that plan, reason, and interact with external tools.

Tool Use

The model generates structured calls to external APIs/tools, receives results, and continues reasoning.

Common tools: web search, code execution, calculators, databases, APIs.

Function calling: Model outputs structured JSON specifying function name and arguments; runtime executes and returns results.

Agent Architectures

Pattern Description
ReAct Thought -> Action -> Observation loop
Plan-and-execute Generate full plan, then execute steps
Reflexion Self-evaluate and retry on failure
Multi-agent Multiple specialized agents collaborating

Challenges

  • Planning: LLMs struggle with long-horizon planning
  • Error recovery: Cascading errors in multi-step workflows
  • Safety: Autonomous actions require guardrails
  • Cost: Agent loops multiply API calls and latency
  • Evaluation: No standard benchmarks; SWE-bench (code), WebArena (web), GAIA (general) are emerging

Key Takeaways

  • Pre-training requires massive data curation, distributed training across thousands of GPUs, and careful stability management
  • PEFT methods (LoRA, QLoRA) make fine-tuning accessible on consumer hardware
  • Alignment (RLHF, DPO) transforms base models into helpful assistants; DPO is simpler and increasingly preferred
  • Inference optimization (quantization, KV cache, speculative decoding, vLLM) is essential for practical deployment
  • Prompt engineering (CoT, few-shot) unlocks capabilities without training; careful prompting can substitute for fine-tuning
  • LLM agents with tool use extend capabilities beyond text generation to real-world interaction