Large Language Models
Overview
Large language models (LLMs) are transformer-based models trained on massive text corpora that exhibit broad capabilities across language understanding and generation tasks. This document covers the full LLM lifecycle: pre-training, fine-tuning, alignment, inference optimization, evaluation, prompt engineering, and agentic use.
Pre-training
Data
LLM training data is drawn from diverse sources.
| Source | Examples | Considerations | |---|---|---| | Web crawl | Common Crawl, C4, RefinedWeb | Requires aggressive deduplication and quality filtering | | Books | Books3, Project Gutenberg | Long-form, high quality | | Code | GitHub, The Stack | Improves reasoning and code generation | | Scientific | Semantic Scholar, arXiv | Domain knowledge | | Curated | Wikipedia, StackExchange | High quality, smaller scale |
Data processing pipeline:
- Language identification and filtering
- Deduplication (MinHash, exact substring)
- Quality filtering (perplexity, classifier-based)
- PII removal and content filtering
- Tokenization
Scale: LLaMA 2 trained on 2T tokens; LLaMA 3 on 15T tokens.
Training Infrastructure
| Component | Details | |---|---| | Hardware | Clusters of thousands of GPUs (A100, H100) or TPUs | | Interconnect | NVLink, InfiniBand for high-bandwidth inter-node communication | | Frameworks | Megatron-LM, DeepSpeed, JAX/XLA, PyTorch FSDP | | Cost | GPT-4 estimated at $100M+; LLaMA 3 405B ~30M GPU-hours |
Parallelism Strategies
| Strategy | Splits | Communication | |---|---|---| | Data parallelism (DP) | Data batches across replicas | All-reduce gradients | | Tensor parallelism (TP) | Individual layers (attention heads, FFN) | All-reduce within layer | | Pipeline parallelism (PP) | Layer groups across stages | Forward/backward between stages | | Expert parallelism (EP) | MoE experts across devices | All-to-all routing | | Sequence parallelism (SP) | Long sequences across devices | Communication at attention | | ZeRO / FSDP | Optimizer states, gradients, params | All-gather when needed |
Typical setup: TP within a node (8 GPUs), PP across nodes, DP across node groups.
Training Stability
- Mixed precision: BF16 forward/backward, FP32 master weights and optimizer states
- Gradient clipping (typically 1.0)
- Learning rate warmup + cosine decay
- Loss spikes: checkpoint averaging, data investigation, learning rate reduction
- Checkpointing every N steps for recovery
Fine-tuning
Full Fine-tuning
Update all model parameters on task-specific data. Effective but expensive for large models and risks catastrophic forgetting.
Parameter-Efficient Fine-tuning (PEFT)
Train a small number of parameters while freezing the base model.
LoRA (Low-Rank Adaptation):
- Decompose weight updates as delta_W = A * B where A is d x r and B is r x d
- Rank r is typically 8-64 (vs d = 4096-8192)
- Reduces trainable parameters by 100-1000x
- Applied to attention projection matrices (Q, K, V, O)
- Merged into base weights at inference (no latency cost)
QLoRA:
- Quantize base model to 4-bit (NF4 quantization)
- Apply LoRA adapters in BF16
- Enables fine-tuning 65B models on a single 48GB GPU
- Double quantization: quantize the quantization constants
Adapters:
- Insert small bottleneck layers (down-project, nonlinearity, up-project) between transformer layers
- Train only adapter parameters; freeze everything else
- Modular: swap adapters for different tasks
Prefix Tuning / Prompt Tuning:
- Learn continuous vectors prepended to the input
- Prefix tuning: learned vectors at every layer
- Prompt tuning: learned vectors only at the input layer
- Extremely parameter-efficient but less expressive
Instruction Tuning
Fine-tune on diverse (instruction, response) pairs to follow arbitrary instructions.
- Datasets: FLAN, OpenAssistant, ShareGPT, Alpaca
- Dramatically improves zero-shot task performance
- Self-instruct: use an LLM to generate training instructions
- Key models: FLAN-T5, Alpaca, Vicuna, Mistral-Instruct
Alignment
Training LLMs to be helpful, harmless, and honest.
RLHF (Reinforcement Learning from Human Feedback)
- Supervised fine-tuning (SFT): Fine-tune on high-quality demonstrations
- Reward model training: Train a model to predict human preferences between response pairs
- RL optimization: Use PPO to optimize the policy (LLM) against the reward model, with a KL penalty to stay close to the SFT model
Challenges: Reward hacking, reward model quality, instability of PPO training.
DPO (Direct Preference Optimization)
Eliminates the explicit reward model by deriving a closed-form loss from the preference data.
L_DPO = -log sigma(beta * (log pi(y_w|x)/pi_ref(y_w|x) - log pi(y_l|x)/pi_ref(y_l|x)))
- y_w: preferred response, y_l: dispreferred response
- Simpler and more stable than RLHF
- No separate reward model needed
- Widely adopted (LLaMA 2, Zephyr, many open models)
Constitutional AI (CAI)
- Generate responses, then self-critique against a set of principles (constitution)
- Revise responses based on critiques
- Use revised responses as preference data for RLHF/DPO
- Reduces reliance on human annotation
- Principles are explicit and auditable
Other Alignment Methods
| Method | Key Idea | |---|---| | RLAIF | AI-generated feedback instead of human | | KTO | Kahneman-Tversky optimization; binary signal (good/bad) | | IPO | Identity preference optimization; addresses DPO overfitting | | ORPO | Odds ratio preference optimization; no reference model needed |
Inference Optimization
Quantization
Reduce model precision to lower memory and compute requirements.
| Precision | Bits | Memory (70B model) | Quality Impact | |---|---|---|---| | FP32 | 32 | ~280 GB | Baseline | | FP16/BF16 | 16 | ~140 GB | Negligible | | INT8 | 8 | ~70 GB | Minimal | | INT4 (GPTQ, AWQ) | 4 | ~35 GB | Small | | GGUF Q4_K_M | ~4.5 | ~40 GB | Small |
- GPTQ: Post-training quantization using approximate second-order information
- AWQ: Activation-aware quantization; protects salient weights
- GGUF (llama.cpp): CPU-friendly quantization with mixed precision per layer
KV Cache Optimization
Autoregressive generation caches key-value pairs from previous tokens to avoid recomputation.
- Memory scales as O(batch_size * seq_len * num_layers * hidden_dim)
- Multi-Query Attention (MQA): Share KV heads across query heads (reduces KV cache by num_heads x)
- Grouped-Query Attention (GQA): Share KV heads among groups of query heads (LLaMA 2, Mistral)
- PagedAttention (vLLM): Manage KV cache memory like virtual memory pages; eliminates fragmentation
Speculative Decoding
Use a small draft model to generate candidate tokens, then verify in parallel with the large model.
- Draft model generates k tokens autoregressively (fast)
- Large model scores all k tokens in a single forward pass (parallel)
- Accept tokens that the large model agrees with; reject and resample from the first disagreement
- Guarantees identical output distribution to the large model alone
- Speedup of 2-3x typical, depending on acceptance rate
Serving Systems
| System | Key Feature | |---|---| | vLLM | PagedAttention, continuous batching | | TensorRT-LLM | NVIDIA-optimized kernels, in-flight batching | | TGI (HuggingFace) | Production serving with streaming | | SGLang | Radix attention for shared prefixes | | Ollama | Local deployment, GGUF models |
Continuous batching: Add new requests to a running batch as slots open, maximizing GPU utilization.
Evaluation
Benchmarks
| Benchmark | Measures | |---|---| | MMLU | Multi-task knowledge (57 subjects) | | HumanEval / MBPP | Code generation | | GSM8K / MATH | Mathematical reasoning | | HellaSwag | Commonsense reasoning | | TruthfulQA | Truthfulness | | MT-Bench | Multi-turn conversation quality | | GPQA | Graduate-level expert QA | | Arena Elo (Chatbot Arena) | Human preference ranking |
Evaluation Challenges
- Benchmark contamination: test data may appear in training data
- Metric gaming: optimizing for benchmarks does not guarantee real-world quality
- Chatbot Arena (crowdsourced human preferences) is currently the most trusted evaluation
Prompt Engineering
Few-Shot Prompting
Provide examples of the desired input-output mapping in the prompt.
Classify the sentiment:
"I love this movie" -> Positive
"Terrible experience" -> Negative
"The food was okay" ->
Chain-of-Thought (CoT)
Instruct the model to reason step-by-step before answering.
Q: If there are 3 cars with 4 wheels each, how many wheels total?
A: Let me think step by step.
Each car has 4 wheels.
There are 3 cars.
3 * 4 = 12 wheels total.
The answer is 12.
- Dramatically improves performance on math, logic, and multi-step reasoning
- Zero-shot CoT: simply append "Let's think step by step"
- Self-consistency: sample multiple CoT paths, take majority vote
Advanced Prompting Techniques
| Technique | Description | |---|---| | ReAct | Interleave reasoning and actions (tool calls) | | Tree of Thought | Explore multiple reasoning branches | | Retrieval-augmented | Include retrieved context in the prompt | | Structured output | Request JSON, XML, or specific formats | | System prompts | Set behavioral guidelines and persona |
Agents and Tool Use
LLMs as agents that plan, reason, and interact with external tools.
Tool Use
The model generates structured calls to external APIs/tools, receives results, and continues reasoning.
Common tools: web search, code execution, calculators, databases, APIs.
Function calling: Model outputs structured JSON specifying function name and arguments; runtime executes and returns results.
Agent Architectures
| Pattern | Description | |---|---| | ReAct | Thought -> Action -> Observation loop | | Plan-and-execute | Generate full plan, then execute steps | | Reflexion | Self-evaluate and retry on failure | | Multi-agent | Multiple specialized agents collaborating |
Challenges
- Planning: LLMs struggle with long-horizon planning
- Error recovery: Cascading errors in multi-step workflows
- Safety: Autonomous actions require guardrails
- Cost: Agent loops multiply API calls and latency
- Evaluation: No standard benchmarks; SWE-bench (code), WebArena (web), GAIA (general) are emerging
Key Takeaways
- Pre-training requires massive data curation, distributed training across thousands of GPUs, and careful stability management
- PEFT methods (LoRA, QLoRA) make fine-tuning accessible on consumer hardware
- Alignment (RLHF, DPO) transforms base models into helpful assistants; DPO is simpler and increasingly preferred
- Inference optimization (quantization, KV cache, speculative decoding, vLLM) is essential for practical deployment
- Prompt engineering (CoT, few-shot) unlocks capabilities without training; careful prompting can substitute for fine-tuning
- LLM agents with tool use extend capabilities beyond text generation to real-world interaction