8 min read
On this page

Large Language Models

Overview

Large language models (LLMs) are transformer-based models trained on massive text corpora that exhibit broad capabilities across language understanding and generation tasks. This document covers the full LLM lifecycle: pre-training, fine-tuning, alignment, inference optimization, evaluation, prompt engineering, and agentic use.


Pre-training

Data

LLM training data is drawn from diverse sources.

| Source | Examples | Considerations | |---|---|---| | Web crawl | Common Crawl, C4, RefinedWeb | Requires aggressive deduplication and quality filtering | | Books | Books3, Project Gutenberg | Long-form, high quality | | Code | GitHub, The Stack | Improves reasoning and code generation | | Scientific | Semantic Scholar, arXiv | Domain knowledge | | Curated | Wikipedia, StackExchange | High quality, smaller scale |

Data processing pipeline:

  1. Language identification and filtering
  2. Deduplication (MinHash, exact substring)
  3. Quality filtering (perplexity, classifier-based)
  4. PII removal and content filtering
  5. Tokenization

Scale: LLaMA 2 trained on 2T tokens; LLaMA 3 on 15T tokens.

Training Infrastructure

| Component | Details | |---|---| | Hardware | Clusters of thousands of GPUs (A100, H100) or TPUs | | Interconnect | NVLink, InfiniBand for high-bandwidth inter-node communication | | Frameworks | Megatron-LM, DeepSpeed, JAX/XLA, PyTorch FSDP | | Cost | GPT-4 estimated at $100M+; LLaMA 3 405B ~30M GPU-hours |

Parallelism Strategies

| Strategy | Splits | Communication | |---|---|---| | Data parallelism (DP) | Data batches across replicas | All-reduce gradients | | Tensor parallelism (TP) | Individual layers (attention heads, FFN) | All-reduce within layer | | Pipeline parallelism (PP) | Layer groups across stages | Forward/backward between stages | | Expert parallelism (EP) | MoE experts across devices | All-to-all routing | | Sequence parallelism (SP) | Long sequences across devices | Communication at attention | | ZeRO / FSDP | Optimizer states, gradients, params | All-gather when needed |

Typical setup: TP within a node (8 GPUs), PP across nodes, DP across node groups.

Training Stability

  • Mixed precision: BF16 forward/backward, FP32 master weights and optimizer states
  • Gradient clipping (typically 1.0)
  • Learning rate warmup + cosine decay
  • Loss spikes: checkpoint averaging, data investigation, learning rate reduction
  • Checkpointing every N steps for recovery

Fine-tuning

Full Fine-tuning

Update all model parameters on task-specific data. Effective but expensive for large models and risks catastrophic forgetting.

Parameter-Efficient Fine-tuning (PEFT)

Train a small number of parameters while freezing the base model.

LoRA (Low-Rank Adaptation):

  • Decompose weight updates as delta_W = A * B where A is d x r and B is r x d
  • Rank r is typically 8-64 (vs d = 4096-8192)
  • Reduces trainable parameters by 100-1000x
  • Applied to attention projection matrices (Q, K, V, O)
  • Merged into base weights at inference (no latency cost)

QLoRA:

  • Quantize base model to 4-bit (NF4 quantization)
  • Apply LoRA adapters in BF16
  • Enables fine-tuning 65B models on a single 48GB GPU
  • Double quantization: quantize the quantization constants

Adapters:

  • Insert small bottleneck layers (down-project, nonlinearity, up-project) between transformer layers
  • Train only adapter parameters; freeze everything else
  • Modular: swap adapters for different tasks

Prefix Tuning / Prompt Tuning:

  • Learn continuous vectors prepended to the input
  • Prefix tuning: learned vectors at every layer
  • Prompt tuning: learned vectors only at the input layer
  • Extremely parameter-efficient but less expressive

Instruction Tuning

Fine-tune on diverse (instruction, response) pairs to follow arbitrary instructions.

  • Datasets: FLAN, OpenAssistant, ShareGPT, Alpaca
  • Dramatically improves zero-shot task performance
  • Self-instruct: use an LLM to generate training instructions
  • Key models: FLAN-T5, Alpaca, Vicuna, Mistral-Instruct

Alignment

Training LLMs to be helpful, harmless, and honest.

RLHF (Reinforcement Learning from Human Feedback)

  1. Supervised fine-tuning (SFT): Fine-tune on high-quality demonstrations
  2. Reward model training: Train a model to predict human preferences between response pairs
  3. RL optimization: Use PPO to optimize the policy (LLM) against the reward model, with a KL penalty to stay close to the SFT model

Challenges: Reward hacking, reward model quality, instability of PPO training.

DPO (Direct Preference Optimization)

Eliminates the explicit reward model by deriving a closed-form loss from the preference data.

L_DPO = -log sigma(beta * (log pi(y_w|x)/pi_ref(y_w|x) - log pi(y_l|x)/pi_ref(y_l|x)))
  • y_w: preferred response, y_l: dispreferred response
  • Simpler and more stable than RLHF
  • No separate reward model needed
  • Widely adopted (LLaMA 2, Zephyr, many open models)

Constitutional AI (CAI)

  1. Generate responses, then self-critique against a set of principles (constitution)
  2. Revise responses based on critiques
  3. Use revised responses as preference data for RLHF/DPO
  • Reduces reliance on human annotation
  • Principles are explicit and auditable

Other Alignment Methods

| Method | Key Idea | |---|---| | RLAIF | AI-generated feedback instead of human | | KTO | Kahneman-Tversky optimization; binary signal (good/bad) | | IPO | Identity preference optimization; addresses DPO overfitting | | ORPO | Odds ratio preference optimization; no reference model needed |


Inference Optimization

Quantization

Reduce model precision to lower memory and compute requirements.

| Precision | Bits | Memory (70B model) | Quality Impact | |---|---|---|---| | FP32 | 32 | ~280 GB | Baseline | | FP16/BF16 | 16 | ~140 GB | Negligible | | INT8 | 8 | ~70 GB | Minimal | | INT4 (GPTQ, AWQ) | 4 | ~35 GB | Small | | GGUF Q4_K_M | ~4.5 | ~40 GB | Small |

  • GPTQ: Post-training quantization using approximate second-order information
  • AWQ: Activation-aware quantization; protects salient weights
  • GGUF (llama.cpp): CPU-friendly quantization with mixed precision per layer

KV Cache Optimization

Autoregressive generation caches key-value pairs from previous tokens to avoid recomputation.

  • Memory scales as O(batch_size * seq_len * num_layers * hidden_dim)
  • Multi-Query Attention (MQA): Share KV heads across query heads (reduces KV cache by num_heads x)
  • Grouped-Query Attention (GQA): Share KV heads among groups of query heads (LLaMA 2, Mistral)
  • PagedAttention (vLLM): Manage KV cache memory like virtual memory pages; eliminates fragmentation

Speculative Decoding

Use a small draft model to generate candidate tokens, then verify in parallel with the large model.

  1. Draft model generates k tokens autoregressively (fast)
  2. Large model scores all k tokens in a single forward pass (parallel)
  3. Accept tokens that the large model agrees with; reject and resample from the first disagreement
  4. Guarantees identical output distribution to the large model alone
  5. Speedup of 2-3x typical, depending on acceptance rate

Serving Systems

| System | Key Feature | |---|---| | vLLM | PagedAttention, continuous batching | | TensorRT-LLM | NVIDIA-optimized kernels, in-flight batching | | TGI (HuggingFace) | Production serving with streaming | | SGLang | Radix attention for shared prefixes | | Ollama | Local deployment, GGUF models |

Continuous batching: Add new requests to a running batch as slots open, maximizing GPU utilization.


Evaluation

Benchmarks

| Benchmark | Measures | |---|---| | MMLU | Multi-task knowledge (57 subjects) | | HumanEval / MBPP | Code generation | | GSM8K / MATH | Mathematical reasoning | | HellaSwag | Commonsense reasoning | | TruthfulQA | Truthfulness | | MT-Bench | Multi-turn conversation quality | | GPQA | Graduate-level expert QA | | Arena Elo (Chatbot Arena) | Human preference ranking |

Evaluation Challenges

  • Benchmark contamination: test data may appear in training data
  • Metric gaming: optimizing for benchmarks does not guarantee real-world quality
  • Chatbot Arena (crowdsourced human preferences) is currently the most trusted evaluation

Prompt Engineering

Few-Shot Prompting

Provide examples of the desired input-output mapping in the prompt.

Classify the sentiment:
"I love this movie" -> Positive
"Terrible experience" -> Negative
"The food was okay" ->

Chain-of-Thought (CoT)

Instruct the model to reason step-by-step before answering.

Q: If there are 3 cars with 4 wheels each, how many wheels total?
A: Let me think step by step.
   Each car has 4 wheels.
   There are 3 cars.
   3 * 4 = 12 wheels total.
   The answer is 12.
  • Dramatically improves performance on math, logic, and multi-step reasoning
  • Zero-shot CoT: simply append "Let's think step by step"
  • Self-consistency: sample multiple CoT paths, take majority vote

Advanced Prompting Techniques

| Technique | Description | |---|---| | ReAct | Interleave reasoning and actions (tool calls) | | Tree of Thought | Explore multiple reasoning branches | | Retrieval-augmented | Include retrieved context in the prompt | | Structured output | Request JSON, XML, or specific formats | | System prompts | Set behavioral guidelines and persona |


Agents and Tool Use

LLMs as agents that plan, reason, and interact with external tools.

Tool Use

The model generates structured calls to external APIs/tools, receives results, and continues reasoning.

Common tools: web search, code execution, calculators, databases, APIs.

Function calling: Model outputs structured JSON specifying function name and arguments; runtime executes and returns results.

Agent Architectures

| Pattern | Description | |---|---| | ReAct | Thought -> Action -> Observation loop | | Plan-and-execute | Generate full plan, then execute steps | | Reflexion | Self-evaluate and retry on failure | | Multi-agent | Multiple specialized agents collaborating |

Challenges

  • Planning: LLMs struggle with long-horizon planning
  • Error recovery: Cascading errors in multi-step workflows
  • Safety: Autonomous actions require guardrails
  • Cost: Agent loops multiply API calls and latency
  • Evaluation: No standard benchmarks; SWE-bench (code), WebArena (web), GAIA (general) are emerging

Key Takeaways

  • Pre-training requires massive data curation, distributed training across thousands of GPUs, and careful stability management
  • PEFT methods (LoRA, QLoRA) make fine-tuning accessible on consumer hardware
  • Alignment (RLHF, DPO) transforms base models into helpful assistants; DPO is simpler and increasingly preferred
  • Inference optimization (quantization, KV cache, speculative decoding, vLLM) is essential for practical deployment
  • Prompt engineering (CoT, few-shot) unlocks capabilities without training; careful prompting can substitute for fine-tuning
  • LLM agents with tool use extend capabilities beyond text generation to real-world interaction