Large Language Models

Overview

Large language models (LLMs) are transformer-based models trained on massive text corpora that exhibit broad capabilities across language understanding and generation tasks. This document covers the full LLM lifecycle: pre-training, fine-tuning, alignment, inference optimization, evaluation, prompt engineering, and agentic use.

Pre-training

Data

LLM training data is drawn from diverse sources.

Source	Examples	Considerations
Web crawl	Common Crawl, C4, RefinedWeb	Requires aggressive deduplication and quality filtering
Books	Books3, Project Gutenberg	Long-form, high quality
Code	GitHub, The Stack	Improves reasoning and code generation
Scientific	Semantic Scholar, arXiv	Domain knowledge
Curated	Wikipedia, StackExchange	High quality, smaller scale

Data processing pipeline:

Language identification and filtering
Deduplication (MinHash, exact substring)
Quality filtering (perplexity, classifier-based)
PII removal and content filtering
Tokenization

Scale: LLaMA 2 trained on 2T tokens; LLaMA 3 on 15T tokens.

Training Infrastructure

Component	Details
Hardware	Clusters of thousands of GPUs (A100, H100) or TPUs
Interconnect	NVLink, InfiniBand for high-bandwidth inter-node communication
Frameworks	Megatron-LM, DeepSpeed, JAX/XLA, PyTorch FSDP
Cost	GPT-4 estimated at $100M+; LLaMA 3 405B ~30M GPU-hours

Parallelism Strategies

Strategy	Splits	Communication
Data parallelism (DP)	Data batches across replicas	All-reduce gradients
Tensor parallelism (TP)	Individual layers (attention heads, FFN)	All-reduce within layer
Pipeline parallelism (PP)	Layer groups across stages	Forward/backward between stages
Expert parallelism (EP)	MoE experts across devices	All-to-all routing
Sequence parallelism (SP)	Long sequences across devices	Communication at attention
ZeRO / FSDP	Optimizer states, gradients, params	All-gather when needed

Typical setup: TP within a node (8 GPUs), PP across nodes, DP across node groups.

Training Stability

Mixed precision: BF16 forward/backward, FP32 master weights and optimizer states
Gradient clipping (typically 1.0)
Learning rate warmup + cosine decay
Loss spikes: checkpoint averaging, data investigation, learning rate reduction
Checkpointing every N steps for recovery

Fine-tuning

Full Fine-tuning

Update all model parameters on task-specific data. Effective but expensive for large models and risks catastrophic forgetting.

Parameter-Efficient Fine-tuning (PEFT)

Train a small number of parameters while freezing the base model.

LoRA (Low-Rank Adaptation):

Decompose weight updates as delta_W = A * B where A is d x r and B is r x d
Rank r is typically 8-64 (vs d = 4096-8192)
Reduces trainable parameters by 100-1000x
Applied to attention projection matrices (Q, K, V, O)
Merged into base weights at inference (no latency cost)

QLoRA:

Quantize base model to 4-bit (NF4 quantization)
Apply LoRA adapters in BF16
Enables fine-tuning 65B models on a single 48GB GPU
Double quantization: quantize the quantization constants

Adapters:

Insert small bottleneck layers (down-project, nonlinearity, up-project) between transformer layers
Train only adapter parameters; freeze everything else
Modular: swap adapters for different tasks

Prefix Tuning / Prompt Tuning:

Learn continuous vectors prepended to the input
Prefix tuning: learned vectors at every layer
Prompt tuning: learned vectors only at the input layer
Extremely parameter-efficient but less expressive

Instruction Tuning

Fine-tune on diverse (instruction, response) pairs to follow arbitrary instructions.

Datasets: FLAN, OpenAssistant, ShareGPT, Alpaca
Dramatically improves zero-shot task performance
Self-instruct: use an LLM to generate training instructions
Key models: FLAN-T5, Alpaca, Vicuna, Mistral-Instruct

Alignment

Training LLMs to be helpful, harmless, and honest.

RLHF (Reinforcement Learning from Human Feedback)

Supervised fine-tuning (SFT): Fine-tune on high-quality demonstrations
Reward model training: Train a model to predict human preferences between response pairs
RL optimization: Use PPO to optimize the policy (LLM) against the reward model, with a KL penalty to stay close to the SFT model

Challenges: Reward hacking, reward model quality, instability of PPO training.

DPO (Direct Preference Optimization)

Eliminates the explicit reward model by deriving a closed-form loss from the preference data.

L_DPO = -log sigma(beta * (log pi(y_w|x)/pi_ref(y_w|x) - log pi(y_l|x)/pi_ref(y_l|x)))

y_w: preferred response, y_l: dispreferred response
Simpler and more stable than RLHF
No separate reward model needed
Widely adopted (LLaMA 2, Zephyr, many open models)

Constitutional AI (CAI)

Generate responses, then self-critique against a set of principles (constitution)
Revise responses based on critiques
Use revised responses as preference data for RLHF/DPO

Reduces reliance on human annotation
Principles are explicit and auditable

Other Alignment Methods

Method	Key Idea
RLAIF	AI-generated feedback instead of human
KTO	Kahneman-Tversky optimization; binary signal (good/bad)
IPO	Identity preference optimization; addresses DPO overfitting
ORPO	Odds ratio preference optimization; no reference model needed

Inference Optimization

Quantization

Reduce model precision to lower memory and compute requirements.

Precision	Bits	Memory (70B model)	Quality Impact
FP32	32	~280 GB	Baseline
FP16/BF16	16	~140 GB	Negligible
INT8	8	~70 GB	Minimal
INT4 (GPTQ, AWQ)	4	~35 GB	Small
GGUF Q4_K_M	~4.5	~40 GB	Small

GPTQ: Post-training quantization using approximate second-order information
AWQ: Activation-aware quantization; protects salient weights
GGUF (llama.cpp): CPU-friendly quantization with mixed precision per layer

KV Cache Optimization

Autoregressive generation caches key-value pairs from previous tokens to avoid recomputation.

Memory scales as O(batch_size * seq_len * num_layers * hidden_dim)
Multi-Query Attention (MQA): Share KV heads across query heads (reduces KV cache by num_heads x)
Grouped-Query Attention (GQA): Share KV heads among groups of query heads (LLaMA 2, Mistral)
PagedAttention (vLLM): Manage KV cache memory like virtual memory pages; eliminates fragmentation

Speculative Decoding

Use a small draft model to generate candidate tokens, then verify in parallel with the large model.

Draft model generates k tokens autoregressively (fast)
Large model scores all k tokens in a single forward pass (parallel)
Accept tokens that the large model agrees with; reject and resample from the first disagreement
Guarantees identical output distribution to the large model alone
Speedup of 2-3x typical, depending on acceptance rate

Serving Systems

System	Key Feature
vLLM	PagedAttention, continuous batching
TensorRT-LLM	NVIDIA-optimized kernels, in-flight batching
TGI (HuggingFace)	Production serving with streaming
SGLang	Radix attention for shared prefixes
Ollama	Local deployment, GGUF models

Continuous batching: Add new requests to a running batch as slots open, maximizing GPU utilization.

Evaluation

Benchmarks

Benchmark	Measures
MMLU	Multi-task knowledge (57 subjects)
HumanEval / MBPP	Code generation
GSM8K / MATH	Mathematical reasoning
HellaSwag	Commonsense reasoning
TruthfulQA	Truthfulness
MT-Bench	Multi-turn conversation quality
GPQA	Graduate-level expert QA
Arena Elo (Chatbot Arena)	Human preference ranking

Evaluation Challenges

Benchmark contamination: test data may appear in training data
Metric gaming: optimizing for benchmarks does not guarantee real-world quality
Chatbot Arena (crowdsourced human preferences) is currently the most trusted evaluation

Prompt Engineering

Few-Shot Prompting

Provide examples of the desired input-output mapping in the prompt.

Classify the sentiment:
"I love this movie" -> Positive
"Terrible experience" -> Negative
"The food was okay" ->

Chain-of-Thought (CoT)

Instruct the model to reason step-by-step before answering.

Q: If there are 3 cars with 4 wheels each, how many wheels total?
A: Let me think step by step.
   Each car has 4 wheels.
   There are 3 cars.
   3 * 4 = 12 wheels total.
   The answer is 12.

Dramatically improves performance on math, logic, and multi-step reasoning
Zero-shot CoT: simply append "Let's think step by step"
Self-consistency: sample multiple CoT paths, take majority vote

Advanced Prompting Techniques

Technique	Description
ReAct	Interleave reasoning and actions (tool calls)
Tree of Thought	Explore multiple reasoning branches
Retrieval-augmented	Include retrieved context in the prompt
Structured output	Request JSON, XML, or specific formats
System prompts	Set behavioral guidelines and persona

Agents and Tool Use

LLMs as agents that plan, reason, and interact with external tools.

Tool Use

The model generates structured calls to external APIs/tools, receives results, and continues reasoning.

Common tools: web search, code execution, calculators, databases, APIs.

Function calling: Model outputs structured JSON specifying function name and arguments; runtime executes and returns results.

Agent Architectures

Pattern	Description
ReAct	Thought -> Action -> Observation loop
Plan-and-execute	Generate full plan, then execute steps
Reflexion	Self-evaluate and retry on failure
Multi-agent	Multiple specialized agents collaborating

Challenges

Planning: LLMs struggle with long-horizon planning
Error recovery: Cascading errors in multi-step workflows
Safety: Autonomous actions require guardrails
Cost: Agent loops multiply API calls and latency
Evaluation: No standard benchmarks; SWE-bench (code), WebArena (web), GAIA (general) are emerging

Key Takeaways

Pre-training requires massive data curation, distributed training across thousands of GPUs, and careful stability management
PEFT methods (LoRA, QLoRA) make fine-tuning accessible on consumer hardware
Alignment (RLHF, DPO) transforms base models into helpful assistants; DPO is simpler and increasingly preferred
Inference optimization (quantization, KV cache, speculative decoding, vLLM) is essential for practical deployment
Prompt engineering (CoT, few-shot) unlocks capabilities without training; careful prompting can substitute for fine-tuning
LLM agents with tool use extend capabilities beyond text generation to real-world interaction