5 min read
On this page

Language Models

Overview

A language model assigns probabilities to sequences of tokens. It captures the statistical structure of language and serves as the foundation for generation, classification, and understanding tasks. The field has progressed from count-based n-gram models to massive neural models exhibiting emergent capabilities.


N-gram Language Models

An n-gram model estimates the probability of a word given the previous n-1 words using maximum likelihood estimation from counts.

P(w_t | w_{t-n+1}, ..., w_{t-1}) = count(w_{t-n+1}...w_t) / count(w_{t-n+1}...w_{t-1})

Sparsity and Smoothing

Most n-grams never appear in training data, yielding zero probabilities. Smoothing redistributes probability mass to unseen events.

| Method | Idea | |---|---| | Laplace (add-1) | Add 1 to all counts; simple but over-smooths | | Add-k | Add k < 1; slightly better | | Backoff | Fall back to (n-1)-gram when n-gram count is zero | | Interpolation | Weighted mixture of unigram through n-gram estimates | | Kneser-Ney | Discount fixed amount from each count; distribute mass based on continuation probability |

Kneser-Ney Smoothing

The key insight: a word's backoff probability should reflect how many different contexts it appears in (continuation count), not its raw frequency.

  • "Francisco" is frequent but almost always follows "San" -- low continuation count
  • "the" appears in many contexts -- high continuation count
  • Modified Kneser-Ney (with multiple discount values) is the gold standard for n-gram LMs

Perplexity

The standard intrinsic evaluation metric for language models.

PP(W) = P(w_1, w_2, ..., w_N)^{-1/N}
      = exp(- (1/N) sum log P(w_t | context))
  • Lower perplexity = better model (assigns higher probability to test data)
  • Equivalent to the exponential of cross-entropy
  • A perplexity of k means the model is as uncertain as choosing uniformly among k options
  • Only comparable between models using the same vocabulary/tokenization

Neural Language Models

Feed-forward Neural LM (Bengio et al., 2003)

  • Fixed context window of n-1 words, each mapped to an embedding
  • Concatenated embeddings passed through hidden layers to predict next word
  • First to show that learned embeddings + neural prediction outperform n-grams
  • Limited by fixed context window

Recurrent Neural LMs

  • RNNs process sequences of arbitrary length with a hidden state
  • LSTMs and GRUs solve the vanishing gradient problem
  • Theoretically unlimited context but practically limited to ~200 tokens
  • Trained with truncated backpropagation through time

Transformer-based LMs

Transformer Architecture with Encoder, Decoder, and Attention

Transformers use self-attention to model long-range dependencies in parallel. Three main architectures define modern LMs.


Autoregressive Models (Decoder-only)

Generate text left-to-right; each token attends only to previous tokens (causal masking).

Architecture: Stack of transformer decoder blocks with causal self-attention.

Training objective: Next-token prediction.

L = -sum_t log P(x_t | x_1, ..., x_{t-1})

Examples: GPT-2, GPT-3, GPT-4, LLaMA, Mistral, Claude

Properties:

  • Natural fit for text generation
  • In-context learning emerges at scale
  • Scale well with more parameters and data
  • Dominate current LLM landscape

Masked Language Models (Encoder-only)

Bidirectional: each token attends to all other tokens. Not autoregressive.

Training objective: Mask 15% of tokens; predict the masked tokens from context.

L = -sum_{t in masked} log P(x_t | x_{\masked})

Examples: BERT, RoBERTa, ALBERT, DeBERTa

Properties:

  • Bidirectional context produces strong representations for understanding tasks
  • Not naturally suited for generation
  • Fine-tuned for classification, NER, QA, etc.
  • RoBERTa improvements: more data, longer training, dynamic masking, no NSP

Encoder-Decoder Models

Encoder processes input bidirectionally; decoder generates output autoregressively, attending to encoder representations via cross-attention.

Training objective: Typically span corruption (T5) -- mask spans of text, generate the masked spans.

Examples: T5, BART, mT5, Flan-T5

Properties:

  • Natural for sequence-to-sequence tasks (translation, summarization)
  • T5 frames all tasks as text-to-text
  • BART uses denoising autoencoder pretraining (deletion, infilling, shuffling, masking)

Scaling Laws

Kaplan et al. (2020) and Hoffmann et al. (2022, "Chinchilla") established power-law relationships.

L(N, D) ~ a/N^alpha + b/D^beta + c

Where N = parameters, D = data tokens, L = loss.

Key findings:

  • Loss decreases as a power law with both model size and data size
  • Chinchilla scaling: optimal training uses roughly 20 tokens per parameter
  • GPT-3 (175B params, 300B tokens) was undertrained by Chinchilla standards
  • LLaMA showed strong performance by training smaller models on much more data

Compute-Optimal Training

| Model | Parameters | Training Tokens | Ratio | |---|---|---|---| | Chinchilla | 70B | 1.4T | 20:1 | | LLaMA | 7-65B | 1-1.4T | 20-200:1 | | LLaMA 2 | 7-70B | 2T | 29-286:1 |

The trend has shifted toward overtrained smaller models for cheaper inference.


Emergent Abilities

Capabilities that appear suddenly at a certain scale, absent in smaller models.

  • Chain-of-thought reasoning: step-by-step problem solving
  • Arithmetic: multi-digit addition/multiplication
  • Code generation: writing functional programs
  • Multilingual transfer: performance on languages with little training data
  • Debate exists on whether emergence is real or an artifact of metric choice (Wei et al. vs Schaeffer et al.)

In-Context Learning (ICL)

The ability to perform tasks by conditioning on examples in the prompt, without gradient updates.

Zero-shot: Task description only, no examples. Few-shot: Task description plus k input-output examples.

Translate English to French:
sea otter => loutre de mer
cheese => fromage
hello =>

Properties:

  • Performance improves with more examples (up to a point)
  • Sensitive to example ordering and formatting
  • May perform pattern matching rather than true learning
  • Theoretical explanations include implicit Bayesian inference, mesa-optimization, and in-context gradient descent

Training Infrastructure

Modern LMs require distributed training across many accelerators.

| Technique | What It Parallelizes | |---|---| | Data parallelism | Replicate model, split data across devices | | Tensor parallelism | Split individual layers across devices | | Pipeline parallelism | Split layers across devices in stages | | Expert parallelism | Route tokens to different experts (MoE) | | ZeRO (DeepSpeed) | Shard optimizer states, gradients, parameters | | FSDP (PyTorch) | Fully sharded data parallelism |

Mixed-precision training (FP16/BF16 with FP32 master weights) and gradient checkpointing reduce memory.


Key Takeaways

  • N-gram models with Kneser-Ney smoothing were the standard for decades and remain useful baselines
  • Perplexity measures how well a model predicts held-out text; lower is better
  • The three transformer LM architectures (autoregressive, masked, encoder-decoder) serve different purposes
  • Scaling laws predict that loss decreases as a power law with compute, data, and parameters
  • Emergent abilities and in-context learning make large autoregressive models qualitatively different from smaller ones
  • Compute-optimal training balances model size against data quantity