Transformers
Self-Attention
The core mechanism: each token attends to all other tokens to compute a context-aware representation.
Given input embeddings X (sequence_length x d_model):
Q = X * W_Q # queries (n x d_k)
K = X * W_K # keys (n x d_k)
V = X * W_V # values (n x d_v)
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
The sqrt(d_k) scaling prevents dot products from growing large, which would push softmax into regions with tiny gradients.
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.shape[-1]
scores = Q @ K.transpose(-2, -1) / np.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
weights = softmax(scores, dim=-1)
return weights @ V
Complexity
- Time: O(n^2 * d) where n = sequence length, d = dimension
- Memory: O(n^2) for the attention matrix
- This quadratic scaling is the primary bottleneck for long sequences
Multi-Head Attention
Run h attention heads in parallel, each with its own projections:
head_i = Attention(X * W_Q^i, X * W_K^i, X * W_V^i)
MultiHead(X) = Concat(head_1, ..., head_h) * W_O
Where W_Q^i, W_K^i: (d_model x d_k), W_V^i: (d_model x d_v), d_k = d_v = d_model / h.
Each head can learn different attention patterns: syntactic, semantic, positional, etc.
Parameter count: 4 * d_model^2 (Q, K, V projections + output projection).
Positional Encoding
Self-attention is permutation-equivariant -- it has no notion of position. Positional information must be injected.
Sinusoidal (Original Transformer)
PE(pos, 2i) = sin(pos / 10000^{2i/d_model})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})
Properties:
- Deterministic, no learned parameters
- Can generalize to unseen sequence lengths (in theory)
- Each dimension oscillates at a different frequency
- PE(pos+k) can be expressed as a linear function of PE(pos)
Rotary Position Embedding (RoPE)
Encode position by rotating the query and key vectors:
f(x, pos) = R(pos) * x
where R(pos) is a block-diagonal rotation matrix:
R(pos) = diag(R_1(pos), R_2(pos), ..., R_{d/2}(pos))
R_i(pos) = [[cos(pos*theta_i), -sin(pos*theta_i)],
[sin(pos*theta_i), cos(pos*theta_i)]]
theta_i = 10000^{-2i/d}
Key insight: the dot product q^T k depends on relative position (pos_q - pos_k), giving relative position encoding without explicit pairwise computation.
Used in LLaMA, PaLM, and most modern LLMs.
ALiBi (Attention with Linear Biases)
Add a linear bias to attention scores based on distance:
softmax(q_i^T k_j - m * |i - j|)
where m is a head-specific slope (geometric sequence from 2^{-8/n} to 2^{-8}).
- No positional embeddings at all -- bias directly in attention
- Strong length extrapolation
- Used in BLOOM, MPT
Transformer Architecture
Encoder Block
x = x + MultiHeadAttention(LayerNorm(x)) # Pre-norm variant
x = x + FFN(LayerNorm(x))
FFN(x) = GELU(x * W_1 + b_1) * W_2 + b_2 # d_model -> d_ff -> d_model
d_ff is typically 4 * d_model. The FFN acts as a per-position "memory" storing learned patterns.
Decoder Block
Same as encoder but with:
- Causal mask: prevent attending to future positions (lower triangular mask)
- Cross-attention: queries from decoder, keys/values from encoder output
Pre-Norm vs Post-Norm
- Post-norm (original): LayerNorm after residual addition. Harder to train, needs careful warmup.
- Pre-norm: LayerNorm before attention/FFN. More stable training, slightly worse final performance without tuning.
Most modern models use pre-norm.
Key Architectures
BERT (Bidirectional Encoder Representations from Transformers)
- Architecture: encoder-only transformer
- Pretraining: masked language modeling (predict [MASK] tokens, 15% randomly) + next sentence prediction
- Fine-tuning: add task-specific head, fine-tune all parameters
- Sizes: BERT-base (110M, 12 layers, 768 dim), BERT-large (340M, 24 layers, 1024 dim)
Best for: classification, NER, question answering, sentence similarity.
GPT (Generative Pre-trained Transformer)
- Architecture: decoder-only transformer with causal masking
- Pretraining: next-token prediction (autoregressive language modeling)
- Inference: generate tokens one at a time, each conditioned on all previous
- Scaling: GPT-2 (1.5B), GPT-3 (175B), GPT-4 (undisclosed)
The scaling laws (Kaplan et al., 2020): loss scales as power laws in model size, data size, and compute.
T5 (Text-to-Text Transfer Transformer)
- Architecture: encoder-decoder
- Approach: frame every NLP task as text-to-text (e.g., "translate English to French: ...")
- Pretraining: span corruption (mask contiguous spans, predict them)
Vision Transformer (ViT)
Apply transformer to images by treating patches as tokens:
1. Split image into P x P patches (e.g., 16x16)
2. Flatten and linearly project each patch to d_model
3. Prepend [CLS] token, add positional embeddings
4. Apply standard transformer encoder
5. Classify using [CLS] token representation
Requires large datasets or strong regularization. With sufficient data/pretraining, matches or exceeds CNNs.
Efficient Transformers
FlashAttention
Exact attention computation using tiling and kernel fusion:
- Divide Q, K, V into blocks that fit in SRAM (fast memory)
- Compute attention block-by-block using online softmax trick
- Never materialize the full n x n attention matrix in HBM (slow memory)
Standard attention: O(n^2) memory, many HBM reads
FlashAttention: O(n) memory, fewer HBM reads (2-4x faster)
FlashAttention-2/3 further optimize with better work partitioning across GPU warps.
Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)
- MQA: all heads share one set of K, V projections. Reduces KV cache size during inference.
- GQA: group heads, share K, V within groups. Compromise between MHA and MQA.
- GQA with g groups: g sets of K, V shared across h/g heads each.
KV Cache
During autoregressive generation, cache computed K, V for all previous tokens to avoid recomputation:
Memory per token: 2 * n_layers * d_model * bytes_per_param
For a 7B model: ~0.5 MB per token
At 128K context: ~64 GB just for KV cache
This motivates MQA/GQA and quantized KV caches.
Other Efficiency Techniques
| Method | Approach | Complexity | |---------------------|---------------------------------------|---------------| | Sparse attention | Attend to subset of positions | O(n * sqrt(n))| | Linear attention | Replace softmax with kernel approx | O(n * d^2) | | Sliding window | Local attention within fixed window | O(n * w) | | Ring attention | Distribute long sequences across GPUs | O(n^2/p) |
Mixture of Experts (MoE)
Replace the dense FFN with a sparse mixture:
y = sum_{i=1}^{E} G(x)_i * Expert_i(x)
where G(x) is a gating network that routes each token to the top-k experts (typically k=1 or 2 out of E=8..64 experts).
def moe_layer(x, experts, gate, top_k=2):
gate_scores = softmax(gate(x)) # (batch, n_experts)
top_k_scores, top_k_indices = topk(gate_scores, top_k)
output = zeros_like(x)
for i in range(top_k):
expert_idx = top_k_indices[:, i]
expert_out = experts[expert_idx](x)
output += top_k_scores[:, i:i+1] * expert_out
return output
Benefits:
- Scale model parameters without proportional compute increase
- Each token uses only k/E of total parameters
- Mixtral 8x7B: 47B total params but ~13B active per token
Challenges: load balancing (auxiliary loss to encourage uniform expert usage), training instability, communication overhead in distributed settings.
Training at Scale
Key Practices
- Mixed precision (FP16/BF16 + FP32 master weights): 2x memory savings, faster matmuls
- Gradient accumulation: simulate larger batch sizes
- Data parallelism: replicate model, split data across GPUs
- Tensor parallelism: split individual layers across GPUs
- Pipeline parallelism: split layers across GPUs, micro-batch pipelining
- ZeRO: shard optimizer states, gradients, and parameters across GPUs
Tokenization
- BPE (Byte Pair Encoding): iteratively merge most frequent character pairs
- WordPiece: similar to BPE but uses likelihood-based merging
- SentencePiece: language-agnostic, treats input as raw bytes/characters
- Typical vocab size: 32K-128K tokens