Machine Translation
Overview
Machine translation (MT) automatically translates text between languages. The field has evolved from rule-based systems through statistical models to neural approaches, with transformers now dominating. MT is both a practical application and a proving ground for sequence-to-sequence modeling.
Statistical Machine Translation (SMT)
Noisy Channel Model
Based on Bayes' theorem, translation from French (f) to English (e):
e* = argmax_e P(e|f) = argmax_e P(f|e) * P(e)
- Translation model P(f|e): trained on parallel corpora; models word/phrase alignments
- Language model P(e): trained on monolingual English data; ensures fluency
Word-Based SMT (IBM Models)
IBM Models 1-5 progressively add alignment complexity:
| Model | Alignment Properties | |---|---| | Model 1 | Uniform alignment (position-independent) | | Model 2 | Position-dependent alignment | | Model 3 | Fertility (one source word -> multiple target words) | | Model 4 | Relative distortion | | Model 5 | Deficiency correction |
Trained with EM algorithm. Model 1's convexity guarantees a global optimum; higher models are initialized from lower ones.
Phrase-Based SMT
Translates contiguous sequences of words (phrases) rather than individual words.
Key components:
- Phrase table: extracted from word-aligned parallel data; stores phrase pairs with scores
- Reordering model: penalizes or allows phrase reordering
- Language model: n-gram model for target fluency
- Decoding: beam search over phrase segmentations and translations
- Tuning: MERT (Minimum Error Rate Training) optimizes feature weights on dev set
Phrase-based SMT dominated 2003-2015 and remains a useful baseline.
Neural Machine Translation (NMT)
Encoder-Decoder Architecture
The foundational NMT architecture (Sutskever et al., 2014; Cho et al., 2014):
Source sentence -> Encoder (BiLSTM) -> Context vector -> Decoder (LSTM) -> Target sentence
- Encoder reads source sentence into a fixed-length vector
- Decoder generates target tokens autoregressively, conditioned on the context vector
- Bottleneck problem: compressing the entire source into one vector loses information for long sentences
Attention Mechanism
Bahdanau et al. (2015) introduced attention to address the bottleneck.
- At each decoding step, compute attention weights over all encoder states
- Context vector is a weighted sum of encoder states, different for each target position
- Attention weights are learned, allowing the decoder to "focus" on relevant source words
score(s_t, h_j) = v^T tanh(W_1 s_t + W_2 h_j) [additive/Bahdanau]
score(s_t, h_j) = s_t^T W h_j [multiplicative/Luong]
Attention provides interpretable alignment as a byproduct and dramatically improves translation quality.
Transformer-Based NMT
The Transformer (Vaswani et al., 2017) replaced RNNs entirely.
Advantages over RNN-based NMT:
- Parallelizable training (no sequential dependence)
- Self-attention captures long-range dependencies directly
- Multi-head attention captures different relationship types
- Positional encodings replace recurrence for position information
The Transformer is the foundation for all modern MT systems.
Subword Segmentation for MT
Subword tokenization is critical for MT to handle:
- Morphologically rich languages (agglutinative, fusional)
- Rare words and names (transliteration)
- Code-switching and neologisms
Shared vocabulary: Source and target languages often share a joint BPE/SentencePiece vocabulary, enabling:
- Parameter sharing in the embedding layer
- Better handling of cognates and borrowed words
- Required for multilingual models
Decoding Strategies
Greedy Decoding
Select the highest-probability token at each step. Fast but often suboptimal (locally optimal choices can lead to globally poor translations).
Beam Search
Maintain the top-k (beam width) partial hypotheses at each step.
- Beam width b=4-6 is typical for MT
- Length normalization: divide log-probability by sequence length to avoid favoring short outputs
- Coverage penalty: encourage attending to all source words
- Beam search is near-universal in MT; it significantly outperforms greedy decoding
Length Control
- Without correction, beam search favors shorter sequences (fewer multiplicative probabilities)
- Length normalization: score = (1/|y|^alpha) * log P(y|x), alpha typically 0.6-1.0
- Minimum/maximum length constraints
Evaluation Metrics
BLEU (Bilingual Evaluation Understudy)
The most widely used MT metric, based on n-gram precision.
BLEU = BP * exp(sum_{n=1}^{4} w_n * log(precision_n))
BP = min(1, exp(1 - ref_length/hyp_length))
- Computes modified precision for 1-grams through 4-grams
- Brevity penalty (BP) penalizes short translations
- Corpus-level metric (unreliable at sentence level)
- Limitations: ignores synonyms, word order sensitivity is limited, does not correlate well with human judgment at high quality levels
Beyond BLEU
| Metric | Approach | Advantages | |---|---|---| | chrF | Character n-gram F-score | Better for morphologically rich languages | | TER | Edit distance to reference | Interpretable (number of edits) | | METEOR | Unigram matching with stemming, synonyms | Better sentence-level correlation | | BERTScore | Cosine similarity of BERT embeddings per token | Captures semantic similarity | | COMET | Trained regression model on human scores | Best correlation with human judgment | | BLEURT | Fine-tuned BERT for quality estimation | Learned metric |
Current best practice: Report BLEU for comparability, COMET for quality assessment.
Human Evaluation
- Adequacy: Does the translation convey the meaning? (1-5 scale)
- Fluency: Is the translation grammatical and natural? (1-5 scale)
- Direct Assessment (DA): Rate translation quality on a 0-100 scale
- MQM (Multidimensional Quality Metrics): Annotate specific error types and severities
- Human evaluation is the gold standard but expensive and slow
Multilingual Models
Multilingual BERT (mBERT)
- Pretrained on 104 languages with shared WordPiece vocabulary
- Surprisingly effective for zero-shot cross-lingual transfer
- Not explicitly trained for translation but representations are partially aligned across languages
Multilingual Translation Models
| Model | Languages | Key Feature | |---|---|---| | mBART | 25 languages | Denoising autoencoder pretraining | | M2M-100 | 100 languages | Direct many-to-many (no English pivot) | | NLLB-200 | 200 languages | Focus on low-resource languages | | MADLAD-400 | 450+ languages | Largest language coverage |
Key techniques:
- Temperature-based sampling: upsample low-resource languages during training
- Language tokens: prepend target language tag to control output language
- Back-translation: translate monolingual target data back to source for data augmentation
- Shared encoder-decoder with language-specific adapters
Zero-Shot Translation
Multilingual models can translate between language pairs never seen together in training (e.g., train on En-Fr and En-De, translate Fr-De). Quality is lower than supervised but useful for low-resource pairs.
Challenges in MT
Low-resource languages: Limited parallel data. Mitigations include transfer learning, back-translation, multilingual models, and leveraging monolingual data.
Document-level translation: Sentence-level MT ignores discourse coherence, pronoun resolution across sentences, and document-level terminology consistency.
Domain adaptation: Models trained on general data perform poorly on specialized domains (medical, legal). Fine-tuning on domain-specific parallel data or terminology injection helps.
Hallucination: MT models sometimes generate fluent but unfaithful translations, especially under domain shift or for low-resource languages.
Real-time translation: Speech translation, simultaneous interpretation require streaming architectures with partial input.
Key Takeaways
- SMT decomposed translation into separately trainable components; phrase-based SMT dominated for a decade
- Attention-based encoder-decoder models eliminated the information bottleneck of fixed-length encoding
- Transformers enable parallel training and model long-range dependencies, becoming the universal MT architecture
- BLEU is standard but correlates poorly with human judgment at high quality; COMET is preferred
- Multilingual models (NLLB, M2M-100) enable translation across 200+ languages, including low-resource pairs
- Beam search with length normalization is essential for high-quality MT decoding