Machine Translation

Overview

Machine translation (MT) automatically translates text between languages. The field has evolved from rule-based systems through statistical models to neural approaches, with transformers now dominating. MT is both a practical application and a proving ground for sequence-to-sequence modeling.

Statistical Machine Translation (SMT)

Noisy Channel Model

Based on Bayes' theorem, translation from French (f) to English (e):

e* = argmax_e P(e|f) = argmax_e P(f|e) * P(e)

Translation model P(f|e): trained on parallel corpora; models word/phrase alignments
Language model P(e): trained on monolingual English data; ensures fluency

Word-Based SMT (IBM Models)

IBM Models 1-5 progressively add alignment complexity:

Model	Alignment Properties
Model 1	Uniform alignment (position-independent)
Model 2	Position-dependent alignment
Model 3	Fertility (one source word -> multiple target words)
Model 4	Relative distortion
Model 5	Deficiency correction

Trained with EM algorithm. Model 1's convexity guarantees a global optimum; higher models are initialized from lower ones.

Phrase-Based SMT

Translates contiguous sequences of words (phrases) rather than individual words.

Key components:

Phrase table: extracted from word-aligned parallel data; stores phrase pairs with scores
Reordering model: penalizes or allows phrase reordering
Language model: n-gram model for target fluency
Decoding: beam search over phrase segmentations and translations
Tuning: MERT (Minimum Error Rate Training) optimizes feature weights on dev set

Phrase-based SMT dominated 2003-2015 and remains a useful baseline.

Neural Machine Translation (NMT)

Encoder-Decoder Architecture

The foundational NMT architecture (Sutskever et al., 2014; Cho et al., 2014):

Source sentence -> Encoder (BiLSTM) -> Context vector -> Decoder (LSTM) -> Target sentence

Encoder reads source sentence into a fixed-length vector
Decoder generates target tokens autoregressively, conditioned on the context vector
Bottleneck problem: compressing the entire source into one vector loses information for long sentences

Attention Mechanism

Bahdanau et al. (2015) introduced attention to address the bottleneck.

At each decoding step, compute attention weights over all encoder states
Context vector is a weighted sum of encoder states, different for each target position
Attention weights are learned, allowing the decoder to "focus" on relevant source words

score(s_t, h_j) = v^T tanh(W_1 s_t + W_2 h_j)    [additive/Bahdanau]
score(s_t, h_j) = s_t^T W h_j                       [multiplicative/Luong]

Attention provides interpretable alignment as a byproduct and dramatically improves translation quality.

Transformer-Based NMT

The Transformer (Vaswani et al., 2017) replaced RNNs entirely.

Advantages over RNN-based NMT:

Parallelizable training (no sequential dependence)
Self-attention captures long-range dependencies directly
Multi-head attention captures different relationship types
Positional encodings replace recurrence for position information

The Transformer is the foundation for all modern MT systems.

Subword Segmentation for MT

Subword tokenization is critical for MT to handle:

Morphologically rich languages (agglutinative, fusional)
Rare words and names (transliteration)
Code-switching and neologisms

Shared vocabulary: Source and target languages often share a joint BPE/SentencePiece vocabulary, enabling:

Parameter sharing in the embedding layer
Better handling of cognates and borrowed words
Required for multilingual models

Decoding Strategies

Greedy Decoding

Select the highest-probability token at each step. Fast but often suboptimal (locally optimal choices can lead to globally poor translations).

Beam Search

Maintain the top-k (beam width) partial hypotheses at each step.

Beam width b=4-6 is typical for MT
Length normalization: divide log-probability by sequence length to avoid favoring short outputs
Coverage penalty: encourage attending to all source words
Beam search is near-universal in MT; it significantly outperforms greedy decoding

Length Control

Without correction, beam search favors shorter sequences (fewer multiplicative probabilities)
Length normalization: score = (1/|y|^alpha) * log P(y|x), alpha typically 0.6-1.0
Minimum/maximum length constraints

Evaluation Metrics

BLEU (Bilingual Evaluation Understudy)

The most widely used MT metric, based on n-gram precision.

BLEU = BP * exp(sum_{n=1}^{4} w_n * log(precision_n))
BP = min(1, exp(1 - ref_length/hyp_length))

Computes modified precision for 1-grams through 4-grams
Brevity penalty (BP) penalizes short translations
Corpus-level metric (unreliable at sentence level)
Limitations: ignores synonyms, word order sensitivity is limited, does not correlate well with human judgment at high quality levels

Beyond BLEU

Metric	Approach	Advantages
chrF	Character n-gram F-score	Better for morphologically rich languages
TER	Edit distance to reference	Interpretable (number of edits)
METEOR	Unigram matching with stemming, synonyms	Better sentence-level correlation
BERTScore	Cosine similarity of BERT embeddings per token	Captures semantic similarity
COMET	Trained regression model on human scores	Best correlation with human judgment
BLEURT	Fine-tuned BERT for quality estimation	Learned metric

Current best practice: Report BLEU for comparability, COMET for quality assessment.

Human Evaluation

Adequacy: Does the translation convey the meaning? (1-5 scale)
Fluency: Is the translation grammatical and natural? (1-5 scale)
Direct Assessment (DA): Rate translation quality on a 0-100 scale
MQM (Multidimensional Quality Metrics): Annotate specific error types and severities
Human evaluation is the gold standard but expensive and slow

Multilingual Models

Multilingual BERT (mBERT)

Pretrained on 104 languages with shared WordPiece vocabulary
Surprisingly effective for zero-shot cross-lingual transfer
Not explicitly trained for translation but representations are partially aligned across languages

Multilingual Translation Models

Model	Languages	Key Feature
mBART	25 languages	Denoising autoencoder pretraining
M2M-100	100 languages	Direct many-to-many (no English pivot)
NLLB-200	200 languages	Focus on low-resource languages
MADLAD-400	450+ languages	Largest language coverage

Key techniques:

Temperature-based sampling: upsample low-resource languages during training
Language tokens: prepend target language tag to control output language
Back-translation: translate monolingual target data back to source for data augmentation
Shared encoder-decoder with language-specific adapters

Zero-Shot Translation

Multilingual models can translate between language pairs never seen together in training (e.g., train on En-Fr and En-De, translate Fr-De). Quality is lower than supervised but useful for low-resource pairs.

Challenges in MT

Low-resource languages: Limited parallel data. Mitigations include transfer learning, back-translation, multilingual models, and leveraging monolingual data.

Document-level translation: Sentence-level MT ignores discourse coherence, pronoun resolution across sentences, and document-level terminology consistency.

Domain adaptation: Models trained on general data perform poorly on specialized domains (medical, legal). Fine-tuning on domain-specific parallel data or terminology injection helps.

Hallucination: MT models sometimes generate fluent but unfaithful translations, especially under domain shift or for low-resource languages.

Real-time translation: Speech translation, simultaneous interpretation require streaming architectures with partial input.

Key Takeaways

SMT decomposed translation into separately trainable components; phrase-based SMT dominated for a decade
Attention-based encoder-decoder models eliminated the information bottleneck of fixed-length encoding
Transformers enable parallel training and model long-range dependencies, becoming the universal MT architecture
BLEU is standard but correlates poorly with human judgment at high quality; COMET is preferred
Multilingual models (NLLB, M2M-100) enable translation across 200+ languages, including low-resource pairs
Beam search with length normalization is essential for high-quality MT decoding