6 min read
On this page

Machine Translation

Overview

Machine translation (MT) automatically translates text between languages. The field has evolved from rule-based systems through statistical models to neural approaches, with transformers now dominating. MT is both a practical application and a proving ground for sequence-to-sequence modeling.


Statistical Machine Translation (SMT)

Noisy Channel Model

Based on Bayes' theorem, translation from French (f) to English (e):

e* = argmax_e P(e|f) = argmax_e P(f|e) * P(e)
  • Translation model P(f|e): trained on parallel corpora; models word/phrase alignments
  • Language model P(e): trained on monolingual English data; ensures fluency

Word-Based SMT (IBM Models)

IBM Models 1-5 progressively add alignment complexity:

| Model | Alignment Properties | |---|---| | Model 1 | Uniform alignment (position-independent) | | Model 2 | Position-dependent alignment | | Model 3 | Fertility (one source word -> multiple target words) | | Model 4 | Relative distortion | | Model 5 | Deficiency correction |

Trained with EM algorithm. Model 1's convexity guarantees a global optimum; higher models are initialized from lower ones.

Phrase-Based SMT

Translates contiguous sequences of words (phrases) rather than individual words.

Key components:

  1. Phrase table: extracted from word-aligned parallel data; stores phrase pairs with scores
  2. Reordering model: penalizes or allows phrase reordering
  3. Language model: n-gram model for target fluency
  4. Decoding: beam search over phrase segmentations and translations
  5. Tuning: MERT (Minimum Error Rate Training) optimizes feature weights on dev set

Phrase-based SMT dominated 2003-2015 and remains a useful baseline.


Neural Machine Translation (NMT)

Encoder-Decoder Architecture

The foundational NMT architecture (Sutskever et al., 2014; Cho et al., 2014):

Source sentence -> Encoder (BiLSTM) -> Context vector -> Decoder (LSTM) -> Target sentence
  • Encoder reads source sentence into a fixed-length vector
  • Decoder generates target tokens autoregressively, conditioned on the context vector
  • Bottleneck problem: compressing the entire source into one vector loses information for long sentences

Attention Mechanism

Bahdanau et al. (2015) introduced attention to address the bottleneck.

  • At each decoding step, compute attention weights over all encoder states
  • Context vector is a weighted sum of encoder states, different for each target position
  • Attention weights are learned, allowing the decoder to "focus" on relevant source words
score(s_t, h_j) = v^T tanh(W_1 s_t + W_2 h_j)    [additive/Bahdanau]
score(s_t, h_j) = s_t^T W h_j                       [multiplicative/Luong]

Attention provides interpretable alignment as a byproduct and dramatically improves translation quality.

Transformer-Based NMT

The Transformer (Vaswani et al., 2017) replaced RNNs entirely.

Advantages over RNN-based NMT:

  • Parallelizable training (no sequential dependence)
  • Self-attention captures long-range dependencies directly
  • Multi-head attention captures different relationship types
  • Positional encodings replace recurrence for position information

The Transformer is the foundation for all modern MT systems.


Subword Segmentation for MT

Subword tokenization is critical for MT to handle:

  • Morphologically rich languages (agglutinative, fusional)
  • Rare words and names (transliteration)
  • Code-switching and neologisms

Shared vocabulary: Source and target languages often share a joint BPE/SentencePiece vocabulary, enabling:

  • Parameter sharing in the embedding layer
  • Better handling of cognates and borrowed words
  • Required for multilingual models

Decoding Strategies

Greedy Decoding

Select the highest-probability token at each step. Fast but often suboptimal (locally optimal choices can lead to globally poor translations).

Maintain the top-k (beam width) partial hypotheses at each step.

  • Beam width b=4-6 is typical for MT
  • Length normalization: divide log-probability by sequence length to avoid favoring short outputs
  • Coverage penalty: encourage attending to all source words
  • Beam search is near-universal in MT; it significantly outperforms greedy decoding

Length Control

  • Without correction, beam search favors shorter sequences (fewer multiplicative probabilities)
  • Length normalization: score = (1/|y|^alpha) * log P(y|x), alpha typically 0.6-1.0
  • Minimum/maximum length constraints

Evaluation Metrics

BLEU (Bilingual Evaluation Understudy)

The most widely used MT metric, based on n-gram precision.

BLEU = BP * exp(sum_{n=1}^{4} w_n * log(precision_n))
BP = min(1, exp(1 - ref_length/hyp_length))
  • Computes modified precision for 1-grams through 4-grams
  • Brevity penalty (BP) penalizes short translations
  • Corpus-level metric (unreliable at sentence level)
  • Limitations: ignores synonyms, word order sensitivity is limited, does not correlate well with human judgment at high quality levels

Beyond BLEU

| Metric | Approach | Advantages | |---|---|---| | chrF | Character n-gram F-score | Better for morphologically rich languages | | TER | Edit distance to reference | Interpretable (number of edits) | | METEOR | Unigram matching with stemming, synonyms | Better sentence-level correlation | | BERTScore | Cosine similarity of BERT embeddings per token | Captures semantic similarity | | COMET | Trained regression model on human scores | Best correlation with human judgment | | BLEURT | Fine-tuned BERT for quality estimation | Learned metric |

Current best practice: Report BLEU for comparability, COMET for quality assessment.

Human Evaluation

  • Adequacy: Does the translation convey the meaning? (1-5 scale)
  • Fluency: Is the translation grammatical and natural? (1-5 scale)
  • Direct Assessment (DA): Rate translation quality on a 0-100 scale
  • MQM (Multidimensional Quality Metrics): Annotate specific error types and severities
  • Human evaluation is the gold standard but expensive and slow

Multilingual Models

Multilingual BERT (mBERT)

  • Pretrained on 104 languages with shared WordPiece vocabulary
  • Surprisingly effective for zero-shot cross-lingual transfer
  • Not explicitly trained for translation but representations are partially aligned across languages

Multilingual Translation Models

| Model | Languages | Key Feature | |---|---|---| | mBART | 25 languages | Denoising autoencoder pretraining | | M2M-100 | 100 languages | Direct many-to-many (no English pivot) | | NLLB-200 | 200 languages | Focus on low-resource languages | | MADLAD-400 | 450+ languages | Largest language coverage |

Key techniques:

  • Temperature-based sampling: upsample low-resource languages during training
  • Language tokens: prepend target language tag to control output language
  • Back-translation: translate monolingual target data back to source for data augmentation
  • Shared encoder-decoder with language-specific adapters

Zero-Shot Translation

Multilingual models can translate between language pairs never seen together in training (e.g., train on En-Fr and En-De, translate Fr-De). Quality is lower than supervised but useful for low-resource pairs.


Challenges in MT

Low-resource languages: Limited parallel data. Mitigations include transfer learning, back-translation, multilingual models, and leveraging monolingual data.

Document-level translation: Sentence-level MT ignores discourse coherence, pronoun resolution across sentences, and document-level terminology consistency.

Domain adaptation: Models trained on general data perform poorly on specialized domains (medical, legal). Fine-tuning on domain-specific parallel data or terminology injection helps.

Hallucination: MT models sometimes generate fluent but unfaithful translations, especially under domain shift or for low-resource languages.

Real-time translation: Speech translation, simultaneous interpretation require streaming architectures with partial input.


Key Takeaways

  • SMT decomposed translation into separately trainable components; phrase-based SMT dominated for a decade
  • Attention-based encoder-decoder models eliminated the information bottleneck of fixed-length encoding
  • Transformers enable parallel training and model long-range dependencies, becoming the universal MT architecture
  • BLEU is standard but correlates poorly with human judgment at high quality; COMET is preferred
  • Multilingual models (NLLB, M2M-100) enable translation across 200+ languages, including low-resource pairs
  • Beam search with length normalization is essential for high-quality MT decoding