5 min read
On this page

Speech Recognition (ASR)

Overview

ASR Speech Recognition Pipeline

Automatic Speech Recognition (ASR) converts spoken language into text. The fundamental challenge: mapping a variable-length acoustic signal to a variable-length symbol sequence across diverse speakers, accents, noise conditions, and vocabularies.

Audio waveform -> Feature extraction -> Acoustic model -> Decoder -> Text

Traditional Pipeline: GMM-HMM

The dominant paradigm from the 1980s through ~2012.

Hidden Markov Models (HMMs)

Model temporal structure of speech at the phoneme level:

  • States represent sub-phoneme units (typically 3 states per phoneme: beginning, middle, end)
  • Transitions capture duration via self-loops and left-to-right topology
  • Emission probabilities model acoustic observations at each state

Gaussian Mixture Models (GMMs)

Model the emission probability distribution at each HMM state:

p(x | state_j) = sum_{m=1}^{M} w_jm * N(x; mu_jm, Sigma_jm)

Typically 16-64 mixture components per state. Input: 39-dim MFCC + delta + delta-delta.

Training

  • Baum-Welch (EM algorithm): Iteratively estimate GMM parameters and state alignments
  • Viterbi training: Use best-path alignment instead of full forward-backward
  • Decision tree clustering: Tie states across triphone contexts to handle data sparsity
  • Discriminative training: MMI, MPE, sMBR objectives improve over maximum likelihood

Decoding

Viterbi search through the composed WFST (Weighted Finite-State Transducer):

Search graph = HMM * Context-dependency * Lexicon * Language Model

Beam search prunes the search space. The decoder finds the word sequence W that maximizes:

W* = argmax_W P(X|W) * P(W)^alpha * |W|^beta

Where alpha = language model weight, beta = word insertion penalty.


DNN-HMM Hybrid

Replace GMMs with deep neural networks (~2012 onward):

  • DNN takes a context window of frames and predicts HMM state posteriors
  • Convert to likelihoods via Bayes' rule: p(x|s) = p(s|x) * p(x) / p(s)
  • HMM still handles temporal alignment and decoding
  • Large improvement over GMMs, especially with deep architectures

Progression of acoustic models:

  1. DNN (feed-forward): Context window of ~11 frames
  2. RNN/LSTM: Process full sequence, capture long-range dependencies
  3. TDNN (Time-Delay Neural Network): Efficient temporal context via dilated 1D convolutions
  4. CNN-TDNN-LSTM: Combine complementary architectures

End-to-End Models

Eliminate the separate components (lexicon, pronunciation model, HMM) in favor of a single neural network that directly maps audio to text.

Connectionist Temporal Classification (CTC)

Maps input frames to output labels without requiring pre-aligned training data.

Key ideas:

  • Output alphabet includes a blank token (for silence/repetition)
  • All alignments that collapse to the same label sequence are marginalized: "aaa-bb-" and "a--ab-b" both map to "ab" (where - is blank)
  • Forward-backward algorithm computes the total probability efficiently
  • Conditional independence assumption: each frame's output is independent given the input

Limitations:

  • Cannot model output dependencies (no implicit language model)
  • Tends to produce peaky, spiky alignments
  • Often combined with external language model in decoding

Used in: DeepSpeech, early wav2vec systems.

Attention-Based: Listen, Attend and Spell (LAS)

Encoder-decoder architecture with attention:

  1. Listener (encoder): Pyramidal BiLSTM reduces input sequence length
  2. Attention: Learns soft alignment between encoder states and output tokens
  3. Speller (decoder): Autoregressive LSTM generates characters/subwords

Advantages: Jointly learns alignment and language model. No conditional independence assumption.

Limitations: Attention-based alignment can fail on long utterances. Requires full input before decoding (not streaming-friendly without modifications).

RNN-Transducer (RNN-T)

Combines CTC-style frame processing with autoregressive label prediction:

  • Encoder: Processes audio frames (like CTC)
  • Prediction network: Processes previous labels (like a language model)
  • Joint network: Combines both to predict next label or blank
  • Supports streaming decoding (process audio left-to-right)

Dominant architecture for on-device ASR (Google, Apple). Handles the CTC conditional independence limitation while remaining streamable.

Conformer

State-of-the-art encoder architecture combining convolution and self-attention:

Input -> FeedForward -> Multi-Head Self-Attention -> Convolution -> FeedForward -> Output
  • Self-attention captures global context
  • Convolution captures local patterns
  • Macaron-style feed-forward layers (half before, half after)
  • Relative positional encoding

Used as the encoder in both CTC and RNN-T systems. Consistently outperforms pure Transformer or pure CNN encoders on speech tasks.

Whisper (OpenAI)

Large-scale weakly supervised ASR model:

  • Architecture: Encoder-decoder Transformer (standard, no novel components)
  • Training data: 680,000 hours of labeled audio from the internet
  • Multitask: Transcription, translation, language ID, timestamp prediction via special tokens
  • Input: 30-second mel-spectrogram chunks (80 mel bins, 16 kHz)
  • Sizes range from Tiny (39M params) to Large-v3 (1.5B params)

Key insight: Scale and data diversity matter more than architectural innovation. Whisper achieves strong zero-shot performance across languages and domains without fine-tuning.

Distil-Whisper provides 6x faster inference with minimal accuracy loss.


Language Model Integration

External language models improve ASR output, especially for rare words and domain adaptation:

Shallow Fusion

Interpolate ASR and LM scores during beam search:

score = log P_ASR(y|x) + lambda * log P_LM(y)

Simple and effective. Works with any LM (n-gram, neural).

Deep Fusion and Cold Fusion

Integrate LM representations into the decoder at the hidden state level during training.

Rescoring

N-best list or lattice rescoring with a large LM after initial decoding. Allows using expensive models (GPT-scale) without slowing real-time decoding.


Evaluation: Word Error Rate (WER)

WER = (Substitutions + Insertions + Deletions) / Reference Words * 100%

Computed via dynamic programming alignment between hypothesis and reference.

| Metric | Description | |--------|-------------| | WER | Standard word-level error rate | | CER | Character Error Rate (useful for character-based languages) | | SER | Sentence Error Rate (binary: any error in sentence) | | RTF | Real-Time Factor: processing_time / audio_duration |

Benchmark datasets: LibriSpeech (English read speech), Common Voice (multilingual), Switchboard (conversational), GigaSpeech, FLEURS (multilingual).


Multilingual and Cross-Lingual ASR

Challenges

  • Low-resource languages lack labeled training data
  • Diverse phoneme inventories and tonal systems
  • Writing system variation (alphabetic, logographic, abjad)

Approaches

  • Multilingual pre-training: Train shared encoder on many languages, fine-tune on target
  • Self-supervised learning: wav2vec 2.0, HuBERT, WavLM learn representations from unlabeled audio, then fine-tune with minimal labeled data
  • Cross-lingual transfer: Phone-level representations transfer across related languages
  • Whisper/MMS (Meta): Massively multilingual models covering 1,000+ languages

Self-supervised models (wav2vec 2.0, HuBERT) have dramatically reduced the labeled data requirements, enabling ASR for hundreds of previously unsupported languages.

Self-Supervised Pre-Training Pipeline

Unlabeled audio -> Encoder -> Contrastive/Masked prediction -> Pre-trained model
Fine-tune with CTC + small labeled dataset -> ASR system

wav2vec 2.0 achieves competitive WER on LibriSpeech with only 10 minutes of labeled data.