Speaker Recognition
Overview
Speaker recognition determines who is speaking, as opposed to what is being said (speech recognition). Two main tasks:
- Speaker Verification: Is this person who they claim to be? (1:1 comparison, binary decision)
- Speaker Identification: Which known speaker is this? (1:N comparison, classification)
Both operate in either text-dependent (fixed passphrase) or text-independent (any speech content) modes. Text-independent is harder but more practical.
Enrollment: Audio -> Feature extraction -> Speaker model -> Store voiceprint
Verification: Audio -> Feature extraction -> Compare with stored model -> Accept/Reject
Traditional Features and Models
Spectral Features
Speaker identity is encoded primarily in the spectral envelope, shaped by vocal tract anatomy (length, shape, nasal cavity):
- MFCCs: Standard features capturing vocal tract shape
- Formant frequencies and bandwidths
- Long-term spectral statistics
- Pitch (F0) carries some speaker info but varies with emotion/intent
GMM-UBM Framework
The classic approach (Reynolds et al., 2000):
- Train a Universal Background Model (UBM): Large GMM on diverse speaker data
- Enrollment: Adapt UBM to target speaker via MAP adaptation (shift means only)
- Scoring: Log-likelihood ratio between speaker model and UBM
score = log P(X | speaker_GMM) - log P(X | UBM)
Threshold on score for verification decision.
GMM Supervectors
Concatenate adapted GMM means into a high-dimensional vector (supervector). Enables use of discriminative classifiers (SVM) on speaker representations.
i-Vectors
Factor analysis approach that replaced GMM supervectors (~2010):
Total Variability Space
Model both speaker and channel variability in a single low-dimensional subspace:
M = m + T * w
- M: GMM supervector for an utterance
- m: UBM supervector (speaker-independent mean)
- T: Total variability matrix (learned from data)
- w: i-vector (low-dimensional representation, typically 400-600 dims)
Scoring with i-Vectors
- Cosine similarity: Simple but effective after length normalization
- PLDA (Probabilistic Linear Discriminant Analysis): Models speaker and session variability separately; standard backend for i-vector systems
score = log P(w1, w2 | same speaker) / P(w1, w2 | different speakers)
PLDA with i-vectors was state-of-the-art until deep learning approaches emerged.
Deep Speaker Embeddings
d-Vectors
Early neural approach (Google, 2014):
- DNN trained on speaker classification (softmax over speaker IDs)
- Extract activations from a hidden layer as speaker embedding
- Frame-level features averaged over utterance
x-Vectors
TDNN-based speaker embeddings (Snyder et al., 2018):
Input frames -> TDNN layers (temporal context) -> Statistics pooling -> FC layers -> Embedding
(mean + stddev)
- Frame-level layers: TDNN (1D convolutions) process acoustic features with expanding context
- Statistics pooling: Compute mean and standard deviation across all frames
- Segment-level layers: Fully connected layers produce fixed-dimensional embedding
- Training: Softmax cross-entropy over speaker IDs (+ data augmentation)
The embedding extracted before the final classification layer (typically 512-dim) serves as the speaker representation. Scored with PLDA or cosine similarity.
ECAPA-TDNN
Enhanced x-vector architecture (Desplanques et al., 2020), dominant in modern systems:
- Squeeze-Excitation (SE) blocks: Channel attention rescales feature maps
- Res2Net modules: Multi-scale feature aggregation within residual blocks
- Attentive statistics pooling: Attention-weighted mean and stddev (not uniform)
- Multi-layer feature aggregation: Concatenate outputs from multiple TDNN layers
Achieves significantly lower error rates than standard x-vectors. Typical embedding: 192 dims.
Training Objectives
Modern speaker embedding training has moved beyond softmax:
| Loss | Description | |------|-------------| | Softmax + CE | Basic classification, requires speaker labels | | AAM-Softmax | Additive angular margin, improves discrimination | | Sub-center AAM | Multiple class centers, handles within-speaker variation | | Prototypical networks | Metric learning on episode-based sampling | | Contrastive / Triplet | Learn by comparing positive and negative pairs |
AAM-Softmax (Additive Angular Margin Softmax) is the current standard:
L = -log(exp(s * cos(theta_y + m)) / (exp(s * cos(theta_y + m)) + sum_j exp(s * cos(theta_j))))
Where s = scale, m = angular margin, forcing embeddings of the same speaker closer together on the hypersphere.
Large-Scale Pre-Trained Models
- WavLM / HuBERT / wav2vec 2.0: Self-supervised models fine-tuned for speaker tasks
- TitaNet (NVIDIA): Scalable architecture with squeeze-excitation and channel attention
- CAM++: Channel-attention-based approach with strong performance
Speaker Diarization
Who spoke when? Segment a multi-speaker recording into speaker-homogeneous regions.
Traditional Pipeline
Audio -> VAD -> Segmentation -> Embedding extraction -> Clustering -> Resegmentation
- Voice Activity Detection (VAD): Identify speech vs non-speech regions
- Segmentation: Divide speech into short uniform segments (1-2 seconds)
- Embedding extraction: Compute x-vector/ECAPA-TDNN embedding per segment
- Clustering: Group segments by speaker
Clustering Methods
- Agglomerative Hierarchical Clustering (AHC): Bottom-up merging with PLDA scores
- Spectral clustering: Eigen-decomposition of affinity matrix, auto-selects speaker count
- VBx (Variational Bayes): Bayesian HMM resegmentation refining initial clustering
Neural Diarization: EEND
End-to-End Neural Diarization (Fujita et al., 2019):
- Frame-level multi-label classification: For each frame, predict which speakers are active
- Handles overlapping speech naturally (multiple speakers active simultaneously)
- Self-attention (Transformer) encoder processes the full recording
- SA-EEND (Self-Attentive EEND): Scales to flexible numbers of speakers
Limitations: Fixed maximum number of speakers, struggles with very long recordings.
Hybrid Approaches
Combine EEND with clustering for practical systems:
- EEND-VC (EEND with vector clustering): Local EEND + global clustering
- TS-VAD (Target Speaker VAD): Given a speaker embedding, detect their speech regions
- Process long recordings in overlapping blocks, merge results
Evaluation
- DER (Diarization Error Rate): Sum of missed speech, false alarm, and speaker confusion
- Typically computed with a 0.25s forgiveness collar around reference boundaries
- JER (Jaccard Error Rate): Per-speaker intersection-over-union based metric
Anti-Spoofing and Presentation Attack Detection
Speaker verification systems are vulnerable to attacks:
| Attack Type | Method | |-------------|--------| | Replay | Play back recorded speech through speaker | | TTS | Synthesize target speaker's voice | | Voice conversion | Transform attacker's voice to sound like target | | Deepfake | Neural-generated speech mimicking target |
Countermeasures
- Front-end features: Linear frequency cepstral coefficients (LFCCs), spectral features capturing synthesis artifacts
- Neural classifiers: AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal graph attention networks), RawNet2
- Self-supervised features: wav2vec 2.0 representations detect subtle artifacts
- Logical vs physical access: Different detection strategies for TTS/VC vs replay
ASVspoof Challenge
Standard benchmark for anti-spoofing research. Evaluates systems on:
- Equal Error Rate (EER) for bonafide vs spoof classification
- t-DCF (tandem Detection Cost Function): Joint optimization with speaker verification
- Increasingly challenging with improving synthesis quality
Applications
- Voice authentication: Banking, device unlock, secure access
- Smart speakers: Personalized responses per household member
- Forensics: Speaker identification from recordings
- Meeting transcription: Diarization for multi-speaker notes
- Call centers: Customer and agent identification
- Media indexing: Who speaks in podcasts, broadcasts, interviews