Speaker Recognition

Overview

Speaker recognition determines who is speaking, as opposed to what is being said (speech recognition). Two main tasks:

Speaker Verification: Is this person who they claim to be? (1:1 comparison, binary decision)
Speaker Identification: Which known speaker is this? (1:N comparison, classification)

Both operate in either text-dependent (fixed passphrase) or text-independent (any speech content) modes. Text-independent is harder but more practical.

Enrollment: Audio -> Feature extraction -> Speaker model -> Store voiceprint
Verification: Audio -> Feature extraction -> Compare with stored model -> Accept/Reject

Traditional Features and Models

Spectral Features

Speaker identity is encoded primarily in the spectral envelope, shaped by vocal tract anatomy (length, shape, nasal cavity):

MFCCs: Standard features capturing vocal tract shape
Formant frequencies and bandwidths
Long-term spectral statistics
Pitch (F0) carries some speaker info but varies with emotion/intent

GMM-UBM Framework

The classic approach (Reynolds et al., 2000):

Train a Universal Background Model (UBM): Large GMM on diverse speaker data
Enrollment: Adapt UBM to target speaker via MAP adaptation (shift means only)
Scoring: Log-likelihood ratio between speaker model and UBM

score = log P(X | speaker_GMM) - log P(X | UBM)

Threshold on score for verification decision.

GMM Supervectors

Concatenate adapted GMM means into a high-dimensional vector (supervector). Enables use of discriminative classifiers (SVM) on speaker representations.

i-Vectors

Factor analysis approach that replaced GMM supervectors (~2010):

Total Variability Space

Model both speaker and channel variability in a single low-dimensional subspace:

M = m + T * w

M: GMM supervector for an utterance
m: UBM supervector (speaker-independent mean)
T: Total variability matrix (learned from data)
w: i-vector (low-dimensional representation, typically 400-600 dims)

Scoring with i-Vectors

Cosine similarity: Simple but effective after length normalization
PLDA (Probabilistic Linear Discriminant Analysis): Models speaker and session variability separately; standard backend for i-vector systems

score = log P(w1, w2 | same speaker) / P(w1, w2 | different speakers)

PLDA with i-vectors was state-of-the-art until deep learning approaches emerged.

Deep Speaker Embeddings

d-Vectors

Early neural approach (Google, 2014):

DNN trained on speaker classification (softmax over speaker IDs)
Extract activations from a hidden layer as speaker embedding
Frame-level features averaged over utterance

x-Vectors

TDNN-based speaker embeddings (Snyder et al., 2018):

Input frames -> TDNN layers (temporal context) -> Statistics pooling -> FC layers -> Embedding
                                                  (mean + stddev)

Frame-level layers: TDNN (1D convolutions) process acoustic features with expanding context
Statistics pooling: Compute mean and standard deviation across all frames
Segment-level layers: Fully connected layers produce fixed-dimensional embedding
Training: Softmax cross-entropy over speaker IDs (+ data augmentation)

The embedding extracted before the final classification layer (typically 512-dim) serves as the speaker representation. Scored with PLDA or cosine similarity.

ECAPA-TDNN

Enhanced x-vector architecture (Desplanques et al., 2020), dominant in modern systems:

Squeeze-Excitation (SE) blocks: Channel attention rescales feature maps
Res2Net modules: Multi-scale feature aggregation within residual blocks
Attentive statistics pooling: Attention-weighted mean and stddev (not uniform)
Multi-layer feature aggregation: Concatenate outputs from multiple TDNN layers

Achieves significantly lower error rates than standard x-vectors. Typical embedding: 192 dims.

Training Objectives

Modern speaker embedding training has moved beyond softmax:

Loss	Description
Softmax + CE	Basic classification, requires speaker labels
AAM-Softmax	Additive angular margin, improves discrimination
Sub-center AAM	Multiple class centers, handles within-speaker variation
Prototypical networks	Metric learning on episode-based sampling
Contrastive / Triplet	Learn by comparing positive and negative pairs

AAM-Softmax (Additive Angular Margin Softmax) is the current standard:

L = -log(exp(s * cos(theta_y + m)) / (exp(s * cos(theta_y + m)) + sum_j exp(s * cos(theta_j))))

Where s = scale, m = angular margin, forcing embeddings of the same speaker closer together on the hypersphere.

Large-Scale Pre-Trained Models

WavLM / HuBERT / wav2vec 2.0: Self-supervised models fine-tuned for speaker tasks
TitaNet (NVIDIA): Scalable architecture with squeeze-excitation and channel attention
CAM++: Channel-attention-based approach with strong performance

Speaker Diarization

Who spoke when? Segment a multi-speaker recording into speaker-homogeneous regions.

Traditional Pipeline

Audio -> VAD -> Segmentation -> Embedding extraction -> Clustering -> Resegmentation

Voice Activity Detection (VAD): Identify speech vs non-speech regions
Segmentation: Divide speech into short uniform segments (1-2 seconds)
Embedding extraction: Compute x-vector/ECAPA-TDNN embedding per segment
Clustering: Group segments by speaker

Clustering Methods

Agglomerative Hierarchical Clustering (AHC): Bottom-up merging with PLDA scores
Spectral clustering: Eigen-decomposition of affinity matrix, auto-selects speaker count
VBx (Variational Bayes): Bayesian HMM resegmentation refining initial clustering

Neural Diarization: EEND

End-to-End Neural Diarization (Fujita et al., 2019):

Frame-level multi-label classification: For each frame, predict which speakers are active
Handles overlapping speech naturally (multiple speakers active simultaneously)
Self-attention (Transformer) encoder processes the full recording
SA-EEND (Self-Attentive EEND): Scales to flexible numbers of speakers

Limitations: Fixed maximum number of speakers, struggles with very long recordings.

Hybrid Approaches

Combine EEND with clustering for practical systems:

EEND-VC (EEND with vector clustering): Local EEND + global clustering
TS-VAD (Target Speaker VAD): Given a speaker embedding, detect their speech regions
Process long recordings in overlapping blocks, merge results

Evaluation

DER (Diarization Error Rate): Sum of missed speech, false alarm, and speaker confusion
Typically computed with a 0.25s forgiveness collar around reference boundaries
JER (Jaccard Error Rate): Per-speaker intersection-over-union based metric

Anti-Spoofing and Presentation Attack Detection

Speaker verification systems are vulnerable to attacks:

Attack Type	Method
Replay	Play back recorded speech through speaker
TTS	Synthesize target speaker's voice
Voice conversion	Transform attacker's voice to sound like target
Deepfake	Neural-generated speech mimicking target

Countermeasures

Front-end features: Linear frequency cepstral coefficients (LFCCs), spectral features capturing synthesis artifacts
Neural classifiers: AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal graph attention networks), RawNet2
Self-supervised features: wav2vec 2.0 representations detect subtle artifacts
Logical vs physical access: Different detection strategies for TTS/VC vs replay

ASVspoof Challenge

Standard benchmark for anti-spoofing research. Evaluates systems on:

Equal Error Rate (EER) for bonafide vs spoof classification
t-DCF (tandem Detection Cost Function): Joint optimization with speaker verification
Increasingly challenging with improving synthesis quality

Applications

Voice authentication: Banking, device unlock, secure access
Smart speakers: Personalized responses per household member
Forensics: Speaker identification from recordings
Meeting transcription: Diarization for multi-speaker notes
Call centers: Customer and agent identification
Media indexing: Who speaks in podcasts, broadcasts, interviews