5 min read
On this page

Speaker Recognition

Overview

Speaker recognition determines who is speaking, as opposed to what is being said (speech recognition). Two main tasks:

  • Speaker Verification: Is this person who they claim to be? (1:1 comparison, binary decision)
  • Speaker Identification: Which known speaker is this? (1:N comparison, classification)

Both operate in either text-dependent (fixed passphrase) or text-independent (any speech content) modes. Text-independent is harder but more practical.

Enrollment: Audio -> Feature extraction -> Speaker model -> Store voiceprint
Verification: Audio -> Feature extraction -> Compare with stored model -> Accept/Reject

Traditional Features and Models

Spectral Features

Speaker identity is encoded primarily in the spectral envelope, shaped by vocal tract anatomy (length, shape, nasal cavity):

  • MFCCs: Standard features capturing vocal tract shape
  • Formant frequencies and bandwidths
  • Long-term spectral statistics
  • Pitch (F0) carries some speaker info but varies with emotion/intent

GMM-UBM Framework

The classic approach (Reynolds et al., 2000):

  1. Train a Universal Background Model (UBM): Large GMM on diverse speaker data
  2. Enrollment: Adapt UBM to target speaker via MAP adaptation (shift means only)
  3. Scoring: Log-likelihood ratio between speaker model and UBM
score = log P(X | speaker_GMM) - log P(X | UBM)

Threshold on score for verification decision.

GMM Supervectors

Concatenate adapted GMM means into a high-dimensional vector (supervector). Enables use of discriminative classifiers (SVM) on speaker representations.


i-Vectors

Factor analysis approach that replaced GMM supervectors (~2010):

Total Variability Space

Model both speaker and channel variability in a single low-dimensional subspace:

M = m + T * w
  • M: GMM supervector for an utterance
  • m: UBM supervector (speaker-independent mean)
  • T: Total variability matrix (learned from data)
  • w: i-vector (low-dimensional representation, typically 400-600 dims)

Scoring with i-Vectors

  • Cosine similarity: Simple but effective after length normalization
  • PLDA (Probabilistic Linear Discriminant Analysis): Models speaker and session variability separately; standard backend for i-vector systems
score = log P(w1, w2 | same speaker) / P(w1, w2 | different speakers)

PLDA with i-vectors was state-of-the-art until deep learning approaches emerged.


Deep Speaker Embeddings

d-Vectors

Early neural approach (Google, 2014):

  • DNN trained on speaker classification (softmax over speaker IDs)
  • Extract activations from a hidden layer as speaker embedding
  • Frame-level features averaged over utterance

x-Vectors

TDNN-based speaker embeddings (Snyder et al., 2018):

Input frames -> TDNN layers (temporal context) -> Statistics pooling -> FC layers -> Embedding
                                                  (mean + stddev)
  1. Frame-level layers: TDNN (1D convolutions) process acoustic features with expanding context
  2. Statistics pooling: Compute mean and standard deviation across all frames
  3. Segment-level layers: Fully connected layers produce fixed-dimensional embedding
  4. Training: Softmax cross-entropy over speaker IDs (+ data augmentation)

The embedding extracted before the final classification layer (typically 512-dim) serves as the speaker representation. Scored with PLDA or cosine similarity.

ECAPA-TDNN

Enhanced x-vector architecture (Desplanques et al., 2020), dominant in modern systems:

  • Squeeze-Excitation (SE) blocks: Channel attention rescales feature maps
  • Res2Net modules: Multi-scale feature aggregation within residual blocks
  • Attentive statistics pooling: Attention-weighted mean and stddev (not uniform)
  • Multi-layer feature aggregation: Concatenate outputs from multiple TDNN layers

Achieves significantly lower error rates than standard x-vectors. Typical embedding: 192 dims.

Training Objectives

Modern speaker embedding training has moved beyond softmax:

| Loss | Description | |------|-------------| | Softmax + CE | Basic classification, requires speaker labels | | AAM-Softmax | Additive angular margin, improves discrimination | | Sub-center AAM | Multiple class centers, handles within-speaker variation | | Prototypical networks | Metric learning on episode-based sampling | | Contrastive / Triplet | Learn by comparing positive and negative pairs |

AAM-Softmax (Additive Angular Margin Softmax) is the current standard:

L = -log(exp(s * cos(theta_y + m)) / (exp(s * cos(theta_y + m)) + sum_j exp(s * cos(theta_j))))

Where s = scale, m = angular margin, forcing embeddings of the same speaker closer together on the hypersphere.

Large-Scale Pre-Trained Models

  • WavLM / HuBERT / wav2vec 2.0: Self-supervised models fine-tuned for speaker tasks
  • TitaNet (NVIDIA): Scalable architecture with squeeze-excitation and channel attention
  • CAM++: Channel-attention-based approach with strong performance

Speaker Diarization

Who spoke when? Segment a multi-speaker recording into speaker-homogeneous regions.

Traditional Pipeline

Audio -> VAD -> Segmentation -> Embedding extraction -> Clustering -> Resegmentation
  1. Voice Activity Detection (VAD): Identify speech vs non-speech regions
  2. Segmentation: Divide speech into short uniform segments (1-2 seconds)
  3. Embedding extraction: Compute x-vector/ECAPA-TDNN embedding per segment
  4. Clustering: Group segments by speaker

Clustering Methods

  • Agglomerative Hierarchical Clustering (AHC): Bottom-up merging with PLDA scores
  • Spectral clustering: Eigen-decomposition of affinity matrix, auto-selects speaker count
  • VBx (Variational Bayes): Bayesian HMM resegmentation refining initial clustering

Neural Diarization: EEND

End-to-End Neural Diarization (Fujita et al., 2019):

  • Frame-level multi-label classification: For each frame, predict which speakers are active
  • Handles overlapping speech naturally (multiple speakers active simultaneously)
  • Self-attention (Transformer) encoder processes the full recording
  • SA-EEND (Self-Attentive EEND): Scales to flexible numbers of speakers

Limitations: Fixed maximum number of speakers, struggles with very long recordings.

Hybrid Approaches

Combine EEND with clustering for practical systems:

  • EEND-VC (EEND with vector clustering): Local EEND + global clustering
  • TS-VAD (Target Speaker VAD): Given a speaker embedding, detect their speech regions
  • Process long recordings in overlapping blocks, merge results

Evaluation

  • DER (Diarization Error Rate): Sum of missed speech, false alarm, and speaker confusion
  • Typically computed with a 0.25s forgiveness collar around reference boundaries
  • JER (Jaccard Error Rate): Per-speaker intersection-over-union based metric

Anti-Spoofing and Presentation Attack Detection

Speaker verification systems are vulnerable to attacks:

| Attack Type | Method | |-------------|--------| | Replay | Play back recorded speech through speaker | | TTS | Synthesize target speaker's voice | | Voice conversion | Transform attacker's voice to sound like target | | Deepfake | Neural-generated speech mimicking target |

Countermeasures

  • Front-end features: Linear frequency cepstral coefficients (LFCCs), spectral features capturing synthesis artifacts
  • Neural classifiers: AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal graph attention networks), RawNet2
  • Self-supervised features: wav2vec 2.0 representations detect subtle artifacts
  • Logical vs physical access: Different detection strategies for TTS/VC vs replay

ASVspoof Challenge

Standard benchmark for anti-spoofing research. Evaluates systems on:

  • Equal Error Rate (EER) for bonafide vs spoof classification
  • t-DCF (tandem Detection Cost Function): Joint optimization with speaker verification
  • Increasingly challenging with improving synthesis quality

Applications

  • Voice authentication: Banking, device unlock, secure access
  • Smart speakers: Personalized responses per household member
  • Forensics: Speaker identification from recordings
  • Meeting transcription: Diarization for multi-speaker notes
  • Call centers: Customer and agent identification
  • Media indexing: Who speaks in podcasts, broadcasts, interviews