7 min read
On this page

Audio Processing Applications

Noise Reduction

Spectral Subtraction

Classic single-channel noise reduction:

  1. Estimate noise spectrum from non-speech segments (silence/noise-only frames)
  2. Subtract estimated noise power spectrum from noisy signal:
    |S_hat(k)|^2 = |X(k)|^2 - alpha * |N_hat(k)|^2
    
  3. Apply spectral floor (beta * |N_hat(k)|^2) to prevent negative values
  4. Reconstruct using noisy phase (phase is not modified)

Parameters alpha (over-subtraction factor) and beta (spectral floor) control the trade-off between noise reduction and musical noise (isolated tonal artifacts from random spectral peaks surviving subtraction).

Wiener Filtering

Estimate the optimal linear filter minimizing mean squared error:

H(k) = |S(k)|^2 / (|S(k)|^2 + |N(k)|^2) = SNR(k) / (SNR(k) + 1)

Requires estimation of signal and noise power spectra. Iterative Wiener filtering alternates between filter estimation and signal estimation. Smoother output than spectral subtraction but requires good SNR estimation.

Statistical Model-Based

  • MMSE-STSA (Minimum Mean Square Error Short-Time Spectral Amplitude): Bayesian estimation assuming Gaussian speech and noise, estimates spectral amplitude
  • Log-MMSE: Operates in log-spectral domain, reduces musical noise
  • Decision-directed SNR estimation (Ephraim-Malah): Smooth a priori SNR estimation combining current frame and previous estimate

RNNoise (Valin, 2018)

Lightweight neural noise suppression designed for real-time communication:

  • GRU-based architecture (3 GRU layers, ~60k parameters)
  • Operates on Bark-scale bands (22 bands): Predicts gain per band
  • Combines neural network with traditional DSP (pitch filtering)
  • Runs in real-time on a single CPU core with minimal latency
  • Trained on synthetic mixtures of clean speech + diverse noise types

Pipeline:

Noisy BFCC features + pitch -> GRU network -> Band gains -> Apply to signal
                                            -> VAD probability

Modern Deep Noise Suppression

  • DTLN (Dual-signal Transformation LSTM Network): Two-stage (magnitude + time-domain)
  • FullSubNet: Full-band and sub-band fusion for frequency-domain processing
  • DCCRN (Deep Complex CRN): Complex-valued convolutions preserving phase information
  • DeepFilterNet: Combines ERB-band gains (coarse) with deep filtering (fine, low bands)
  • DNS Challenge (Microsoft): Benchmark driving real-time noise suppression research

Acoustic Echo Cancellation (AEC)

Remove the echo of a loudspeaker signal captured by a nearby microphone during full-duplex communication (e.g., speakerphone calls).

Adaptive Filtering

Model the acoustic path from loudspeaker to microphone:

echo_estimate = h * x  (convolution of loudspeaker signal with room impulse response)
error = mic_signal - echo_estimate
  • NLMS (Normalized Least Mean Squares): Simple, robust, widely deployed
  • RLS (Recursive Least Squares): Faster convergence, higher complexity
  • Frequency-domain adaptive filtering (FDAF): Block-based, efficient for long filters
  • Filter length must cover room reverberation (typically 100-500 ms, thousands of taps)

Challenges

  • Double-talk: Both near-end and far-end speakers active simultaneously
  • Non-linearities: Loudspeaker distortion not captured by linear filter
  • Acoustic changes: Room changes, speaker/mic movement require re-adaptation
  • Residual echo: Remaining echo after linear cancellation

Neural AEC

  • Post-filter networks suppress residual echo after linear AEC
  • End-to-end neural AEC replacing or augmenting adaptive filters
  • AEC Challenge (Microsoft): Standardized evaluation with real recordings
  • Modern approaches combine adaptive filtering with neural residual suppression

Sound Event Detection (SED)

Identify and temporally locate sound events in audio recordings.

Task Variants

| Variant | Description | |---------|-------------| | Audio tagging | Clip-level labels (event present/absent in recording) | | SED | Frame-level labels with onset/offset times | | Polyphonic SED | Multiple overlapping events detected simultaneously | | Few-shot SED | Detect novel event types from few examples |

Architectures

  • CNN on mel-spectrograms: VGG-like, ResNet adapted for audio
  • CRNN: CNN feature extractor + RNN for temporal modeling
  • Audio Spectrogram Transformer (AST): Vision Transformer on spectrogram patches
  • BEATs: Self-supervised pre-training with audio tokenizer, fine-tuned for SED
  • PANNs (Pre-trained Audio Neural Networks): Large-scale models on AudioSet

AudioSet

Google's large-scale audio dataset: 2M+ 10-second YouTube clips, 527 event categories, multi-label weakly labeled (clip-level, not frame-level). Primary training resource for general audio understanding.

DCASE Challenge

Annual Detection and Classification of Acoustic Scenes and Events challenge:

  • Acoustic scene classification (airport, park, metro, etc.)
  • Sound event detection and localization
  • Anomalous sound detection for machine monitoring
  • Few-shot bioacoustic event detection

Acoustic Source Localization

Determine the spatial position or direction of a sound source using microphone arrays.

Direction of Arrival (DOA) Estimation

Estimate the angle from which sound arrives at the array.

GCC-PHAT (Generalized Cross-Correlation with Phase Transform):

  1. Compute cross-correlation between microphone pairs
  2. Phase transform (whiten) sharpens the correlation peak
  3. Peak position gives the time difference of arrival (TDOA)
  4. TDOA maps to DOA angle via array geometry:
    theta = arcsin(TDOA * c / d)   (for linear array, d = mic spacing)
    

MUSIC (Multiple Signal Classification):

  • Eigendecomposition of spatial covariance matrix
  • Separate signal and noise subspaces
  • Scan steering vectors against noise subspace for sharp DOA peaks
  • Super-resolution (resolves closely spaced sources beyond array limits)

Neural DOA: CNN/CRNN on multi-channel spectrograms predict DOA directly. Handle reverberation and noise better than classical methods.

Beamforming

Spatially filter a microphone array to enhance signals from a target direction:

Delay-and-Sum: Align signals by compensating inter-microphone delays, then average. Simplest beamformer. Array gain = N (number of microphones).

MVDR (Minimum Variance Distortionless Response):

  • Minimize output power while preserving signal from target direction
  • Requires noise covariance matrix estimation
  • Optimal for stationary noise; adaptive variants track changes

Neural beamforming: Estimate masks or beamformer coefficients with neural networks. Commonly used as front-end for far-field ASR (e.g., CHiME challenge).

Applications

  • Smart speakers (far-field voice interaction)
  • Video conferencing (speaker tracking)
  • Hearing aids (directional processing)
  • Surveillance and monitoring
  • Robot audition

Spatial Audio

Binaural Audio

Two-channel audio reproducing 3D perception over headphones:

  • Simulate how each ear receives sound from a specific direction
  • Requires HRTF (Head-Related Transfer Function) -- the frequency-dependent filter from a point in space to each eardrum

HRTF encodes:

  • ITD (Interaural Time Difference): Arrival time difference between ears (~0-0.7 ms)
  • ILD (Interaural Level Difference): Level difference, especially at high frequencies
  • Spectral coloring: Pinna reflections encode elevation cues

Personalized HRTFs (measured or estimated) give the most realistic spatialization. Generic HRTFs work reasonably but suffer from front-back confusion and poor elevation.

Ambisonics

Represent a full 3D sound field using spherical harmonic decomposition:

  • First-order (FOA): 4 channels (W, X, Y, Z) -- omnidirectional + 3 figure-eight patterns
  • Higher-order (HOA): (N+1)^2 channels for order N -- better spatial resolution
  • Scene-based: Records/represents the entire sound field, not individual sources
  • Format-agnostic: Decode to any speaker layout or headphones (via HRTF)

Workflow:

Source positions + audio -> Ambisonics encoding -> B-format -> Decoding -> Speakers/headphones

Used in VR/AR, 360-degree video, and immersive audio production. YouTube and Facebook support first-order ambisonics for spatial audio in 360 video.

Other Spatial Audio Formats

  • Dolby Atmos: Object-based + channel-based hybrid; supports up to 128 tracks
  • MPEG-H 3D Audio: ISO standard for interactive, personalized spatial audio
  • Sony 360 Reality Audio: Object-based format using MPEG-H
  • Auro-3D: Channel-based height format for cinema
  • Steam Audio / Project Acoustics: Real-time spatial audio for games with physics-based room simulation

Audio Codec Design

Principles

Audio codecs balance bitrate, quality, latency, and complexity:

Quality = f(bitrate, algorithm, content_type, latency_constraints)

Psychoacoustic Coding (Traditional)

Used by MP3, AAC, Opus (CELT mode):

  1. Transform: MDCT (Modified Discrete Cosine Transform) for time-frequency analysis
  2. Psychoacoustic model: Compute masking thresholds per frequency band
  3. Quantization: Allocate bits based on perceptual importance (mask inaudible parts)
  4. Entropy coding: Huffman or arithmetic coding of quantized coefficients
  5. Bitstream packing: Format into decodable bitstream

Opus Codec

The modern standard for real-time communication and streaming:

  • SILK mode: Linear prediction-based, optimized for speech (6-40 kbps)
  • CELT mode: MDCT-based, optimized for music (48-510 kbps)
  • Hybrid mode: SILK for low band + CELT for high band (crossover at ~8 kHz)
  • Frame sizes: 2.5, 5, 10, 20, 40, 60 ms (algorithm latency as low as 2.5 ms)
  • Seamless transition between modes based on content
  • IETF standard (RFC 6716), royalty-free, open-source

Neural Audio Codecs

Learned compression using neural networks:

SoundStream (Google, 2021):

  • Encoder-decoder with residual vector quantization (RVQ)
  • Each quantizer captures progressively finer detail
  • Trained end-to-end with reconstruction + adversarial + perceptual losses
  • Operates at 3-18 kbps with quality rivaling Opus at higher bitrates

EnCodec (Meta, 2022):

  • Similar RVQ architecture to SoundStream
  • Supports 1.5, 3, 6, 12, 24 kbps
  • Balances spectral and time-domain losses
  • Foundation for audio language models (MusicGen, VALL-E)

DAC (Descript Audio Codec):

  • Improved codebook utilization via factorized codes
  • Higher quality at very low bitrates
  • 44.1 kHz support for music

Codec Applications Beyond Compression

Neural codecs serve as discrete audio tokenizers for generative models:

  • TTS: VALL-E generates EnCodec tokens conditioned on text
  • Music generation: MusicGen operates on EnCodec token sequences
  • Audio understanding: Discrete tokens enable LLM-style audio processing
  • Bandwidth extension: Predict high-frequency codes from low-frequency ones

Evaluation Metrics

| Metric | Type | Description | |--------|------|-------------| | PESQ | Intrusive | Perceptual Evaluation of Speech Quality (ITU-T P.862) | | POLQA | Intrusive | Successor to PESQ (P.863) | | ViSQOL | Intrusive | Virtual Speech Quality Objective Listener | | SI-SDR | Intrusive | Scale-Invariant Signal-to-Distortion Ratio | | DNSMOS | Non-intrusive | Deep noise suppression MOS predictor | | MUSHRA | Subjective | Multiple Stimuli with Hidden Reference and Anchor | | MOS | Subjective | Mean Opinion Score (1-5 scale, human listeners) |

Intrusive metrics compare processed audio against a clean reference. Non-intrusive metrics estimate quality from the signal alone. Subjective listening tests remain the gold standard for perceptual quality evaluation.