Audio Processing Applications
Noise Reduction
Spectral Subtraction
Classic single-channel noise reduction:
- Estimate noise spectrum from non-speech segments (silence/noise-only frames)
- Subtract estimated noise power spectrum from noisy signal:
|S_hat(k)|^2 = |X(k)|^2 - alpha * |N_hat(k)|^2 - Apply spectral floor (beta * |N_hat(k)|^2) to prevent negative values
- Reconstruct using noisy phase (phase is not modified)
Parameters alpha (over-subtraction factor) and beta (spectral floor) control the trade-off between noise reduction and musical noise (isolated tonal artifacts from random spectral peaks surviving subtraction).
Wiener Filtering
Estimate the optimal linear filter minimizing mean squared error:
H(k) = |S(k)|^2 / (|S(k)|^2 + |N(k)|^2) = SNR(k) / (SNR(k) + 1)
Requires estimation of signal and noise power spectra. Iterative Wiener filtering alternates between filter estimation and signal estimation. Smoother output than spectral subtraction but requires good SNR estimation.
Statistical Model-Based
- MMSE-STSA (Minimum Mean Square Error Short-Time Spectral Amplitude): Bayesian estimation assuming Gaussian speech and noise, estimates spectral amplitude
- Log-MMSE: Operates in log-spectral domain, reduces musical noise
- Decision-directed SNR estimation (Ephraim-Malah): Smooth a priori SNR estimation combining current frame and previous estimate
RNNoise (Valin, 2018)
Lightweight neural noise suppression designed for real-time communication:
- GRU-based architecture (3 GRU layers, ~60k parameters)
- Operates on Bark-scale bands (22 bands): Predicts gain per band
- Combines neural network with traditional DSP (pitch filtering)
- Runs in real-time on a single CPU core with minimal latency
- Trained on synthetic mixtures of clean speech + diverse noise types
Pipeline:
Noisy BFCC features + pitch -> GRU network -> Band gains -> Apply to signal
-> VAD probability
Modern Deep Noise Suppression
- DTLN (Dual-signal Transformation LSTM Network): Two-stage (magnitude + time-domain)
- FullSubNet: Full-band and sub-band fusion for frequency-domain processing
- DCCRN (Deep Complex CRN): Complex-valued convolutions preserving phase information
- DeepFilterNet: Combines ERB-band gains (coarse) with deep filtering (fine, low bands)
- DNS Challenge (Microsoft): Benchmark driving real-time noise suppression research
Acoustic Echo Cancellation (AEC)
Remove the echo of a loudspeaker signal captured by a nearby microphone during full-duplex communication (e.g., speakerphone calls).
Adaptive Filtering
Model the acoustic path from loudspeaker to microphone:
echo_estimate = h * x (convolution of loudspeaker signal with room impulse response)
error = mic_signal - echo_estimate
- NLMS (Normalized Least Mean Squares): Simple, robust, widely deployed
- RLS (Recursive Least Squares): Faster convergence, higher complexity
- Frequency-domain adaptive filtering (FDAF): Block-based, efficient for long filters
- Filter length must cover room reverberation (typically 100-500 ms, thousands of taps)
Challenges
- Double-talk: Both near-end and far-end speakers active simultaneously
- Non-linearities: Loudspeaker distortion not captured by linear filter
- Acoustic changes: Room changes, speaker/mic movement require re-adaptation
- Residual echo: Remaining echo after linear cancellation
Neural AEC
- Post-filter networks suppress residual echo after linear AEC
- End-to-end neural AEC replacing or augmenting adaptive filters
- AEC Challenge (Microsoft): Standardized evaluation with real recordings
- Modern approaches combine adaptive filtering with neural residual suppression
Sound Event Detection (SED)
Identify and temporally locate sound events in audio recordings.
Task Variants
| Variant | Description | |---------|-------------| | Audio tagging | Clip-level labels (event present/absent in recording) | | SED | Frame-level labels with onset/offset times | | Polyphonic SED | Multiple overlapping events detected simultaneously | | Few-shot SED | Detect novel event types from few examples |
Architectures
- CNN on mel-spectrograms: VGG-like, ResNet adapted for audio
- CRNN: CNN feature extractor + RNN for temporal modeling
- Audio Spectrogram Transformer (AST): Vision Transformer on spectrogram patches
- BEATs: Self-supervised pre-training with audio tokenizer, fine-tuned for SED
- PANNs (Pre-trained Audio Neural Networks): Large-scale models on AudioSet
AudioSet
Google's large-scale audio dataset: 2M+ 10-second YouTube clips, 527 event categories, multi-label weakly labeled (clip-level, not frame-level). Primary training resource for general audio understanding.
DCASE Challenge
Annual Detection and Classification of Acoustic Scenes and Events challenge:
- Acoustic scene classification (airport, park, metro, etc.)
- Sound event detection and localization
- Anomalous sound detection for machine monitoring
- Few-shot bioacoustic event detection
Acoustic Source Localization
Determine the spatial position or direction of a sound source using microphone arrays.
Direction of Arrival (DOA) Estimation
Estimate the angle from which sound arrives at the array.
GCC-PHAT (Generalized Cross-Correlation with Phase Transform):
- Compute cross-correlation between microphone pairs
- Phase transform (whiten) sharpens the correlation peak
- Peak position gives the time difference of arrival (TDOA)
- TDOA maps to DOA angle via array geometry:
theta = arcsin(TDOA * c / d) (for linear array, d = mic spacing)
MUSIC (Multiple Signal Classification):
- Eigendecomposition of spatial covariance matrix
- Separate signal and noise subspaces
- Scan steering vectors against noise subspace for sharp DOA peaks
- Super-resolution (resolves closely spaced sources beyond array limits)
Neural DOA: CNN/CRNN on multi-channel spectrograms predict DOA directly. Handle reverberation and noise better than classical methods.
Beamforming
Spatially filter a microphone array to enhance signals from a target direction:
Delay-and-Sum: Align signals by compensating inter-microphone delays, then average. Simplest beamformer. Array gain = N (number of microphones).
MVDR (Minimum Variance Distortionless Response):
- Minimize output power while preserving signal from target direction
- Requires noise covariance matrix estimation
- Optimal for stationary noise; adaptive variants track changes
Neural beamforming: Estimate masks or beamformer coefficients with neural networks. Commonly used as front-end for far-field ASR (e.g., CHiME challenge).
Applications
- Smart speakers (far-field voice interaction)
- Video conferencing (speaker tracking)
- Hearing aids (directional processing)
- Surveillance and monitoring
- Robot audition
Spatial Audio
Binaural Audio
Two-channel audio reproducing 3D perception over headphones:
- Simulate how each ear receives sound from a specific direction
- Requires HRTF (Head-Related Transfer Function) -- the frequency-dependent filter from a point in space to each eardrum
HRTF encodes:
- ITD (Interaural Time Difference): Arrival time difference between ears (~0-0.7 ms)
- ILD (Interaural Level Difference): Level difference, especially at high frequencies
- Spectral coloring: Pinna reflections encode elevation cues
Personalized HRTFs (measured or estimated) give the most realistic spatialization. Generic HRTFs work reasonably but suffer from front-back confusion and poor elevation.
Ambisonics
Represent a full 3D sound field using spherical harmonic decomposition:
- First-order (FOA): 4 channels (W, X, Y, Z) -- omnidirectional + 3 figure-eight patterns
- Higher-order (HOA): (N+1)^2 channels for order N -- better spatial resolution
- Scene-based: Records/represents the entire sound field, not individual sources
- Format-agnostic: Decode to any speaker layout or headphones (via HRTF)
Workflow:
Source positions + audio -> Ambisonics encoding -> B-format -> Decoding -> Speakers/headphones
Used in VR/AR, 360-degree video, and immersive audio production. YouTube and Facebook support first-order ambisonics for spatial audio in 360 video.
Other Spatial Audio Formats
- Dolby Atmos: Object-based + channel-based hybrid; supports up to 128 tracks
- MPEG-H 3D Audio: ISO standard for interactive, personalized spatial audio
- Sony 360 Reality Audio: Object-based format using MPEG-H
- Auro-3D: Channel-based height format for cinema
- Steam Audio / Project Acoustics: Real-time spatial audio for games with physics-based room simulation
Audio Codec Design
Principles
Audio codecs balance bitrate, quality, latency, and complexity:
Quality = f(bitrate, algorithm, content_type, latency_constraints)
Psychoacoustic Coding (Traditional)
Used by MP3, AAC, Opus (CELT mode):
- Transform: MDCT (Modified Discrete Cosine Transform) for time-frequency analysis
- Psychoacoustic model: Compute masking thresholds per frequency band
- Quantization: Allocate bits based on perceptual importance (mask inaudible parts)
- Entropy coding: Huffman or arithmetic coding of quantized coefficients
- Bitstream packing: Format into decodable bitstream
Opus Codec
The modern standard for real-time communication and streaming:
- SILK mode: Linear prediction-based, optimized for speech (6-40 kbps)
- CELT mode: MDCT-based, optimized for music (48-510 kbps)
- Hybrid mode: SILK for low band + CELT for high band (crossover at ~8 kHz)
- Frame sizes: 2.5, 5, 10, 20, 40, 60 ms (algorithm latency as low as 2.5 ms)
- Seamless transition between modes based on content
- IETF standard (RFC 6716), royalty-free, open-source
Neural Audio Codecs
Learned compression using neural networks:
SoundStream (Google, 2021):
- Encoder-decoder with residual vector quantization (RVQ)
- Each quantizer captures progressively finer detail
- Trained end-to-end with reconstruction + adversarial + perceptual losses
- Operates at 3-18 kbps with quality rivaling Opus at higher bitrates
EnCodec (Meta, 2022):
- Similar RVQ architecture to SoundStream
- Supports 1.5, 3, 6, 12, 24 kbps
- Balances spectral and time-domain losses
- Foundation for audio language models (MusicGen, VALL-E)
DAC (Descript Audio Codec):
- Improved codebook utilization via factorized codes
- Higher quality at very low bitrates
- 44.1 kHz support for music
Codec Applications Beyond Compression
Neural codecs serve as discrete audio tokenizers for generative models:
- TTS: VALL-E generates EnCodec tokens conditioned on text
- Music generation: MusicGen operates on EnCodec token sequences
- Audio understanding: Discrete tokens enable LLM-style audio processing
- Bandwidth extension: Predict high-frequency codes from low-frequency ones
Evaluation Metrics
| Metric | Type | Description | |--------|------|-------------| | PESQ | Intrusive | Perceptual Evaluation of Speech Quality (ITU-T P.862) | | POLQA | Intrusive | Successor to PESQ (P.863) | | ViSQOL | Intrusive | Virtual Speech Quality Objective Listener | | SI-SDR | Intrusive | Scale-Invariant Signal-to-Distortion Ratio | | DNSMOS | Non-intrusive | Deep noise suppression MOS predictor | | MUSHRA | Subjective | Multiple Stimuli with Hidden Reference and Anchor | | MOS | Subjective | Mean Opinion Score (1-5 scale, human listeners) |
Intrusive metrics compare processed audio against a clean reference. Non-intrusive metrics estimate quality from the signal alone. Subjective listening tests remain the gold standard for perceptual quality evaluation.