Audio Processing Applications

Noise Reduction

Spectral Subtraction

Classic single-channel noise reduction:

Estimate noise spectrum from non-speech segments (silence/noise-only frames)
Subtract estimated noise power spectrum from noisy signal:
```
|S_hat(k)|^2 = |X(k)|^2 - alpha * |N_hat(k)|^2
```
Apply spectral floor (beta * |N_hat(k)|^2) to prevent negative values
Reconstruct using noisy phase (phase is not modified)

Parameters alpha (over-subtraction factor) and beta (spectral floor) control the trade-off between noise reduction and musical noise (isolated tonal artifacts from random spectral peaks surviving subtraction).

Wiener Filtering

Estimate the optimal linear filter minimizing mean squared error:

H(k) = |S(k)|^2 / (|S(k)|^2 + |N(k)|^2) = SNR(k) / (SNR(k) + 1)

Requires estimation of signal and noise power spectra. Iterative Wiener filtering alternates between filter estimation and signal estimation. Smoother output than spectral subtraction but requires good SNR estimation.

Statistical Model-Based

MMSE-STSA (Minimum Mean Square Error Short-Time Spectral Amplitude): Bayesian estimation assuming Gaussian speech and noise, estimates spectral amplitude
Log-MMSE: Operates in log-spectral domain, reduces musical noise
Decision-directed SNR estimation (Ephraim-Malah): Smooth a priori SNR estimation combining current frame and previous estimate

RNNoise (Valin, 2018)

Lightweight neural noise suppression designed for real-time communication:

GRU-based architecture (3 GRU layers, ~60k parameters)
Operates on Bark-scale bands (22 bands): Predicts gain per band
Combines neural network with traditional DSP (pitch filtering)
Runs in real-time on a single CPU core with minimal latency
Trained on synthetic mixtures of clean speech + diverse noise types

Pipeline:

Noisy BFCC features + pitch -> GRU network -> Band gains -> Apply to signal
                                            -> VAD probability

Modern Deep Noise Suppression

DTLN (Dual-signal Transformation LSTM Network): Two-stage (magnitude + time-domain)
FullSubNet: Full-band and sub-band fusion for frequency-domain processing
DCCRN (Deep Complex CRN): Complex-valued convolutions preserving phase information
DeepFilterNet: Combines ERB-band gains (coarse) with deep filtering (fine, low bands)
DNS Challenge (Microsoft): Benchmark driving real-time noise suppression research

Acoustic Echo Cancellation (AEC)

Remove the echo of a loudspeaker signal captured by a nearby microphone during full-duplex communication (e.g., speakerphone calls).

Adaptive Filtering

Model the acoustic path from loudspeaker to microphone:

echo_estimate = h * x  (convolution of loudspeaker signal with room impulse response)
error = mic_signal - echo_estimate

NLMS (Normalized Least Mean Squares): Simple, robust, widely deployed
RLS (Recursive Least Squares): Faster convergence, higher complexity
Frequency-domain adaptive filtering (FDAF): Block-based, efficient for long filters
Filter length must cover room reverberation (typically 100-500 ms, thousands of taps)

Challenges

Double-talk: Both near-end and far-end speakers active simultaneously
Non-linearities: Loudspeaker distortion not captured by linear filter
Acoustic changes: Room changes, speaker/mic movement require re-adaptation
Residual echo: Remaining echo after linear cancellation

Neural AEC

Post-filter networks suppress residual echo after linear AEC
End-to-end neural AEC replacing or augmenting adaptive filters
AEC Challenge (Microsoft): Standardized evaluation with real recordings
Modern approaches combine adaptive filtering with neural residual suppression

Sound Event Detection (SED)

Identify and temporally locate sound events in audio recordings.

Task Variants

Variant	Description
Audio tagging	Clip-level labels (event present/absent in recording)
SED	Frame-level labels with onset/offset times
Polyphonic SED	Multiple overlapping events detected simultaneously
Few-shot SED	Detect novel event types from few examples

Architectures

CNN on mel-spectrograms: VGG-like, ResNet adapted for audio
CRNN: CNN feature extractor + RNN for temporal modeling
Audio Spectrogram Transformer (AST): Vision Transformer on spectrogram patches
BEATs: Self-supervised pre-training with audio tokenizer, fine-tuned for SED
PANNs (Pre-trained Audio Neural Networks): Large-scale models on AudioSet

AudioSet

Google's large-scale audio dataset: 2M+ 10-second YouTube clips, 527 event categories, multi-label weakly labeled (clip-level, not frame-level). Primary training resource for general audio understanding.

DCASE Challenge

Annual Detection and Classification of Acoustic Scenes and Events challenge:

Acoustic scene classification (airport, park, metro, etc.)
Sound event detection and localization
Anomalous sound detection for machine monitoring
Few-shot bioacoustic event detection

Acoustic Source Localization

Determine the spatial position or direction of a sound source using microphone arrays.

Direction of Arrival (DOA) Estimation

Estimate the angle from which sound arrives at the array.

GCC-PHAT (Generalized Cross-Correlation with Phase Transform):

Compute cross-correlation between microphone pairs
Phase transform (whiten) sharpens the correlation peak
Peak position gives the time difference of arrival (TDOA)

TDOA maps to DOA angle via array geometry:

theta = arcsin(TDOA * c / d)   (for linear array, d = mic spacing)

MUSIC (Multiple Signal Classification):

Eigendecomposition of spatial covariance matrix
Separate signal and noise subspaces
Scan steering vectors against noise subspace for sharp DOA peaks
Super-resolution (resolves closely spaced sources beyond array limits)

Neural DOA: CNN/CRNN on multi-channel spectrograms predict DOA directly. Handle reverberation and noise better than classical methods.

Beamforming

Spatially filter a microphone array to enhance signals from a target direction:

Delay-and-Sum: Align signals by compensating inter-microphone delays, then average. Simplest beamformer. Array gain = N (number of microphones).

MVDR (Minimum Variance Distortionless Response):

Minimize output power while preserving signal from target direction
Requires noise covariance matrix estimation
Optimal for stationary noise; adaptive variants track changes

Neural beamforming: Estimate masks or beamformer coefficients with neural networks. Commonly used as front-end for far-field ASR (e.g., CHiME challenge).

Applications

Smart speakers (far-field voice interaction)
Video conferencing (speaker tracking)
Hearing aids (directional processing)
Surveillance and monitoring
Robot audition

Spatial Audio

Binaural Audio

Two-channel audio reproducing 3D perception over headphones:

Simulate how each ear receives sound from a specific direction
Requires HRTF (Head-Related Transfer Function) -- the frequency-dependent filter from a point in space to each eardrum

HRTF encodes:

ITD (Interaural Time Difference): Arrival time difference between ears (~0-0.7 ms)
ILD (Interaural Level Difference): Level difference, especially at high frequencies
Spectral coloring: Pinna reflections encode elevation cues

Personalized HRTFs (measured or estimated) give the most realistic spatialization. Generic HRTFs work reasonably but suffer from front-back confusion and poor elevation.

Ambisonics

Represent a full 3D sound field using spherical harmonic decomposition:

First-order (FOA): 4 channels (W, X, Y, Z) -- omnidirectional + 3 figure-eight patterns
Higher-order (HOA): (N+1)^2 channels for order N -- better spatial resolution
Scene-based: Records/represents the entire sound field, not individual sources
Format-agnostic: Decode to any speaker layout or headphones (via HRTF)

Workflow:

Source positions + audio -> Ambisonics encoding -> B-format -> Decoding -> Speakers/headphones

Used in VR/AR, 360-degree video, and immersive audio production. YouTube and Facebook support first-order ambisonics for spatial audio in 360 video.

Other Spatial Audio Formats

Dolby Atmos: Object-based + channel-based hybrid; supports up to 128 tracks
MPEG-H 3D Audio: ISO standard for interactive, personalized spatial audio
Sony 360 Reality Audio: Object-based format using MPEG-H
Auro-3D: Channel-based height format for cinema
Steam Audio / Project Acoustics: Real-time spatial audio for games with physics-based room simulation

Audio Codec Design

Principles

Audio codecs balance bitrate, quality, latency, and complexity:

Quality = f(bitrate, algorithm, content_type, latency_constraints)

Psychoacoustic Coding (Traditional)

Used by MP3, AAC, Opus (CELT mode):

Transform: MDCT (Modified Discrete Cosine Transform) for time-frequency analysis
Psychoacoustic model: Compute masking thresholds per frequency band
Quantization: Allocate bits based on perceptual importance (mask inaudible parts)
Entropy coding: Huffman or arithmetic coding of quantized coefficients
Bitstream packing: Format into decodable bitstream

Opus Codec

The modern standard for real-time communication and streaming:

SILK mode: Linear prediction-based, optimized for speech (6-40 kbps)
CELT mode: MDCT-based, optimized for music (48-510 kbps)
Hybrid mode: SILK for low band + CELT for high band (crossover at ~8 kHz)
Frame sizes: 2.5, 5, 10, 20, 40, 60 ms (algorithm latency as low as 2.5 ms)
Seamless transition between modes based on content
IETF standard (RFC 6716), royalty-free, open-source

Neural Audio Codecs

Learned compression using neural networks:

SoundStream (Google, 2021):

Encoder-decoder with residual vector quantization (RVQ)
Each quantizer captures progressively finer detail
Trained end-to-end with reconstruction + adversarial + perceptual losses
Operates at 3-18 kbps with quality rivaling Opus at higher bitrates

EnCodec (Meta, 2022):

Similar RVQ architecture to SoundStream
Supports 1.5, 3, 6, 12, 24 kbps
Balances spectral and time-domain losses
Foundation for audio language models (MusicGen, VALL-E)

DAC (Descript Audio Codec):

Improved codebook utilization via factorized codes
Higher quality at very low bitrates
44.1 kHz support for music

Codec Applications Beyond Compression

Neural codecs serve as discrete audio tokenizers for generative models:

TTS: VALL-E generates EnCodec tokens conditioned on text
Music generation: MusicGen operates on EnCodec token sequences
Audio understanding: Discrete tokens enable LLM-style audio processing
Bandwidth extension: Predict high-frequency codes from low-frequency ones

Evaluation Metrics

Metric	Type	Description
PESQ	Intrusive	Perceptual Evaluation of Speech Quality (ITU-T P.862)
POLQA	Intrusive	Successor to PESQ (P.863)
ViSQOL	Intrusive	Virtual Speech Quality Objective Listener
SI-SDR	Intrusive	Scale-Invariant Signal-to-Distortion Ratio
DNSMOS	Non-intrusive	Deep noise suppression MOS predictor
MUSHRA	Subjective	Multiple Stimuli with Hidden Reference and Anchor
MOS	Subjective	Mean Opinion Score (1-5 scale, human listeners)

Intrusive metrics compare processed audio against a clean reference. Non-intrusive metrics estimate quality from the signal alone. Subjective listening tests remain the gold standard for perceptual quality evaluation.