Audio Feature Extraction
Overview
Audio features transform raw waveforms into compact, informative representations for downstream tasks (speech recognition, music analysis, sound classification). Features are extracted from short overlapping frames (20-50 ms with 10-25 ms hop).
Raw waveform -> Framing -> Windowing -> Feature Extraction -> [time x feature] matrix
Pre-emphasis (high-pass filter, y[n] = x[n] - alpha*x[n-1], alpha ~ 0.97) is often applied first to boost high-frequency energy and flatten the spectrum.
Time-Domain Features
Computed directly from the waveform samples without frequency transformation.
Zero-Crossing Rate (ZCR)
Number of times the signal crosses zero per frame:
ZCR = (1/2N) * sum_{n=1}^{N} |sign(x[n]) - sign(x[n-1])|
- High ZCR: noise, unvoiced speech (/s/, /f/), high-frequency content
- Low ZCR: voiced speech, tonal sounds
- Useful for voice activity detection and speech/music discrimination
Short-Time Energy
Sum of squared sample values in a frame:
E = sum_{n=0}^{N-1} x[n]^2
Distinguishes silence from active signal. Varies greatly across signal types.
Root Mean Square (RMS)
Square root of mean squared amplitude -- measures frame-level loudness:
RMS = sqrt((1/N) * sum_{n=0}^{N-1} x[n]^2)
More perceptually meaningful than raw energy. Often converted to dB: 20*log10(RMS/ref).
Autocorrelation
Measures self-similarity at lag tau. Peaks at multiples of the fundamental period, making it useful for pitch detection:
R(tau) = sum_{n=0}^{N-1-tau} x[n] * x[n + tau]
Frequency-Domain Features
Short-Time Fourier Transform (STFT)
Apply DFT to each windowed frame to obtain time-frequency representation:
X(m, k) = sum_{n=0}^{N-1} x[n + m*H] * w[n] * exp(-j*2*pi*k*n/N)
Where m = frame index, k = frequency bin, H = hop size, w[n] = window function.
Window functions trade off frequency resolution vs spectral leakage:
- Rectangular: best frequency resolution, worst leakage
- Hann: good general-purpose compromise
- Hamming: slightly less sidelobe leakage than Hann
- Blackman: best sidelobe suppression, widest main lobe
Spectrogram: magnitude-squared of STFT displayed as a 2D image (time x frequency). Power spectrograms are typically shown in log scale (dB).
Key trade-off: longer windows give better frequency resolution but worse time resolution (the uncertainty principle). Typical: 25 ms window, 10 ms hop for speech; longer for music.
Mel Scale and Mel-Spectrogram
The mel scale approximates human pitch perception:
mel(f) = 2595 * log10(1 + f/700)
A mel-spectrogram applies a triangular filterbank (typically 40-128 filters) spaced uniformly on the mel scale to the power spectrum:
- Compute power spectrum |X(k)|^2
- Apply mel filterbank (triangular, overlapping filters)
- Take log of filterbank energies
Mel-spectrograms are the dominant input for modern neural audio models (speech recognition, TTS, music tagging). Typical: 80-128 mel bands for neural TTS, 40 for ASR.
Mel-Frequency Cepstral Coefficients (MFCCs)
The standard feature for traditional speech systems:
- Compute power spectrum
- Apply mel filterbank
- Take log of filterbank energies
- Apply DCT (Discrete Cosine Transform)
- Keep first 13 coefficients (discard higher order)
The DCT decorrelates the log mel energies, producing compact features. Typically augmented with delta (first derivative) and delta-delta (second derivative) coefficients for a 39-dimensional feature vector.
- MFCC-0 (or c0): Overall energy
- MFCCs 1-12: Spectral shape (vocal tract characteristics)
- Higher MFCCs: Fine spectral detail (usually discarded)
MFCCs were dominant in GMM-HMM ASR. Neural models increasingly prefer raw mel-spectrograms or learn features directly from waveforms.
Spectral Centroid
"Center of mass" of the spectrum -- correlates with perceived brightness:
SC = sum_k (f_k * |X(k)|^2) / sum_k |X(k)|^2
Higher centroid = brighter timbre. Useful for music timbre classification.
Spectral Rolloff
Frequency below which a specified percentage (typically 85%) of spectral energy is contained:
sum_{k=0}^{k_rolloff} |X(k)|^2 = 0.85 * sum_{k} |X(k)|^2
Distinguishes harmonic (lower rolloff) from noisy (higher rolloff) content.
Spectral Flux
Measures rate of spectral change between consecutive frames:
SF(m) = sum_k (|X(m,k)| - |X(m-1,k)|)^2
High at onsets and transients. Core feature for onset detection in music.
Spectral Bandwidth and Flatness
- Bandwidth: Weighted standard deviation of frequencies around the centroid
- Flatness: Ratio of geometric mean to arithmetic mean of power spectrum (0 = tonal, 1 = noise)
Cepstral Analysis
The cepstrum is the inverse Fourier transform of the log spectrum:
cepstrum = IFFT(log(|FFT(x)|))
It separates the slowly varying spectral envelope (vocal tract) from the rapidly varying excitation (pitch harmonics):
- Low quefrency (cepstral domain analog of time): spectral envelope
- High quefrency: fine harmonic structure, pitch period
This source-filter separation is fundamental to speech analysis. MFCCs are a mel-warped variant of cepstral analysis.
Constant-Q Transform (CQT)
Frequency-domain transform with logarithmically spaced frequency bins:
Frequency ratio between adjacent bins = 2^(1/B) (B = bins per octave)
Properties:
- Q factor (f/bandwidth) is constant across all bins
- Low frequencies get long windows (good frequency resolution)
- High frequencies get short windows (good time resolution)
- Bins align naturally with musical pitches (12 bins/octave = semitones)
Preferred over STFT for music analysis tasks: pitch tracking, chord recognition, music transcription. Computationally more expensive than FFT.
Feature Normalization
Raw features have varying scales and distributions. Normalization improves model robustness:
Cepstral Mean Normalization (CMN)
Subtract the mean of each coefficient over an utterance or sliding window:
c_norm[n] = c[n] - mean(c)
Removes convolutional channel effects (microphone, room). Standard in ASR.
Cepstral Mean and Variance Normalization (CMVN)
Additionally normalize variance to unit:
c_norm[n] = (c[n] - mean(c)) / std(c)
More aggressive normalization. Can be utterance-level or global.
Per-Channel Energy Normalization (PCEN)
Learned normalization that replaces log compression in mel-spectrograms:
PCEN(E) = (E / (smoothed_E + delta)^alpha + offset)^r - offset^r
Adaptive to local energy, more robust to noise than fixed log compression. Used in sound event detection and keyword spotting.
SpecAugment
Data augmentation applied to spectrograms during training (not strictly normalization):
- Time warping: Slight temporal distortion
- Frequency masking: Zero out random frequency bands
- Time masking: Zero out random time steps
Dramatically improves ASR robustness. Standard practice in modern speech models.
Feature Summary Table
| Feature | Domain | Dimensions | Primary Use | |---------|--------|-----------|-------------| | ZCR | Time | 1 | VAD, voiced/unvoiced | | RMS Energy | Time | 1 | VAD, loudness | | MFCCs | Cepstral | 13 (+ deltas = 39) | Traditional ASR | | Mel-spectrogram | Frequency | 40-128 bins | Neural ASR, TTS | | Spectral centroid | Frequency | 1 | Timbre, brightness | | Spectral flux | Frequency | 1 | Onset detection | | CQT | Frequency | 12-48/octave | Music pitch analysis | | Chroma | Frequency | 12 | Chord recognition |
Chroma Features
Project the spectrum onto 12 pitch classes (C, C#, D, ..., B), collapsing octave information. Captures harmonic content independent of octave, making them ideal for chord recognition and cover song identification. Can be computed from STFT or CQT.