Audio Feature Extraction

Overview

Audio features transform raw waveforms into compact, informative representations for downstream tasks (speech recognition, music analysis, sound classification). Features are extracted from short overlapping frames (20-50 ms with 10-25 ms hop).

Raw waveform -> Framing -> Windowing -> Feature Extraction -> [time x feature] matrix

Pre-emphasis (high-pass filter, y[n] = x[n] - alpha*x[n-1], alpha ~ 0.97) is often applied first to boost high-frequency energy and flatten the spectrum.

Time-Domain Features

Computed directly from the waveform samples without frequency transformation.

Zero-Crossing Rate (ZCR)

Number of times the signal crosses zero per frame:

ZCR = (1/2N) * sum_{n=1}^{N} |sign(x[n]) - sign(x[n-1])|

High ZCR: noise, unvoiced speech (/s/, /f/), high-frequency content
Low ZCR: voiced speech, tonal sounds
Useful for voice activity detection and speech/music discrimination

Short-Time Energy

Sum of squared sample values in a frame:

E = sum_{n=0}^{N-1} x[n]^2

Distinguishes silence from active signal. Varies greatly across signal types.

Root Mean Square (RMS)

Square root of mean squared amplitude -- measures frame-level loudness:

RMS = sqrt((1/N) * sum_{n=0}^{N-1} x[n]^2)

More perceptually meaningful than raw energy. Often converted to dB: 20*log10(RMS/ref).

Autocorrelation

Measures self-similarity at lag tau. Peaks at multiples of the fundamental period, making it useful for pitch detection:

R(tau) = sum_{n=0}^{N-1-tau} x[n] * x[n + tau]

Frequency-Domain Features

Short-Time Fourier Transform (STFT)

Apply DFT to each windowed frame to obtain time-frequency representation:

X(m, k) = sum_{n=0}^{N-1} x[n + m*H] * w[n] * exp(-j*2*pi*k*n/N)

Where m = frame index, k = frequency bin, H = hop size, w[n] = window function.

Window functions trade off frequency resolution vs spectral leakage:

Rectangular: best frequency resolution, worst leakage
Hann: good general-purpose compromise
Hamming: slightly less sidelobe leakage than Hann
Blackman: best sidelobe suppression, widest main lobe

Spectrogram: magnitude-squared of STFT displayed as a 2D image (time x frequency). Power spectrograms are typically shown in log scale (dB).

Key trade-off: longer windows give better frequency resolution but worse time resolution (the uncertainty principle). Typical: 25 ms window, 10 ms hop for speech; longer for music.

Mel Scale and Mel-Spectrogram

The mel scale approximates human pitch perception:

mel(f) = 2595 * log10(1 + f/700)

A mel-spectrogram applies a triangular filterbank (typically 40-128 filters) spaced uniformly on the mel scale to the power spectrum:

Compute power spectrum |X(k)|^2
Apply mel filterbank (triangular, overlapping filters)
Take log of filterbank energies

Mel-spectrograms are the dominant input for modern neural audio models (speech recognition, TTS, music tagging). Typical: 80-128 mel bands for neural TTS, 40 for ASR.

Mel-Frequency Cepstral Coefficients (MFCCs)

The standard feature for traditional speech systems:

Compute power spectrum
Apply mel filterbank
Take log of filterbank energies
Apply DCT (Discrete Cosine Transform)
Keep first 13 coefficients (discard higher order)

The DCT decorrelates the log mel energies, producing compact features. Typically augmented with delta (first derivative) and delta-delta (second derivative) coefficients for a 39-dimensional feature vector.

MFCC-0 (or c0): Overall energy
MFCCs 1-12: Spectral shape (vocal tract characteristics)
Higher MFCCs: Fine spectral detail (usually discarded)

MFCCs were dominant in GMM-HMM ASR. Neural models increasingly prefer raw mel-spectrograms or learn features directly from waveforms.

Spectral Centroid

"Center of mass" of the spectrum -- correlates with perceived brightness:

SC = sum_k (f_k * |X(k)|^2) / sum_k |X(k)|^2

Higher centroid = brighter timbre. Useful for music timbre classification.

Spectral Rolloff

Frequency below which a specified percentage (typically 85%) of spectral energy is contained:

sum_{k=0}^{k_rolloff} |X(k)|^2 = 0.85 * sum_{k} |X(k)|^2

Distinguishes harmonic (lower rolloff) from noisy (higher rolloff) content.

Spectral Flux

Measures rate of spectral change between consecutive frames:

SF(m) = sum_k (|X(m,k)| - |X(m-1,k)|)^2

High at onsets and transients. Core feature for onset detection in music.

Spectral Bandwidth and Flatness

Bandwidth: Weighted standard deviation of frequencies around the centroid
Flatness: Ratio of geometric mean to arithmetic mean of power spectrum (0 = tonal, 1 = noise)

Cepstral Analysis

The cepstrum is the inverse Fourier transform of the log spectrum:

cepstrum = IFFT(log(|FFT(x)|))

It separates the slowly varying spectral envelope (vocal tract) from the rapidly varying excitation (pitch harmonics):

Low quefrency (cepstral domain analog of time): spectral envelope
High quefrency: fine harmonic structure, pitch period

This source-filter separation is fundamental to speech analysis. MFCCs are a mel-warped variant of cepstral analysis.

Constant-Q Transform (CQT)

Frequency-domain transform with logarithmically spaced frequency bins:

Frequency ratio between adjacent bins = 2^(1/B)  (B = bins per octave)

Properties:

Q factor (f/bandwidth) is constant across all bins
Low frequencies get long windows (good frequency resolution)
High frequencies get short windows (good time resolution)
Bins align naturally with musical pitches (12 bins/octave = semitones)

Preferred over STFT for music analysis tasks: pitch tracking, chord recognition, music transcription. Computationally more expensive than FFT.

Feature Normalization

Raw features have varying scales and distributions. Normalization improves model robustness:

Cepstral Mean Normalization (CMN)

Subtract the mean of each coefficient over an utterance or sliding window:

c_norm[n] = c[n] - mean(c)

Removes convolutional channel effects (microphone, room). Standard in ASR.

Cepstral Mean and Variance Normalization (CMVN)

Additionally normalize variance to unit:

c_norm[n] = (c[n] - mean(c)) / std(c)

More aggressive normalization. Can be utterance-level or global.

Per-Channel Energy Normalization (PCEN)

Learned normalization that replaces log compression in mel-spectrograms:

PCEN(E) = (E / (smoothed_E + delta)^alpha + offset)^r - offset^r

Adaptive to local energy, more robust to noise than fixed log compression. Used in sound event detection and keyword spotting.

SpecAugment

Data augmentation applied to spectrograms during training (not strictly normalization):

Time warping: Slight temporal distortion
Frequency masking: Zero out random frequency bands
Time masking: Zero out random time steps

Dramatically improves ASR robustness. Standard practice in modern speech models.

Feature Summary Table

Feature	Domain	Dimensions	Primary Use
ZCR	Time	1	VAD, voiced/unvoiced
RMS Energy	Time	1	VAD, loudness
MFCCs	Cepstral	13 (+ deltas = 39)	Traditional ASR
Mel-spectrogram	Frequency	40-128 bins	Neural ASR, TTS
Spectral centroid	Frequency	1	Timbre, brightness
Spectral flux	Frequency	1	Onset detection
CQT	Frequency	12-48/octave	Music pitch analysis
Chroma	Frequency	12	Chord recognition

Chroma Features

Project the spectrum onto 12 pitch classes (C, C#, D, ..., B), collapsing octave information. Captures harmonic content independent of octave, making them ideal for chord recognition and cover song identification. Can be computed from STFT or CQT.