Music Information Retrieval (MIR)

Overview

Music Information Retrieval applies signal processing and machine learning to extract meaningful information from music signals. Tasks span rhythm analysis, pitch and harmony, structure, similarity, and generation.

Rhythm Analysis

Onset Detection

Identify the start times of musical events (notes, drum hits):

Compute a detection function that peaks at onsets:
- Spectral flux: Sum of positive differences in magnitude spectrum between frames
- High-frequency content: Weighted spectral energy emphasizing high frequencies
- Complex-domain: Uses both magnitude and phase deviations
- Superflux: Spectral flux with vibrato suppression via max-filtering
Peak picking: Find local maxima above adaptive threshold
Post-processing: Enforce minimum inter-onset interval

Neural onset detectors (CNN/RNN on mel-spectrograms) now outperform signal-processing methods, especially for soft onsets and polyphonic music.

Beat Tracking

Estimate the regular pulse (beat positions) of music:

Compute onset detection function
Autocorrelation or comb filter analysis to estimate tempo period
Dynamic programming to select beat positions maximizing:
- Alignment with onset strength
- Regularity of inter-beat intervals
Predominant local pulse: Handle tempo variations and non-4/4 meters

State-of-the-art: TCN (Temporal Convolutional Network) beat trackers process spectrograms and output beat activation functions, followed by dynamic Bayesian networks or Viterbi decoding for final beat positions.

Tempo Estimation

Estimate beats per minute (BPM):

Periodicity analysis of onset/beat activation functions
Report one or two tempo estimates (to handle octave ambiguity: 60 vs 120 BPM)
Global tempo vs local tempo tracking for music with tempo changes
Accuracy metric: Allow 4% tolerance (Accuracy1) or octave errors (Accuracy2)

Pitch Tracking

Estimate the fundamental frequency (F0) of a signal over time.

YIN Algorithm

Autocorrelation-based pitch detector (de Cheveigne, 2002):

Compute difference function: d(tau) = sum (x[n] - x[n+tau])^2
Cumulative mean normalization: d'(tau) = d(tau) / ((1/tau) * sum d(j))
Absolute threshold: Find first tau where d'(tau) < threshold (typically 0.1-0.15)
Parabolic interpolation: Refine estimate for sub-sample accuracy

Fast, accurate for monophonic signals. Struggles with polyphonic music.

pYIN

Probabilistic extension of YIN:

Multiple pitch candidates per frame with probabilities
HMM decoding for smooth pitch trajectory
Jointly estimates voiced/unvoiced decisions
Handles uncertain regions much better than hard-threshold YIN

CREPE (2018)

CNN-based pitch tracker trained on synthesized data:

Input: 1024-sample raw audio frame (64 ms at 16 kHz)
Architecture: 6-layer CNN
Output: 360-bin pitch salience (20 Hz to 3.5 kHz, 20 cents resolution)
Weighted average of activations gives continuous F0 estimate

Significantly more robust than signal-processing methods, especially in noise. Variants: Tiny CREPE (smaller model), FCNF0 (fully convolutional).

Multi-Pitch Estimation

Detect multiple simultaneous pitches in polyphonic music:

Iterative subtraction approaches
Non-negative matrix factorization (NMF) of spectrograms
Deep learning: Multi-F0 estimation networks (e.g., DeepSalience, Basic Pitch)

Chord Recognition

Identify the chord (harmonic label) at each time point.

Pipeline

Audio -> CQT/Chroma features -> Chord classifier -> Chord sequence -> Smoothing

Features: Chroma vectors (12-dim pitch class profiles) from CQT or harmonic CQT
Classification: CNN/CRNN on chromagram or CQT spectrogram
Temporal smoothing: CRF, HMM, or Viterbi decoding for consistent chord sequences

Chord Vocabulary

Major/minor triads: 24 chords + "no chord" (simplest vocabulary)
Sevenths: Major 7, minor 7, dominant 7 -- 72+ chord types
Full vocabulary: Extensions, inversions, slash chords

State-of-the-art uses CRNNs (CNN encoder + BiLSTM + softmax) on CQT representations. Self-supervised pre-training and large-scale training data continue to improve accuracy.

Music Transcription (AMT)

Automatic Music Transcription converts audio to symbolic notation (MIDI, sheet music).

Piano Transcription

The most studied AMT task:

Onsets and Frames (Google, 2018): Dual-head CNN predicts onset events and sustained frames jointly; onset predictions gate frame predictions
High-resolution piano transcription: Predict onset, offset, velocity per note
Input: Mel-spectrogram or CQT
Output: Piano roll (time x 88 keys) converted to MIDI

Multi-Instrument Transcription

Far more challenging due to timbral overlap:

Instrument-specific models or multi-task approaches
Source separation as pre-processing step
MT3 (Google): Transformer-based model transcribing arbitrary instruments using sequence-to-sequence with spectrogram input and MIDI-like token output

Source Separation

Isolate individual sources (vocals, drums, bass, other) from a mixture.

Classical Approaches

NMF (Non-negative Matrix Factorization): Decompose spectrogram into basis and activation
RPCA (Robust PCA): Separate low-rank (accompaniment) from sparse (vocals) components
Ideal ratio mask: Oracle mask applied to mixture STFT (upper bound reference)

Deep Learning

U-Net architectures: Encoder-decoder with skip connections on spectrograms
Open-Unmix: Reference open-source system using BiLSTM on magnitude spectrograms
Conv-TasNet: Time-domain separation using learned encoder/decoder and TCN mask estimator

Demucs (Meta)

State-of-the-art music source separation:

Hybrid architecture: Processes both waveform (temporal) and spectrogram (spectral) domains
Dual-path processing: Temporal encoder-decoder + spectral encoder-decoder with cross-attention
Outputs: 4 stems (vocals, drums, bass, other) or 6 stems
HTDemucs: Hybrid Transformer version with cross-domain attention
Trained on a large internal dataset plus MusDB18

SDR (Signal-to-Distortion Ratio) is the primary evaluation metric. Demucs achieves ~9 dB SDR on vocals separation on MusDB18.

Music Generation

Symbolic Generation

Generate MIDI or other symbolic representations:

Music Transformer: Self-attention with relative positional encoding over MIDI events
MuseNet (OpenAI): GPT-style model generating multi-instrument MIDI

Audio Generation

Generate raw audio or neural codec tokens:

MusicGen (Meta, 2023):

Autoregressive Transformer over EnCodec discrete audio tokens
Codebook interleaving patterns avoid the need for multiple model passes
Conditioned on text descriptions, melody, or both
Single-stage model (no separate planning/refinement)

MusicLM (Google):

Hierarchical generation: semantic tokens -> acoustic tokens -> audio
Text and melody conditioning via MuLan and w2v-BERT embeddings

Stable Audio / Riffusion: Diffusion-based music generation from text prompts.

Evaluation

FAD (Frechet Audio Distance): Compare distributions of generated vs real audio embeddings
KL divergence on audio classifier outputs
Human evaluation remains essential (musicality, coherence, text adherence)

Audio Fingerprinting

Identify a recording from a short audio excerpt, robust to noise and distortion.

Shazam Algorithm (Wang, 2003)

Compute spectrogram, find spectral peaks (time-frequency landmarks)
Form landmark pairs: Combine nearby peaks into (f1, f2, delta_t) tuples
Hash each landmark: Compact representation for database lookup
Search: Match query hashes against database, find time-coherent cluster of matches
Offset histogram determines the matching song and time position

Properties: Robust to noise, reverberation, and encoding. Works with 3-5 second queries. Database can hold millions of tracks with sub-second lookup.

Neural Fingerprinting

Learn embeddings that are invariant to distortions but discriminative across songs
Contrastive learning on augmented audio pairs
Neural Audio Fingerprint (2021): CNN encoder produces compact embeddings
Can handle more severe distortions than landmark-based methods
Trade-off: Higher computational cost, requires GPU for indexing

Other MIR Tasks

Key detection: Estimate musical key (C major, A minor, etc.) from chroma profiles
Structure segmentation: Identify verse, chorus, bridge sections
Music tagging: Genre, mood, instrument classification from audio
Cover song detection: Identify different performances of the same composition
Music recommendation: Content-based similarity using learned audio embeddings
Lyrics alignment: Synchronize lyrics text with audio timestamps