6 min read
On this page

Music Information Retrieval (MIR)

Overview

Music Information Retrieval applies signal processing and machine learning to extract meaningful information from music signals. Tasks span rhythm analysis, pitch and harmony, structure, similarity, and generation.


Rhythm Analysis

Onset Detection

Identify the start times of musical events (notes, drum hits):

  1. Compute a detection function that peaks at onsets:
    • Spectral flux: Sum of positive differences in magnitude spectrum between frames
    • High-frequency content: Weighted spectral energy emphasizing high frequencies
    • Complex-domain: Uses both magnitude and phase deviations
    • Superflux: Spectral flux with vibrato suppression via max-filtering
  2. Peak picking: Find local maxima above adaptive threshold
  3. Post-processing: Enforce minimum inter-onset interval

Neural onset detectors (CNN/RNN on mel-spectrograms) now outperform signal-processing methods, especially for soft onsets and polyphonic music.

Beat Tracking

Estimate the regular pulse (beat positions) of music:

  • Compute onset detection function
  • Autocorrelation or comb filter analysis to estimate tempo period
  • Dynamic programming to select beat positions maximizing:
    • Alignment with onset strength
    • Regularity of inter-beat intervals
  • Predominant local pulse: Handle tempo variations and non-4/4 meters

State-of-the-art: TCN (Temporal Convolutional Network) beat trackers process spectrograms and output beat activation functions, followed by dynamic Bayesian networks or Viterbi decoding for final beat positions.

Tempo Estimation

Estimate beats per minute (BPM):

  • Periodicity analysis of onset/beat activation functions
  • Report one or two tempo estimates (to handle octave ambiguity: 60 vs 120 BPM)
  • Global tempo vs local tempo tracking for music with tempo changes
  • Accuracy metric: Allow 4% tolerance (Accuracy1) or octave errors (Accuracy2)

Pitch Tracking

Estimate the fundamental frequency (F0) of a signal over time.

YIN Algorithm

Autocorrelation-based pitch detector (de Cheveigne, 2002):

  1. Compute difference function: d(tau) = sum (x[n] - x[n+tau])^2
  2. Cumulative mean normalization: d'(tau) = d(tau) / ((1/tau) * sum d(j))
  3. Absolute threshold: Find first tau where d'(tau) < threshold (typically 0.1-0.15)
  4. Parabolic interpolation: Refine estimate for sub-sample accuracy

Fast, accurate for monophonic signals. Struggles with polyphonic music.

pYIN

Probabilistic extension of YIN:

  • Multiple pitch candidates per frame with probabilities
  • HMM decoding for smooth pitch trajectory
  • Jointly estimates voiced/unvoiced decisions
  • Handles uncertain regions much better than hard-threshold YIN

CREPE (2018)

CNN-based pitch tracker trained on synthesized data:

  • Input: 1024-sample raw audio frame (64 ms at 16 kHz)
  • Architecture: 6-layer CNN
  • Output: 360-bin pitch salience (20 Hz to 3.5 kHz, 20 cents resolution)
  • Weighted average of activations gives continuous F0 estimate

Significantly more robust than signal-processing methods, especially in noise. Variants: Tiny CREPE (smaller model), FCNF0 (fully convolutional).

Multi-Pitch Estimation

Detect multiple simultaneous pitches in polyphonic music:

  • Iterative subtraction approaches
  • Non-negative matrix factorization (NMF) of spectrograms
  • Deep learning: Multi-F0 estimation networks (e.g., DeepSalience, Basic Pitch)

Chord Recognition

Identify the chord (harmonic label) at each time point.

Pipeline

Audio -> CQT/Chroma features -> Chord classifier -> Chord sequence -> Smoothing
  1. Features: Chroma vectors (12-dim pitch class profiles) from CQT or harmonic CQT
  2. Classification: CNN/CRNN on chromagram or CQT spectrogram
  3. Temporal smoothing: CRF, HMM, or Viterbi decoding for consistent chord sequences

Chord Vocabulary

  • Major/minor triads: 24 chords + "no chord" (simplest vocabulary)
  • Sevenths: Major 7, minor 7, dominant 7 -- 72+ chord types
  • Full vocabulary: Extensions, inversions, slash chords

State-of-the-art uses CRNNs (CNN encoder + BiLSTM + softmax) on CQT representations. Self-supervised pre-training and large-scale training data continue to improve accuracy.


Music Transcription (AMT)

Automatic Music Transcription converts audio to symbolic notation (MIDI, sheet music).

Piano Transcription

The most studied AMT task:

  • Onsets and Frames (Google, 2018): Dual-head CNN predicts onset events and sustained frames jointly; onset predictions gate frame predictions
  • High-resolution piano transcription: Predict onset, offset, velocity per note
  • Input: Mel-spectrogram or CQT
  • Output: Piano roll (time x 88 keys) converted to MIDI

Multi-Instrument Transcription

Far more challenging due to timbral overlap:

  • Instrument-specific models or multi-task approaches
  • Source separation as pre-processing step
  • MT3 (Google): Transformer-based model transcribing arbitrary instruments using sequence-to-sequence with spectrogram input and MIDI-like token output

Source Separation

Isolate individual sources (vocals, drums, bass, other) from a mixture.

Classical Approaches

  • NMF (Non-negative Matrix Factorization): Decompose spectrogram into basis and activation
  • RPCA (Robust PCA): Separate low-rank (accompaniment) from sparse (vocals) components
  • Ideal ratio mask: Oracle mask applied to mixture STFT (upper bound reference)

Deep Learning

  • U-Net architectures: Encoder-decoder with skip connections on spectrograms
  • Open-Unmix: Reference open-source system using BiLSTM on magnitude spectrograms
  • Conv-TasNet: Time-domain separation using learned encoder/decoder and TCN mask estimator

Demucs (Meta)

State-of-the-art music source separation:

  • Hybrid architecture: Processes both waveform (temporal) and spectrogram (spectral) domains
  • Dual-path processing: Temporal encoder-decoder + spectral encoder-decoder with cross-attention
  • Outputs: 4 stems (vocals, drums, bass, other) or 6 stems
  • HTDemucs: Hybrid Transformer version with cross-domain attention
  • Trained on a large internal dataset plus MusDB18

SDR (Signal-to-Distortion Ratio) is the primary evaluation metric. Demucs achieves ~9 dB SDR on vocals separation on MusDB18.


Music Generation

Symbolic Generation

Generate MIDI or other symbolic representations:

  • Music Transformer: Self-attention with relative positional encoding over MIDI events
  • MuseNet (OpenAI): GPT-style model generating multi-instrument MIDI

Audio Generation

Generate raw audio or neural codec tokens:

MusicGen (Meta, 2023):

  • Autoregressive Transformer over EnCodec discrete audio tokens
  • Codebook interleaving patterns avoid the need for multiple model passes
  • Conditioned on text descriptions, melody, or both
  • Single-stage model (no separate planning/refinement)

MusicLM (Google):

  • Hierarchical generation: semantic tokens -> acoustic tokens -> audio
  • Text and melody conditioning via MuLan and w2v-BERT embeddings

Stable Audio / Riffusion: Diffusion-based music generation from text prompts.

Evaluation

  • FAD (Frechet Audio Distance): Compare distributions of generated vs real audio embeddings
  • KL divergence on audio classifier outputs
  • Human evaluation remains essential (musicality, coherence, text adherence)

Audio Fingerprinting

Identify a recording from a short audio excerpt, robust to noise and distortion.

Shazam Algorithm (Wang, 2003)

  1. Compute spectrogram, find spectral peaks (time-frequency landmarks)
  2. Form landmark pairs: Combine nearby peaks into (f1, f2, delta_t) tuples
  3. Hash each landmark: Compact representation for database lookup
  4. Search: Match query hashes against database, find time-coherent cluster of matches
  5. Offset histogram determines the matching song and time position

Properties: Robust to noise, reverberation, and encoding. Works with 3-5 second queries. Database can hold millions of tracks with sub-second lookup.

Neural Fingerprinting

  • Learn embeddings that are invariant to distortions but discriminative across songs
  • Contrastive learning on augmented audio pairs
  • Neural Audio Fingerprint (2021): CNN encoder produces compact embeddings
  • Can handle more severe distortions than landmark-based methods
  • Trade-off: Higher computational cost, requires GPU for indexing

Other MIR Tasks

  • Key detection: Estimate musical key (C major, A minor, etc.) from chroma profiles
  • Structure segmentation: Identify verse, chorus, bridge sections
  • Music tagging: Genre, mood, instrument classification from audio
  • Cover song detection: Identify different performances of the same composition
  • Music recommendation: Content-based similarity using learned audio embeddings
  • Lyrics alignment: Synchronize lyrics text with audio timestamps