Audio Fundamentals
Sound Physics
Sound is a longitudinal pressure wave propagating through a medium (air, water, solids). The wave creates alternating compressions and rarefactions that the ear perceives as sound.
Core Properties
| Property | Definition | Unit | Perceptual Correlate | |----------|-----------|------|---------------------| | Frequency | Oscillation cycles per second | Hz | Pitch | | Amplitude | Peak displacement from equilibrium | Pa (pressure) | Loudness | | Phase | Position within the wave cycle | Radians/degrees | Spatial perception | | Wavelength | Distance for one full cycle (lambda = v/f) | Meters | -- |
Human hearing range: approximately 20 Hz to 20 kHz. Speed of sound in air at 20C: ~343 m/s.
Harmonics and Timbre
A pure tone is a single sinusoid. Natural sounds contain a fundamental frequency (f0) plus harmonics (integer multiples: 2f0, 3f0, ...). The relative amplitudes of these harmonics define the timbre -- why a piano and violin playing the same note sound different.
- Harmonic series: f0, 2f0, 3f0, 4f0, ... (periodic signals)
- Inharmonic partials: Non-integer multiples (bells, percussion)
- Noise: Aperiodic signals with continuous spectra (wind, fricatives like /s/)
- Formants: Resonant peaks in the vocal tract transfer function, critical for vowel identity
Waveform Composition (Fourier's Theorem)
Any periodic signal can be decomposed into a sum of sinusoids:
x(t) = sum_{k=1}^{N} A_k * sin(2*pi*k*f0*t + phi_k)
This is the foundation of all frequency-domain audio analysis.
Digital Audio
Analog-to-Digital Conversion
Converting continuous sound to discrete digital representation requires two steps:
- Sampling: Measure amplitude at regular time intervals
- Quantization: Map continuous amplitude values to discrete levels
Sampling Rate
The Nyquist-Shannon theorem states that a signal must be sampled at least twice its highest frequency component to be perfectly reconstructable:
f_sample >= 2 * f_max
| Standard | Sample Rate | Bandwidth | Use Case | |----------|------------|-----------|----------| | Telephony | 8 kHz | 4 kHz | Voice calls | | Wideband speech | 16 kHz | 8 kHz | Speech recognition, VoIP | | CD audio | 44.1 kHz | 22.05 kHz | Music distribution | | Professional | 48 kHz | 24 kHz | Film, broadcast | | Hi-res | 96/192 kHz | 48/96 kHz | Studio recording |
Aliasing occurs when sampling below the Nyquist rate -- high frequencies fold back as spurious low-frequency artifacts. Anti-aliasing filters remove content above f_sample/2 before sampling.
Bit Depth and Quantization
Bit depth determines the number of discrete amplitude levels (2^n levels for n bits):
- 8-bit: 256 levels, ~48 dB dynamic range (telephony)
- 16-bit: 65,536 levels, ~96 dB dynamic range (CD quality)
- 24-bit: 16.7M levels, ~144 dB dynamic range (professional)
- 32-bit float: Virtually unlimited headroom (internal processing)
Quantization noise is the rounding error from mapping continuous to discrete values. Signal-to-quantization-noise ratio: SQNR ~ 6.02n + 1.76 dB (for n bits). Dithering adds small random noise before quantization to decorrelate the error, converting distortion into a flat noise floor.
PCM (Pulse Code Modulation)
The standard uncompressed digital audio encoding. Each sample is a fixed-width integer (or float) representing instantaneous amplitude. Variants include LPCM (linear), A-law, and mu-law (logarithmic companding used in telephony).
Audio Formats
| Format | Type | Compression | Typical Use | |--------|------|-------------|-------------| | WAV | Container (RIFF) | None (PCM) | Professional editing | | AIFF | Container | None (PCM) | macOS professional | | FLAC | Codec | Lossless (~60% size) | Archival, audiophile | | ALAC | Codec | Lossless | Apple ecosystem | | MP3 | Codec | Lossy (perceptual) | Music distribution | | AAC | Codec | Lossy (better than MP3) | Streaming, mobile | | Opus | Codec | Lossy (state-of-art) | VoIP, streaming, low latency | | Vorbis | Codec | Lossy | Open-source alternative |
Lossy codecs exploit psychoacoustic masking to discard inaudible information. Opus is notable for handling both speech and music well across 6-510 kbps.
Audio I/O Systems
JACK (JACK Audio Connection Kit)
A professional-grade, low-latency audio server for Linux/macOS:
- Routes audio between applications with sample-accurate synchronization
- Supports arbitrary inter-application connections (patchbay model)
- Typical round-trip latency: 2-10 ms
PipeWire
Modern Linux audio/video server replacing both PulseAudio and JACK:
- Unified graph-based processing for consumer and professional audio
- Compatible with JACK, PulseAudio, and ALSA APIs simultaneously
- Dynamic sample rate and buffer size negotiation
- Handles audio, video, and MIDI in a single framework
Other Systems
- ALSA: Linux kernel-level audio interface (direct hardware access)
- PulseAudio: Linux consumer audio server (mixing, per-app volume)
- CoreAudio: macOS native audio framework
- WASAPI/ASIO: Windows audio interfaces (ASIO for low latency)
Psychoacoustics
Psychoacoustics studies how the auditory system perceives sound, directly informing codec design, audio processing, and interface design.
Loudness Perception
Perceived loudness is not linearly proportional to physical intensity:
- Follows approximately a power law (Stevens' law): L ~ I^0.3
- Measured in phons (equal-loudness) and sones (linear loudness scale)
- 1 sone = loudness of a 1 kHz tone at 40 dB SPL
- Doubling sones ~ +10 phon ~ +10 dB increase
Equal-Loudness Contours (ISO 226)
The ear's sensitivity varies with frequency. Equal-loudness contours (Fletcher-Munson curves) show SPL levels perceived as equally loud across frequencies:
- Most sensitive: 2-5 kHz (ear canal resonance)
- Less sensitive at low frequencies, especially at quiet levels
- At high SPL, response flattens
- Basis for A-weighting (dBA) used in noise measurements
Auditory Masking
A louder sound can render a quieter sound inaudible:
- Simultaneous masking: Signals at nearby frequencies mask each other (frequency domain)
- Temporal masking: Pre-masking (~5 ms before) and post-masking (~100-200 ms after)
- Energetic masking: Physical overlap in cochlear excitation
- Informational masking: Cognitive interference (e.g., competing speech)
Lossy codecs allocate bits based on masking thresholds -- masked components need not be encoded.
Critical Bands
The cochlea performs frequency analysis in overlapping critical bands (~25 bands spanning 20 Hz - 20 kHz). Two tones within the same critical band interact (beat, mask); tones in different bands are perceived independently.
- Critical bandwidth approximated by the ERB (Equivalent Rectangular Bandwidth) scale
- Bark scale: 24 critical bands, roughly linear below 500 Hz, logarithmic above
- Mel scale: Perceptual pitch scale, approximately linear below 1 kHz, logarithmic above
- These scales motivate mel-frequency filterbanks used in speech/audio feature extraction
Auditory Scene Analysis
The brain groups acoustic components into coherent auditory streams using:
- Common onset/offset (harmonicity)
- Proximity in frequency and time
- Continuity and smooth transitions
- Spatial cues (ITD, ILD)
This perceptual grouping is what computational auditory scene analysis (CASA) and source separation algorithms attempt to replicate.