Audio Fundamentals

Sound Physics

Sound is a longitudinal pressure wave propagating through a medium (air, water, solids). The wave creates alternating compressions and rarefactions that the ear perceives as sound.

Core Properties

Property	Definition	Unit	Perceptual Correlate
Frequency	Oscillation cycles per second	Hz	Pitch
Amplitude	Peak displacement from equilibrium	Pa (pressure)	Loudness
Phase	Position within the wave cycle	Radians/degrees	Spatial perception
Wavelength	Distance for one full cycle (lambda = v/f)	Meters	--

Human hearing range: approximately 20 Hz to 20 kHz. Speed of sound in air at 20C: ~343 m/s.

A pure tone is a single sinusoid. Natural sounds contain a fundamental frequency (f0) plus harmonics (integer multiples: 2f0, 3f0, ...). The relative amplitudes of these harmonics define the timbre -- why a piano and violin playing the same note sound different.

Harmonic series: f0, 2f0, 3f0, 4f0, ... (periodic signals)
Inharmonic partials: Non-integer multiples (bells, percussion)
Noise: Aperiodic signals with continuous spectra (wind, fricatives like /s/)
Formants: Resonant peaks in the vocal tract transfer function, critical for vowel identity

Waveform Composition (Fourier's Theorem)

Any periodic signal can be decomposed into a sum of sinusoids:

x(t) = sum_{k=1}^{N} A_k * sin(2*pi*k*f0*t + phi_k)

This is the foundation of all frequency-domain audio analysis.

Digital Audio

Analog-to-Digital Conversion

Converting continuous sound to discrete digital representation requires two steps:

Sampling: Measure amplitude at regular time intervals
Quantization: Map continuous amplitude values to discrete levels

Sampling Rate

The Nyquist-Shannon theorem states that a signal must be sampled at least twice its highest frequency component to be perfectly reconstructable:

f_sample >= 2 * f_max

Standard	Sample Rate	Bandwidth	Use Case
Telephony	8 kHz	4 kHz	Voice calls
Wideband speech	16 kHz	8 kHz	Speech recognition, VoIP
CD audio	44.1 kHz	22.05 kHz	Music distribution
Professional	48 kHz	24 kHz	Film, broadcast
Hi-res	96/192 kHz	48/96 kHz	Studio recording

Aliasing occurs when sampling below the Nyquist rate -- high frequencies fold back as spurious low-frequency artifacts. Anti-aliasing filters remove content above f_sample/2 before sampling.

Bit Depth and Quantization

Bit depth determines the number of discrete amplitude levels (2^n levels for n bits):

8-bit: 256 levels, ~48 dB dynamic range (telephony)
16-bit: 65,536 levels, ~96 dB dynamic range (CD quality)
24-bit: 16.7M levels, ~144 dB dynamic range (professional)
32-bit float: Virtually unlimited headroom (internal processing)

Quantization noise is the rounding error from mapping continuous to discrete values. Signal-to-quantization-noise ratio: SQNR ~ 6.02n + 1.76 dB (for n bits). Dithering adds small random noise before quantization to decorrelate the error, converting distortion into a flat noise floor.

PCM (Pulse Code Modulation)

The standard uncompressed digital audio encoding. Each sample is a fixed-width integer (or float) representing instantaneous amplitude. Variants include LPCM (linear), A-law, and mu-law (logarithmic companding used in telephony).

Audio Formats

Format	Type	Compression	Typical Use
WAV	Container (RIFF)	None (PCM)	Professional editing
AIFF	Container	None (PCM)	macOS professional
FLAC	Codec	Lossless (~60% size)	Archival, audiophile
ALAC	Codec	Lossless	Apple ecosystem
MP3	Codec	Lossy (perceptual)	Music distribution
AAC	Codec	Lossy (better than MP3)	Streaming, mobile
Opus	Codec	Lossy (state-of-art)	VoIP, streaming, low latency
Vorbis	Codec	Lossy	Open-source alternative

Lossy codecs exploit psychoacoustic masking to discard inaudible information. Opus is notable for handling both speech and music well across 6-510 kbps.

Audio I/O Systems

JACK (JACK Audio Connection Kit)

A professional-grade, low-latency audio server for Linux/macOS:

Routes audio between applications with sample-accurate synchronization
Supports arbitrary inter-application connections (patchbay model)
Typical round-trip latency: 2-10 ms

PipeWire

Modern Linux audio/video server replacing both PulseAudio and JACK:

Unified graph-based processing for consumer and professional audio
Compatible with JACK, PulseAudio, and ALSA APIs simultaneously
Dynamic sample rate and buffer size negotiation
Handles audio, video, and MIDI in a single framework

Other Systems

ALSA: Linux kernel-level audio interface (direct hardware access)
PulseAudio: Linux consumer audio server (mixing, per-app volume)
CoreAudio: macOS native audio framework
WASAPI/ASIO: Windows audio interfaces (ASIO for low latency)

Psychoacoustics

Psychoacoustics studies how the auditory system perceives sound, directly informing codec design, audio processing, and interface design.

Loudness Perception

Perceived loudness is not linearly proportional to physical intensity:

Follows approximately a power law (Stevens' law): L ~ I^0.3
Measured in phons (equal-loudness) and sones (linear loudness scale)
1 sone = loudness of a 1 kHz tone at 40 dB SPL
Doubling sones ~ +10 phon ~ +10 dB increase

Equal-Loudness Contours (ISO 226)

The ear's sensitivity varies with frequency. Equal-loudness contours (Fletcher-Munson curves) show SPL levels perceived as equally loud across frequencies:

Most sensitive: 2-5 kHz (ear canal resonance)
Less sensitive at low frequencies, especially at quiet levels
At high SPL, response flattens
Basis for A-weighting (dBA) used in noise measurements

Auditory Masking

A louder sound can render a quieter sound inaudible:

Simultaneous masking: Signals at nearby frequencies mask each other (frequency domain)
Temporal masking: Pre-masking (~5 ms before) and post-masking (~100-200 ms after)
Energetic masking: Physical overlap in cochlear excitation
Informational masking: Cognitive interference (e.g., competing speech)

Lossy codecs allocate bits based on masking thresholds -- masked components need not be encoded.

Critical Bands

The cochlea performs frequency analysis in overlapping critical bands (~25 bands spanning 20 Hz - 20 kHz). Two tones within the same critical band interact (beat, mask); tones in different bands are perceived independently.

Critical bandwidth approximated by the ERB (Equivalent Rectangular Bandwidth) scale
Bark scale: 24 critical bands, roughly linear below 500 Hz, logarithmic above
Mel scale: Perceptual pitch scale, approximately linear below 1 kHz, logarithmic above
These scales motivate mel-frequency filterbanks used in speech/audio feature extraction

Auditory Scene Analysis

The brain groups acoustic components into coherent auditory streams using:

Common onset/offset (harmonicity)
Proximity in frequency and time
Continuity and smooth transitions
Spatial cues (ITD, ILD)

This perceptual grouping is what computational auditory scene analysis (CASA) and source separation algorithms attempt to replicate.