Source Coding

Source and Channel Coding Hierarchy

Source Coding Theorem (Shannon's First Theorem)

For a discrete memoryless source (DMS) with entropy $H(X)$ , the expected codeword length $L$ of any uniquely decodable code satisfies:

$H(X) \leq L < H(X) + 1$

More generally, for block coding of $n$ i.i.d. symbols, there exists a code with rate approaching $H(X)$ bits/symbol as $n \to \infty$ . No lossless code can achieve rate below $H(X)$ .

Asymptotic equipartition property (AEP): for i.i.d. $X_1, \ldots, X_n$ :

$-\frac{1}{n} \log p(X_1, \ldots, X_n) \xrightarrow{p} H(X)$

The typical set $A_\epsilon^{(n)} = \{x^n : 2^{-n(H+\epsilon)} \leq p(x^n) \leq 2^{-n(H-\epsilon)}\}$ has probability approaching 1, size approximately $2^{nH}$ , and each element has roughly equal probability. This justifies encoding only typical sequences.

Kraft Inequality

For a $D$ -ary prefix-free code with codeword lengths $\ell_1, \ldots, \ell_m$ :

$\sum_{i=1}^{m} D^{-\ell_i} \leq 1$

Conversely, given lengths satisfying Kraft's inequality, a prefix-free code with those lengths exists. Extended to uniquely decodable codes by the McMillan inequality (same bound). The optimal codeword length for symbol $x_i$ is $\ell_i^* = -\log_D p(x_i)$ , which is generally non-integer, motivating arithmetic coding.

Huffman Coding

Optimal prefix-free code for known symbol probabilities. Construction via greedy bottom-up tree building:

Create leaf node for each symbol with its probability
Repeatedly merge the two lowest-probability nodes into a parent
Assign 0/1 to left/right branches

Properties:

Achieves $H(X) \leq L_{\text{Huffman}} < H(X) + 1$
Optimal among all prefix-free codes for single-symbol coding
Suboptimal when probabilities are far from powers of 2 (e.g., $p = 0.99$ yields 1 bit/symbol vs. $H \approx 0.08$ )
Adaptive Huffman (Vitter's algorithm): updates tree online without two-pass processing
Canonical Huffman: standardized assignment enabling compact code description (used in DEFLATE)

Arithmetic Coding

Encodes an entire message as a single number in $[0, 1)$ , approaching entropy optimally even for skewed distributions.

Encoding: maintain interval $[L, H)$ , initially $[0, 1)$ . For each symbol $s$ with CDF range $[c_s, c_s + p_s)$ :

$L_{\text{new}} = L + (H - L) \cdot c_s, \quad H_{\text{new}} = L + (H - L) \cdot (c_s + p_s)$

Decoding: reverse the process, reading the fractional value and determining which symbol's interval contains it.

Practical considerations:

Finite precision: use integer arithmetic with periodic renormalization (output leading bits when interval narrows sufficiently)
Carry propagation: handle via bit-stuffing or range coder variant
Achieves within 2 bits of optimal for any message length (vs. Huffman's 1 bit/symbol overhead)
Used in JPEG 2000, H.264 (CABAC), and modern ML-based compressors

Range coding: variant that operates on bytes instead of bits, avoiding carry issues and simplifying implementation. Essentially equivalent compression ratio.

Asymmetric Numeral Systems (ANS)

A modern entropy coding family by Jarek Duda (2009) combining arithmetic coding's compression ratio with Huffman's speed.

Basic ANS: state $x$ encodes symbol $s$ via $C(x, s) = d_s + (\lfloor x/f_s \rfloor \cdot M + \text{cumul}_s)$ where $f_s$ is the quantized frequency and $M = \sum f_s$ .

Two practical variants:

tANS (tabled ANS): precomputed lookup tables, single table lookup per symbol, no multiplications. Extremely fast decoding
rANS (ranged ANS): operates like a range coder but with a single state integer. Supports streaming with renormalization

Properties:

Asymptotically optimal (approaches entropy)
LIFO encoding order (encode in reverse, decode forward) unless interleaved
Used in Zstandard (zstd), LZFSE (Apple), and JPEG XL

Lempel-Ziv Compression

Dictionary-based methods that achieve universality (no knowledge of source statistics needed).

LZ77 (Sliding Window)

Encode each position as $(d, \ell, c)$ : distance back to match, length of match, next character. Uses a sliding window of recent history as the dictionary.

Foundation of DEFLATE, gzip, zlib
Parsing is greedy (longest match); optimal parsing is $O(n \log n)$
Universal: achieves entropy rate for stationary ergodic sources

LZ78 (Dictionary Building)

Build dictionary incrementally: each new phrase is the longest previously seen phrase plus one new character. Dictionary entries are numbered sequentially.

Simpler dictionary management than LZ77
LZW variant: dictionary initialized with single-character strings; output dictionary index only (no extra character). Used in GIF, early Unix compress

Universality

Both LZ77 and LZ78 are universal compressors: for any stationary ergodic source, the compression ratio converges to $H(\mathcal{X})$ . This is remarkable---no knowledge of the source distribution is needed.

Practical Compression Systems

System	Algorithm	Typical Speed	Notes
gzip	DEFLATE (LZ77 + Huffman)	Moderate	Ubiquitous, RFC 1951
Brotli	LZ77 + context-dependent Huffman + static dictionary	Slower encode	Web standard, ~20% better than gzip
Zstandard (zstd)	LZ77 + tANS + Huffman	Very fast	Facebook, dictionary training support
LZMA/xz	LZ77 + range coding	Slow encode, fast decode	High ratio, 7-Zip
lz4	LZ77 (simple)	Extremely fast	Low ratio, real-time use
Bzip2	BWT + MTF + Huffman	Moderate	Good ratio, largely superseded

Burrows-Wheeler Transform (BWT): reversible permutation that groups similar contexts together, making the output highly compressible by subsequent stages (move-to-front + entropy coding).

Rate-Distortion Theory

For lossy compression, Shannon's rate-distortion theory characterizes the minimum rate $R$ to represent a source within distortion $D$ :

$R(D) = \min_{p(\hat{x}|x): \mathbb{E}[d(X,\hat{X})] \leq D} I(X; \hat{X})$

Key results:

Gaussian source with MSE distortion: $R(D) = \frac{1}{2}\log\frac{\sigma^2}{D}$ for $0 \leq D \leq \sigma^2$
Binary source with Hamming distortion: $R(D) = H_b(p) - H_b(D)$ for $0 \leq D \leq \min(p, 1-p)$
$R(D)$ is convex, non-increasing in $D$
Achieved by random codebooks of size $2^{nR}$ with joint typicality encoding

Blahut-Arimoto algorithm: iterative method to compute $R(D)$ (alternating minimization).

The distortion-rate function $D(R)$ is the inverse, giving minimum achievable distortion at rate $R$ . In practice, practical codecs (JPEG, HEVC, AV1) operate above $R(D)$ but approach it with increasing block size and complexity.

A source is successively refinable if encoding at rate $R_1$ to distortion $D_1$ , then sending additional rate $R_2$ to reach distortion $D_2$ , achieves the same total rate $R_1 + R_2$ as encoding directly at $D_2$ . Gaussian sources with MSE are successively refinable; binary sources with Hamming distortion generally are not.

Distributed Source Coding

Slepian-Wolf Theorem

For lossless distributed compression of correlated sources $(X, Y)$ encoded separately but decoded jointly, the achievable rate region is:

$R_X \geq H(X|Y), \quad R_Y \geq H(Y|X), \quad R_X + R_Y \geq H(X, Y)$

This is remarkable: separate encoding is as efficient as joint encoding. The corner points correspond to one encoder operating at full rate while the other exploits correlation.

Practical schemes: DISCUS (distributed source coding using syndromes), based on channel codes. Turbo/LDPC codes can approach Slepian-Wolf bounds.

Wyner-Ziv Theorem

Lossy analog of Slepian-Wolf: for encoding $X$ with side information $Y$ available only at the decoder, the rate-distortion function equals what it would be if $Y$ were available at the encoder too (for Gaussian sources with MSE, and in general for the quadratic-Gaussian case). For general sources, there can be a rate loss.

Quantization

Scalar quantization: partition the input range into intervals, map to representative points.

Uniform quantizer: intervals of equal width. For high-rate regime, distortion $\approx \frac{\Delta^2}{12}$
Lloyd-Max quantizer: iteratively optimizes partition boundaries and centroids to minimize MSE. Optimal for a given number of levels
Companding: apply nonlinear transform before uniform quantization ( $\mu$ -law, A-law in telephony)

Vector quantization (VQ): quantize blocks of samples jointly. Achieves rate-distortion bound as dimension grows. LBG algorithm (generalized Lloyd) for codebook design. Computational cost is the main limitation, addressed by structured codebooks (lattice VQ, tree-structured VQ).

Transform Coding

Apply invertible transform, then quantize coefficients independently:

KLT (Karhunen-Loeve): optimal decorrelation, diagonalizes covariance. Data-dependent
DCT: approximates KLT for Markov sources. Foundation of JPEG, MPEG
Wavelets: multi-resolution analysis. Used in JPEG 2000

Reverse water-filling: optimal bit allocation across transform coefficients---allocate more bits to higher-variance components, zero bits to components below a threshold.

Practical Lossy Compression Pipeline

Modern codecs (H.265/HEVC, AV1, VVC) follow:

Prediction (intra/inter) to exploit spatial/temporal redundancy
Transform (DCT variants) of residual
Quantization (with rate-distortion optimization per block)
Entropy coding (CABAC/ANS) of quantized coefficients

Neural compression: learned nonlinear transforms + learned entropy models (hyperprior, autoregressive). Competitive with or superior to traditional codecs (e.g., Balle et al.'s framework, VTM-level performance at lower complexity in some regimes).