7 min read
On this page

Source Coding

Source and Channel Coding Hierarchy

Source Coding Theorem (Shannon's First Theorem)

For a discrete memoryless source (DMS) with entropy H(X)H(X), the expected codeword length LL of any uniquely decodable code satisfies:

H(X)L<H(X)+1H(X) \leq L < H(X) + 1

More generally, for block coding of nn i.i.d. symbols, there exists a code with rate approaching H(X)H(X) bits/symbol as nn \to \infty. No lossless code can achieve rate below H(X)H(X).

Asymptotic equipartition property (AEP): for i.i.d. X1,,XnX_1, \ldots, X_n:

1nlogp(X1,,Xn)pH(X)-\frac{1}{n} \log p(X_1, \ldots, X_n) \xrightarrow{p} H(X)

The typical set Aϵ(n)={xn:2n(H+ϵ)p(xn)2n(Hϵ)}A_\epsilon^{(n)} = \{x^n : 2^{-n(H+\epsilon)} \leq p(x^n) \leq 2^{-n(H-\epsilon)}\} has probability approaching 1, size approximately 2nH2^{nH}, and each element has roughly equal probability. This justifies encoding only typical sequences.

Kraft Inequality

For a DD-ary prefix-free code with codeword lengths 1,,m\ell_1, \ldots, \ell_m:

i=1mDi1\sum_{i=1}^{m} D^{-\ell_i} \leq 1

Conversely, given lengths satisfying Kraft's inequality, a prefix-free code with those lengths exists. Extended to uniquely decodable codes by the McMillan inequality (same bound). The optimal codeword length for symbol xix_i is i=logDp(xi)\ell_i^* = -\log_D p(x_i), which is generally non-integer, motivating arithmetic coding.

Huffman Coding

Optimal prefix-free code for known symbol probabilities. Construction via greedy bottom-up tree building:

  1. Create leaf node for each symbol with its probability
  2. Repeatedly merge the two lowest-probability nodes into a parent
  3. Assign 0/1 to left/right branches

Properties:

  • Achieves H(X)LHuffman<H(X)+1H(X) \leq L_{\text{Huffman}} < H(X) + 1
  • Optimal among all prefix-free codes for single-symbol coding
  • Suboptimal when probabilities are far from powers of 2 (e.g., p=0.99p = 0.99 yields 1 bit/symbol vs. H0.08H \approx 0.08)
  • Adaptive Huffman (Vitter's algorithm): updates tree online without two-pass processing
  • Canonical Huffman: standardized assignment enabling compact code description (used in DEFLATE)

Arithmetic Coding

Encodes an entire message as a single number in [0,1)[0, 1), approaching entropy optimally even for skewed distributions.

Encoding: maintain interval [L,H)[L, H), initially [0,1)[0, 1). For each symbol ss with CDF range [cs,cs+ps)[c_s, c_s + p_s):

Lnew=L+(HL)cs,Hnew=L+(HL)(cs+ps)L_{\text{new}} = L + (H - L) \cdot c_s, \quad H_{\text{new}} = L + (H - L) \cdot (c_s + p_s)

Decoding: reverse the process, reading the fractional value and determining which symbol's interval contains it.

Practical considerations:

  • Finite precision: use integer arithmetic with periodic renormalization (output leading bits when interval narrows sufficiently)
  • Carry propagation: handle via bit-stuffing or range coder variant
  • Achieves within 2 bits of optimal for any message length (vs. Huffman's 1 bit/symbol overhead)
  • Used in JPEG 2000, H.264 (CABAC), and modern ML-based compressors

Range coding: variant that operates on bytes instead of bits, avoiding carry issues and simplifying implementation. Essentially equivalent compression ratio.

Asymmetric Numeral Systems (ANS)

A modern entropy coding family by Jarek Duda (2009) combining arithmetic coding's compression ratio with Huffman's speed.

Basic ANS: state xx encodes symbol ss via C(x,s)=ds+(x/fsM+cumuls)C(x, s) = d_s + (\lfloor x/f_s \rfloor \cdot M + \text{cumul}_s) where fsf_s is the quantized frequency and M=fsM = \sum f_s.

Two practical variants:

  • tANS (tabled ANS): precomputed lookup tables, single table lookup per symbol, no multiplications. Extremely fast decoding
  • rANS (ranged ANS): operates like a range coder but with a single state integer. Supports streaming with renormalization

Properties:

  • Asymptotically optimal (approaches entropy)
  • LIFO encoding order (encode in reverse, decode forward) unless interleaved
  • Used in Zstandard (zstd), LZFSE (Apple), and JPEG XL

Lempel-Ziv Compression

Dictionary-based methods that achieve universality (no knowledge of source statistics needed).

LZ77 (Sliding Window)

Encode each position as (d,,c)(d, \ell, c): distance back to match, length of match, next character. Uses a sliding window of recent history as the dictionary.

  • Foundation of DEFLATE, gzip, zlib
  • Parsing is greedy (longest match); optimal parsing is O(nlogn)O(n \log n)
  • Universal: achieves entropy rate for stationary ergodic sources

LZ78 (Dictionary Building)

Build dictionary incrementally: each new phrase is the longest previously seen phrase plus one new character. Dictionary entries are numbered sequentially.

  • Simpler dictionary management than LZ77
  • LZW variant: dictionary initialized with single-character strings; output dictionary index only (no extra character). Used in GIF, early Unix compress

Universality

Both LZ77 and LZ78 are universal compressors: for any stationary ergodic source, the compression ratio converges to H(X)H(\mathcal{X}). This is remarkable---no knowledge of the source distribution is needed.

Practical Compression Systems

| System | Algorithm | Typical Speed | Notes | |---|---|---|---| | gzip | DEFLATE (LZ77 + Huffman) | Moderate | Ubiquitous, RFC 1951 | | Brotli | LZ77 + context-dependent Huffman + static dictionary | Slower encode | Web standard, ~20% better than gzip | | Zstandard (zstd) | LZ77 + tANS + Huffman | Very fast | Facebook, dictionary training support | | LZMA/xz | LZ77 + range coding | Slow encode, fast decode | High ratio, 7-Zip | | lz4 | LZ77 (simple) | Extremely fast | Low ratio, real-time use | | Bzip2 | BWT + MTF + Huffman | Moderate | Good ratio, largely superseded |

Burrows-Wheeler Transform (BWT): reversible permutation that groups similar contexts together, making the output highly compressible by subsequent stages (move-to-front + entropy coding).

Rate-Distortion Theory

For lossy compression, Shannon's rate-distortion theory characterizes the minimum rate RR to represent a source within distortion DD:

R(D)=minp(x^x):E[d(X,X^)]DI(X;X^)R(D) = \min_{p(\hat{x}|x): \mathbb{E}[d(X,\hat{X})] \leq D} I(X; \hat{X})

Key results:

  • Gaussian source with MSE distortion: R(D)=12logσ2DR(D) = \frac{1}{2}\log\frac{\sigma^2}{D} for 0Dσ20 \leq D \leq \sigma^2
  • Binary source with Hamming distortion: R(D)=Hb(p)Hb(D)R(D) = H_b(p) - H_b(D) for 0Dmin(p,1p)0 \leq D \leq \min(p, 1-p)
  • R(D)R(D) is convex, non-increasing in DD
  • Achieved by random codebooks of size 2nR2^{nR} with joint typicality encoding

Blahut-Arimoto algorithm: iterative method to compute R(D)R(D) (alternating minimization).

The distortion-rate function D(R)D(R) is the inverse, giving minimum achievable distortion at rate RR. In practice, practical codecs (JPEG, HEVC, AV1) operate above R(D)R(D) but approach it with increasing block size and complexity.

Successive Refinement and Scalable Coding

A source is successively refinable if encoding at rate R1R_1 to distortion D1D_1, then sending additional rate R2R_2 to reach distortion D2D_2, achieves the same total rate R1+R2R_1 + R_2 as encoding directly at D2D_2. Gaussian sources with MSE are successively refinable; binary sources with Hamming distortion generally are not.

Distributed Source Coding

Slepian-Wolf Theorem

For lossless distributed compression of correlated sources (X,Y)(X, Y) encoded separately but decoded jointly, the achievable rate region is:

RXH(XY),RYH(YX),RX+RYH(X,Y)R_X \geq H(X|Y), \quad R_Y \geq H(Y|X), \quad R_X + R_Y \geq H(X, Y)

This is remarkable: separate encoding is as efficient as joint encoding. The corner points correspond to one encoder operating at full rate while the other exploits correlation.

Practical schemes: DISCUS (distributed source coding using syndromes), based on channel codes. Turbo/LDPC codes can approach Slepian-Wolf bounds.

Wyner-Ziv Theorem

Lossy analog of Slepian-Wolf: for encoding XX with side information YY available only at the decoder, the rate-distortion function equals what it would be if YY were available at the encoder too (for Gaussian sources with MSE, and in general for the quadratic-Gaussian case). For general sources, there can be a rate loss.

Quantization

Scalar quantization: partition the input range into intervals, map to representative points.

  • Uniform quantizer: intervals of equal width. For high-rate regime, distortion Δ212\approx \frac{\Delta^2}{12}
  • Lloyd-Max quantizer: iteratively optimizes partition boundaries and centroids to minimize MSE. Optimal for a given number of levels
  • Companding: apply nonlinear transform before uniform quantization (μ\mu-law, A-law in telephony)

Vector quantization (VQ): quantize blocks of samples jointly. Achieves rate-distortion bound as dimension grows. LBG algorithm (generalized Lloyd) for codebook design. Computational cost is the main limitation, addressed by structured codebooks (lattice VQ, tree-structured VQ).

Transform Coding

Apply invertible transform, then quantize coefficients independently:

  1. KLT (Karhunen-Loeve): optimal decorrelation, diagonalizes covariance. Data-dependent
  2. DCT: approximates KLT for Markov sources. Foundation of JPEG, MPEG
  3. Wavelets: multi-resolution analysis. Used in JPEG 2000

Reverse water-filling: optimal bit allocation across transform coefficients---allocate more bits to higher-variance components, zero bits to components below a threshold.

Practical Lossy Compression Pipeline

Modern codecs (H.265/HEVC, AV1, VVC) follow:

  1. Prediction (intra/inter) to exploit spatial/temporal redundancy
  2. Transform (DCT variants) of residual
  3. Quantization (with rate-distortion optimization per block)
  4. Entropy coding (CABAC/ANS) of quantized coefficients

Neural compression: learned nonlinear transforms + learned entropy models (hyperprior, autoregressive). Competitive with or superior to traditional codecs (e.g., Balle et al.'s framework, VTM-level performance at lower complexity in some regimes).