Source Coding

Source Coding Theorem (Shannon's First Theorem)
For a discrete memoryless source (DMS) with entropy , the expected codeword length of any uniquely decodable code satisfies:
More generally, for block coding of i.i.d. symbols, there exists a code with rate approaching bits/symbol as . No lossless code can achieve rate below .
Asymptotic equipartition property (AEP): for i.i.d. :
The typical set has probability approaching 1, size approximately , and each element has roughly equal probability. This justifies encoding only typical sequences.
Kraft Inequality
For a -ary prefix-free code with codeword lengths :
Conversely, given lengths satisfying Kraft's inequality, a prefix-free code with those lengths exists. Extended to uniquely decodable codes by the McMillan inequality (same bound). The optimal codeword length for symbol is , which is generally non-integer, motivating arithmetic coding.
Huffman Coding
Optimal prefix-free code for known symbol probabilities. Construction via greedy bottom-up tree building:
- Create leaf node for each symbol with its probability
- Repeatedly merge the two lowest-probability nodes into a parent
- Assign 0/1 to left/right branches
Properties:
- Achieves
- Optimal among all prefix-free codes for single-symbol coding
- Suboptimal when probabilities are far from powers of 2 (e.g., yields 1 bit/symbol vs. )
- Adaptive Huffman (Vitter's algorithm): updates tree online without two-pass processing
- Canonical Huffman: standardized assignment enabling compact code description (used in DEFLATE)
Arithmetic Coding
Encodes an entire message as a single number in , approaching entropy optimally even for skewed distributions.
Encoding: maintain interval , initially . For each symbol with CDF range :
Decoding: reverse the process, reading the fractional value and determining which symbol's interval contains it.
Practical considerations:
- Finite precision: use integer arithmetic with periodic renormalization (output leading bits when interval narrows sufficiently)
- Carry propagation: handle via bit-stuffing or range coder variant
- Achieves within 2 bits of optimal for any message length (vs. Huffman's 1 bit/symbol overhead)
- Used in JPEG 2000, H.264 (CABAC), and modern ML-based compressors
Range coding: variant that operates on bytes instead of bits, avoiding carry issues and simplifying implementation. Essentially equivalent compression ratio.
Asymmetric Numeral Systems (ANS)
A modern entropy coding family by Jarek Duda (2009) combining arithmetic coding's compression ratio with Huffman's speed.
Basic ANS: state encodes symbol via where is the quantized frequency and .
Two practical variants:
- tANS (tabled ANS): precomputed lookup tables, single table lookup per symbol, no multiplications. Extremely fast decoding
- rANS (ranged ANS): operates like a range coder but with a single state integer. Supports streaming with renormalization
Properties:
- Asymptotically optimal (approaches entropy)
- LIFO encoding order (encode in reverse, decode forward) unless interleaved
- Used in Zstandard (zstd), LZFSE (Apple), and JPEG XL
Lempel-Ziv Compression
Dictionary-based methods that achieve universality (no knowledge of source statistics needed).
LZ77 (Sliding Window)
Encode each position as : distance back to match, length of match, next character. Uses a sliding window of recent history as the dictionary.
- Foundation of DEFLATE, gzip, zlib
- Parsing is greedy (longest match); optimal parsing is
- Universal: achieves entropy rate for stationary ergodic sources
LZ78 (Dictionary Building)
Build dictionary incrementally: each new phrase is the longest previously seen phrase plus one new character. Dictionary entries are numbered sequentially.
- Simpler dictionary management than LZ77
- LZW variant: dictionary initialized with single-character strings; output dictionary index only (no extra character). Used in GIF, early Unix
compress
Universality
Both LZ77 and LZ78 are universal compressors: for any stationary ergodic source, the compression ratio converges to . This is remarkable---no knowledge of the source distribution is needed.
Practical Compression Systems
| System | Algorithm | Typical Speed | Notes | |---|---|---|---| | gzip | DEFLATE (LZ77 + Huffman) | Moderate | Ubiquitous, RFC 1951 | | Brotli | LZ77 + context-dependent Huffman + static dictionary | Slower encode | Web standard, ~20% better than gzip | | Zstandard (zstd) | LZ77 + tANS + Huffman | Very fast | Facebook, dictionary training support | | LZMA/xz | LZ77 + range coding | Slow encode, fast decode | High ratio, 7-Zip | | lz4 | LZ77 (simple) | Extremely fast | Low ratio, real-time use | | Bzip2 | BWT + MTF + Huffman | Moderate | Good ratio, largely superseded |
Burrows-Wheeler Transform (BWT): reversible permutation that groups similar contexts together, making the output highly compressible by subsequent stages (move-to-front + entropy coding).
Rate-Distortion Theory
For lossy compression, Shannon's rate-distortion theory characterizes the minimum rate to represent a source within distortion :
Key results:
- Gaussian source with MSE distortion: for
- Binary source with Hamming distortion: for
- is convex, non-increasing in
- Achieved by random codebooks of size with joint typicality encoding
Blahut-Arimoto algorithm: iterative method to compute (alternating minimization).
The distortion-rate function is the inverse, giving minimum achievable distortion at rate . In practice, practical codecs (JPEG, HEVC, AV1) operate above but approach it with increasing block size and complexity.
Successive Refinement and Scalable Coding
A source is successively refinable if encoding at rate to distortion , then sending additional rate to reach distortion , achieves the same total rate as encoding directly at . Gaussian sources with MSE are successively refinable; binary sources with Hamming distortion generally are not.
Distributed Source Coding
Slepian-Wolf Theorem
For lossless distributed compression of correlated sources encoded separately but decoded jointly, the achievable rate region is:
This is remarkable: separate encoding is as efficient as joint encoding. The corner points correspond to one encoder operating at full rate while the other exploits correlation.
Practical schemes: DISCUS (distributed source coding using syndromes), based on channel codes. Turbo/LDPC codes can approach Slepian-Wolf bounds.
Wyner-Ziv Theorem
Lossy analog of Slepian-Wolf: for encoding with side information available only at the decoder, the rate-distortion function equals what it would be if were available at the encoder too (for Gaussian sources with MSE, and in general for the quadratic-Gaussian case). For general sources, there can be a rate loss.
Quantization
Scalar quantization: partition the input range into intervals, map to representative points.
- Uniform quantizer: intervals of equal width. For high-rate regime, distortion
- Lloyd-Max quantizer: iteratively optimizes partition boundaries and centroids to minimize MSE. Optimal for a given number of levels
- Companding: apply nonlinear transform before uniform quantization (-law, A-law in telephony)
Vector quantization (VQ): quantize blocks of samples jointly. Achieves rate-distortion bound as dimension grows. LBG algorithm (generalized Lloyd) for codebook design. Computational cost is the main limitation, addressed by structured codebooks (lattice VQ, tree-structured VQ).
Transform Coding
Apply invertible transform, then quantize coefficients independently:
- KLT (Karhunen-Loeve): optimal decorrelation, diagonalizes covariance. Data-dependent
- DCT: approximates KLT for Markov sources. Foundation of JPEG, MPEG
- Wavelets: multi-resolution analysis. Used in JPEG 2000
Reverse water-filling: optimal bit allocation across transform coefficients---allocate more bits to higher-variance components, zero bits to components below a threshold.
Practical Lossy Compression Pipeline
Modern codecs (H.265/HEVC, AV1, VVC) follow:
- Prediction (intra/inter) to exploit spatial/temporal redundancy
- Transform (DCT variants) of residual
- Quantization (with rate-distortion optimization per block)
- Entropy coding (CABAC/ANS) of quantized coefficients
Neural compression: learned nonlinear transforms + learned entropy models (hyperprior, autoregressive). Competitive with or superior to traditional codecs (e.g., Balle et al.'s framework, VTM-level performance at lower complexity in some regimes).