8 min read
On this page

Kolmogorov Complexity

Algorithmic Information Theory

Kolmogorov complexity (algorithmic complexity, descriptive complexity) measures the information content of an individual string, independent of any probability distribution. While Shannon entropy characterizes the average information of a random source, Kolmogorov complexity captures the intrinsic complexity of a single object.

Definition: The Kolmogorov complexity of a string xx with respect to a universal Turing machine UU is:

KU(x)=min{p:U(p)=x}K_U(x) = \min\{|p| : U(p) = x\}

where p|p| is the length of program pp in bits. Intuitively, K(x)K(x) is the length of the shortest program that produces xx.

Conditional complexity: K(xy)=min{p:U(p,y)=x}K(x|y) = \min\{|p| : U(p, y) = x\} --- the shortest program that produces xx given yy as auxiliary input.

Joint complexity: K(x,y)=min{p:U(p)=x,y}K(x, y) = \min\{|p| : U(p) = \langle x, y \rangle\} where ,\langle \cdot, \cdot \rangle is an effective pairing function.

Invariance Theorem

Theorem (Solomonoff, Kolmogorov, Chaitin): For any two universal Turing machines U1,U2U_1, U_2:

KU1(x)KU2(x)cU1,U2|K_{U_1}(x) - K_{U_2}(x)| \leq c_{U_1, U_2}

where cU1,U2c_{U_1, U_2} is a constant depending only on the machines, not on xx.

Proof sketch: U1U_1 can simulate U2U_2 by prepending a fixed interpreter program. The constant cc equals the length of this interpreter.

Consequence: K(x)K(x) is well-defined up to an additive constant, so we fix a reference universal machine and suppress the subscript. All asymptotic statements about KK are machine-independent.

Prefix-Free Kolmogorov Complexity

To connect with coding theory and probability, we restrict to prefix-free programs (no program is a prefix of another):

K(x)=min{p:U(p)=x,dom(U) is prefix-free}K(x) = \min\{|p| : U(p) = x, \text{dom}(U) \text{ is prefix-free}\}

This variant satisfies Kraft's inequality: x2K(x)1\sum_{x} 2^{-K(x)} \leq 1, enabling interpretation as a probability distribution (algorithmic probability / universal prior).

The prefix-free complexity K(x)K(x) and plain complexity C(x)C(x) are related:

C(x)K(x)C(x)+2logC(x)+O(1)C(x) \leq K(x) \leq C(x) + 2\log C(x) + O(1)

Basic Properties

  • Upper bound: K(x)x+O(logx)K(x) \leq |x| + O(\log |x|) (the program can simply contain xx as a literal, plus its length encoding)
  • Subadditivity: K(x,y)K(x)+K(y)+O(log(K(x)+K(y)))K(x, y) \leq K(x) + K(y) + O(\log(K(x) + K(y)))
  • Symmetry of information: K(x,y)=K(x)+K(yx)+O(logK(x,y))K(x, y) = K(x) + K(y|x^*) + O(\log K(x, y)) where xx^* is the shortest program for xx
  • Uncomputability: K(x)K(x) is not computable (but upper semi-computable: we can enumerate shorter and shorter programs)
  • Non-monotonicity: KK is not monotone under substrings; a substring of a random string can be compressible

Incompressibility Method

A string xx of length nn is cc-incompressible if K(x)ncK(x) \geq n - c.

Counting argument: at most 2n12^n - 1 programs shorter than nn bits, so at most 2n12^n - 1 strings have K(x)<nK(x) < n. Hence at least one string of each length is incompressible, and the fraction of cc-incompressible strings is 12c\geq 1 - 2^{-c}.

The incompressibility method proves combinatorial results by choosing a random (incompressible) object and deriving contradictions. Applications:

  • Lower bounds on sorting: any comparison sort of nn elements requires Ω(nlogn)\Omega(n \log n) comparisons. Take an incompressible permutation; if the sort uses fewer comparisons, the comparison outcomes encode a shorter description
  • Average-case analysis: the average number of comparisons in Quicksort is Θ(nlogn)\Theta(n \log n)
  • Graph theory: existence of graphs with specific properties (high girth and high chromatic number)
  • Communication complexity: lower bounds on communication protocols

Algorithmic Randomness

Martin-Lof Randomness

An infinite binary sequence ω\omega is Martin-Lof random if it passes all effective statistical tests. Formally, it avoids all constructive measure-zero sets.

A Martin-Lof test is a uniformly computably enumerable sequence of open sets {Um}m=1\{U_m\}_{m=1}^\infty with μ(Um)2m\mu(U_m) \leq 2^{-m}. A sequence ω\omega is ML-random iff ωmUm\omega \notin \bigcap_m U_m for every ML test.

Equivalent characterizations:

  • Levin-Schnorr theorem: ω\omega is ML-random iff K(ω1:n)nO(1)K(\omega_{1:n}) \geq n - O(1) for all nn (prefix-free complexity stays near maximum)
  • Unpredictability: no computable martingale succeeds on ω\omega (cannot computably gamble and win unbounded wealth)
  • Typicality: satisfies all effectively testable properties that hold with probability 1 (law of large numbers, law of the iterated logarithm, etc.)

Hierarchy of Randomness Notions

From weakest to strongest:

  1. Schnorr randomness: pass all computable tests (test measures must be computably convergent)
  2. Computable randomness: no computable martingale succeeds
  3. Martin-Lof randomness: pass all c.e. tests
  4. 2-randomness: pass all Σ20\Sigma_2^0 tests (ML-random relative to the halting problem)

Each level strictly contains the next. ML-randomness is the most widely accepted formalization of algorithmic randomness.

Chaitin's Omega

The halting probability of a universal prefix-free machine UU:

ΩU=p:U(p) halts2p\Omega_U = \sum_{p : U(p) \text{ halts}} 2^{-|p|}

Properties:

  • 0<Ω<10 < \Omega < 1 (converges by Kraft's inequality)
  • Martin-Lof random: the binary expansion of Ω\Omega is an ML-random sequence
  • Computably enumerable from below: we can compute increasingly accurate lower bounds
  • Not computable: knowing the first nn bits of Ω\Omega solves the halting problem for all programs of length n\leq n
  • Maximally unknowable: no consistent formal system of complexity kk can prove more than k+O(1)k + O(1) bits of Ω\Omega

Ω\Omega concentrates the difficulty of the halting problem: its first nn bits encode the answers to all halting questions for short programs. It is sometimes called the "number of wisdom."

Connections to Godel's Theorems

Kolmogorov complexity provides an elegant information-theoretic perspective on incompleteness:

Chaitin's incompleteness theorem: for any consistent formal system FF of complexity K(F)=kK(F) = k:

There exists c such that F cannot prove K(x)>k+c for any x\text{There exists } c \text{ such that } F \text{ cannot prove } K(x) > k + c \text{ for any } x

Proof: if FF could prove K(x)>k+cK(x) > k + c for arbitrarily large values, we could enumerate theorems of FF, find a proof that K(x)>k+cK(x) > k + c, and output xx --- using only k+O(1)k + O(1) bits (to specify FF, the search, and the constant), contradicting K(x)>k+cK(x) > k + c for large enough cc.

This mirrors Godel's first incompleteness theorem: any sufficiently powerful formal system has true but unprovable statements. The information-theoretic version reveals that the unprovable statements are precisely those asserting high complexity (randomness) of specific strings.

Algorithmic Probability and Solomonoff Induction

Algorithmic probability (Solomonoff, 1964): the probability that UU outputs xx when fed random bits:

m(x)=p:U(p)=x2pm(x) = \sum_{p : U(p) = x} 2^{-|p|}

This is a universal semimeasure: it dominates every computable semimeasure up to a multiplicative constant. logm(x)=K(x)+O(1)-\log m(x) = K(x) + O(1).

Solomonoff's theory of induction: predict the next bit of a sequence by weighting all computable hypotheses by their algorithmic probability. This is the Bayesian-optimal predictor relative to the universal prior. It converges to the true distribution (if computable) at an optimal rate.

Connection to Occam's razor: simpler (shorter program) hypotheses receive higher prior weight. This provides a formal justification for preferring simpler explanations.

Minimum Description Length (MDL) Principle

MDL (Rissanen, 1978) is the practical operationalization of Kolmogorov complexity for statistical inference. Select the model MM and parameters θ\theta minimizing:

L(M)+L(θM)+L(dataM,θ)L(M) + L(\theta | M) + L(\text{data} | M, \theta)

where L()L(\cdot) denotes description length (in bits).

Two-part MDL (crude): minimize model code length + data-given-model code length. Requires discretization of parameters.

Refined MDL (normalized maximum likelihood): the minimax optimal universal code for a model class M\mathcal{M}:

pNML(xM)=p(xθ^(x),M)xp(xθ^(x),M)p_{\text{NML}}(x | \mathcal{M}) = \frac{p(x | \hat{\theta}(x), \mathcal{M})}{\sum_{x'} p(x' | \hat{\theta}(x'), \mathcal{M})}

The stochastic complexity (log of the NML denominator, the parametric complexity) asymptotically equals k2logn2π+logdetI(θ)dθ+o(1)\frac{k}{2}\log\frac{n}{2\pi} + \log\int\sqrt{\det I(\theta)} \, d\theta + o(1), connecting to BIC and Fisher information.

MDL advantages over classical hypothesis testing:

  • No need for significance levels or p-values
  • Automatically penalizes complexity (Occam)
  • Consistent model selection
  • Works with nested and non-nested models

Normalized Compression Distance (NCD)

A practical approximation to the normalized information distance:

NID(x,y)=max(K(xy),K(yx))max(K(x),K(y))\text{NID}(x, y) = \frac{\max(K(x|y^*), K(y|x^*))}{max(K(x), K(y))}

which is a universal metric (up to additive precision): it minorizes every computable normalized distance.

NCD replaces KK with a real compressor CC (gzip, bzip2, etc.):

NCD(x,y)=C(xy)min(C(x),C(y))max(C(x),C(y))\text{NCD}(x, y) = \frac{C(xy) - \min(C(x), C(y))}{\max(C(x), C(y))}

Applications: language classification, phylogenetic tree construction, plagiarism detection, malware analysis, music clustering. Remarkably parameter-free---the compressor implicitly captures all regularities.

Algorithmic Statistics

Algorithmic sufficient statistic: a set SS containing xx is an algorithmic sufficient statistic if:

  • K(S)K(S) is small (the "model" is simple)
  • logSK(xS)\log|S| \approx K(x|S) (within SS, xx looks random)
  • K(S)+logSK(x)K(S) + \log|S| \approx K(x) (two-part description is near-optimal)

The structure function hx(k)=min{logS:xS,K(S)k}h_x(k) = \min\{\log|S| : x \in S, K(S) \leq k\} characterizes the tradeoff between model complexity and randomness deficiency. Its behavior reveals whether xx has useful structure (stochastic, non-stochastic, or random).

Connections and Applications

  • Kolmogorov structure function links to rate-distortion theory: minimum "noise" for a given model complexity budget
  • Resource-bounded complexity: restrict to polynomial-time programs, connecting to computational complexity (e.g., pseudorandomness means Kpoly(x)xK^{\text{poly}}(x) \approx |x|)
  • Logical depth (Bennett): the computational time of the shortest program for xx. Captures "organized complexity"---deep objects are neither simple nor random
  • Sophistication: the complexity of the simplest "meaningful" description of xx (the algorithmic sufficient statistic)
  • Effective dimension (Lutz): use Kolmogorov complexity to define Hausdorff dimension for individual sequences, connecting algorithmic randomness to fractal geometry