Kolmogorov Complexity

Algorithmic Information Theory

Kolmogorov complexity (algorithmic complexity, descriptive complexity) measures the information content of an individual string, independent of any probability distribution. While Shannon entropy characterizes the average information of a random source, Kolmogorov complexity captures the intrinsic complexity of a single object.

Definition: The Kolmogorov complexity of a string $x$ with respect to a universal Turing machine $U$ is:

$K_U(x) = \min\{|p| : U(p) = x\}$

where $|p|$ is the length of program $p$ in bits. Intuitively, $K(x)$ is the length of the shortest program that produces $x$ .

Conditional complexity: $K(x|y) = \min\{|p| : U(p, y) = x\}$ --- the shortest program that produces $x$ given $y$ as auxiliary input.

Joint complexity: $K(x, y) = \min\{|p| : U(p) = \langle x, y \rangle\}$ where $\langle \cdot, \cdot \rangle$ is an effective pairing function.

Invariance Theorem

Theorem (Solomonoff, Kolmogorov, Chaitin): For any two universal Turing machines $U_1, U_2$ :

$|K_{U_1}(x) - K_{U_2}(x)| \leq c_{U_1, U_2}$

where $c_{U_1, U_2}$ is a constant depending only on the machines, not on $x$ .

Proof sketch: $U_1$ can simulate $U_2$ by prepending a fixed interpreter program. The constant $c$ equals the length of this interpreter.

Consequence: $K(x)$ is well-defined up to an additive constant, so we fix a reference universal machine and suppress the subscript. All asymptotic statements about $K$ are machine-independent.

Prefix-Free Kolmogorov Complexity

To connect with coding theory and probability, we restrict to prefix-free programs (no program is a prefix of another):

$K(x) = \min\{|p| : U(p) = x, \text{dom}(U) \text{ is prefix-free}\}$

This variant satisfies Kraft's inequality: $\sum_{x} 2^{-K(x)} \leq 1$ , enabling interpretation as a probability distribution (algorithmic probability / universal prior).

The prefix-free complexity $K(x)$ and plain complexity $C(x)$ are related:

$C(x) \leq K(x) \leq C(x) + 2\log C(x) + O(1)$

Basic Properties

Upper bound: $K(x) \leq |x| + O(\log |x|)$ (the program can simply contain $x$ as a literal, plus its length encoding)
Subadditivity: $K(x, y) \leq K(x) + K(y) + O(\log(K(x) + K(y)))$
Symmetry of information: $K(x, y) = K(x) + K(y|x^*) + O(\log K(x, y))$ where $x^*$ is the shortest program for $x$
Uncomputability: $K(x)$ is not computable (but upper semi-computable: we can enumerate shorter and shorter programs)
Non-monotonicity: $K$ is not monotone under substrings; a substring of a random string can be compressible

Incompressibility Method

A string $x$ of length $n$ is $c$ -incompressible if $K(x) \geq n - c$ .

Counting argument: at most $2^n - 1$ programs shorter than $n$ bits, so at most $2^n - 1$ strings have $K(x) < n$ . Hence at least one string of each length is incompressible, and the fraction of $c$ -incompressible strings is $\geq 1 - 2^{-c}$ .

The incompressibility method proves combinatorial results by choosing a random (incompressible) object and deriving contradictions. Applications:

Lower bounds on sorting: any comparison sort of $n$ elements requires $\Omega(n \log n)$ comparisons. Take an incompressible permutation; if the sort uses fewer comparisons, the comparison outcomes encode a shorter description
Average-case analysis: the average number of comparisons in Quicksort is $\Theta(n \log n)$
Graph theory: existence of graphs with specific properties (high girth and high chromatic number)
Communication complexity: lower bounds on communication protocols

Algorithmic Randomness

Martin-Lof Randomness

An infinite binary sequence $\omega$ is Martin-Lof random if it passes all effective statistical tests. Formally, it avoids all constructive measure-zero sets.

A Martin-Lof test is a uniformly computably enumerable sequence of open sets $\{U_m\}_{m=1}^\infty$ with $\mu(U_m) \leq 2^{-m}$ . A sequence $\omega$ is ML-random iff $\omega \notin \bigcap_m U_m$ for every ML test.

Equivalent characterizations:

Levin-Schnorr theorem: $\omega$ is ML-random iff $K(\omega_{1:n}) \geq n - O(1)$ for all $n$ (prefix-free complexity stays near maximum)
Unpredictability: no computable martingale succeeds on $\omega$ (cannot computably gamble and win unbounded wealth)
Typicality: satisfies all effectively testable properties that hold with probability 1 (law of large numbers, law of the iterated logarithm, etc.)

Hierarchy of Randomness Notions

From weakest to strongest:

Schnorr randomness: pass all computable tests (test measures must be computably convergent)
Computable randomness: no computable martingale succeeds
Martin-Lof randomness: pass all c.e. tests
2-randomness: pass all $\Sigma_2^0$ tests (ML-random relative to the halting problem)

Each level strictly contains the next. ML-randomness is the most widely accepted formalization of algorithmic randomness.

Chaitin's Omega

The halting probability of a universal prefix-free machine $U$ :

$\Omega_U = \sum_{p : U(p) \text{ halts}} 2^{-|p|}$

Properties:

$0 < \Omega < 1$ (converges by Kraft's inequality)
Martin-Lof random: the binary expansion of $\Omega$ is an ML-random sequence
Computably enumerable from below: we can compute increasingly accurate lower bounds
Not computable: knowing the first $n$ bits of $\Omega$ solves the halting problem for all programs of length $\leq n$
Maximally unknowable: no consistent formal system of complexity $k$ can prove more than $k + O(1)$ bits of $\Omega$

$\Omega$ concentrates the difficulty of the halting problem: its first $n$ bits encode the answers to all halting questions for short programs. It is sometimes called the "number of wisdom."

Connections to Godel's Theorems

Kolmogorov complexity provides an elegant information-theoretic perspective on incompleteness:

Chaitin's incompleteness theorem: for any consistent formal system $F$ of complexity $K(F) = k$ :

$\text{There exists } c \text{ such that } F \text{ cannot prove } K(x) > k + c \text{ for any } x$

Proof: if $F$ could prove $K(x) > k + c$ for arbitrarily large values, we could enumerate theorems of $F$ , find a proof that $K(x) > k + c$ , and output $x$ --- using only $k + O(1)$ bits (to specify $F$ , the search, and the constant), contradicting $K(x) > k + c$ for large enough $c$ .

This mirrors Godel's first incompleteness theorem: any sufficiently powerful formal system has true but unprovable statements. The information-theoretic version reveals that the unprovable statements are precisely those asserting high complexity (randomness) of specific strings.

Algorithmic Probability and Solomonoff Induction

Algorithmic probability (Solomonoff, 1964): the probability that $U$ outputs $x$ when fed random bits:

$m(x) = \sum_{p : U(p) = x} 2^{-|p|}$

This is a universal semimeasure: it dominates every computable semimeasure up to a multiplicative constant. $-\log m(x) = K(x) + O(1)$ .

Solomonoff's theory of induction: predict the next bit of a sequence by weighting all computable hypotheses by their algorithmic probability. This is the Bayesian-optimal predictor relative to the universal prior. It converges to the true distribution (if computable) at an optimal rate.

Connection to Occam's razor: simpler (shorter program) hypotheses receive higher prior weight. This provides a formal justification for preferring simpler explanations.

Minimum Description Length (MDL) Principle

MDL (Rissanen, 1978) is the practical operationalization of Kolmogorov complexity for statistical inference. Select the model $M$ and parameters $\theta$ minimizing:

$L(M) + L(\theta | M) + L(\text{data} | M, \theta)$

where $L(\cdot)$ denotes description length (in bits).

Two-part MDL (crude): minimize model code length + data-given-model code length. Requires discretization of parameters.

Refined MDL (normalized maximum likelihood): the minimax optimal universal code for a model class $\mathcal{M}$ :

$p_{\text{NML}}(x | \mathcal{M}) = \frac{p(x | \hat{\theta}(x), \mathcal{M})}{\sum_{x'} p(x' | \hat{\theta}(x'), \mathcal{M})}$

The stochastic complexity (log of the NML denominator, the parametric complexity) asymptotically equals $\frac{k}{2}\log\frac{n}{2\pi} + \log\int\sqrt{\det I(\theta)} \, d\theta + o(1)$ , connecting to BIC and Fisher information.

MDL advantages over classical hypothesis testing:

No need for significance levels or p-values
Automatically penalizes complexity (Occam)
Consistent model selection
Works with nested and non-nested models

Normalized Compression Distance (NCD)

A practical approximation to the normalized information distance:

$\text{NID}(x, y) = \frac{\max(K(x|y^*), K(y|x^*))}{max(K(x), K(y))}$

which is a universal metric (up to additive precision): it minorizes every computable normalized distance.

NCD replaces $K$ with a real compressor $C$ (gzip, bzip2, etc.):

$\text{NCD}(x, y) = \frac{C(xy) - \min(C(x), C(y))}{\max(C(x), C(y))}$

Applications: language classification, phylogenetic tree construction, plagiarism detection, malware analysis, music clustering. Remarkably parameter-free---the compressor implicitly captures all regularities.

Algorithmic Statistics

Algorithmic sufficient statistic: a set $S$ containing $x$ is an algorithmic sufficient statistic if:

$K(S)$ is small (the "model" is simple)
$\log|S| \approx K(x|S)$ (within $S$ , $x$ looks random)
$K(S) + \log|S| \approx K(x)$ (two-part description is near-optimal)

The structure function $h_x(k) = \min\{\log|S| : x \in S, K(S) \leq k\}$ characterizes the tradeoff between model complexity and randomness deficiency. Its behavior reveals whether $x$ has useful structure (stochastic, non-stochastic, or random).

Connections and Applications

Kolmogorov structure function links to rate-distortion theory: minimum "noise" for a given model complexity budget
Resource-bounded complexity: restrict to polynomial-time programs, connecting to computational complexity (e.g., pseudorandomness means $K^{\text{poly}}(x) \approx |x|$ )
Logical depth (Bennett): the computational time of the shortest program for $x$ . Captures "organized complexity"---deep objects are neither simple nor random
Sophistication: the complexity of the simplest "meaningful" description of $x$ (the algorithmic sufficient statistic)
Effective dimension (Lutz): use Kolmogorov complexity to define Hausdorff dimension for individual sequences, connecting algorithmic randomness to fractal geometry