6 min read
On this page

Entropy and Information

Self-Information

The self-information (surprisal) of an event xx with probability p(x)p(x) quantifies the information gained upon observing it:

I(x)=logbp(x)I(x) = -\log_b p(x)

Properties:

  • I(x)0I(x) \geq 0 (non-negative)
  • I(x)=0I(x) = 0 when p(x)=1p(x) = 1 (certain events carry no information)
  • Rare events carry more information
  • For independent events: I(x,y)=I(x)+I(y)I(x, y) = I(x) + I(y) (additivity)

Base b=2b = 2 gives bits, b=eb = e gives nats, b=3b = 3 gives trits.

Shannon Entropy

The Shannon entropy is the expected self-information of a random variable XX with distribution pp:

H(X)=xXp(x)logp(x)H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x)

Key properties:

  • H(X)0H(X) \geq 0, with equality iff XX is deterministic
  • H(X)logXH(X) \leq \log |\mathcal{X}|, with equality iff XX is uniform
  • Concavity: H(λp+(1λ)q)λH(p)+(1λ)H(q)H(\lambda p + (1-\lambda)q) \geq \lambda H(p) + (1-\lambda) H(q)
  • Permutation invariance
  • Continuity in pp

The binary entropy function Hb(p)=plogp(1p)log(1p)H_b(p) = -p\log p - (1-p)\log(1-p) is fundamental in coding theory. It achieves maximum 1 bit at p=1/2p = 1/2.

Joint and Conditional Entropy

Joint entropy of (X,Y)(X, Y):

H(X,Y)=x,yp(x,y)logp(x,y)H(X, Y) = -\sum_{x,y} p(x,y) \log p(x,y)

Conditional entropy measures remaining uncertainty in XX given YY:

H(XY)=x,yp(x,y)logp(xy)=H(X,Y)H(Y)H(X|Y) = -\sum_{x,y} p(x,y) \log p(x|y) = H(X,Y) - H(Y)

Critical inequalities:

  • H(XY)H(X)H(X|Y) \leq H(X) (conditioning reduces entropy)
  • H(X,Y)H(X)+H(Y)H(X,Y) \leq H(X) + H(Y) (subadditivity, equality iff independent)
  • H(XY)0H(X|Y) \geq 0 (for discrete variables; can be negative for continuous)

Chain Rule for Entropy

For random variables X1,X2,,XnX_1, X_2, \ldots, X_n:

H(X1,X2,,Xn)=i=1nH(XiX1,,Xi1)H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^{n} H(X_i | X_1, \ldots, X_{i-1})

This decomposes joint entropy into successive conditional entropies and is the foundation for sequential coding schemes.

Mutual Information

Mutual information quantifies shared information between XX and YY:

I(X;Y)=H(X)H(XY)=H(Y)H(YX)=H(X)+H(Y)H(X,Y)I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)

Equivalently, in terms of KL divergence:

I(X;Y)=DKL(p(x,y)p(x)p(y))I(X; Y) = D_{\text{KL}}(p(x,y) \| p(x)p(y))

Properties:

  • I(X;Y)0I(X; Y) \geq 0 with equality iff XYX \perp Y
  • I(X;Y)=I(Y;X)I(X; Y) = I(Y; X) (symmetric)
  • I(X;X)=H(X)I(X; X) = H(X) (self-information equals entropy)

Conditional mutual information: I(X;YZ)=H(XZ)H(XY,Z)I(X; Y | Z) = H(X|Z) - H(X|Y,Z).

Chain rule for MI: I(X1,,Xn;Y)=i=1nI(Xi;YX1,,Xi1)I(X_1, \ldots, X_n; Y) = \sum_{i=1}^n I(X_i; Y | X_1, \ldots, X_{i-1}).

Data Processing Inequality

If XYZX \to Y \to Z forms a Markov chain (i.e., XX and ZZ are conditionally independent given YY):

I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y)

No processing of YY can increase information about XX. Equality holds iff XZYX \to Z \to Y also forms a Markov chain (i.e., ZZ is a sufficient statistic of YY for XX). This has profound implications:

  • Learned representations cannot contain more task-relevant information than the input
  • Each layer in a neural network can only lose or preserve (never gain) information about the input
  • Motivates the information bottleneck framework

Entropy Rate

For a stochastic process {Xi}\{X_i\}, the entropy rate captures the per-symbol entropy:

H(X)=limn1nH(X1,X2,,Xn)H(\mathcal{X}) = \lim_{n \to \infty} \frac{1}{n} H(X_1, X_2, \ldots, X_n)

For stationary processes, equivalently: H(X)=limnH(XnXn1,,X1)H(\mathcal{X}) = \lim_{n \to \infty} H(X_n | X_{n-1}, \ldots, X_1).

For a stationary ergodic Markov chain with transition matrix PP and stationary distribution μ\mu:

H(X)=iμijPijlogPijH(\mathcal{X}) = -\sum_{i} \mu_i \sum_{j} P_{ij} \log P_{ij}

The entropy rate determines the fundamental compression limit for the process (Shannon's source coding theorem).

Maximum Entropy Principle

Given constraints (e.g., known moments), the maximum entropy distribution is the least informative distribution consistent with those constraints. Under constraint E[fi(X)]=αi\mathbb{E}[f_i(X)] = \alpha_i:

p(x)=1Zexp(iλifi(x))p^*(x) = \frac{1}{Z} \exp\left(-\sum_i \lambda_i f_i(x)\right)

This yields the exponential family. Examples:

  • Known mean and variance on R\mathbb{R}: Gaussian
  • Known mean on [0,)[0, \infty): Exponential
  • No constraints on finite set: Uniform
  • Known mean on {0,1,2,}\{0, 1, 2, \ldots\}: Geometric

Jaynes' MaxEnt principle connects information theory to statistical mechanics, where the Boltzmann distribution arises as the maximum entropy distribution under fixed expected energy.

Differential Entropy

For continuous random variable XX with density ff:

h(X)=f(x)logf(x)dxh(X) = -\int f(x) \log f(x) \, dx

Key differences from discrete entropy:

  • Can be negative (e.g., Uniform(0,a)\text{Uniform}(0, a) with a<1a < 1)
  • Not invariant under change of variables: h(g(X))=h(X)+E[logg(X)]h(g(X)) = h(X) + \mathbb{E}[\log |g'(X)|]
  • Not a limit of discrete entropy (off by logΔ\log \Delta for discretization width Δ\Delta)

Notable values: Gaussian h(X)=12log(2πeσ2)h(X) = \frac{1}{2}\log(2\pi e \sigma^2) is the maximum differential entropy for fixed variance. Mutual information I(X;Y)I(X;Y) remains well-defined and non-negative for continuous variables, unlike differential entropy.

KL Divergence

The Kullback-Leibler divergence (relative entropy) from qq to pp:

DKL(pq)=xp(x)logp(x)q(x)=Ep[logp(X)q(X)]D_{\text{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_p\left[\log \frac{p(X)}{q(X)}\right]

Properties:

  • DKL(pq)0D_{\text{KL}}(p \| q) \geq 0 (Gibbs' inequality), with equality iff p=qp = q
  • Not symmetric: DKL(pq)DKL(qp)D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p) in general
  • Not a metric: violates triangle inequality
  • Connects to maximum likelihood: minimizing DKL(pdatapθ)D_{\text{KL}}(p_{\text{data}} \| p_\theta) is equivalent to maximizing Epdata[logpθ(X)]\mathbb{E}_{p_{\text{data}}}[\log p_\theta(X)]

Forward KL (DKL(pq)D_{\text{KL}}(p \| q)): zero-avoiding in qq (mean-seeking). Reverse KL (DKL(qp)D_{\text{KL}}(q \| p)): zero-forcing in qq (mode-seeking). This asymmetry is crucial in variational inference.

f-Divergences

A general family: for convex ff with f(1)=0f(1) = 0:

Df(pq)=xq(x)f(p(x)q(x))D_f(p \| q) = \sum_x q(x) f\left(\frac{p(x)}{q(x)}\right)

Special cases: | Divergence | f(t)f(t) | |---|---| | KL | tlogtt \log t | | Reverse KL | logt-\log t | | Total variation | 12t1\frac{1}{2}|t - 1| | | χ2\chi^2 | (t1)2(t-1)^2 | | Hellinger | (t1)2(\sqrt{t} - 1)^2 |

Key properties shared by all f-divergences:

  • Non-negative, zero iff p=qp = q
  • Invariant under sufficient statistics
  • Satisfy data processing inequality: Df(pYqY)Df(pXqX)D_f(p_Y \| q_Y) \leq D_f(p_X \| q_X) for any channel Y=g(X)Y = g(X)
  • Variational representation (Fenchel conjugate): Df(pq)=supT{Ep[T(X)]Eq[f(T(X))]}D_f(p \| q) = \sup_T \{\mathbb{E}_p[T(X)] - \mathbb{E}_q[f^*(T(X))]\}, enabling neural estimation (f-GAN)

Jensen-Shannon Divergence

A symmetrized, bounded divergence:

JSD(pq)=12DKL(pm)+12DKL(qm),m=p+q2\text{JSD}(p \| q) = \frac{1}{2} D_{\text{KL}}(p \| m) + \frac{1}{2} D_{\text{KL}}(q \| m), \quad m = \frac{p + q}{2}

Properties:

  • 0JSD(pq)log20 \leq \text{JSD}(p \| q) \leq \log 2 (in bits)
  • Symmetric: JSD(pq)=JSD(qp)\text{JSD}(p \| q) = \text{JSD}(q \| p)
  • JSD\sqrt{\text{JSD}} is a proper metric (satisfies triangle inequality)
  • JSD is an f-divergence with f(t)=tlogt(t+1)logt+12f(t) = t\log t - (t+1)\log\frac{t+1}{2}

The generalized JSD with weights π1,,πn\pi_1, \ldots, \pi_n and distributions p1,,pnp_1, \ldots, p_n:

JSDπ(p1,,pn)=H(iπipi)iπiH(pi)\text{JSD}_\pi(p_1, \ldots, p_n) = H\left(\sum_i \pi_i p_i\right) - \sum_i \pi_i H(p_i)

JSD was the original GAN training objective. Its boundedness causes training instabilities when supports of pp and qq have minimal overlap (the gradient vanishing problem), motivating Wasserstein distances.

Fano's Inequality

A lower bound on error probability for estimation from noisy observations. If XYX^X \to Y \to \hat{X} with Pe=Pr(X^X)P_e = \Pr(\hat{X} \neq X):

H(XY)Hb(Pe)+Pelog(X1)H(X|Y) \leq H_b(P_e) + P_e \log(|\mathcal{X}| - 1)

Rearranging gives a lower bound on error probability in terms of conditional entropy. This is the key tool for proving converse results in information theory (showing that rates beyond capacity are unachievable).

Pinsker's Inequality

Relates total variation to KL divergence:

δ(p,q)12DKL(pq)\delta(p, q) \leq \sqrt{\frac{1}{2} D_{\text{KL}}(p \| q)}

where δ(p,q)=12xp(x)q(x)\delta(p,q) = \frac{1}{2}\sum_x |p(x) - q(x)| is the total variation distance. This is tight up to constants and bridges information-theoretic and statistical distances.

Information Geometry (Brief)

KL divergence induces a Riemannian structure on the statistical manifold. The Fisher information matrix:

[I(θ)]ij=E[logp(Xθ)θilogp(Xθ)θj][I(\theta)]_{ij} = \mathbb{E}\left[\frac{\partial \log p(X|\theta)}{\partial \theta_i} \frac{\partial \log p(X|\theta)}{\partial \theta_j}\right]

serves as the metric tensor. Locally, DKL(pθpθ+dθ)12dθTI(θ)dθD_{\text{KL}}(p_\theta \| p_{\theta+d\theta}) \approx \frac{1}{2} d\theta^T I(\theta) d\theta. Natural gradient descent uses I(θ)1θI(\theta)^{-1} \nabla_\theta instead of θ\nabla_\theta, providing invariance to parameterization and faster convergence (e.g., in policy gradient methods).