Entropy and Information

Self-Information

The self-information (surprisal) of an event $x$ with probability $p(x)$ quantifies the information gained upon observing it:

$I(x) = -\log_b p(x)$

Properties:

$I(x) \geq 0$ (non-negative)
$I(x) = 0$ when $p(x) = 1$ (certain events carry no information)
Rare events carry more information
For independent events: $I(x, y) = I(x) + I(y)$ (additivity)

Base $b = 2$ gives bits, $b = e$ gives nats, $b = 3$ gives trits.

Shannon Entropy

The Shannon entropy is the expected self-information of a random variable $X$ with distribution $p$ :

$H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x)$

Key properties:

$H(X) \geq 0$ , with equality iff $X$ is deterministic
$H(X) \leq \log |\mathcal{X}|$ , with equality iff $X$ is uniform
Concavity: $H(\lambda p + (1-\lambda)q) \geq \lambda H(p) + (1-\lambda) H(q)$
Permutation invariance
Continuity in $p$

The binary entropy function $H_b(p) = -p\log p - (1-p)\log(1-p)$ is fundamental in coding theory. It achieves maximum 1 bit at $p = 1/2$ .

Joint and Conditional Entropy

Joint entropy of $(X, Y)$ :

$H(X, Y) = -\sum_{x,y} p(x,y) \log p(x,y)$

Conditional entropy measures remaining uncertainty in $X$ given $Y$ :

$H(X|Y) = -\sum_{x,y} p(x,y) \log p(x|y) = H(X,Y) - H(Y)$

Critical inequalities:

$H(X|Y) \leq H(X)$ (conditioning reduces entropy)
$H(X,Y) \leq H(X) + H(Y)$ (subadditivity, equality iff independent)
$H(X|Y) \geq 0$ (for discrete variables; can be negative for continuous)

Chain Rule for Entropy

For random variables $X_1, X_2, \ldots, X_n$ :

$H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^{n} H(X_i | X_1, \ldots, X_{i-1})$

This decomposes joint entropy into successive conditional entropies and is the foundation for sequential coding schemes.

Mutual Information

Mutual information quantifies shared information between $X$ and $Y$ :

$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)$

Equivalently, in terms of KL divergence:

$I(X; Y) = D_{\text{KL}}(p(x,y) \| p(x)p(y))$

Properties:

$I(X; Y) \geq 0$ with equality iff $X \perp Y$
$I(X; Y) = I(Y; X)$ (symmetric)
$I(X; X) = H(X)$ (self-information equals entropy)

Conditional mutual information: $I(X; Y | Z) = H(X|Z) - H(X|Y,Z)$ .

Chain rule for MI: $I(X_1, \ldots, X_n; Y) = \sum_{i=1}^n I(X_i; Y | X_1, \ldots, X_{i-1})$ .

Data Processing Inequality

If $X \to Y \to Z$ forms a Markov chain (i.e., $X$ and $Z$ are conditionally independent given $Y$ ):

$I(X; Z) \leq I(X; Y)$

No processing of $Y$ can increase information about $X$ . Equality holds iff $X \to Z \to Y$ also forms a Markov chain (i.e., $Z$ is a sufficient statistic of $Y$ for $X$ ). This has profound implications:

Learned representations cannot contain more task-relevant information than the input
Each layer in a neural network can only lose or preserve (never gain) information about the input
Motivates the information bottleneck framework

Entropy Rate

For a stochastic process $\{X_i\}$ , the entropy rate captures the per-symbol entropy:

$H(\mathcal{X}) = \lim_{n \to \infty} \frac{1}{n} H(X_1, X_2, \ldots, X_n)$

For stationary processes, equivalently: $H(\mathcal{X}) = \lim_{n \to \infty} H(X_n | X_{n-1}, \ldots, X_1)$ .

For a stationary ergodic Markov chain with transition matrix $P$ and stationary distribution $\mu$ :

$H(\mathcal{X}) = -\sum_{i} \mu_i \sum_{j} P_{ij} \log P_{ij}$

The entropy rate determines the fundamental compression limit for the process (Shannon's source coding theorem).

Maximum Entropy Principle

Given constraints (e.g., known moments), the maximum entropy distribution is the least informative distribution consistent with those constraints. Under constraint $\mathbb{E}[f_i(X)] = \alpha_i$ :

$p^*(x) = \frac{1}{Z} \exp\left(-\sum_i \lambda_i f_i(x)\right)$

This yields the exponential family. Examples:

Known mean and variance on $\mathbb{R}$ : Gaussian
Known mean on $[0, \infty)$ : Exponential
No constraints on finite set: Uniform
Known mean on $\{0, 1, 2, \ldots\}$ : Geometric

Jaynes' MaxEnt principle connects information theory to statistical mechanics, where the Boltzmann distribution arises as the maximum entropy distribution under fixed expected energy.

Differential Entropy

For continuous random variable $X$ with density $f$ :

$h(X) = -\int f(x) \log f(x) \, dx$

Key differences from discrete entropy:

Can be negative (e.g., $\text{Uniform}(0, a)$ with $a < 1$ )
Not invariant under change of variables: $h(g(X)) = h(X) + \mathbb{E}[\log |g'(X)|]$
Not a limit of discrete entropy (off by $\log \Delta$ for discretization width $\Delta$ )

Notable values: Gaussian $h(X) = \frac{1}{2}\log(2\pi e \sigma^2)$ is the maximum differential entropy for fixed variance. Mutual information $I(X;Y)$ remains well-defined and non-negative for continuous variables, unlike differential entropy.

KL Divergence

The Kullback-Leibler divergence (relative entropy) from $q$ to $p$ :

$D_{\text{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_p\left[\log \frac{p(X)}{q(X)}\right]$

Properties:

$D_{\text{KL}}(p \| q) \geq 0$ (Gibbs' inequality), with equality iff $p = q$
Not symmetric: $D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$ in general
Not a metric: violates triangle inequality
Connects to maximum likelihood: minimizing $D_{\text{KL}}(p_{\text{data}} \| p_\theta)$ is equivalent to maximizing $\mathbb{E}_{p_{\text{data}}}[\log p_\theta(X)]$

Forward KL ( $D_{\text{KL}}(p \| q)$ ): zero-avoiding in $q$ (mean-seeking). Reverse KL ( $D_{\text{KL}}(q \| p)$ ): zero-forcing in $q$ (mode-seeking). This asymmetry is crucial in variational inference.

f-Divergences

A general family: for convex $f$ with $f(1) = 0$ :

$D_f(p \| q) = \sum_x q(x) f\left(\frac{p(x)}{q(x)}\right)$

Special cases:

Divergence	$f(t)$
KL	$t \log t$
Reverse KL	$-\log t$
Total variation	$\frac{1}{2}
$\chi^2$	$(t-1)^2$
Hellinger	$(\sqrt{t} - 1)^2$

Key properties shared by all f-divergences:

Non-negative, zero iff $p = q$
Invariant under sufficient statistics
Satisfy data processing inequality: $D_f(p_Y \| q_Y) \leq D_f(p_X \| q_X)$ for any channel $Y = g(X)$
Variational representation (Fenchel conjugate): $D_f(p \| q) = \sup_T \{\mathbb{E}_p[T(X)] - \mathbb{E}_q[f^*(T(X))]\}$ , enabling neural estimation (f-GAN)

Jensen-Shannon Divergence

A symmetrized, bounded divergence:

$\text{JSD}(p \| q) = \frac{1}{2} D_{\text{KL}}(p \| m) + \frac{1}{2} D_{\text{KL}}(q \| m), \quad m = \frac{p + q}{2}$

Properties:

$0 \leq \text{JSD}(p \| q) \leq \log 2$ (in bits)
Symmetric: $\text{JSD}(p \| q) = \text{JSD}(q \| p)$
$\sqrt{\text{JSD}}$ is a proper metric (satisfies triangle inequality)
JSD is an f-divergence with $f(t) = t\log t - (t+1)\log\frac{t+1}{2}$

The generalized JSD with weights $\pi_1, \ldots, \pi_n$ and distributions $p_1, \ldots, p_n$ :

$\text{JSD}_\pi(p_1, \ldots, p_n) = H\left(\sum_i \pi_i p_i\right) - \sum_i \pi_i H(p_i)$

JSD was the original GAN training objective. Its boundedness causes training instabilities when supports of $p$ and $q$ have minimal overlap (the gradient vanishing problem), motivating Wasserstein distances.

Fano's Inequality

A lower bound on error probability for estimation from noisy observations. If $X \to Y \to \hat{X}$ with $P_e = \Pr(\hat{X} \neq X)$ :

$H(X|Y) \leq H_b(P_e) + P_e \log(|\mathcal{X}| - 1)$

Rearranging gives a lower bound on error probability in terms of conditional entropy. This is the key tool for proving converse results in information theory (showing that rates beyond capacity are unachievable).

Pinsker's Inequality

Relates total variation to KL divergence:

$\delta(p, q) \leq \sqrt{\frac{1}{2} D_{\text{KL}}(p \| q)}$

where $\delta(p,q) = \frac{1}{2}\sum_x |p(x) - q(x)|$ is the total variation distance. This is tight up to constants and bridges information-theoretic and statistical distances.

Information Geometry (Brief)

KL divergence induces a Riemannian structure on the statistical manifold. The Fisher information matrix:

$[I(\theta)]_{ij} = \mathbb{E}\left[\frac{\partial \log p(X|\theta)}{\partial \theta_i} \frac{\partial \log p(X|\theta)}{\partial \theta_j}\right]$

serves as the metric tensor. Locally, $D_{\text{KL}}(p_\theta \| p_{\theta+d\theta}) \approx \frac{1}{2} d\theta^T I(\theta) d\theta$ . Natural gradient descent uses $I(\theta)^{-1} \nabla_\theta$ instead of $\nabla_\theta$ , providing invariance to parameterization and faster convergence (e.g., in policy gradient methods).