Entropy and Information
Self-Information
The self-information (surprisal) of an event with probability quantifies the information gained upon observing it:
Properties:
- (non-negative)
- when (certain events carry no information)
- Rare events carry more information
- For independent events: (additivity)
Base gives bits, gives nats, gives trits.
Shannon Entropy
The Shannon entropy is the expected self-information of a random variable with distribution :
Key properties:
- , with equality iff is deterministic
- , with equality iff is uniform
- Concavity:
- Permutation invariance
- Continuity in
The binary entropy function is fundamental in coding theory. It achieves maximum 1 bit at .
Joint and Conditional Entropy
Joint entropy of :
Conditional entropy measures remaining uncertainty in given :
Critical inequalities:
- (conditioning reduces entropy)
- (subadditivity, equality iff independent)
- (for discrete variables; can be negative for continuous)
Chain Rule for Entropy
For random variables :
This decomposes joint entropy into successive conditional entropies and is the foundation for sequential coding schemes.
Mutual Information
Mutual information quantifies shared information between and :
Equivalently, in terms of KL divergence:
Properties:
- with equality iff
- (symmetric)
- (self-information equals entropy)
Conditional mutual information: .
Chain rule for MI: .
Data Processing Inequality
If forms a Markov chain (i.e., and are conditionally independent given ):
No processing of can increase information about . Equality holds iff also forms a Markov chain (i.e., is a sufficient statistic of for ). This has profound implications:
- Learned representations cannot contain more task-relevant information than the input
- Each layer in a neural network can only lose or preserve (never gain) information about the input
- Motivates the information bottleneck framework
Entropy Rate
For a stochastic process , the entropy rate captures the per-symbol entropy:
For stationary processes, equivalently: .
For a stationary ergodic Markov chain with transition matrix and stationary distribution :
The entropy rate determines the fundamental compression limit for the process (Shannon's source coding theorem).
Maximum Entropy Principle
Given constraints (e.g., known moments), the maximum entropy distribution is the least informative distribution consistent with those constraints. Under constraint :
This yields the exponential family. Examples:
- Known mean and variance on : Gaussian
- Known mean on : Exponential
- No constraints on finite set: Uniform
- Known mean on : Geometric
Jaynes' MaxEnt principle connects information theory to statistical mechanics, where the Boltzmann distribution arises as the maximum entropy distribution under fixed expected energy.
Differential Entropy
For continuous random variable with density :
Key differences from discrete entropy:
- Can be negative (e.g., with )
- Not invariant under change of variables:
- Not a limit of discrete entropy (off by for discretization width )
Notable values: Gaussian is the maximum differential entropy for fixed variance. Mutual information remains well-defined and non-negative for continuous variables, unlike differential entropy.
KL Divergence
The Kullback-Leibler divergence (relative entropy) from to :
Properties:
- (Gibbs' inequality), with equality iff
- Not symmetric: in general
- Not a metric: violates triangle inequality
- Connects to maximum likelihood: minimizing is equivalent to maximizing
Forward KL (): zero-avoiding in (mean-seeking). Reverse KL (): zero-forcing in (mode-seeking). This asymmetry is crucial in variational inference.
f-Divergences
A general family: for convex with :
Special cases: | Divergence | | |---|---| | KL | | | Reverse KL | | | Total variation | | | | | | Hellinger | |
Key properties shared by all f-divergences:
- Non-negative, zero iff
- Invariant under sufficient statistics
- Satisfy data processing inequality: for any channel
- Variational representation (Fenchel conjugate): , enabling neural estimation (f-GAN)
Jensen-Shannon Divergence
A symmetrized, bounded divergence:
Properties:
- (in bits)
- Symmetric:
- is a proper metric (satisfies triangle inequality)
- JSD is an f-divergence with
The generalized JSD with weights and distributions :
JSD was the original GAN training objective. Its boundedness causes training instabilities when supports of and have minimal overlap (the gradient vanishing problem), motivating Wasserstein distances.
Fano's Inequality
A lower bound on error probability for estimation from noisy observations. If with :
Rearranging gives a lower bound on error probability in terms of conditional entropy. This is the key tool for proving converse results in information theory (showing that rates beyond capacity are unachievable).
Pinsker's Inequality
Relates total variation to KL divergence:
where is the total variation distance. This is tight up to constants and bridges information-theoretic and statistical distances.
Information Geometry (Brief)
KL divergence induces a Riemannian structure on the statistical manifold. The Fisher information matrix:
serves as the metric tensor. Locally, . Natural gradient descent uses instead of , providing invariance to parameterization and faster convergence (e.g., in policy gradient methods).