Information Theory in Machine Learning
Information Bottleneck
The information bottleneck (IB) method (Tishby, Pereira, Bialek, 1999) formalizes the tradeoff between compression and prediction. Given joint distribution , find a compressed representation of that retains maximal information about :
subject to the Markov chain .
Lagrange multiplier controls the tradeoff: yields maximum compression ( independent of ), yields maximum prediction ( retains everything about ).
Self-consistent equations (iterative solution):
where is a normalization constant. This has the form of a rate-distortion problem with KL divergence as the distortion measure.
IB and Deep Learning
Shwartz-Ziv and Tishby (2017) proposed the deep learning as information bottleneck hypothesis: during training, networks undergo two phases:
- Fitting phase: increases (learn to predict)
- Compression phase: decreases (forget irrelevant details)
This claim generated significant debate. Critiques:
- Saxe et al. (2018): compression phase depends on activation function (observed with tanh, not ReLU); may be an artifact of binning-based MI estimation
- MI estimation in high dimensions is notoriously difficult; results are estimator-dependent
- The framework assumes a fixed joint distribution, while deep learning operates on finite samples
Despite controversies, the IB perspective remains influential for understanding representation learning, regularization, and the role of noise in training.
Variational Information Bottleneck
Alemi et al. (2017): tractable variational bound for the IB objective:
where is a variational decoder and is a variational prior. This is closely related to the -VAE objective and connects the IB to variational autoencoders.
InfoNCE and Contrastive Learning
InfoNCE (van den Oord et al., 2018, CPC): a contrastive loss that provides a lower bound on mutual information:
where is a critic function, is a positive pair with , and are negative samples.
Relation to MI: . The bound saturates at , so more negatives yield tighter bounds.
InfoNCE is the foundation of modern contrastive learning:
- SimCLR (Chen et al., 2020): contrastive learning on augmented views, with projection head
- MoCo (He et al., 2020): momentum-updated encoder for consistent negative representations
- CLIP (Radford et al., 2021): contrastive image-text pretraining
- BYOL, SimSiam: avoid negatives via asymmetric architectures (stop-gradient, predictor), departing from strict MI maximization
Limitations of MI maximization: McAllester and Stratos (2020) showed that the sample complexity of MI estimation scales exponentially in the true MI, bounding the utility of MI-based objectives for high-MI pairs. Tschannen et al. (2020) argued that representation quality in contrastive learning depends more on the encoder architecture and critic design than on the tightness of the MI bound.
Mutual Information Estimation
MINE (Mutual Information Neural Estimation)
Belghazi et al. (2018): use the Donsker-Varadhan representation of KL divergence:
where is parameterized by a neural network. The bound is tight when .
Practical issues:
- High variance of the gradient of the log-partition function; biased gradient estimates
- Exponential moving average of the denominator reduces variance but introduces bias
- Difficult to optimize when MI is large
Other Neural MI Estimators
| Estimator | Bound Type | Variance | Bias | |---|---|---|---| | MINE (DV) | Lower (tight) | High | Low (asymptotic) | | NWJ (f-divergence) | Lower | Lower than MINE | More biased | | InfoNCE | Lower (saturates at ) | Low | Bounded | | SMILE | Lower | Controlled via clipping | Moderate | | BA (Barber-Agakov) | Lower | Low | Can be loose |
Non-parametric methods: KSG estimator (Kraskov, Stogbauer, Grassberger) based on -nearest neighbors. Works well in low dimensions but degrades in high dimensions.
PAC-Bayes Theory
PAC-Bayes bounds provide generalization guarantees for stochastic classifiers, naturally expressed in information-theoretic terms.
Classic PAC-Bayes bound (McAllester, 1999): for any prior (independent of data), posterior over hypotheses, with probability over training set of size :
where is the population loss and is the empirical loss.
Interpretation: generalization gap is controlled by the KL divergence between the learned posterior and the prior --- an information-theoretic complexity measure. Models that stay close to the prior (in KL sense) generalize better.
Extensions:
- Data-dependent priors: use part of the data to choose , tightening bounds significantly
- Catoni's bound: tighter than the square-root form, using moment-generating function
- PAC-Bayes with unbounded losses: extensions to heavy-tailed losses via truncation or Bernstein-type bounds
- PAC-Bayes for neural networks: Dziugaite and Roy (2017) showed non-vacuous bounds for stochastic neural networks via noise injection and optimization of the bound
Connection to SGD: SGD implicitly performs a form of posterior inference. The "noise" in SGD can be interpreted as limiting the information the algorithm extracts from the training data, providing a PAC-Bayes-like generalization mechanism.
Information-Theoretic Generalization Bounds
Xu-Raginsky Framework
Xu and Raginsky (2017): the generalization gap is bounded by the mutual information between the training data and the learned hypothesis :
where is the sub-Gaussian parameter of the loss. This directly connects information leakage from data to hypothesis with generalization.
Implications:
- Noisy algorithms (DP-SGD, dropout) limit , hence generalize
- Memorization corresponds to high
- Extends to individual sample MI: controls the contribution of sample
Conditional Mutual Information (CMI) Bounds
Steinke and Zakynthinou (2020): use a "supersample" construction. Tighter than raw MI bounds, can explain generalization of interpolating classifiers.
Classical Complexity Measures
VC Dimension
The Vapnik-Chervonenkis dimension of hypothesis class is the largest set size that can shatter (realize all labelings).
VC bound: with probability :
Limitations: VC dimension is a worst-case combinatorial measure. For neural networks, VC dimension scales with number of parameters, yielding vacuous bounds for overparameterized networks. The gap between VC predictions and observed generalization in deep learning is enormous.
Rademacher Complexity
Empirical Rademacher complexity:
where are i.i.d. Rademacher () variables. Measures how well can fit random noise.
Generalization bound: .
Advantages over VC: data-dependent, can capture structure. For neural networks, norm-based Rademacher bounds (spectral norm, Frobenius norm constraints) give tighter (though still often vacuous) bounds.
Compression and Generalization
Compression-based generalization (Littlestone and Warmuth, 1986; Arora et al., 2018): if a learning algorithm can be described by a subset of training examples (or bits), generalization error scales as .
Neural network compression: if a trained network can be compressed (pruned, quantized, distilled) to effective bits, this provides a generalization bound. This partially explains why overparameterized networks generalize: they learn compressible functions.
Connections:
- Flat minima (Hochreiter and Schmidhuber, 1997): parameters in flat minima can be described with fewer bits (less precision needed). Flat minima compressible good generalization
- Noise stability: if adding noise to weights doesn't change predictions much, the function is compressible
- PAC-Bayes as compression: measures how many bits the posterior differs from the prior
MDL for Model Selection
Minimum Description Length applied to model selection in ML:
where is the model complexity (parametric complexity, stochastic complexity).
Practical forms:
- BIC approximation: (coincides with MDL to first order)
- Normalized maximum likelihood (NML): minimax optimal but often intractable
- Prequential (predictive) MDL: using sequential prediction. No explicit model complexity term; complexity is implicit in prediction quality
MDL naturally handles:
- Model order selection (polynomial degree, number of clusters, tree depth)
- Feature selection (description length of features + data given features)
- Regularization (compression penalty = regularization strength)
Information-Theoretic Privacy
Differential Privacy via Information Theory
Differential privacy (DP) limits information leakage about any individual:
for adjacent datasets differing in one record.
MI bound: -DP implies per individual (for small ).
Connections:
- Mutual information DP: as an alternative privacy definition (weaker than DP but information-theoretically natural)
- Renyi DP: uses Renyi divergence, composes more tightly. -RDP:
- Fisher information and privacy: local DP mechanisms limit Fisher information about individual data points
Maximal Leakage
Maximal leakage (Issa, Wagner, Kamath, 2019):
Measures the maximum information that reveals about any function of . Equals the Sibson MI of order . Provides operational security guarantees beyond mutual information.
Information-Theoretic Bounds on Learning with Privacy
The privacy-utility tradeoff is fundamental:
- DP-SGD adds Gaussian noise proportional to gradient sensitivity, limiting
- Generalization bounds improve with privacy (less overfitting), but utility degrades
- The optimal tradeoff depends on the problem structure (convexity, smoothness, dimensionality)
Bassily et al. (2014): for -DP ERM over convex losses, excess risk (matching upper and lower bounds).
Additional Information-Theoretic Tools in ML
Rate-Distortion for Representations
Neural network representations can be analyzed through rate-distortion theory: what is the minimum description complexity of a representation that achieves a given task performance? This connects to the information plane (plotting vs. for intermediate representations ).
Channel Capacity of Learning
Learning can be viewed as communication through a "training channel": the training set is the input, the learned model is the output. The capacity of this channel bounds the maximum amount of information extractable from samples.
Entropy-Based Objectives
- Cross-entropy loss: minimizes , where the entropy term is constant
- Label smoothing: increases , acting as entropy regularization
- Entropy regularization in RL: encourages exploration (SAC, SQL)
- Maximum entropy IRL: infer reward function that makes observed behavior maximum-entropy optimal