Information Theory in Machine Learning

Information Bottleneck

The information bottleneck (IB) method (Tishby, Pereira, Bialek, 1999) formalizes the tradeoff between compression and prediction. Given joint distribution $p(X, Y)$ , find a compressed representation $T$ of $X$ that retains maximal information about $Y$ :

$\min_{p(t|x)} \; I(X; T) - \beta \cdot I(T; Y)$

subject to the Markov chain $Y \to X \to T$ .

Lagrange multiplier $\beta$ controls the tradeoff: $\beta \to 0$ yields maximum compression ( $T$ independent of $X$ ), $\beta \to \infty$ yields maximum prediction ( $T$ retains everything about $Y$ ).

Self-consistent equations (iterative solution):

$p(t|x) = \frac{p(t)}{Z(x,\beta)} \exp(-\beta \cdot D_{\text{KL}}(p(y|x) \| p(y|t)))$

where $Z$ is a normalization constant. This has the form of a rate-distortion problem with KL divergence as the distortion measure.

IB and Deep Learning

Shwartz-Ziv and Tishby (2017) proposed the deep learning as information bottleneck hypothesis: during training, networks undergo two phases:

Fitting phase: $I(T; Y)$ increases (learn to predict)
Compression phase: $I(X; T)$ decreases (forget irrelevant details)

This claim generated significant debate. Critiques:

Saxe et al. (2018): compression phase depends on activation function (observed with tanh, not ReLU); may be an artifact of binning-based MI estimation
MI estimation in high dimensions is notoriously difficult; results are estimator-dependent
The framework assumes a fixed joint distribution, while deep learning operates on finite samples

Despite controversies, the IB perspective remains influential for understanding representation learning, regularization, and the role of noise in training.

Variational Information Bottleneck

Alemi et al. (2017): tractable variational bound for the IB objective:

$\mathcal{L} = \mathbb{E}_{p(x,y)}[-\mathbb{E}_{p(t|x)}[\log q(y|t)]] + \beta \cdot \mathbb{E}_{p(x)}[D_{\text{KL}}(p(t|x) \| r(t))]$

where $q(y|t)$ is a variational decoder and $r(t)$ is a variational prior. This is closely related to the $\beta$ -VAE objective and connects the IB to variational autoencoders.

InfoNCE and Contrastive Learning

InfoNCE (van den Oord et al., 2018, CPC): a contrastive loss that provides a lower bound on mutual information:

$\mathcal{L}_{\text{InfoNCE}} = -\mathbb{E}\left[\log \frac{f(x, y^+)}{f(x, y^+) + \sum_{j=1}^{K-1} f(x, y_j^-)}\right]$

where $f(x, y) = \exp(g(x)^T h(y) / \tau)$ is a critic function, $y^+$ is a positive pair with $x$ , and $y_j^-$ are negative samples.

Relation to MI: $I(X; Y) \geq \log K - \mathcal{L}_{\text{InfoNCE}}$ . The bound saturates at $\log K$ , so more negatives yield tighter bounds.

InfoNCE is the foundation of modern contrastive learning:

SimCLR (Chen et al., 2020): contrastive learning on augmented views, with projection head
MoCo (He et al., 2020): momentum-updated encoder for consistent negative representations
CLIP (Radford et al., 2021): contrastive image-text pretraining
BYOL, SimSiam: avoid negatives via asymmetric architectures (stop-gradient, predictor), departing from strict MI maximization

Limitations of MI maximization: McAllester and Stratos (2020) showed that the sample complexity of MI estimation scales exponentially in the true MI, bounding the utility of MI-based objectives for high-MI pairs. Tschannen et al. (2020) argued that representation quality in contrastive learning depends more on the encoder architecture and critic design than on the tightness of the MI bound.

Mutual Information Estimation

MINE (Mutual Information Neural Estimation)

Belghazi et al. (2018): use the Donsker-Varadhan representation of KL divergence:

$I(X; Y) = \sup_{T \in \mathcal{F}} \mathbb{E}_{p(x,y)}[T(x,y)] - \log \mathbb{E}_{p(x)p(y)}[e^{T(x,y)}]$

where $T$ is parameterized by a neural network. The bound is tight when $T^*(x,y) = \log \frac{p(x,y)}{p(x)p(y)} + C$ .

Practical issues:

High variance of the gradient of the log-partition function; biased gradient estimates
Exponential moving average of the denominator reduces variance but introduces bias
Difficult to optimize when MI is large

Other Neural MI Estimators

Estimator	Bound Type	Variance	Bias
MINE (DV)	Lower (tight)	High	Low (asymptotic)
NWJ (f-divergence)	Lower	Lower than MINE	More biased
InfoNCE	Lower (saturates at $\log K$ )	Low	Bounded
SMILE	Lower	Controlled via clipping	Moderate
BA (Barber-Agakov)	Lower	Low	Can be loose

Non-parametric methods: KSG estimator (Kraskov, Stogbauer, Grassberger) based on $k$ -nearest neighbors. Works well in low dimensions but degrades in high dimensions.

PAC-Bayes Theory

PAC-Bayes bounds provide generalization guarantees for stochastic classifiers, naturally expressed in information-theoretic terms.

Classic PAC-Bayes bound (McAllester, 1999): for any prior $P$ (independent of data), posterior $Q$ over hypotheses, with probability $\geq 1-\delta$ over training set $S$ of size $n$ :

$\mathbb{E}_{h \sim Q}[L(h)] \leq \mathbb{E}_{h \sim Q}[\hat{L}(h)] + \sqrt{\frac{D_{\text{KL}}(Q \| P) + \ln(n/\delta)}{2n}}$

where $L(h)$ is the population loss and $\hat{L}(h)$ is the empirical loss.

Interpretation: generalization gap is controlled by the KL divergence between the learned posterior and the prior --- an information-theoretic complexity measure. Models that stay close to the prior (in KL sense) generalize better.

Extensions:

Data-dependent priors: use part of the data to choose $P$ , tightening bounds significantly
Catoni's bound: tighter than the square-root form, using moment-generating function
PAC-Bayes with unbounded losses: extensions to heavy-tailed losses via truncation or Bernstein-type bounds
PAC-Bayes for neural networks: Dziugaite and Roy (2017) showed non-vacuous bounds for stochastic neural networks via noise injection and optimization of the bound

Connection to SGD: SGD implicitly performs a form of posterior inference. The "noise" in SGD can be interpreted as limiting the information the algorithm extracts from the training data, providing a PAC-Bayes-like generalization mechanism.

Information-Theoretic Generalization Bounds

Xu-Raginsky Framework

Xu and Raginsky (2017): the generalization gap is bounded by the mutual information between the training data $S$ and the learned hypothesis $W$ :

$|\text{gen}(W, S)| \leq \sqrt{\frac{2\sigma^2 I(W; S)}{n}}$

where $\sigma^2$ is the sub-Gaussian parameter of the loss. This directly connects information leakage from data to hypothesis with generalization.

Implications:

Noisy algorithms (DP-SGD, dropout) limit $I(W; S)$ , hence generalize
Memorization corresponds to high $I(W; S)$
Extends to individual sample MI: $I(W; Z_i)$ controls the contribution of sample $i$

Conditional Mutual Information (CMI) Bounds

Steinke and Zakynthinou (2020): use a "supersample" construction. Tighter than raw MI bounds, can explain generalization of interpolating classifiers.

Classical Complexity Measures

VC Dimension

The Vapnik-Chervonenkis dimension of hypothesis class $\mathcal{H}$ is the largest set size $d$ that $\mathcal{H}$ can shatter (realize all $2^d$ labelings).

VC bound: with probability $\geq 1-\delta$ :

$\sup_{h \in \mathcal{H}} |L(h) - \hat{L}(h)| = O\left(\sqrt{\frac{d_{\text{VC}} \log n + \log(1/\delta)}{n}}\right)$

Limitations: VC dimension is a worst-case combinatorial measure. For neural networks, VC dimension scales with number of parameters, yielding vacuous bounds for overparameterized networks. The gap between VC predictions and observed generalization in deep learning is enormous.

Rademacher Complexity

Empirical Rademacher complexity:

$\hat{\mathcal{R}}_n(\mathcal{H}) = \mathbb{E}_\sigma\left[\sup_{h \in \mathcal{H}} \frac{1}{n}\sum_{i=1}^n \sigma_i h(x_i)\right]$

where $\sigma_i$ are i.i.d. Rademacher ( $\pm 1$ ) variables. Measures how well $\mathcal{H}$ can fit random noise.

Generalization bound: $\mathbb{E}[\sup_{h} |L(h) - \hat{L}(h)|] \leq 2\mathcal{R}_n(\mathcal{H})$ .

Advantages over VC: data-dependent, can capture structure. For neural networks, norm-based Rademacher bounds (spectral norm, Frobenius norm constraints) give tighter (though still often vacuous) bounds.

Compression and Generalization

Compression-based generalization (Littlestone and Warmuth, 1986; Arora et al., 2018): if a learning algorithm can be described by a subset of $k$ training examples (or $k$ bits), generalization error scales as $O(\sqrt{k/n})$ .

Neural network compression: if a trained network can be compressed (pruned, quantized, distilled) to $k$ effective bits, this provides a generalization bound. This partially explains why overparameterized networks generalize: they learn compressible functions.

Connections:

Flat minima (Hochreiter and Schmidhuber, 1997): parameters in flat minima can be described with fewer bits (less precision needed). Flat minima $\implies$ compressible $\implies$ good generalization
Noise stability: if adding noise to weights doesn't change predictions much, the function is compressible
PAC-Bayes as compression: $D_{\text{KL}}(Q \| P)$ measures how many bits the posterior differs from the prior

MDL for Model Selection

Minimum Description Length applied to model selection in ML:

$\text{Best model} = \arg\min_{M} \left[-\log p(\text{data} | \hat{\theta}_M, M) + \text{COMP}(M)\right]$

where $\text{COMP}(M)$ is the model complexity (parametric complexity, stochastic complexity).

Practical forms:

BIC approximation: $-\log p(\text{data} | \hat{\theta}) + \frac{k}{2}\log n$ (coincides with MDL to first order)
Normalized maximum likelihood (NML): minimax optimal but often intractable
Prequential (predictive) MDL: $-\sum_{i=1}^n \log p(x_i | x_1, \ldots, x_{i-1})$ using sequential prediction. No explicit model complexity term; complexity is implicit in prediction quality

MDL naturally handles:

Model order selection (polynomial degree, number of clusters, tree depth)
Feature selection (description length of features + data given features)
Regularization (compression penalty = regularization strength)

Information-Theoretic Privacy

Differential Privacy via Information Theory

Differential privacy (DP) limits information leakage about any individual:

$\Pr[\mathcal{M}(D) \in S] \leq e^\epsilon \Pr[\mathcal{M}(D') \in S]$

for adjacent datasets $D, D'$ differing in one record.

MI bound: $(\epsilon, \delta)$ -DP implies $I(X_i; \mathcal{M}(D)) \leq O(\epsilon^2)$ per individual $i$ (for small $\epsilon$ ).

Connections:

Mutual information DP: $I(X_i; W) \leq \epsilon$ as an alternative privacy definition (weaker than DP but information-theoretically natural)
Renyi DP: uses Renyi divergence, composes more tightly. $(\alpha, \epsilon)$ -RDP: $D_\alpha(\mathcal{M}(D) \| \mathcal{M}(D')) \leq \epsilon$
Fisher information and privacy: local DP mechanisms limit Fisher information about individual data points

Maximal Leakage

Maximal leakage (Issa, Wagner, Kamath, 2019):

$\mathcal{L}(X \to Y) = \sup_{U: U-X-Y} \log \frac{\max_{\hat{u}(y)} \Pr[U = \hat{u}(Y)]}{\max_u \Pr[U = u]}$

Measures the maximum information that $Y$ reveals about any function of $X$ . Equals the Sibson MI of order $\infty$ . Provides operational security guarantees beyond mutual information.

Information-Theoretic Bounds on Learning with Privacy

The privacy-utility tradeoff is fundamental:

DP-SGD adds Gaussian noise proportional to gradient sensitivity, limiting $I(W; S)$
Generalization bounds improve with privacy (less overfitting), but utility degrades
The optimal tradeoff depends on the problem structure (convexity, smoothness, dimensionality)

Bassily et al. (2014): for $(\epsilon, \delta)$ -DP ERM over convex losses, excess risk $\Theta\left(\frac{\sqrt{d \log(1/\delta)}}{n\epsilon}\right)$ (matching upper and lower bounds).

Additional Information-Theoretic Tools in ML

Rate-Distortion for Representations

Neural network representations can be analyzed through rate-distortion theory: what is the minimum description complexity of a representation that achieves a given task performance? This connects to the information plane (plotting $I(X;T)$ vs. $I(T;Y)$ for intermediate representations $T$ ).

Channel Capacity of Learning

Learning can be viewed as communication through a "training channel": the training set $S$ is the input, the learned model $W$ is the output. The capacity of this channel bounds the maximum amount of information extractable from $n$ samples.

Entropy-Based Objectives

Cross-entropy loss: minimizes $D_{\text{KL}}(p_{\text{data}} \| p_\theta) + H(p_{\text{data}})$ , where the entropy term is constant
Label smoothing: increases $H(\text{targets})$ , acting as entropy regularization
Entropy regularization in RL: $\max_\pi \mathbb{E}[R] + \alpha H(\pi)$ encourages exploration (SAC, SQL)
Maximum entropy IRL: infer reward function that makes observed behavior maximum-entropy optimal