9 min read
On this page

Information Theory in Machine Learning

Information Bottleneck

The information bottleneck (IB) method (Tishby, Pereira, Bialek, 1999) formalizes the tradeoff between compression and prediction. Given joint distribution p(X,Y)p(X, Y), find a compressed representation TT of XX that retains maximal information about YY:

minp(tx)  I(X;T)βI(T;Y)\min_{p(t|x)} \; I(X; T) - \beta \cdot I(T; Y)

subject to the Markov chain YXTY \to X \to T.

Lagrange multiplier β\beta controls the tradeoff: β0\beta \to 0 yields maximum compression (TT independent of XX), β\beta \to \infty yields maximum prediction (TT retains everything about YY).

Self-consistent equations (iterative solution):

p(tx)=p(t)Z(x,β)exp(βDKL(p(yx)p(yt)))p(t|x) = \frac{p(t)}{Z(x,\beta)} \exp(-\beta \cdot D_{\text{KL}}(p(y|x) \| p(y|t)))

where ZZ is a normalization constant. This has the form of a rate-distortion problem with KL divergence as the distortion measure.

IB and Deep Learning

Shwartz-Ziv and Tishby (2017) proposed the deep learning as information bottleneck hypothesis: during training, networks undergo two phases:

  1. Fitting phase: I(T;Y)I(T; Y) increases (learn to predict)
  2. Compression phase: I(X;T)I(X; T) decreases (forget irrelevant details)

This claim generated significant debate. Critiques:

  • Saxe et al. (2018): compression phase depends on activation function (observed with tanh, not ReLU); may be an artifact of binning-based MI estimation
  • MI estimation in high dimensions is notoriously difficult; results are estimator-dependent
  • The framework assumes a fixed joint distribution, while deep learning operates on finite samples

Despite controversies, the IB perspective remains influential for understanding representation learning, regularization, and the role of noise in training.

Variational Information Bottleneck

Alemi et al. (2017): tractable variational bound for the IB objective:

L=Ep(x,y)[Ep(tx)[logq(yt)]]+βEp(x)[DKL(p(tx)r(t))]\mathcal{L} = \mathbb{E}_{p(x,y)}[-\mathbb{E}_{p(t|x)}[\log q(y|t)]] + \beta \cdot \mathbb{E}_{p(x)}[D_{\text{KL}}(p(t|x) \| r(t))]

where q(yt)q(y|t) is a variational decoder and r(t)r(t) is a variational prior. This is closely related to the β\beta-VAE objective and connects the IB to variational autoencoders.

InfoNCE and Contrastive Learning

InfoNCE (van den Oord et al., 2018, CPC): a contrastive loss that provides a lower bound on mutual information:

LInfoNCE=E[logf(x,y+)f(x,y+)+j=1K1f(x,yj)]\mathcal{L}_{\text{InfoNCE}} = -\mathbb{E}\left[\log \frac{f(x, y^+)}{f(x, y^+) + \sum_{j=1}^{K-1} f(x, y_j^-)}\right]

where f(x,y)=exp(g(x)Th(y)/τ)f(x, y) = \exp(g(x)^T h(y) / \tau) is a critic function, y+y^+ is a positive pair with xx, and yjy_j^- are negative samples.

Relation to MI: I(X;Y)logKLInfoNCEI(X; Y) \geq \log K - \mathcal{L}_{\text{InfoNCE}}. The bound saturates at logK\log K, so more negatives yield tighter bounds.

InfoNCE is the foundation of modern contrastive learning:

  • SimCLR (Chen et al., 2020): contrastive learning on augmented views, with projection head
  • MoCo (He et al., 2020): momentum-updated encoder for consistent negative representations
  • CLIP (Radford et al., 2021): contrastive image-text pretraining
  • BYOL, SimSiam: avoid negatives via asymmetric architectures (stop-gradient, predictor), departing from strict MI maximization

Limitations of MI maximization: McAllester and Stratos (2020) showed that the sample complexity of MI estimation scales exponentially in the true MI, bounding the utility of MI-based objectives for high-MI pairs. Tschannen et al. (2020) argued that representation quality in contrastive learning depends more on the encoder architecture and critic design than on the tightness of the MI bound.

Mutual Information Estimation

MINE (Mutual Information Neural Estimation)

Belghazi et al. (2018): use the Donsker-Varadhan representation of KL divergence:

I(X;Y)=supTFEp(x,y)[T(x,y)]logEp(x)p(y)[eT(x,y)]I(X; Y) = \sup_{T \in \mathcal{F}} \mathbb{E}_{p(x,y)}[T(x,y)] - \log \mathbb{E}_{p(x)p(y)}[e^{T(x,y)}]

where TT is parameterized by a neural network. The bound is tight when T(x,y)=logp(x,y)p(x)p(y)+CT^*(x,y) = \log \frac{p(x,y)}{p(x)p(y)} + C.

Practical issues:

  • High variance of the gradient of the log-partition function; biased gradient estimates
  • Exponential moving average of the denominator reduces variance but introduces bias
  • Difficult to optimize when MI is large

Other Neural MI Estimators

| Estimator | Bound Type | Variance | Bias | |---|---|---|---| | MINE (DV) | Lower (tight) | High | Low (asymptotic) | | NWJ (f-divergence) | Lower | Lower than MINE | More biased | | InfoNCE | Lower (saturates at logK\log K) | Low | Bounded | | SMILE | Lower | Controlled via clipping | Moderate | | BA (Barber-Agakov) | Lower | Low | Can be loose |

Non-parametric methods: KSG estimator (Kraskov, Stogbauer, Grassberger) based on kk-nearest neighbors. Works well in low dimensions but degrades in high dimensions.

PAC-Bayes Theory

PAC-Bayes bounds provide generalization guarantees for stochastic classifiers, naturally expressed in information-theoretic terms.

Classic PAC-Bayes bound (McAllester, 1999): for any prior PP (independent of data), posterior QQ over hypotheses, with probability 1δ\geq 1-\delta over training set SS of size nn:

EhQ[L(h)]EhQ[L^(h)]+DKL(QP)+ln(n/δ)2n\mathbb{E}_{h \sim Q}[L(h)] \leq \mathbb{E}_{h \sim Q}[\hat{L}(h)] + \sqrt{\frac{D_{\text{KL}}(Q \| P) + \ln(n/\delta)}{2n}}

where L(h)L(h) is the population loss and L^(h)\hat{L}(h) is the empirical loss.

Interpretation: generalization gap is controlled by the KL divergence between the learned posterior and the prior --- an information-theoretic complexity measure. Models that stay close to the prior (in KL sense) generalize better.

Extensions:

  • Data-dependent priors: use part of the data to choose PP, tightening bounds significantly
  • Catoni's bound: tighter than the square-root form, using moment-generating function
  • PAC-Bayes with unbounded losses: extensions to heavy-tailed losses via truncation or Bernstein-type bounds
  • PAC-Bayes for neural networks: Dziugaite and Roy (2017) showed non-vacuous bounds for stochastic neural networks via noise injection and optimization of the bound

Connection to SGD: SGD implicitly performs a form of posterior inference. The "noise" in SGD can be interpreted as limiting the information the algorithm extracts from the training data, providing a PAC-Bayes-like generalization mechanism.

Information-Theoretic Generalization Bounds

Xu-Raginsky Framework

Xu and Raginsky (2017): the generalization gap is bounded by the mutual information between the training data SS and the learned hypothesis WW:

gen(W,S)2σ2I(W;S)n|\text{gen}(W, S)| \leq \sqrt{\frac{2\sigma^2 I(W; S)}{n}}

where σ2\sigma^2 is the sub-Gaussian parameter of the loss. This directly connects information leakage from data to hypothesis with generalization.

Implications:

  • Noisy algorithms (DP-SGD, dropout) limit I(W;S)I(W; S), hence generalize
  • Memorization corresponds to high I(W;S)I(W; S)
  • Extends to individual sample MI: I(W;Zi)I(W; Z_i) controls the contribution of sample ii

Conditional Mutual Information (CMI) Bounds

Steinke and Zakynthinou (2020): use a "supersample" construction. Tighter than raw MI bounds, can explain generalization of interpolating classifiers.

Classical Complexity Measures

VC Dimension

The Vapnik-Chervonenkis dimension of hypothesis class H\mathcal{H} is the largest set size dd that H\mathcal{H} can shatter (realize all 2d2^d labelings).

VC bound: with probability 1δ\geq 1-\delta:

suphHL(h)L^(h)=O(dVClogn+log(1/δ)n)\sup_{h \in \mathcal{H}} |L(h) - \hat{L}(h)| = O\left(\sqrt{\frac{d_{\text{VC}} \log n + \log(1/\delta)}{n}}\right)

Limitations: VC dimension is a worst-case combinatorial measure. For neural networks, VC dimension scales with number of parameters, yielding vacuous bounds for overparameterized networks. The gap between VC predictions and observed generalization in deep learning is enormous.

Rademacher Complexity

Empirical Rademacher complexity:

R^n(H)=Eσ[suphH1ni=1nσih(xi)]\hat{\mathcal{R}}_n(\mathcal{H}) = \mathbb{E}_\sigma\left[\sup_{h \in \mathcal{H}} \frac{1}{n}\sum_{i=1}^n \sigma_i h(x_i)\right]

where σi\sigma_i are i.i.d. Rademacher (±1\pm 1) variables. Measures how well H\mathcal{H} can fit random noise.

Generalization bound: E[suphL(h)L^(h)]2Rn(H)\mathbb{E}[\sup_{h} |L(h) - \hat{L}(h)|] \leq 2\mathcal{R}_n(\mathcal{H}).

Advantages over VC: data-dependent, can capture structure. For neural networks, norm-based Rademacher bounds (spectral norm, Frobenius norm constraints) give tighter (though still often vacuous) bounds.

Compression and Generalization

Compression-based generalization (Littlestone and Warmuth, 1986; Arora et al., 2018): if a learning algorithm can be described by a subset of kk training examples (or kk bits), generalization error scales as O(k/n)O(\sqrt{k/n}).

Neural network compression: if a trained network can be compressed (pruned, quantized, distilled) to kk effective bits, this provides a generalization bound. This partially explains why overparameterized networks generalize: they learn compressible functions.

Connections:

  • Flat minima (Hochreiter and Schmidhuber, 1997): parameters in flat minima can be described with fewer bits (less precision needed). Flat minima     \implies compressible     \implies good generalization
  • Noise stability: if adding noise to weights doesn't change predictions much, the function is compressible
  • PAC-Bayes as compression: DKL(QP)D_{\text{KL}}(Q \| P) measures how many bits the posterior differs from the prior

MDL for Model Selection

Minimum Description Length applied to model selection in ML:

Best model=argminM[logp(dataθ^M,M)+COMP(M)]\text{Best model} = \arg\min_{M} \left[-\log p(\text{data} | \hat{\theta}_M, M) + \text{COMP}(M)\right]

where COMP(M)\text{COMP}(M) is the model complexity (parametric complexity, stochastic complexity).

Practical forms:

  • BIC approximation: logp(dataθ^)+k2logn-\log p(\text{data} | \hat{\theta}) + \frac{k}{2}\log n (coincides with MDL to first order)
  • Normalized maximum likelihood (NML): minimax optimal but often intractable
  • Prequential (predictive) MDL: i=1nlogp(xix1,,xi1)-\sum_{i=1}^n \log p(x_i | x_1, \ldots, x_{i-1}) using sequential prediction. No explicit model complexity term; complexity is implicit in prediction quality

MDL naturally handles:

  • Model order selection (polynomial degree, number of clusters, tree depth)
  • Feature selection (description length of features + data given features)
  • Regularization (compression penalty = regularization strength)

Information-Theoretic Privacy

Differential Privacy via Information Theory

Differential privacy (DP) limits information leakage about any individual:

Pr[M(D)S]eϵPr[M(D)S]\Pr[\mathcal{M}(D) \in S] \leq e^\epsilon \Pr[\mathcal{M}(D') \in S]

for adjacent datasets D,DD, D' differing in one record.

MI bound: (ϵ,δ)(\epsilon, \delta)-DP implies I(Xi;M(D))O(ϵ2)I(X_i; \mathcal{M}(D)) \leq O(\epsilon^2) per individual ii (for small ϵ\epsilon).

Connections:

  • Mutual information DP: I(Xi;W)ϵI(X_i; W) \leq \epsilon as an alternative privacy definition (weaker than DP but information-theoretically natural)
  • Renyi DP: uses Renyi divergence, composes more tightly. (α,ϵ)(\alpha, \epsilon)-RDP: Dα(M(D)M(D))ϵD_\alpha(\mathcal{M}(D) \| \mathcal{M}(D')) \leq \epsilon
  • Fisher information and privacy: local DP mechanisms limit Fisher information about individual data points

Maximal Leakage

Maximal leakage (Issa, Wagner, Kamath, 2019):

L(XY)=supU:UXYlogmaxu^(y)Pr[U=u^(Y)]maxuPr[U=u]\mathcal{L}(X \to Y) = \sup_{U: U-X-Y} \log \frac{\max_{\hat{u}(y)} \Pr[U = \hat{u}(Y)]}{\max_u \Pr[U = u]}

Measures the maximum information that YY reveals about any function of XX. Equals the Sibson MI of order \infty. Provides operational security guarantees beyond mutual information.

Information-Theoretic Bounds on Learning with Privacy

The privacy-utility tradeoff is fundamental:

  • DP-SGD adds Gaussian noise proportional to gradient sensitivity, limiting I(W;S)I(W; S)
  • Generalization bounds improve with privacy (less overfitting), but utility degrades
  • The optimal tradeoff depends on the problem structure (convexity, smoothness, dimensionality)

Bassily et al. (2014): for (ϵ,δ)(\epsilon, \delta)-DP ERM over convex losses, excess risk Θ(dlog(1/δ)nϵ)\Theta\left(\frac{\sqrt{d \log(1/\delta)}}{n\epsilon}\right) (matching upper and lower bounds).

Additional Information-Theoretic Tools in ML

Rate-Distortion for Representations

Neural network representations can be analyzed through rate-distortion theory: what is the minimum description complexity of a representation that achieves a given task performance? This connects to the information plane (plotting I(X;T)I(X;T) vs. I(T;Y)I(T;Y) for intermediate representations TT).

Channel Capacity of Learning

Learning can be viewed as communication through a "training channel": the training set SS is the input, the learned model WW is the output. The capacity of this channel bounds the maximum amount of information extractable from nn samples.

Entropy-Based Objectives

  • Cross-entropy loss: minimizes DKL(pdatapθ)+H(pdata)D_{\text{KL}}(p_{\text{data}} \| p_\theta) + H(p_{\text{data}}), where the entropy term is constant
  • Label smoothing: increases H(targets)H(\text{targets}), acting as entropy regularization
  • Entropy regularization in RL: maxπE[R]+αH(π)\max_\pi \mathbb{E}[R] + \alpha H(\pi) encourages exploration (SAC, SQL)
  • Maximum entropy IRL: infer reward function that makes observed behavior maximum-entropy optimal