Interpretability

Motivation and Taxonomy

Interpretability addresses the question: why did the model produce this output? As ML models are deployed in high-stakes domains (medicine, law, finance, autonomous systems), understanding model behavior is critical for trust, debugging, regulatory compliance, and scientific discovery.

Post-Hoc vs. Intrinsic

Intrinsic interpretability: The model is inherently interpretable (linear models, decision trees, rule lists, GAMs). Limited expressiveness.
Post-hoc interpretability: Apply explanation methods to an already-trained model. Works with any model, but explanations may be unfaithful to the model's actual reasoning.

Local vs. Global

Local explanations: Explain a single prediction (why did the model classify this image as a cat?).
Global explanations: Explain overall model behavior (what features does the model generally use?).

Feature Attribution Methods

SHAP (SHapley Additive exPlanations)

Lundberg and Lee (2017) unified several attribution methods under the Shapley value framework from cooperative game theory. The Shapley value of feature i for prediction f(x) is:

φ_i = Σ_{S ⊆ N{i}} [|S|!(|N|-|S|-1)!/|N|!] · [f(S ∪ {i}) - f(S)]

where f(S) is the model prediction using only features in S (others marginalized out).

Properties (uniquely characterized): Efficiency (attributions sum to f(x) - E[f(x)]), symmetry, dummy (irrelevant features get zero attribution), linearity.

Computing SHAP: Exact computation requires 2^n evaluations (exponential in features). Approximations:

KernelSHAP: Sample subsets and fit a weighted linear model. Model-agnostic but slow.
TreeSHAP: Exact polynomial-time algorithm for tree-based models (random forests, XGBoost, LightGBM). Exploits tree structure to compute exact Shapley values in O(TLD²) per sample.
DeepSHAP: Combines DeepLIFT's backpropagation with Shapley value computation for neural networks.

Limitations: The choice of baseline/reference distribution affects explanations. Marginalizing features by replacement (interventional vs. observational) yields different explanations — observational SHAP can produce misleading attributions for correlated features.

LIME (Local Interpretable Model-agnostic Explanations)

Ribeiro et al. (2016): Explain any model locally by fitting an interpretable surrogate model in the neighborhood of the instance.

Perturb: Generate perturbed samples around x by randomly toggling interpretable features (e.g., superpixels for images, words for text).
Weight: Weight perturbed samples by proximity to x (exponential kernel).
Fit: Train a linear model (or decision tree) on weighted perturbed samples to approximate f locally.
Explain: The surrogate model's coefficients are the feature attributions.

Criticism: LIME explanations are sensitive to the perturbation distribution, kernel width, and number of perturbations. Different runs can produce different explanations. The surrogate may not faithfully approximate the model's decision boundary.

Integrated Gradients

Sundararajan et al. (2017): Accumulate gradients along a straight-line path from a baseline x' (e.g., black image) to the input x:

IG_i(x) = (x_i - x'_i) × ∫₀¹ (∂f/∂x_i)(x' + α(x - x')) dα

Axiomatic justification: Satisfies sensitivity (if changing feature i changes the output, it gets non-zero attribution), implementation invariance (two functionally identical networks get the same attributions), and completeness (attributions sum to f(x) - f(x')). Computed by averaging gradients at 20-300 interpolation points.

Limitation: Sensitive to baseline choice. The straight-line path may pass through unrealistic regions of input space.

GradCAM (Gradient-weighted Class Activation Mapping)

Selvaraju et al. (2017): For CNNs, compute the gradient of the class score with respect to the final convolutional feature map A^k. The importance weight of feature map k:

α_k = (1/Z) Σ_i Σ_j (∂y^c / ∂A^k_{ij})

The GradCAM heatmap: L = ReLU(Σ_k α_k A^k). ReLU retains only features with positive influence on the class. Produces coarse localization maps highlighting which image regions contribute to the prediction.

GradCAM++: Improved weighting using second-order gradients. Score-CAM: Uses activation maps as masks and measures the change in class score, avoiding gradient computation entirely.

Concept-Based Explanations

TCAV (Testing with Concept Activation Vectors)

Kim et al. (2018): Explain model predictions in terms of human-understandable concepts (e.g., "striped," "furry," "wheel-shaped") rather than individual pixels.

Define a concept: Collect a set of images exemplifying the concept (e.g., images with stripes) and a set without.
Train a linear probe: Fit a linear classifier in the model's activation space to distinguish concept-present vs. concept-absent activations. The normal vector to the decision boundary is the Concept Activation Vector (CAV).
Directional derivative: Compute the sensitivity of the class prediction to the concept direction: S_{C,k,l} = ∇h_l f_k(x) · v_CAV. A positive value means the concept positively influences the class prediction.

Statistical testing: Use a t-test to determine if the concept's influence is significantly different from random concepts. This controls for spurious correlations.

Concept Bottleneck Models

Koh et al. (2020): Insert an explicit concept layer into the model architecture. The model first predicts human-defined concepts (e.g., "wing color," "beak shape"), then predicts the class from concepts. Enables: concept-level explanations (the model predicts "robin" because it detects "red breast"), human intervention (correct a concept prediction to fix the class prediction), and systematic auditing.

Tradeoff: Requires concept annotations. If concepts are incomplete (don't capture all relevant information), prediction accuracy suffers.

Mechanistic Interpretability

Goal

Understand the internal mechanisms of neural networks at the level of individual neurons, circuits, and computational structures. Rather than asking "which input features matter?" (feature attribution), ask "how does the network compute its answer internally?"

Probing

Train simple classifiers (linear probes) on intermediate representations to test what information is encoded. For language models: probe for part-of-speech, syntactic structure, semantic roles, world knowledge at different layers. High probe accuracy indicates the information is linearly accessible in the representation.

Limitations: A probe may learn to extract information that is present but not used by the model. Selectivity (Hewitt and Liang, 2019) controls for this by comparing against random baselines. Causal probing (Elazar et al., 2021) intervenes on representations to test if the probed information is causally relevant.

Circuits

Olah et al. (2020) and the Anthropic interpretability team study circuits — subnetworks of neurons and connections that implement specific computations. Examples:

Induction heads (Olsson et al., 2022): Attention heads in transformers that implement the pattern "[A][B]...[A] → [B]" (copy the token that followed the previous occurrence). These are fundamental to in-context learning. Composed of two heads: a "previous token" head and a "match and copy" head.
Indirect object identification (Wang et al., 2023): A circuit in GPT-2 that identifies the indirect object in sentences like "John gave Mary the..." Involves 26 heads across 10 layers, organized into name-mover, backup name-mover, and inhibition heads.

Sparse Autoencoders (SAEs)

Bricken et al. (2023), Cunningham et al. (2023): Address the superposition hypothesis — individual neurons encode multiple concepts simultaneously because the model has more concepts than neurons.

SAEs decompose neural activations into a sparse, overcomplete basis:

h ≈ Σ_i a_i d_i (sparse coefficients a_i, dictionary vectors d_i)

The SAE is trained to reconstruct activations with an L1 sparsity penalty on coefficients. Each dictionary feature (direction d_i) often corresponds to an interpretable concept. This reveals monosemantic features hidden in polysemantic neurons.

Scaling: Templeton et al. (2024) scaled SAEs to Claude (Anthropic's production model) with millions of features, finding interpretable features for safety-relevant concepts (deception, bias, harmful content).

Superposition

Elhage et al. (2022): Neural networks represent more features than they have dimensions by using superposition — encoding features as nearly-orthogonal directions in a lower-dimensional space. Features are linearly decodable but interfere with each other (crosstalk). Sparse features (rarely active) can be packed more densely because interference is unlikely when both features are simultaneously active.

Toy models: Networks trained on sparse synthetic features demonstrate phase transitions — as feature sparsity increases, the network transitions from dedicating one neuron per feature to superposed representations. This explains why individual neurons are often polysemantic.

Counterfactual Explanations

Counterfactual: "What is the smallest change to input x that would change the prediction?" Find x' = argmin d(x, x') such that f(x') ≠ f(x).

Properties of good counterfactuals: Proximity (minimal change), plausibility (x' is realistic), diversity (multiple counterfactuals exploring different changes), actionability (changes are feasible for the user).

DiCE (Mothilal et al., 2020): Generate diverse counterfactual explanations by optimizing for both proximity and diversity across a set of counterfactuals.

Algorithmic recourse: Counterfactuals that recommend actionable changes for individuals (e.g., "increase income by $5000 and reduce debt by$ 2000 to get loan approved"). Must respect causal constraints (cannot change age, immutable features).

Influence Functions

Koh and Liang (2017): Estimate the influence of a single training example z on a prediction f(x_test). Using a second-order Taylor approximation around the optimal parameters θ*:

I(z, x_test) = -∇_θ L(x_test, θ*)^T H_θ^{-1} ∇_θ L(z, θ*)

where H is the Hessian of the empirical risk. This measures how much upweighting training example z would change the prediction on x_test.

Applications: Identify training examples responsible for a prediction (data attribution), detect mislabeled training data (high self-influence), understand model errors, detect data poisoning.

Scalability: Computing H^{-1} is O(n²p + p³) for n examples and p parameters — intractable for large models. Approximations: stochastic estimation (LiSSA), Kronecker-factored curvature (K-FAC), Arnoldi iteration. TracIn (Pruthi et al., 2020) avoids the Hessian entirely by summing gradient dot products over training checkpoints.

Limitations for deep networks: Influence functions rely on the quadratic approximation being accurate, which may not hold for highly non-convex loss landscapes. Empirical validation (leave-one-out retraining) shows that influence function estimates can be inaccurate for large neural networks.

Faithfulness and Evaluation

A central challenge: are explanations faithful to the model's actual reasoning? An explanation method might produce plausible-looking attributions that do not reflect the model's computation.

Evaluation metrics:

Deletion/insertion: Progressively remove (or add) features in order of attribution. Faithful attributions should cause rapid change in model output.
ROAR (Remove And Retrain): Remove top-attributed features and retrain. If attributions are meaningful, retraining without important features should degrade performance.
Sanity checks (Adebayo et al., 2018): Randomize model weights progressively and check if explanations change. Methods that produce similar explanations for trained and random models (e.g., some gradient-based methods) are not explaining the model.