8 min read
On this page

Meta-Learning

Overview

Meta-learning ("learning to learn") addresses the problem of rapid adaptation: given experience across many tasks, learn an inductive bias that enables fast learning on new, unseen tasks with very few examples. Standard supervised learning optimizes performance on a single task with many examples. Meta-learning optimizes the learning process itself across a distribution of tasks.

Problem Formulation

A task T = (D_train, D_test) consists of a small training set (support set) and a test set (query set). Tasks are drawn from a task distribution p(T). During meta-training, the learner processes many tasks, learning a strategy that generalizes. During meta-testing, the learner faces new tasks from the same distribution, applying the learned strategy to the support set and evaluating on the query set.

N-way K-shot classification: Each task has N classes, each with K labeled examples in the support set (typically K ∈ {1, 5}). The query set contains additional examples from the same N classes. Standard benchmarks: Omniglot (1623 characters, 20 examples each), miniImageNet (100 classes from ImageNet, 600 examples each), tieredImageNet (608 classes, hierarchically structured).

Taxonomy

Meta-learning approaches are categorized by what is learned:

  1. Metric-based: Learn a similarity space where comparison-based classification is effective.
  2. Optimization-based: Learn an initialization or optimization procedure that enables fast fine-tuning.
  3. Model-based: Learn a model architecture or parameterization that can quickly absorb new information.

Metric-Based Meta-Learning

Core Idea

Learn an embedding function fθ : X → ℝ^d such that classification in embedding space reduces to nearest-neighbor-like comparison. At test time, embed support and query examples, classify queries by proximity to support embeddings.

Siamese Networks

Koch et al. (2015) train a Siamese network to determine whether two images belong to the same class. Twin networks with shared weights embed both images; the L1 distance between embeddings is fed to a sigmoid classifier. At test time, the query is compared to each support example; the class of the nearest match is predicted.

Limitation: Pairwise comparison — does not leverage the full support set structure.

Matching Networks

Vinyals et al. (2016) classify queries via a soft attention over support set embeddings:

p(y = k | x, S) = Σ_{(x_i, y_i) ∈ S} a(x, x_i) · 1[y_i = k]

where a(x, x_i) = softmax(cosine(f(x), g(x_i))) and f, g are embedding functions (potentially with context via LSTM over the support set — Full Context Embeddings). The attention mechanism weights each support example's contribution.

Episodic training: Training mimics the test scenario — sample N-way K-shot tasks from the training set, compute loss on query examples. This "learning in the same way as testing" principle is fundamental to meta-learning.

Prototypical Networks

Snell et al. (2017) compute a prototype (mean embedding) for each class from the support set:

c_k = (1/|S_k|) Σ_{x_i ∈ S_k} fθ(x_i)

Classification uses the negative squared Euclidean distance:

p(y = k | x) = softmax(-||fθ(x) - c_k||²)

Simplicity and effectiveness: Prototypical networks are simpler than matching networks (no attention, no LSTM context) yet consistently outperform them. The Euclidean distance with mean prototypes is equivalent to a linear classifier in a specific Bregman divergence framework.

Theoretical insight: The prototype computation is a sufficient statistic for classification under a Gaussian class-conditional assumption. The squared Euclidean distance produces a linear decision boundary between classes.

Relation Networks

Sung et al. (2018) replace the fixed distance metric with a learned relation module: a small neural network that takes concatenated embeddings and outputs a similarity score. This enables learning non-linear, task-adaptive similarity measures.

Optimization-Based Meta-Learning

MAML (Model-Agnostic Meta-Learning)

Finn et al. (2017) learn an initialization θ of model parameters such that a few gradient steps on a new task's support set yields good performance on the query set.

Inner loop (task adaptation): For each task T_i, compute adapted parameters:

θ'_i = θ - α ∇_θ L_{T_i}^{support}(θ)

Outer loop (meta-update): Update θ to minimize the post-adaptation loss:

θ ← θ - β ∇_θ Σ_i L_{T_i}^{query}(θ'_i)

The outer gradient involves differentiating through the inner gradient step — a second-order gradient (gradient of a gradient). This is computationally expensive but captures how the initialization affects post-adaptation performance.

First-order approximations:

  • FOMAML: Ignore second-order terms in the outer gradient. Use ∇_{θ'_i} L^{query}(θ'_i) as an approximation. Surprisingly effective in practice.
  • Reptile (Nichol et al., 2018): Even simpler — perform k inner gradient steps to get θ'_i, then update θ toward θ'_i: θ ← θ + ε(θ'_i - θ). No second-order gradients needed. Interpretable as approximating both joint training and MAML.

Multi-Step MAML

Standard MAML uses one inner gradient step. Using multiple steps (θ' = θ - α∇L, θ'' = θ' - α∇L', ...) improves adaptation quality but increases computational cost and memory (storing all intermediate parameters for the second-order gradient).

Gradient checkpointing and implicit differentiation (iMAML, Rajeswaran et al., 2019) reduce memory cost. iMAML computes the outer gradient without storing intermediate parameters by solving a linear system involving the Hessian.

MAML Variants

  • ANIL (Almost No Inner Loop): Only adapt the last (classification) layer in the inner loop; feature extractor is fixed. Nearly matches MAML performance, suggesting that the learned initialization primarily provides a good feature representation, and the last layer is the main adaptation target.
  • Meta-SGD: Learn per-parameter learning rates (the inner loop α is a learnable vector, not a scalar).
  • CAVIA: Adapt only a small set of context parameters (appended to the input), not the full model. Reduces inner loop dimensionality.
  • LEO (Latent Embedding Optimization): Perform inner loop optimization in a low-dimensional latent space of model parameters, then decode to full parameters. Addresses MAML's difficulty with high-dimensional parameter spaces.

Model-Based Meta-Learning

Memory-Augmented Neural Networks (MANN)

Santoro et al. (2016) use a Neural Turing Machine (NTM) with external memory for few-shot learning. The memory stores representations of previously seen examples. When a new example arrives, the controller reads relevant memories and writes the new example. Classification is based on memory content.

Hypernetworks

A hypernetwork generates the weights of a task-specific model. Given a task description (e.g., support set), the hypernetwork produces a full set of parameters for a classifier. This amortizes the cost of adaptation — no inner loop optimization needed.

Task-conditioned hypernetworks: Encode the support set with a set encoder (e.g., DeepSets, Set Transformer), then use the encoding to generate task-specific weights via a hypernetwork.

Learned Optimizers

Learn the optimization algorithm itself. An LSTM or Transformer takes as input the gradient history and outputs parameter updates:

θ_{t+1} = θ_t + g_φ(∇L_t, ∇L_{t-1}, ..., h_t)

where g_φ is the learned optimizer and h_t is its hidden state. The learned optimizer is meta-trained to produce fast convergence across tasks. L2L (Andrychowicz et al., 2016) and MetaOptNet explore this direction.

Few-Shot Benchmarks and Evaluation

Standard Benchmarks

| Benchmark | Domain | Classes | Images/class | N-way | K-shot | |-----------|--------|---------|-------------|-------|--------| | Omniglot | Handwritten chars | 1623 | 20 | 5/20 | 1/5 | | miniImageNet | Natural images | 100 | 600 | 5 | 1/5 | | tieredImageNet | Natural images | 608 | ~1200 | 5 | 1/5 | | CUB-200 | Fine-grained birds | 200 | ~60 | 5 | 1/5 | | Meta-Dataset | Multi-domain | Variable | Variable | Variable | Variable |

Evaluation Protocol

Randomly sample many (600-10000) test episodes. Report mean accuracy with 95% confidence interval on query sets. Meta-Dataset (Triantafillou et al., 2020) introduced variable-way variable-shot evaluation from multiple domains, testing generalization across diverse distributions.

Strong Baselines

Surprisingly, fine-tuning a pre-trained feature extractor (train on all meta-training classes with standard cross-entropy, then fine-tune the last layer on the support set) is competitive with or exceeds many meta-learning methods (Chen et al., 2019). This suggests that learning a good representation is more important than the meta-learning algorithm.

Cross-Domain Meta-Learning

Standard meta-learning assumes meta-train and meta-test tasks come from the same distribution. Cross-domain meta-learning tests generalization to entirely different domains (e.g., train on ImageNet, test on medical images or satellite imagery).

Feature-wise Linear Modulation (FiLM): Modulate feature maps with task-specific affine transformations. URL (Universal Representation Learning): Learn domain-agnostic representations by training across diverse domains with domain-specific heads.

CDFSL benchmark (Guo et al., 2020): Evaluate methods trained on miniImageNet on CropDisease, EuroSAT, ISIC (dermatology), ChestX. Large domain shifts reveal that many meta-learning methods fail to generalize, while simple transfer learning baselines remain competitive.

Connections to Other Fields

  • Transfer learning: Meta-learning can be seen as learning what to transfer across tasks.
  • Multi-task learning: Meta-learning shares representations across tasks but with explicit adaptation mechanisms.
  • Bayesian learning: MAML can be interpreted as MAP inference under a task-specific posterior with the meta-learned initialization as the prior.
  • Continual learning: Meta-learning methods (OML, ANML) address catastrophic forgetting by learning to learn without forgetting.
  • AutoML: Meta-learning can select hyperparameters, architectures, or augmentation strategies based on task characteristics.