Neural Architecture Search

Overview

Neural Architecture Search (NAS) automates the design of neural network architectures, replacing human expertise with algorithmic search. The goal: given a task and dataset, find an architecture that maximizes performance (accuracy, latency, parameter count, etc.) within a defined search space. NAS has discovered architectures matching or exceeding human-designed ones, but at significant computational cost — motivating research into efficient search methods.

NAS Components

Every NAS method specifies three components:

Search space: The set of possible architectures. Defines what architectures can be discovered.
Search strategy: How to explore the search space. Determines the efficiency and effectiveness of the search.
Performance estimation: How to evaluate candidate architectures. The bottleneck — training each candidate to convergence is prohibitively expensive.

Search Spaces

Macro Search Space

The entire network topology is searchable: number of layers, layer types (conv, pool, FC, skip), connections between layers. Highly expressive but enormous — early NAS methods used this (Zoph and Le, 2017).

Cell-Based (Micro) Search Space

Search for a small building block (cell), then stack copies of the cell to form the full network. Two cell types are typically learned:

Normal cell: Preserves spatial dimensions.
Reduction cell: Reduces spatial dimensions (stride-2 operations).

The stacking pattern (number of cells, where to place reduction cells) is predefined. This dramatically reduces the search space (searching one cell of ~5-7 nodes vs. an entire 20+ layer network) and improves transferability — a cell found on CIFAR-10 can be stacked differently for ImageNet.

Cell internals: A directed acyclic graph (DAG) of nodes. Each node represents a latent representation. Each edge applies an operation (3×3 conv, 5×5 conv, max pool, skip connection, zero/none). NAS selects which operations to place on which edges and how nodes combine inputs (sum, concatenation).

Operation Choices

Typical operation sets: {3×3 separable conv, 5×5 separable conv, 3×3 dilated conv, 3×3 max pool, 3×3 avg pool, skip connection, zero}. The identity (skip connection) enables residual-like learning. The zero operation enables edge pruning.

Hierarchical and Flexible Spaces

Hierarchical NAS: Search at multiple levels — primitive operations, motifs (small subgraphs), cells, network-level stacking patterns. FBNet, OFA (Once-for-All): Search spaces that include variable kernel sizes, expansion ratios, and channel numbers per layer, targeting mobile deployment with hardware constraints.

Search Strategies

Reinforcement Learning

NASNet (Zoph et al., 2018)

An RNN controller generates architecture descriptions as sequences of tokens (operation type, skip connections). The controller is trained with REINFORCE:

J(θ_c) = E_{a~π_θ}[R(a)]

where R(a) is the validation accuracy of architecture a (trained for a fixed number of epochs). The controller learns to propose better architectures over time.

Cost: The original NAS (Zoph and Le, 2017) required 800 GPUs for 28 days (22,400 GPU-days) to search on CIFAR-10. NASNet reduced this with the cell-based search space but still required ~500 GPU-days.

ENAS (Efficient NAS, Pham et al., 2018): Share weights across candidate architectures by defining a single over-parameterized supernet. The controller samples subgraphs (architectures) from the supernet; each subgraph reuses the supernet's trained weights. Reduces search cost to ~0.5 GPU-days.

Evolutionary Search

AmoebaNet (Real et al., 2019): Evolve architectures using a tournament selection evolutionary algorithm:

Initialize a population of random architectures.
Tournament selection: Sample a subset, select the fittest (highest validation accuracy).
Mutation: Modify the selected architecture (change an operation, rewire a connection, add/remove a node).
Aging: Remove the oldest architectures from the population (regularized evolution).
Repeat.

Regularized evolution (aging) prevents premature convergence by continuously injecting diversity. AmoebaNet achieved results comparable to RL-based NAS with similar computational cost.

NSGANetV2: Multi-objective evolutionary NAS using NSGA-II. Optimizes Pareto fronts of accuracy vs. latency/FLOPs.

DARTS (Differentiable Architecture Search)

Liu et al. (2019) make the discrete architecture search continuous and differentiable:

Continuous relaxation: Instead of selecting one operation per edge, compute a weighted sum: ō(x) = Σ_o [exp(α_o)/Σ exp(α_o')] · o(x), where α are architecture parameters.
Bi-level optimization: Jointly optimize network weights w (inner loop, SGD on training loss) and architecture parameters α (outer loop, gradient descent on validation loss).
Discretization: After search, select the operation with the highest α weight on each edge.

Cost: ~1-4 GPU-days on CIFAR-10 (orders of magnitude cheaper than RL/evolution).

Issues:

Performance collapse: DARTS sometimes converges to degenerate architectures dominated by skip connections (parameter-free operations). The search favors low-training-loss architectures, which may not generalize.
Discretization gap: The continuous supernet's optimal α may not correspond to the best discrete architecture.

Fixes: DARTS+: Early stopping based on skip connection proportion. P-DARTS: Progressive search space shrinking. FairDARTS: Sigmoid instead of softmax (independent operation selection). GDAS: Sample one operation per edge using Gumbel-softmax.

Supernet Training

Train a supernet containing all possible architectures as subnetworks. Each forward pass samples a random subnetwork (uniform or guided). After training, evaluate candidate architectures by inheriting supernet weights (no retraining). Single-Path One-Shot: Sample a single path per step. BigNAS: Train the supernet with diverse subnetwork sizes simultaneously.

Zero-Cost Proxies

Estimate architecture quality without any training, using properties of the randomly initialized network.

SynFlow (Tanaka et al., 2020): Product of absolute parameter values through the network. Proxy for network connectivity/expressiveness.
NASWOT (Mellor et al., 2021): Overlap of binary activation patterns across mini-batch samples. High overlap = less expressive.
Jacob_cov: Log-determinant of the Jacobian of the network at initialization. Measures sensitivity to input perturbations.
Zen-Score: Gaussian complexity of the network function.

Zero-cost proxies rank architectures in seconds but have limited correlation with true performance, especially for large-scale tasks. They are best used for initial filtering, reducing the candidates that require actual training.

Performance Estimation Strategies

Reduced Training

Train each candidate for fewer epochs (e.g., 20 instead of 600). Cheap but early-epoch performance may not correlate with final performance for all architectures.

Learning Curve Extrapolation

Train for a few epochs, fit a parametric curve (e.g., exponential decay), predict final performance. Terminate unpromising candidates early (successive halving, Hyperband).

Reuse weights from a supernet or parent architecture. New architectures start from trained weights, requiring minimal additional training. Risk: shared weights may not accurately reflect the performance of standalone architectures.

Predictor-Based

Train a surrogate model (neural network, Gaussian process, random forest) to predict architecture performance from architecture features (operation types, graph structure, zero-cost proxy values). The predictor is trained on a small sample of fully evaluated architectures and used to score the rest. BANANAS, SemiNAS, BRP-NAS use this approach.

Efficiency-Aware NAS

Hardware-Aware Search

Optimize not just accuracy but also latency, FLOPs, parameter count, energy consumption, or memory footprint on target hardware.

Latency prediction: Build a lookup table of per-operation latency on the target device. Architecture latency ≈ sum of operation latencies. Alternatively, train a latency predictor from measured architectures.

Multi-objective optimization: Pareto-optimal front of accuracy vs. efficiency. MnasNet (Tan et al., 2019): RL-based NAS with reward = accuracy × (latency/target)^w. FBNet: DARTS-style search with latency in the loss function.

Once-for-All (OFA)

Cai et al. (2020): Train a single network that supports many sub-networks of different sizes (elastic depth, width, kernel size, resolution). At deployment, extract the sub-network matching the target device's constraints without retraining. Progressive shrinking training ensures sub-networks perform well.

Transferable Architectures

Architectures found on proxy tasks (CIFAR-10, small resolution) are transferred to target tasks (ImageNet, detection, segmentation) by adjusting the stacking pattern. The cell-based search space enables this: search for cells on CIFAR-10 (cheap), stack more cells for ImageNet.

EfficientNet (Tan and Le, 2019): NAS-derived baseline (EfficientNet-B0) with compound scaling — uniformly scale depth, width, and resolution using a compound coefficient φ. EfficientNet-B0 through B7 achieve state-of-the-art accuracy-efficiency tradeoffs.

EfficientNetV2: NAS with progressive learning (increasing image size during training) and Fused-MBConv blocks. Faster training and inference than V1.

AutoML

NAS is one component of the broader AutoML (Automated Machine Learning) pipeline:

Full Pipeline Automation

Feature engineering: Automated feature selection, transformation, generation.
Model selection: Choose among architectures and model families.
Hyperparameter optimization (HPO): Bayesian optimization (GP-based, TPE), random search, Hyperband, BOHB (Bayesian + Hyperband).
Data augmentation search: AutoAugment (RL to find augmentation policies), RandAugment (simplified random augmentation), TrivialAugment.

AutoML Frameworks

Auto-sklearn: Bayesian optimization over scikit-learn pipelines (preprocessing + model + hyperparameters). Meta-learning warm-starts the search using dataset meta-features.
Auto-WEKA: Bayesian optimization over WEKA classifiers and preprocessing.
Google AutoML (Cloud AutoML, Vertex AI): Managed NAS and HPO services.
Ray Tune: Scalable HPO framework supporting many search algorithms and schedulers.
Optuna: Lightweight HPO with pruning (early stopping of unpromising trials) and multi-objective support.

Meta-Features and Warm-Starting

Use dataset characteristics (number of samples, features, classes, statistical properties) to predict which models/hyperparameters are likely to work well, warm-starting the search from historically successful configurations on similar datasets. Reduces the number of evaluations needed.

Current State and Outlook

NAS has matured from prohibitively expensive (tens of thousands of GPU-days) to practical (GPU-hours with weight sharing and zero-cost proxies). Key trends:

Benchmark suites: NAS-Bench-101, NAS-Bench-201, NAS-Bench-301 provide precomputed performance data for all architectures in defined search spaces, enabling reproducible comparison without GPU costs.
Foundation model era: With the rise of transformers and large pretrained models, the value of architecture search is shifting. Transformer architectures are relatively uniform; the important decisions are scale (parameters, data, compute) and training recipes rather than topology.
NAS for specialized domains: Hardware design (accelerator-aware architectures), scientific computing (physics-informed architectures), edge deployment (tiny models for microcontrollers).