5 min read
On this page

Numerical Optimization

Numerical optimization finds the minimum (or maximum) of a function. It is the computational engine behind machine learning, engineering design, and operations research.

Problem Formulation

Unconstrained

minimize f(x)    where x ∈ ℝⁿ

Constrained

minimize f(x)
subject to gᵢ(x) ≤ 0    (inequality constraints)
           hⱼ(x) = 0     (equality constraints)

Necessary Conditions (Unconstrained)

First-order: ∇f(x*) = 0 (critical point). Second-order: ∇²f(x*) is positive semidefinite (not a saddle point).

Unconstrained Optimization

Steepest Descent (Gradient Descent)

x_{k+1} = xₖ - αₖ ∇f(xₖ)

where αₖ is the step size (learning rate).

Properties:

  • Direction: -∇f (steepest descent in ℓ² norm)
  • Convergence: Linear. Rate depends on condition number κ: (κ-1)/(κ+1) per step.
  • Zigzags on ill-conditioned problems (elongated contours)
  • Simple but slow for difficult problems

Step size selection:

  • Fixed: α = constant. Must be small enough (α < 2/L for L-smooth functions).
  • Exact line search: αₖ = argmin_α f(xₖ - α∇f(xₖ)). Impractical but theoretically useful.
  • Backtracking (Armijo): Start with large α, halve until sufficient decrease condition holds.

Newton's Method

Uses second-order (curvature) information:

x_{k+1} = xₖ - [∇²f(xₖ)]⁻¹ ∇f(xₖ)

Properties:

  • Quadratic convergence near a minimum (if Hessian is positive definite)
  • Scale-invariant (not affected by coordinate transformations)
  • Each step requires computing and inverting the Hessian: O(n³)
  • May diverge if started far from the minimum
  • May converge to saddle points or maxima

Modified Newton: Replace H with H + μI if H is not positive definite (Levenberg-Marquardt-like damping).

In practice: Solve Hₖ dₖ = -∇f(xₖ) for the direction dₖ instead of inverting H.

Quasi-Newton Methods

Approximate the Hessian (or its inverse) using gradient information accumulated over iterations.

BFGS

The most popular quasi-Newton method. Maintains an approximation Bₖ ≈ ∇²f(xₖ):

sₖ = xₖ₊₁ - xₖ
yₖ = ∇f(xₖ₊₁) - ∇f(xₖ)
Bₖ₊₁ = Bₖ + yₖyₖᵀ/(yₖᵀsₖ) - (Bₖsₖ)(Bₖsₖ)ᵀ/(sₖᵀBₖsₖ)

Properties:

  • Superlinear convergence
  • O(n²) per iteration (vs O(n³) for Newton)
  • Self-correcting: even if B₀ = I, it converges
  • Maintains positive definiteness of B

L-BFGS (Limited-memory BFGS)

Stores only the last m (typically 5-20) pairs (sₖ, yₖ) instead of the full n × n matrix. Computes H⁻¹∇f using a two-loop recursion.

Memory: O(mn) instead of O(n²). Per iteration: O(mn).

The method of choice for large-scale smooth unconstrained optimization. Used in many ML libraries.

Conjugate Gradient (for Optimization)

Nonlinear CG extends the linear CG method to general functions:

dₖ₊₁ = -∇f(xₖ₊₁) + βₖ dₖ

Choices for β:

  • Fletcher-Reeves: βₖ = ‖∇fₖ₊₁‖²/‖∇fₖ‖²
  • Polak-Ribière: βₖ = ∇fₖ₊₁ᵀ(∇fₖ₊₁ - ∇fₖ)/‖∇fₖ‖²

Properties: O(n) memory. Superlinear convergence for quadratics (exact in n steps).

Line Search Methods

Given direction dₖ, choose step size αₖ:

Armijo Condition (Sufficient Decrease)

f(xₖ + αdₖ) ≤ f(xₖ) + c₁ α ∇fₖᵀdₖ

Typical c₁ = 10⁻⁴. Ensures the step actually decreases f.

Wolfe Conditions

Armijo + curvature condition:

∇f(xₖ + αdₖ)ᵀdₖ ≥ c₂ ∇fₖᵀdₖ

Typical c₂ = 0.9 (for quasi-Newton) or c₂ = 0.1 (for CG). Prevents too-small steps.

Strong Wolfe: Use |∇f·d| ≤ c₂|∇f₀·d|.

Trust Region Methods

Instead of choosing a direction then a step size, trust region methods solve:

minimize mₖ(d) = f(xₖ) + ∇fₖᵀd + ½ dᵀBₖd
subject to ‖d‖ ≤ Δₖ

where Δₖ is the trust region radius.

If the model predicts well → expand Δₖ. If poorly → shrink Δₖ.

Advantages over line search: More robust. Natural handling of negative curvature. No step size selection.

Solving the subproblem: Steihaug-CG (truncated CG), dogleg method, exact solution via secular equation.

Constrained Optimization

Penalty Methods

Replace constrained problem with unconstrained:

minimize f(x) + μ Σ max(0, gᵢ(x))² + μ Σ hⱼ(x)²

Increase μ → solution approaches feasible region. Problem: ill-conditioning as μ → ∞.

Augmented Lagrangian

Combines penalty with Lagrange multiplier estimates:

L_A(x, λ, μ) = f(x) + Σ λᵢgᵢ(x) + (μ/2) Σ gᵢ(x)²

Alternately minimize over x (with fixed λ, μ) and update λ. Avoids ill-conditioning.

Sequential Quadratic Programming (SQP)

At each iteration, solve a quadratic programming subproblem:

minimize ∇fₖᵀd + ½ dᵀHₖd
subject to ∇gᵢ(xₖ)ᵀd + gᵢ(xₖ) ≤ 0
           ∇hⱼ(xₖ)ᵀd + hⱼ(xₖ) = 0

Properties: Superlinear convergence near the solution. The gold standard for smooth nonlinear constrained optimization.

Interior Point Methods

Approach the boundary of the feasible region gradually using barrier functions:

minimize f(x) - μ Σ ln(-gᵢ(x))

As μ → 0, the solution approaches the constrained optimum.

Modern interior point methods use Newton steps on the KKT system. Polynomial-time for linear and convex programs.

Global Optimization

Local methods find local minima. Global optimization seeks the global minimum.

Deterministic

  • Branch and bound: Partition the search space, compute bounds, prune subregions.
  • Lipschitz optimization: Use Lipschitz constant to bound the function.

Stochastic

  • Simulated annealing: Accept uphill moves with decreasing probability. Converges to global optimum (in theory, with cooling schedule).
  • Basin-hopping: Local optimization from random starting points. Keep the best.
  • Genetic algorithms / evolutionary strategies: Population-based. CMA-ES for continuous optimization.
  • Particle swarm optimization: Swarm-based search.

Global optimization is generally NP-hard. No guarantee of finding the global optimum for non-convex problems.

Convergence Rate Summary

| Method | Rate | Cost per iteration | |---|---|---| | Gradient descent | Linear: (κ-1)/(κ+1) | O(n) | | Conjugate gradient | Superlinear (quadratic for quadratics) | O(n) | | L-BFGS | Superlinear | O(mn) | | BFGS | Superlinear | O(n²) | | Newton | Quadratic | O(n³) | | Truncated Newton | Superlinear | O(n · CG_iters) |

Applications in CS

  • Machine learning: SGD, Adam, L-BFGS for training neural networks. SQP for constrained ML. Bayesian optimization for hyperparameters.
  • Computer vision: Bundle adjustment (Levenberg-Marquardt) for 3D reconstruction. Optical flow. Image registration.
  • Robotics: Trajectory optimization (SQP, iLQR). Inverse kinematics.
  • Operations research: Supply chain optimization, scheduling, resource allocation.
  • Signal processing: Filter design (minimax optimization). Compressed sensing (L1 minimization).
  • Finance: Portfolio optimization (Markowitz). Risk management (CVaR optimization).
  • Engineering design: Shape optimization (aerodynamics, structures). Topology optimization.
  • Compiler optimization: Register allocation, instruction scheduling (combinatorial optimization).