Numerical Optimization
Numerical optimization finds the minimum (or maximum) of a function. It is the computational engine behind machine learning, engineering design, and operations research.
Problem Formulation
Unconstrained
minimize f(x) where x ∈ ℝⁿ
Constrained
minimize f(x)
subject to gᵢ(x) ≤ 0 (inequality constraints)
hⱼ(x) = 0 (equality constraints)
Necessary Conditions (Unconstrained)
First-order: ∇f(x*) = 0 (critical point). Second-order: ∇²f(x*) is positive semidefinite (not a saddle point).
Unconstrained Optimization
Steepest Descent (Gradient Descent)
x_{k+1} = xₖ - αₖ ∇f(xₖ)
where αₖ is the step size (learning rate).
Properties:
- Direction: -∇f (steepest descent in ℓ² norm)
- Convergence: Linear. Rate depends on condition number κ: (κ-1)/(κ+1) per step.
- Zigzags on ill-conditioned problems (elongated contours)
- Simple but slow for difficult problems
Step size selection:
- Fixed: α = constant. Must be small enough (α < 2/L for L-smooth functions).
- Exact line search: αₖ = argmin_α f(xₖ - α∇f(xₖ)). Impractical but theoretically useful.
- Backtracking (Armijo): Start with large α, halve until sufficient decrease condition holds.
Newton's Method
Uses second-order (curvature) information:
x_{k+1} = xₖ - [∇²f(xₖ)]⁻¹ ∇f(xₖ)
Properties:
- Quadratic convergence near a minimum (if Hessian is positive definite)
- Scale-invariant (not affected by coordinate transformations)
- Each step requires computing and inverting the Hessian: O(n³)
- May diverge if started far from the minimum
- May converge to saddle points or maxima
Modified Newton: Replace H with H + μI if H is not positive definite (Levenberg-Marquardt-like damping).
In practice: Solve Hₖ dₖ = -∇f(xₖ) for the direction dₖ instead of inverting H.
Quasi-Newton Methods
Approximate the Hessian (or its inverse) using gradient information accumulated over iterations.
BFGS
The most popular quasi-Newton method. Maintains an approximation Bₖ ≈ ∇²f(xₖ):
sₖ = xₖ₊₁ - xₖ
yₖ = ∇f(xₖ₊₁) - ∇f(xₖ)
Bₖ₊₁ = Bₖ + yₖyₖᵀ/(yₖᵀsₖ) - (Bₖsₖ)(Bₖsₖ)ᵀ/(sₖᵀBₖsₖ)
Properties:
- Superlinear convergence
- O(n²) per iteration (vs O(n³) for Newton)
- Self-correcting: even if B₀ = I, it converges
- Maintains positive definiteness of B
L-BFGS (Limited-memory BFGS)
Stores only the last m (typically 5-20) pairs (sₖ, yₖ) instead of the full n × n matrix. Computes H⁻¹∇f using a two-loop recursion.
Memory: O(mn) instead of O(n²). Per iteration: O(mn).
The method of choice for large-scale smooth unconstrained optimization. Used in many ML libraries.
Conjugate Gradient (for Optimization)
Nonlinear CG extends the linear CG method to general functions:
dₖ₊₁ = -∇f(xₖ₊₁) + βₖ dₖ
Choices for β:
- Fletcher-Reeves: βₖ = ‖∇fₖ₊₁‖²/‖∇fₖ‖²
- Polak-Ribière: βₖ = ∇fₖ₊₁ᵀ(∇fₖ₊₁ - ∇fₖ)/‖∇fₖ‖²
Properties: O(n) memory. Superlinear convergence for quadratics (exact in n steps).
Line Search Methods
Given direction dₖ, choose step size αₖ:
Armijo Condition (Sufficient Decrease)
f(xₖ + αdₖ) ≤ f(xₖ) + c₁ α ∇fₖᵀdₖ
Typical c₁ = 10⁻⁴. Ensures the step actually decreases f.
Wolfe Conditions
Armijo + curvature condition:
∇f(xₖ + αdₖ)ᵀdₖ ≥ c₂ ∇fₖᵀdₖ
Typical c₂ = 0.9 (for quasi-Newton) or c₂ = 0.1 (for CG). Prevents too-small steps.
Strong Wolfe: Use |∇f·d| ≤ c₂|∇f₀·d|.
Trust Region Methods
Instead of choosing a direction then a step size, trust region methods solve:
minimize mₖ(d) = f(xₖ) + ∇fₖᵀd + ½ dᵀBₖd
subject to ‖d‖ ≤ Δₖ
where Δₖ is the trust region radius.
If the model predicts well → expand Δₖ. If poorly → shrink Δₖ.
Advantages over line search: More robust. Natural handling of negative curvature. No step size selection.
Solving the subproblem: Steihaug-CG (truncated CG), dogleg method, exact solution via secular equation.
Constrained Optimization
Penalty Methods
Replace constrained problem with unconstrained:
minimize f(x) + μ Σ max(0, gᵢ(x))² + μ Σ hⱼ(x)²
Increase μ → solution approaches feasible region. Problem: ill-conditioning as μ → ∞.
Augmented Lagrangian
Combines penalty with Lagrange multiplier estimates:
L_A(x, λ, μ) = f(x) + Σ λᵢgᵢ(x) + (μ/2) Σ gᵢ(x)²
Alternately minimize over x (with fixed λ, μ) and update λ. Avoids ill-conditioning.
Sequential Quadratic Programming (SQP)
At each iteration, solve a quadratic programming subproblem:
minimize ∇fₖᵀd + ½ dᵀHₖd
subject to ∇gᵢ(xₖ)ᵀd + gᵢ(xₖ) ≤ 0
∇hⱼ(xₖ)ᵀd + hⱼ(xₖ) = 0
Properties: Superlinear convergence near the solution. The gold standard for smooth nonlinear constrained optimization.
Interior Point Methods
Approach the boundary of the feasible region gradually using barrier functions:
minimize f(x) - μ Σ ln(-gᵢ(x))
As μ → 0, the solution approaches the constrained optimum.
Modern interior point methods use Newton steps on the KKT system. Polynomial-time for linear and convex programs.
Global Optimization
Local methods find local minima. Global optimization seeks the global minimum.
Deterministic
- Branch and bound: Partition the search space, compute bounds, prune subregions.
- Lipschitz optimization: Use Lipschitz constant to bound the function.
Stochastic
- Simulated annealing: Accept uphill moves with decreasing probability. Converges to global optimum (in theory, with cooling schedule).
- Basin-hopping: Local optimization from random starting points. Keep the best.
- Genetic algorithms / evolutionary strategies: Population-based. CMA-ES for continuous optimization.
- Particle swarm optimization: Swarm-based search.
Global optimization is generally NP-hard. No guarantee of finding the global optimum for non-convex problems.
Convergence Rate Summary
| Method | Rate | Cost per iteration | |---|---|---| | Gradient descent | Linear: (κ-1)/(κ+1) | O(n) | | Conjugate gradient | Superlinear (quadratic for quadratics) | O(n) | | L-BFGS | Superlinear | O(mn) | | BFGS | Superlinear | O(n²) | | Newton | Quadratic | O(n³) | | Truncated Newton | Superlinear | O(n · CG_iters) |
Applications in CS
- Machine learning: SGD, Adam, L-BFGS for training neural networks. SQP for constrained ML. Bayesian optimization for hyperparameters.
- Computer vision: Bundle adjustment (Levenberg-Marquardt) for 3D reconstruction. Optical flow. Image registration.
- Robotics: Trajectory optimization (SQP, iLQR). Inverse kinematics.
- Operations research: Supply chain optimization, scheduling, resource allocation.
- Signal processing: Filter design (minimax optimization). Compressed sensing (L1 minimization).
- Finance: Portfolio optimization (Markowitz). Risk management (CVaR optimization).
- Engineering design: Shape optimization (aerodynamics, structures). Topology optimization.
- Compiler optimization: Register allocation, instruction scheduling (combinatorial optimization).