Nonlinear Programming
Problem Setting
minimize f(x)
subject to g_i(x) <= 0, i = 1,...,m
h_j(x) = 0, j = 1,...,p
where f, g_i, h_j are generally nonlinear and possibly nonconvex. Unlike convex optimization, local optima may not be global, and algorithms typically guarantee convergence to stationary points (KKT points).
Unconstrained Optimization
Gradient Descent
x_{k+1} = x_k - alpha_k * grad(f(x_k))
Converges to stationary points. For smooth nonconvex f, achieves ||grad(f(x_k))|| <= epsilon in O(1/epsilon^2) iterations. See 03-convex-optimization.md for convex convergence rates.
Newton's Method
Uses second-order (curvature) information for faster local convergence:
x_{k+1} = x_k - [H(x_k)]^{-1} * grad(f(x_k))
where H(x_k) is the Hessian. Each iteration solves the linear system H(x_k) d = -grad(f(x_k)).
Properties:
- Quadratic local convergence: ||x_{k+1} - x*|| <= C * ||x_k - x*||^2 near x* when H(x*) is positive definite.
- Affinely invariant: performance does not depend on coordinate scaling.
- Cost: O(n^3) per iteration for dense Hessian (or O(n^2) with structure).
- Not globally convergent without modification: may diverge or move to saddle points.
Modified Newton: ensure descent by modifying the Hessian to be positive definite:
- Add a multiple of identity: H + tauI with tau chosen so H + tauI > 0.
- Use the Cholesky factorization with added diagonal (Gill-Murray modification).
Damped Newton: combine with line search or trust region for global convergence.
Trust Region Methods
Instead of choosing a direction then a step size, trust region methods define a region where the quadratic model is trusted:
minimize grad(f(x_k))^T d + (1/2) d^T H_k d
subject to ||d|| <= Delta_k
The ratio rho_k = (actual reduction) / (predicted reduction) determines whether to accept the step and adjust the trust region radius Delta_k:
- rho_k > 0.75: expand Delta_k (model is good)
- rho_k < 0.25: shrink Delta_k (model is poor)
- rho_k < 0: reject the step
Trust regions handle negative curvature naturally and have strong global convergence guarantees.
Quasi-Newton Methods
Approximate the Hessian (or its inverse) using only gradient information, avoiding O(n^3) Hessian computation.
BFGS (Broyden-Fletcher-Goldfarb-Shanno)
Maintain an approximation B_k to the Hessian, updated via rank-2 corrections:
s_k = x_{k+1} - x_k
y_k = grad(f(x_{k+1})) - grad(f(x_k))
B_{k+1} = B_k - (B_k s_k s_k^T B_k)/(s_k^T B_k s_k) + (y_k y_k^T)/(y_k^T s_k)
Or equivalently, maintain H_k = B_k^{-1} directly (inverse BFGS update). BFGS achieves superlinear convergence on smooth, strongly convex functions.
L-BFGS (Limited-Memory BFGS)
Store only the last m pairs {s_k, y_k} (typically m = 5-20) instead of the full n x n matrix. Compute H_k * grad(f) via the two-loop recursion in O(mn) time.
L-BFGS is the default choice for large-scale smooth unconstrained/bound-constrained optimization. Memory: O(mn) vs O(n^2) for full BFGS.
Other Quasi-Newton Methods
- SR1 (Symmetric Rank-1): can capture negative curvature, useful in trust region frameworks.
- Broyden family: interpolation between BFGS and DFP updates.
Constrained Optimization Methods
Penalty Methods
Replace the constrained problem with unconstrained subproblems:
Quadratic penalty:
minimize f(x) + (rho/2) * sum_i max(0, g_i(x))^2 + (rho/2) * sum_j h_j(x)^2
Increase rho iteratively. As rho -> infinity, solutions approach feasibility. Issue: ill-conditioning as rho grows, making subproblems hard to solve.
Exact penalty (L1):
minimize f(x) + rho * sum_i max(0, g_i(x)) + rho * sum_j |h_j(x)|
For sufficiently large rho, the exact penalty has the same minimizers as the original problem. Nonsmoothness requires specialized solvers.
Augmented Lagrangian Method (ALM)
Combines Lagrangian with a quadratic penalty:
L_rho(x, lambda, nu) = f(x) + sum_i lambda_i g_i(x) + (rho/2) sum_i max(0, g_i(x) + lambda_i/rho)^2
+ sum_j nu_j h_j(x) + (rho/2) sum_j h_j(x)^2
Algorithm:
- Minimize L_rho over x (approximately, using any unconstrained method).
- Update multipliers: lambda_i <- lambda_i + rho * g_i(x), nu_j <- nu_j + rho * h_j(x).
- Optionally increase rho.
- Repeat until convergence.
Advantage over pure penalty: rho does not need to go to infinity; the multiplier estimates capture the constraint information. The subproblems remain well-conditioned.
Sequential Quadratic Programming (SQP)
The gold standard for smooth nonlinear constrained optimization. At each iteration, solve a QP subproblem:
minimize grad(f(x_k))^T d + (1/2) d^T H_k d
subject to g_i(x_k) + grad(g_i(x_k))^T d <= 0
h_j(x_k) + grad(h_j(x_k))^T d = 0
The QP solution gives a search direction d_k and multiplier estimates. H_k is the Hessian of the Lagrangian (or a quasi-Newton approximation).
Properties:
- Superlinear convergence with quasi-Newton Hessian, quadratic with exact Hessian.
- Handles nonlinear constraints directly.
- Requires a QP solver at each step.
- Globalization via merit function (L1 penalty) or filter methods.
Interior Point Methods for NLP
Extend LP/conic interior point ideas to general NLP. Treat inequality constraints with log barriers:
minimize f(x) - mu * sum_i ln(-g_i(x))
subject to h_j(x) = 0
Solve the barrier KKT system with Newton's method. Modern NLP-IPM solvers (IPOPT) are highly effective for large-scale problems with many inequality constraints.
Global Optimization
Local methods find stationary points. Global methods aim for the global optimum.
Deterministic Methods
- Branch and bound: partition the domain, compute lower bounds (via convex relaxations), prune subdomains.
- Spatial branch and bound: for factorable programs, construct convex relaxations automatically (McCormick envelopes, alpha-BB).
- Reformulation-linearization technique (RLT): multiply constraints to create tighter relaxations.
Stochastic / Heuristic Methods
- Multistart: run local solver from many random initial points.
- Simulated annealing: random perturbations with decreasing temperature.
- Genetic algorithms / evolutionary strategies: population-based search.
- Basin-hopping: local minimization + random perturbation.
No guarantees for nonconvex problems in polynomial time (NP-hard in general), but practical heuristics often find good solutions.
Method Selection Guide
| Problem Type | Recommended Method | |-------------|-------------------| | Small dense unconstrained | Newton / trust region | | Large sparse unconstrained | L-BFGS | | Small constrained | SQP (SNOPT, fmincon) | | Large constrained, many inequalities | Interior point (IPOPT) | | Bound-constrained | L-BFGS-B | | Global optimization | Multistart + local solver, or spatial B&B (BARON) |
Connections
- Convex specializations with guaranteed global optimality (→
03-convex-optimization.md). - KKT conditions as optimality criteria (→
04-duality-and-kkt.md). - Stochastic variants for noisy objectives (→
08-stochastic-optimization.md). - Solver implementations (→
09-optimization-solvers.md).