Nonlinear Programming

Problem Setting

minimize    f(x)
subject to  g_i(x) <= 0,   i = 1,...,m
            h_j(x) = 0,    j = 1,...,p

where f, g_i, h_j are generally nonlinear and possibly nonconvex. Unlike convex optimization, local optima may not be global, and algorithms typically guarantee convergence to stationary points (KKT points).

Unconstrained Optimization

Gradient Descent

x_{k+1} = x_k - alpha_k * grad(f(x_k))

Converges to stationary points. For smooth nonconvex f, achieves ||grad(f(x_k))|| <= epsilon in O(1/epsilon^2) iterations. See 03-convex-optimization.md for convex convergence rates.

Newton's Method

Uses second-order (curvature) information for faster local convergence:

x_{k+1} = x_k - [H(x_k)]^{-1} * grad(f(x_k))

where H(x_k) is the Hessian. Each iteration solves the linear system H(x_k) d = -grad(f(x_k)).

Properties:

Quadratic local convergence: ||x_{k+1} - x*|| <= C * ||x_k - x*||^2 near x* when H(x*) is positive definite.
Affinely invariant: performance does not depend on coordinate scaling.
Cost: O(n^3) per iteration for dense Hessian (or O(n^2) with structure).
Not globally convergent without modification: may diverge or move to saddle points.

Modified Newton: ensure descent by modifying the Hessian to be positive definite:

Add a multiple of identity: H + tauI with tau chosen so H + tauI > 0.
Use the Cholesky factorization with added diagonal (Gill-Murray modification).

Damped Newton: combine with line search or trust region for global convergence.

Trust Region Methods

Instead of choosing a direction then a step size, trust region methods define a region where the quadratic model is trusted:

minimize    grad(f(x_k))^T d + (1/2) d^T H_k d
subject to  ||d|| <= Delta_k

The ratio rho_k = (actual reduction) / (predicted reduction) determines whether to accept the step and adjust the trust region radius Delta_k:

rho_k > 0.75: expand Delta_k (model is good)
rho_k < 0.25: shrink Delta_k (model is poor)
rho_k < 0: reject the step

Trust regions handle negative curvature naturally and have strong global convergence guarantees.

Quasi-Newton Methods

Approximate the Hessian (or its inverse) using only gradient information, avoiding O(n^3) Hessian computation.

BFGS (Broyden-Fletcher-Goldfarb-Shanno)

Maintain an approximation B_k to the Hessian, updated via rank-2 corrections:

s_k = x_{k+1} - x_k
y_k = grad(f(x_{k+1})) - grad(f(x_k))

B_{k+1} = B_k - (B_k s_k s_k^T B_k)/(s_k^T B_k s_k) + (y_k y_k^T)/(y_k^T s_k)

Or equivalently, maintain H_k = B_k^{-1} directly (inverse BFGS update). BFGS achieves superlinear convergence on smooth, strongly convex functions.

L-BFGS (Limited-Memory BFGS)

Store only the last m pairs {s_k, y_k} (typically m = 5-20) instead of the full n x n matrix. Compute H_k * grad(f) via the two-loop recursion in O(mn) time.

L-BFGS is the default choice for large-scale smooth unconstrained/bound-constrained optimization. Memory: O(mn) vs O(n^2) for full BFGS.

Other Quasi-Newton Methods

SR1 (Symmetric Rank-1): can capture negative curvature, useful in trust region frameworks.
Broyden family: interpolation between BFGS and DFP updates.

Constrained Optimization Methods

Penalty Methods

Replace the constrained problem with unconstrained subproblems:

Quadratic penalty:

minimize  f(x) + (rho/2) * sum_i max(0, g_i(x))^2 + (rho/2) * sum_j h_j(x)^2

Increase rho iteratively. As rho -> infinity, solutions approach feasibility. Issue: ill-conditioning as rho grows, making subproblems hard to solve.

Exact penalty (L1):

minimize  f(x) + rho * sum_i max(0, g_i(x)) + rho * sum_j |h_j(x)|

For sufficiently large rho, the exact penalty has the same minimizers as the original problem. Nonsmoothness requires specialized solvers.

Augmented Lagrangian Method (ALM)

Combines Lagrangian with a quadratic penalty:

L_rho(x, lambda, nu) = f(x) + sum_i lambda_i g_i(x) + (rho/2) sum_i max(0, g_i(x) + lambda_i/rho)^2
                        + sum_j nu_j h_j(x) + (rho/2) sum_j h_j(x)^2

Algorithm:

Minimize L_rho over x (approximately, using any unconstrained method).
Update multipliers: lambda_i <- lambda_i + rho * g_i(x), nu_j <- nu_j + rho * h_j(x).
Optionally increase rho.
Repeat until convergence.

Advantage over pure penalty: rho does not need to go to infinity; the multiplier estimates capture the constraint information. The subproblems remain well-conditioned.

Sequential Quadratic Programming (SQP)

The gold standard for smooth nonlinear constrained optimization. At each iteration, solve a QP subproblem:

minimize    grad(f(x_k))^T d + (1/2) d^T H_k d
subject to  g_i(x_k) + grad(g_i(x_k))^T d <= 0
            h_j(x_k) + grad(h_j(x_k))^T d = 0

The QP solution gives a search direction d_k and multiplier estimates. H_k is the Hessian of the Lagrangian (or a quasi-Newton approximation).

Properties:

Superlinear convergence with quasi-Newton Hessian, quadratic with exact Hessian.
Handles nonlinear constraints directly.
Requires a QP solver at each step.
Globalization via merit function (L1 penalty) or filter methods.

Interior Point Methods for NLP

Extend LP/conic interior point ideas to general NLP. Treat inequality constraints with log barriers:

minimize  f(x) - mu * sum_i ln(-g_i(x))
subject to  h_j(x) = 0

Solve the barrier KKT system with Newton's method. Modern NLP-IPM solvers (IPOPT) are highly effective for large-scale problems with many inequality constraints.

Global Optimization

Local methods find stationary points. Global methods aim for the global optimum.

Deterministic Methods

Branch and bound: partition the domain, compute lower bounds (via convex relaxations), prune subdomains.
Spatial branch and bound: for factorable programs, construct convex relaxations automatically (McCormick envelopes, alpha-BB).
Reformulation-linearization technique (RLT): multiply constraints to create tighter relaxations.

Stochastic / Heuristic Methods

Multistart: run local solver from many random initial points.
Simulated annealing: random perturbations with decreasing temperature.
Genetic algorithms / evolutionary strategies: population-based search.
Basin-hopping: local minimization + random perturbation.

No guarantees for nonconvex problems in polynomial time (NP-hard in general), but practical heuristics often find good solutions.

Method Selection Guide

Problem Type	Recommended Method
Small dense unconstrained	Newton / trust region
Large sparse unconstrained	L-BFGS
Small constrained	SQP (SNOPT, fmincon)
Large constrained, many inequalities	Interior point (IPOPT)
Bound-constrained	L-BFGS-B
Global optimization	Multistart + local solver, or spatial B&B (BARON)

Connections

Convex specializations with guaranteed global optimality (→ 03-convex-optimization.md).
KKT conditions as optimality criteria (→ 04-duality-and-kkt.md).
Stochastic variants for noisy objectives (→ 08-stochastic-optimization.md).
Solver implementations (→ 09-optimization-solvers.md).