Convex Optimization

Optimization problem hierarchy

Convex Sets and Functions

A set C is convex if for all x, y in C and theta in [0, 1]: theta*x + (1-theta)*y in C. Key examples: hyperplanes, halfspaces, polyhedra, ellipsoids, norm balls, positive semidefinite cone.

A function f: R^n -> R is convex if dom(f) is convex and for all x, y in dom(f):

f(theta*x + (1-theta)*y) <= theta*f(x) + (1-theta)*f(y)

Equivalent characterizations:

First-order: f(y) >= f(x) + grad(f(x))^T (y - x) for all x, y (the function lies above its tangent hyperplanes).
Second-order: the Hessian H(x) is positive semidefinite everywhere.

Strongly convex (parameter mu > 0): f(y) >= f(x) + grad(f(x))^T (y-x) + (mu/2) ||y-x||^2. This guarantees a unique minimizer and faster convergence.

L-smooth: ||grad(f(x)) - grad(f(y))|| <= L ||x-y|| for all x, y. Equivalently, f(y) <= f(x) + grad(f(x))^T (y-x) + (L/2) ||y-x||^2.

The condition number kappa = L/mu governs convergence rates.

Standard Convex Optimization Problem

minimize    f(x)
subject to  g_i(x) <= 0,   i = 1,...,m   (convex)
            h_j(x) = 0,    j = 1,...,p   (affine)

Any local minimum is a global minimum. This is the fundamental property that makes convex optimization tractable.

Gradient Descent

The workhorse of unconstrained convex optimization.

x_{k+1} = x_k - alpha_k * grad(f(x_k))

Convergence Rates

Setting	Step size	Rate	Iterations to epsilon
Convex, L-smooth	1/L	O(1/k)	O(L/epsilon)
mu-strongly convex, L-smooth	2/(L+mu)	O((1 - mu/L)^k)	O(kappa * ln(1/epsilon))

Constant step size alpha = 1/L is safe when L is known.
Line search (Armijo/backtracking): find alpha satisfying f(x - alpha*grad) <= f(x) - c*alpha*||grad||^2. Adaptive and practical.
Exact line search: alpha = argmin f(x - alpha*grad). Rarely used except for quadratics.

Nesterov's Accelerated Gradient

Achieves the optimal rate for first-order methods on smooth convex problems.

y_{k+1} = x_k - (1/L) * grad(f(x_k))
x_{k+1} = y_{k+1} + (k-1)/(k+2) * (y_{k+1} - y_k)

Convergence Rates

Setting	Rate	Lower bound match?
Convex, L-smooth	O(1/k^2)	Yes (Nesterov 1983)
Strongly convex, L-smooth	O((1 - sqrt(mu/L))^k)	Yes

The momentum term (k-1)/(k+2) provides acceleration. For strongly convex problems, use a constant momentum (sqrt(kappa) - 1)/(sqrt(kappa) + 1).

Proximal Gradient Methods

For composite problems: minimize f(x) + g(x) where f is smooth and g is convex but possibly nonsmooth (e.g., L1 norm).

The proximal operator of g:

prox_{alpha*g}(v) = argmin_x { g(x) + (1/(2*alpha)) ||x - v||^2 }

Proximal gradient step:

x_{k+1} = prox_{alpha/L, g}( x_k - (1/L) * grad(f(x_k)) )

Key proximal operators:

g(x) = lambda * ||x||_1: soft thresholding (ISTA for LASSO)
g(x) = indicator of convex set C: projection onto C
g(x) = lambda * ||X||_*: singular value soft thresholding

FISTA: accelerated proximal gradient, achieves O(1/k^2) rate.

Projected Gradient Descent

Special case of proximal gradient where g is the indicator function of a convex set C.

x_{k+1} = proj_C( x_k - alpha_k * grad(f(x_k)) )

Convergence matches standard gradient descent rates. The bottleneck is the projection step, which must be efficient. Closed-form projections exist for boxes, simplices, L2 balls, halfspaces, and affine subspaces.

Mirror Descent

Generalizes projected gradient by replacing the Euclidean distance with a Bregman divergence D_phi:

x_{k+1} = argmin_x { <grad(f(x_k)), x> + (1/alpha_k) * D_phi(x, x_k) }

When phi(x) = (1/2)||x||^2, this recovers projected gradient. For the simplex with phi(x) = sum x_i ln(x_i) (negative entropy), mirror descent gives the exponentiated gradient update, which is dimension-free in its convergence bound.

Useful when the geometry of the constraint set is better captured by a non-Euclidean distance.

Frank-Wolfe (Conditional Gradient)

For minimizing a smooth function over a compact convex set C:

s_k = argmin_{s in C} <grad(f(x_k)), s>       (linear minimization oracle)
x_{k+1} = x_k + gamma_k * (s_k - x_k)         (gamma_k = 2/(k+2))

Advantages: projection-free (only requires a linear oracle over C), iterates remain in C, and solutions are sparse (convex combinations of vertices). Rate: O(1/k) for smooth convex f, which is tight for this oracle model.

Applications: matrix completion (nuclear norm ball), structured sparsity, traffic assignment.

ADMM (Alternating Direction Method of Multipliers)

Solves problems of the form: minimize f(x) + g(z) subject to Ax + Bz = c.

Augmented Lagrangian with penalty rho:

x_{k+1} = argmin_x { f(x) + (rho/2) ||Ax + Bz_k - c + u_k||^2 }
z_{k+1} = argmin_z { g(z) + (rho/2) ||Ax_{k+1} + Bz - c + u_k||^2 }
u_{k+1} = u_k + Ax_{k+1} + Bz_{k+1} - c

ADMM decomposes the problem into subproblems that are often easier to solve. Convergence rate is O(1/k) in general. Widely used in distributed optimization, signal processing, and machine learning (LASSO, consensus problems).

Practical Considerations

Penalty parameter rho: critical for convergence speed. Adaptive schemes (residual balancing) help.
Over-relaxation: replace Ax_{k+1} with alphaAx_{k+1} - (1-alpha)(Bz_k - c) with alpha in (1, 1.8) for speedup.
Stopping criteria: primal residual ||Ax + Bz - c|| and dual residual ||rho * B^T (z_{k+1} - z_k)|| both below tolerance.

Summary of Rates

Method	Smooth convex	Strongly convex
Gradient descent	O(1/k)	O(exp(-k/kappa))
Accelerated gradient	O(1/k^2)	O(exp(-k/sqrt(kappa)))
Proximal gradient	O(1/k)	O(exp(-k/kappa))
Frank-Wolfe	O(1/k)	O(1/k)
Mirror descent	O(1/sqrt(k))	O(exp(-k*mu))

Connections

Duality theory underlying these methods (→ 04-duality-and-kkt.md).
Conic optimization as structured convex optimization (→ 05-conic-programming.md).
Newton and quasi-Newton for faster local convergence (→ 06-nonlinear-programming.md).
Stochastic variants of gradient methods (→ 08-stochastic-optimization.md).