Stochastic Optimization
Overview
Stochastic optimization handles problems where data is uncertain or the objective is an expectation:
minimize E_xi [ f(x, xi) ]
where xi is a random variable (or vector) representing uncertainty. The field spans two paradigms: optimization under uncertainty (stochastic programming, robust optimization) and optimization with stochastic oracles (SGD, Bayesian optimization).
Stochastic Programming
Two-Stage Stochastic Programming
First stage: make decisions x before uncertainty is revealed. Second stage: after observing xi, make recourse decisions y(xi).
minimize c^T x + E_xi [ Q(x, xi) ]
subject to Ax = b, x >= 0
where Q(x, xi) = min { q^T y : Wy = h(xi) - T(xi)x, y >= 0 } is the recourse function.
Sample Average Approximation (SAA): replace the expectation with an average over N scenarios:
minimize c^T x + (1/N) sum_{k=1}^N Q(x, xi_k)
SAA converges to the true problem as N -> infinity. Practical issue: the deterministic equivalent has N copies of second-stage constraints, growing linearly with scenarios.
Multi-Stage Stochastic Programming
Decisions are made sequentially as information unfolds over T stages. The decision at stage t depends on the history (information up to t). Modeled via scenario trees, but the tree size grows exponentially with stages.
Tractable approximations: decision rules (linear, piecewise linear), stochastic dual dynamic programming (SDDP) for problems with stage-wise independence.
Robust Optimization
Replace distributional assumptions with uncertainty sets. The decision must be feasible for all realizations within the set:
minimize c^T x
subject to a_i^T x <= b_i for all a_i in U_i
Uncertainty Set Types
| Set Type | Shape | Tractability | |----------|-------|-------------| | Box | ||u||_inf <= 1 | LP | | Ellipsoidal | ||u||_2 <= 1 | SOCP | | Polyhedral | Du <= d | LP (with more constraints) | | Budget of uncertainty | ||u||_1 <= Gamma | LP |
Advantages: no distributional assumptions needed, tractable reformulations, worst-case guarantees. Disadvantage: can be overly conservative.
Distributionally robust optimization (DRO): optimize over the worst-case distribution within an ambiguity set (moment constraints, Wasserstein ball). Bridges stochastic and robust approaches.
Chance Constraints
Require constraints to hold with high probability:
P[ g(x, xi) <= 0 ] >= 1 - epsilon
Individual chance constraints: each constraint holds with probability >= 1 - epsilon. For Gaussian uncertainty with linear constraints, this reduces to SOCP (via quantile reformulation).
Joint chance constraints: all constraints hold simultaneously with probability >= 1 - epsilon. Generally harder; approximated via:
- Bonferroni: split epsilon across individual constraints.
- CVaR approximation: conservative convex approximation.
- Scenario approach: sample constraints; with N = O(n/epsilon) scenarios, feasibility holds with high confidence (Campi-Garatti theory).
Stochastic Gradient Descent (SGD)
For problems of the form minimize E_xi [ f(x, xi) ], SGD uses a single (or mini-batch) stochastic gradient:
x_{k+1} = x_k - alpha_k * grad_x f(x_k, xi_k)
where xi_k is sampled independently at each step.
Convergence
Assume unbiased gradients: E[grad_x f(x, xi)] = grad F(x), with variance bound E[||grad_x f(x,xi) - grad F(x)||^2] <= sigma^2.
| Setting | Step Size | Rate | |---------|-----------|------| | Convex, L-smooth | alpha_k = 1/(L + sigmasqrt(k)) | O(sigma/sqrt(k) + 1/k) | | Strongly convex | alpha_k = 1/(muk) | O(sigma^2/(mu*k)) | | Nonconvex, L-smooth | alpha_k = 1/sqrt(K) | E[||grad||^2] = O(sigma/sqrt(K)) |
The O(1/sqrt(K)) rate for the stochastic term is optimal for first-order stochastic methods.
Mini-Batch SGD
Use B samples per step: variance reduces by factor 1/B, but cost per step increases by B. Linear speedup until B reaches a critical threshold (gradient noise dominates at small B, deterministic convergence at large B).
Learning Rate Schedules
- Constant then decay: practical for deep learning (warmup + cosine/step decay).
- 1/sqrt(k): theoretically motivated for convex.
- Polyak step size: alpha_k = (f(x_k) - f*) / ||g_k||^2 when f* is known.
Variance Reduction
Reduce the variance of stochastic gradients to achieve faster convergence.
SVRG (Stochastic Variance Reduced Gradient)
Periodically compute a full gradient at a snapshot point, then use it to correct stochastic gradients:
g_k = grad f_{i_k}(x_k) - grad f_{i_k}(x_tilde) + grad F(x_tilde)
Achieves linear convergence O((n + kappa) * ln(1/epsilon)) for strongly convex finite-sum problems, matching deterministic gradient descent.
Other Methods
- SAG/SAGA: maintain a table of per-component gradients. SAGA is unbiased; SAG is biased but converges.
- SARAH/SPIDER: recursive variance reduction without full gradient snapshots.
- Adam, AdaGrad, RMSProp: adaptive per-coordinate step sizes. Widely used in deep learning; weaker theoretical guarantees but excellent practical performance.
Bayesian Optimization
For expensive black-box functions where each evaluation is costly (hyperparameter tuning, experimental design, materials discovery).
Gaussian Process (GP) Surrogate
Model f(x) as a Gaussian process: f ~ GP(mu, k) where k(x, x') is a kernel (RBF, Matern). After observing data D = {(x_i, y_i)}, the posterior is also a GP with:
mu_D(x) = k(x, X) [K + sigma^2 I]^{-1} y
sigma_D^2(x) = k(x,x) - k(x,X) [K + sigma^2 I]^{-1} k(X,x)
The posterior mean provides a prediction; the posterior variance quantifies uncertainty.
Acquisition Functions
Determine where to evaluate next by balancing exploration (high uncertainty) and exploitation (low predicted value).
| Acquisition | Formula | Properties | |-------------|---------|-----------| | Expected Improvement (EI) | E[max(f_best - f(x), 0)] | Most popular, analytic for GP | | Upper Confidence Bound (UCB) | mu(x) - beta * sigma(x) | Tunable exploration via beta | | Probability of Improvement (PI) | P(f(x) < f_best - xi) | Simple but greedy | | Knowledge Gradient | Value of information at x | Myopically optimal, expensive | | Thompson Sampling | Sample f from posterior, minimize | Simple, parallelizable |
Practical Considerations
- Kernel choice: Matern-5/2 is a common default (less smooth than RBF, often more realistic).
- Scaling: standard GP is O(n^3) in observations. Sparse GPs, random features, or local models for scaling.
- High dimensions: Bayesian optimization is most effective in d <= 20. For higher dimensions, use random embeddings (REMBO) or additive structure.
Multi-Armed Bandits
Sequential decision-making under uncertainty: at each round, select an arm and observe a reward.
Regret Framework
Regret after T rounds: R_T = T * mu* - sum_{t=1}^T mu_{a_t}, where mu* is the best arm's mean.
Algorithms
UCB1 (Upper Confidence Bound):
a_t = argmax_a { mu_hat_a + sqrt(2 ln(t) / N_a(t)) }
where mu_hat_a is the empirical mean and N_a(t) is the pull count. Achieves O(sqrt(KT ln T)) regret for K arms.
Thompson Sampling: maintain a posterior on each arm's reward distribution. Sample from each posterior, play the arm with the highest sample. Achieves O(sqrt(KT)) regret (optimal up to constants). Empirically often outperforms UCB.
EXP3: for adversarial bandits (no stochastic assumption). Achieves O(sqrt(KT ln K)) regret.
Extensions
- Contextual bandits: arm rewards depend on observed context. LinUCB, contextual Thompson sampling.
- Combinatorial bandits: select subsets of arms (e.g., routes, assortments).
- Best-arm identification: minimize sample complexity to identify the best arm with confidence delta.
- Bayesian regret: integrate over a prior on arm parameters; Thompson sampling is Bayes-optimal.
Connections
- Convex optimization as the deterministic foundation (→
03-convex-optimization.md). - SGD applied to nonlinear problems (→
06-nonlinear-programming.md). - Combinatorial problems under uncertainty (→
07-combinatorial-optimization.md). - Solver support for stochastic programs (→
09-optimization-solvers.md).