Stochastic Optimization

Overview

Stochastic optimization handles problems where data is uncertain or the objective is an expectation:

minimize  E_xi [ f(x, xi) ]

where xi is a random variable (or vector) representing uncertainty. The field spans two paradigms: optimization under uncertainty (stochastic programming, robust optimization) and optimization with stochastic oracles (SGD, Bayesian optimization).

Stochastic Programming

Two-Stage Stochastic Programming

First stage: make decisions x before uncertainty is revealed. Second stage: after observing xi, make recourse decisions y(xi).

minimize    c^T x + E_xi [ Q(x, xi) ]
subject to  Ax = b, x >= 0

where Q(x, xi) = min { q^T y : Wy = h(xi) - T(xi)x, y >= 0 } is the recourse function.

Sample Average Approximation (SAA): replace the expectation with an average over N scenarios:

minimize  c^T x + (1/N) sum_{k=1}^N Q(x, xi_k)

SAA converges to the true problem as N -> infinity. Practical issue: the deterministic equivalent has N copies of second-stage constraints, growing linearly with scenarios.

Multi-Stage Stochastic Programming

Decisions are made sequentially as information unfolds over T stages. The decision at stage t depends on the history (information up to t). Modeled via scenario trees, but the tree size grows exponentially with stages.

Tractable approximations: decision rules (linear, piecewise linear), stochastic dual dynamic programming (SDDP) for problems with stage-wise independence.

Robust Optimization

Replace distributional assumptions with uncertainty sets. The decision must be feasible for all realizations within the set:

minimize    c^T x
subject to  a_i^T x <= b_i   for all a_i in U_i

Uncertainty Set Types

Set Type	Shape	Tractability
Box
Ellipsoidal
Polyhedral	Du <= d	LP (with more constraints)
Budget of uncertainty

Advantages: no distributional assumptions needed, tractable reformulations, worst-case guarantees. Disadvantage: can be overly conservative.

Distributionally robust optimization (DRO): optimize over the worst-case distribution within an ambiguity set (moment constraints, Wasserstein ball). Bridges stochastic and robust approaches.

Chance Constraints

Require constraints to hold with high probability:

P[ g(x, xi) <= 0 ] >= 1 - epsilon

Individual chance constraints: each constraint holds with probability >= 1 - epsilon. For Gaussian uncertainty with linear constraints, this reduces to SOCP (via quantile reformulation).

Joint chance constraints: all constraints hold simultaneously with probability >= 1 - epsilon. Generally harder; approximated via:

Bonferroni: split epsilon across individual constraints.
CVaR approximation: conservative convex approximation.
Scenario approach: sample constraints; with N = O(n/epsilon) scenarios, feasibility holds with high confidence (Campi-Garatti theory).

Stochastic Gradient Descent (SGD)

For problems of the form minimize E_xi [ f(x, xi) ], SGD uses a single (or mini-batch) stochastic gradient:

x_{k+1} = x_k - alpha_k * grad_x f(x_k, xi_k)

where xi_k is sampled independently at each step.

Convergence

Assume unbiased gradients: E[grad_x f(x, xi)] = grad F(x), with variance bound E[||grad_x f(x,xi) - grad F(x)||^2] <= sigma^2.

Setting	Step Size	Rate
Convex, L-smooth	alpha_k = 1/(L + sigma*sqrt(k))	O(sigma/sqrt(k) + 1/k)
Strongly convex	alpha_k = 1/(mu*k)	O(sigma^2/(mu*k))
Nonconvex, L-smooth	alpha_k = 1/sqrt(K)	E[

The O(1/sqrt(K)) rate for the stochastic term is optimal for first-order stochastic methods.

Mini-Batch SGD

Use B samples per step: variance reduces by factor 1/B, but cost per step increases by B. Linear speedup until B reaches a critical threshold (gradient noise dominates at small B, deterministic convergence at large B).

Learning Rate Schedules

Constant then decay: practical for deep learning (warmup + cosine/step decay).
1/sqrt(k): theoretically motivated for convex.
Polyak step size: alpha_k = (f(x_k) - f*) / ||g_k||^2 when f* is known.

Variance Reduction

Reduce the variance of stochastic gradients to achieve faster convergence.

SVRG (Stochastic Variance Reduced Gradient)

Periodically compute a full gradient at a snapshot point, then use it to correct stochastic gradients:

g_k = grad f_{i_k}(x_k) - grad f_{i_k}(x_tilde) + grad F(x_tilde)

Achieves linear convergence O((n + kappa) * ln(1/epsilon)) for strongly convex finite-sum problems, matching deterministic gradient descent.

Other Methods

SAG/SAGA: maintain a table of per-component gradients. SAGA is unbiased; SAG is biased but converges.
SARAH/SPIDER: recursive variance reduction without full gradient snapshots.
Adam, AdaGrad, RMSProp: adaptive per-coordinate step sizes. Widely used in deep learning; weaker theoretical guarantees but excellent practical performance.

Bayesian Optimization

For expensive black-box functions where each evaluation is costly (hyperparameter tuning, experimental design, materials discovery).

Gaussian Process (GP) Surrogate

Model f(x) as a Gaussian process: f ~ GP(mu, k) where k(x, x') is a kernel (RBF, Matern). After observing data D = {(x_i, y_i)}, the posterior is also a GP with:

mu_D(x) = k(x, X) [K + sigma^2 I]^{-1} y
sigma_D^2(x) = k(x,x) - k(x,X) [K + sigma^2 I]^{-1} k(X,x)

The posterior mean provides a prediction; the posterior variance quantifies uncertainty.

Acquisition Functions

Determine where to evaluate next by balancing exploration (high uncertainty) and exploitation (low predicted value).

Acquisition	Formula	Properties
Expected Improvement (EI)	E[max(f_best - f(x), 0)]	Most popular, analytic for GP
Upper Confidence Bound (UCB)	mu(x) - beta * sigma(x)	Tunable exploration via beta
Probability of Improvement (PI)	P(f(x) < f_best - xi)	Simple but greedy
Knowledge Gradient	Value of information at x	Myopically optimal, expensive
Thompson Sampling	Sample f from posterior, minimize	Simple, parallelizable

Practical Considerations

Kernel choice: Matern-5/2 is a common default (less smooth than RBF, often more realistic).
Scaling: standard GP is O(n^3) in observations. Sparse GPs, random features, or local models for scaling.
High dimensions: Bayesian optimization is most effective in d <= 20. For higher dimensions, use random embeddings (REMBO) or additive structure.

Multi-Armed Bandits

Sequential decision-making under uncertainty: at each round, select an arm and observe a reward.

Regret Framework

Regret after T rounds: R_T = T * mu* - sum_{t=1}^T mu_{a_t}, where mu* is the best arm's mean.

Algorithms

UCB1 (Upper Confidence Bound):

a_t = argmax_a { mu_hat_a + sqrt(2 ln(t) / N_a(t)) }

where mu_hat_a is the empirical mean and N_a(t) is the pull count. Achieves O(sqrt(KT ln T)) regret for K arms.

Thompson Sampling: maintain a posterior on each arm's reward distribution. Sample from each posterior, play the arm with the highest sample. Achieves O(sqrt(KT)) regret (optimal up to constants). Empirically often outperforms UCB.

EXP3: for adversarial bandits (no stochastic assumption). Achieves O(sqrt(KT ln K)) regret.

Extensions

Contextual bandits: arm rewards depend on observed context. LinUCB, contextual Thompson sampling.
Combinatorial bandits: select subsets of arms (e.g., routes, assortments).
Best-arm identification: minimize sample complexity to identify the best arm with confidence delta.
Bayesian regret: integrate over a prior on arm parameters; Thompson sampling is Bayes-optimal.

Connections

Convex optimization as the deterministic foundation (→ 03-convex-optimization.md).
SGD applied to nonlinear problems (→ 06-nonlinear-programming.md).
Combinatorial problems under uncertainty (→ 07-combinatorial-optimization.md).
Solver support for stochastic programs (→ 09-optimization-solvers.md).