Multivariable Calculus
Multivariable calculus extends differentiation and integration to functions of multiple variables. It is essential for optimization (machine learning), physics simulations, and probability theory.
Partial Derivatives
For f(x₁, x₂, ..., xₙ), the partial derivative with respect to xᵢ:
∂f/∂xᵢ = lim_{h→0} (f(x₁,...,xᵢ+h,...,xₙ) - f(x₁,...,xᵢ,...,xₙ)) / h
Differentiate with respect to xᵢ while holding all other variables constant.
Example: f(x,y) = x²y + sin(y).
∂f/∂x = 2xy
∂f/∂y = x² + cos(y)
Higher-order partial derivatives:
∂²f/∂x² = fₓₓ, ∂²f/∂y² = f_yy, ∂²f/∂x∂y = f_xy
Clairaut's theorem: If mixed partials are continuous, order doesn't matter: f_xy = f_yx.
Gradient
The gradient of f: ℝⁿ → ℝ is the vector of all partial derivatives:
∇f = (∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ)
Key properties:
- ∇f points in the direction of steepest ascent.
- -∇f points in the direction of steepest descent (used in gradient descent).
- ‖∇f‖ gives the rate of change in the steepest direction.
- ∇f is perpendicular to level curves/surfaces of f.
- At a local minimum/maximum: ∇f = 0 (critical point).
Example: f(x,y) = x² + y². ∇f = (2x, 2y). At (1, 1), the gradient points away from the origin — steepest ascent is radially outward.
Directional Derivatives
The rate of change of f in the direction of unit vector u:
D_u f = ∇f · u = ‖∇f‖ cos θ
where θ is the angle between ∇f and u.
- Maximum when u is parallel to ∇f (steepest ascent).
- Zero when u is perpendicular to ∇f (along a level curve).
- Minimum when u is anti-parallel to ∇f (steepest descent).
Jacobian
For a vector-valued function f: ℝⁿ → ℝᵐ, the Jacobian matrix is the m × n matrix of all partial derivatives:
J = [∂f₁/∂x₁ ∂f₁/∂x₂ ... ∂f₁/∂xₙ]
[∂f₂/∂x₁ ∂f₂/∂x₂ ... ∂f₂/∂xₙ]
[ ⋮ ⋮ ⋱ ⋮ ]
[∂fₘ/∂x₁ ∂fₘ/∂x₂ ... ∂fₘ/∂xₙ]
Row i is the gradient of fᵢ. The Jacobian is the best linear approximation to f near a point:
f(x + Δx) ≈ f(x) + J · Δx
Jacobian determinant (for n = m): |det(J)| measures local volume distortion. Used in change of variables for integration.
In ML: The Jacobian appears in backpropagation through vector-valued layers.
Hessian
For f: ℝⁿ → ℝ, the Hessian matrix is the n × n matrix of second partial derivatives:
H = [∂²f/∂x₁² ∂²f/∂x₁∂x₂ ... ∂²f/∂x₁∂xₙ]
[∂²f/∂x₂∂x₁ ∂²f/∂x₂² ... ∂²f/∂x₂∂xₙ]
[ ⋮ ⋮ ⋱ ⋮ ]
[∂²f/∂xₙ∂x₁ ∂²f/∂xₙ∂x₂ ... ∂²f/∂xₙ² ]
H is symmetric (by Clairaut's theorem).
Second-order Taylor approximation:
f(x + Δx) ≈ f(x) + ∇f(x)ᵀΔx + ½ ΔxᵀH(x)Δx
Classification of critical points (where ∇f = 0):
- H positive definite (all eigenvalues > 0) → local minimum
- H negative definite (all eigenvalues < 0) → local maximum
- H indefinite (mixed signs) → saddle point
- H singular → inconclusive (degenerate critical point)
In optimization: Newton's method uses H⁻¹∇f for the step direction. Quasi-Newton methods (BFGS) approximate H.
Chain Rule (Multivariable)
If y = f(g(x)) where g: ℝⁿ → ℝᵐ and f: ℝᵐ → ℝᵖ:
J_{f∘g} = J_f · J_g
For scalar f(g₁(t), g₂(t)):
df/dt = (∂f/∂g₁)(dg₁/dt) + (∂f/∂g₂)(dg₂/dt) = ∇f · g'(t)
Backpropagation in neural networks is the chain rule applied repeatedly through the computational graph.
Implicit Function Theorem
If F(x, y) = 0 defines y implicitly as a function of x, and ∂F/∂y ≠ 0:
dy/dx = -(∂F/∂x) / (∂F/∂y)
Generalization: If F: ℝⁿ⁺ᵐ → ℝᵐ with F(x, y) = 0, and the m × m matrix ∂F/∂y is invertible, then y can locally be expressed as a function of x.
Lagrange Multipliers
Optimize f(x) subject to the constraint g(x) = 0.
At the optimum, ∇f is parallel to ∇g:
∇f = λ∇g (where λ is the Lagrange multiplier)
g(x) = 0
Interpretation: λ represents the sensitivity of the optimal value to the constraint. If the constraint is relaxed by ε, the optimal value changes by approximately λε.
Multiple constraints: g₁(x) = 0, ..., gₖ(x) = 0:
∇f = λ₁∇g₁ + λ₂∇g₂ + ... + λₖ∇gₖ
KKT conditions (inequality constraints) generalize Lagrange multipliers and are fundamental to optimization theory.
Example: Maximize f(x,y) = xy subject to x + y = 10.
∇f = (y, x) = λ(1, 1) = λ∇g
→ y = λ, x = λ → x = y
→ x + y = 10 → x = y = 5
→ Maximum xy = 25
Multiple Integrals
Double Integrals
∫∫_R f(x,y) dA
Computed as iterated integrals:
∫ₐᵇ ∫_{c(x)}^{d(x)} f(x,y) dy dx
Fubini's theorem: If f is continuous, the order of integration can be swapped.
Change of Variables
For transformation (x,y) = T(u,v):
∫∫_R f(x,y) dx dy = ∫∫_S f(T(u,v)) |det(J_T)| du dv
Polar coordinates: x = r cos θ, y = r sin θ, |J| = r.
∫∫ f(x,y) dx dy = ∫∫ f(r cos θ, r sin θ) r dr dθ
Spherical coordinates: x = ρ sin φ cos θ, y = ρ sin φ sin θ, z = ρ cos φ, |J| = ρ² sin φ.
Gaussian Integral
∫_{-∞}^{∞} e^{-x²} dx = √π
Proved using polar coordinates on the double integral ∫∫ e^{-(x²+y²)} dx dy.
This integral is foundational to probability (normal distribution) and physics (partition functions).
Applications in CS
- Gradient descent: The fundamental optimization algorithm. Update rule: x ← x - α∇f(x).
- Backpropagation: Chain rule through computational graphs. Jacobians propagate gradients layer by layer.
- Newton's method: Uses the Hessian for second-order optimization: x ← x - H⁻¹∇f.
- Probability: Joint PDFs integrate over regions. Marginal distributions require integration. Change of variables formula transforms distributions.
- Physics simulation: Multivariable calculus governs fluid dynamics, electromagnetism, and mechanics used in game engines and scientific computing.
- Constrained optimization: Lagrange multipliers solve constrained problems in ML (SVMs, regularized regression), economics, and engineering.
- Computer vision: Image gradients (∂I/∂x, ∂I/∂y) detect edges. Optical flow uses the image Jacobian.
- Robotics: Jacobians relate joint velocities to end-effector velocities. Inverse kinematics uses the Jacobian pseudoinverse.