4 min read
On this page

Multivariable Calculus

Multivariable calculus extends differentiation and integration to functions of multiple variables. It is essential for optimization (machine learning), physics simulations, and probability theory.

Partial Derivatives

For f(x₁, x₂, ..., xₙ), the partial derivative with respect to xᵢ:

∂f/∂xᵢ = lim_{h→0} (f(x₁,...,xᵢ+h,...,xₙ) - f(x₁,...,xᵢ,...,xₙ)) / h

Differentiate with respect to xᵢ while holding all other variables constant.

Example: f(x,y) = x²y + sin(y).

∂f/∂x = 2xy
∂f/∂y = x² + cos(y)

Higher-order partial derivatives:

∂²f/∂x² = fₓₓ,  ∂²f/∂y² = f_yy,  ∂²f/∂x∂y = f_xy

Clairaut's theorem: If mixed partials are continuous, order doesn't matter: f_xy = f_yx.

Gradient

The gradient of f: ℝⁿ → ℝ is the vector of all partial derivatives:

∇f = (∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ)

Key properties:

  • ∇f points in the direction of steepest ascent.
  • -∇f points in the direction of steepest descent (used in gradient descent).
  • ‖∇f‖ gives the rate of change in the steepest direction.
  • ∇f is perpendicular to level curves/surfaces of f.
  • At a local minimum/maximum: ∇f = 0 (critical point).

Example: f(x,y) = x² + y². ∇f = (2x, 2y). At (1, 1), the gradient points away from the origin — steepest ascent is radially outward.

Directional Derivatives

The rate of change of f in the direction of unit vector u:

D_u f = ∇f · u = ‖∇f‖ cos θ

where θ is the angle between ∇f and u.

  • Maximum when u is parallel to ∇f (steepest ascent).
  • Zero when u is perpendicular to ∇f (along a level curve).
  • Minimum when u is anti-parallel to ∇f (steepest descent).

Jacobian

For a vector-valued function f: ℝⁿ → ℝᵐ, the Jacobian matrix is the m × n matrix of all partial derivatives:

J = [∂f₁/∂x₁  ∂f₁/∂x₂  ...  ∂f₁/∂xₙ]
    [∂f₂/∂x₁  ∂f₂/∂x₂  ...  ∂f₂/∂xₙ]
    [   ⋮         ⋮       ⋱      ⋮    ]
    [∂fₘ/∂x₁  ∂fₘ/∂x₂  ...  ∂fₘ/∂xₙ]

Row i is the gradient of fᵢ. The Jacobian is the best linear approximation to f near a point:

f(x + Δx) ≈ f(x) + J · Δx

Jacobian determinant (for n = m): |det(J)| measures local volume distortion. Used in change of variables for integration.

In ML: The Jacobian appears in backpropagation through vector-valued layers.

Hessian

For f: ℝⁿ → ℝ, the Hessian matrix is the n × n matrix of second partial derivatives:

H = [∂²f/∂x₁²      ∂²f/∂x₁∂x₂  ...  ∂²f/∂x₁∂xₙ]
    [∂²f/∂x₂∂x₁    ∂²f/∂x₂²    ...  ∂²f/∂x₂∂xₙ]
    [    ⋮              ⋮         ⋱       ⋮       ]
    [∂²f/∂xₙ∂x₁    ∂²f/∂xₙ∂x₂  ...  ∂²f/∂xₙ²   ]

H is symmetric (by Clairaut's theorem).

Second-order Taylor approximation:

f(x + Δx) ≈ f(x) + ∇f(x)ᵀΔx + ½ ΔxᵀH(x)Δx

Classification of critical points (where ∇f = 0):

  • H positive definite (all eigenvalues > 0) → local minimum
  • H negative definite (all eigenvalues < 0) → local maximum
  • H indefinite (mixed signs) → saddle point
  • H singular → inconclusive (degenerate critical point)

In optimization: Newton's method uses H⁻¹∇f for the step direction. Quasi-Newton methods (BFGS) approximate H.

Chain Rule (Multivariable)

If y = f(g(x)) where g: ℝⁿ → ℝᵐ and f: ℝᵐ → ℝᵖ:

J_{f∘g} = J_f · J_g

For scalar f(g₁(t), g₂(t)):

df/dt = (∂f/∂g₁)(dg₁/dt) + (∂f/∂g₂)(dg₂/dt) = ∇f · g'(t)

Backpropagation in neural networks is the chain rule applied repeatedly through the computational graph.

Implicit Function Theorem

If F(x, y) = 0 defines y implicitly as a function of x, and ∂F/∂y ≠ 0:

dy/dx = -(∂F/∂x) / (∂F/∂y)

Generalization: If F: ℝⁿ⁺ᵐ → ℝᵐ with F(x, y) = 0, and the m × m matrix ∂F/∂y is invertible, then y can locally be expressed as a function of x.

Lagrange Multipliers

Optimize f(x) subject to the constraint g(x) = 0.

At the optimum, ∇f is parallel to ∇g:

∇f = λ∇g     (where λ is the Lagrange multiplier)
g(x) = 0

Interpretation: λ represents the sensitivity of the optimal value to the constraint. If the constraint is relaxed by ε, the optimal value changes by approximately λε.

Multiple constraints: g₁(x) = 0, ..., gₖ(x) = 0:

∇f = λ₁∇g₁ + λ₂∇g₂ + ... + λₖ∇gₖ

KKT conditions (inequality constraints) generalize Lagrange multipliers and are fundamental to optimization theory.

Example: Maximize f(x,y) = xy subject to x + y = 10.

∇f = (y, x) = λ(1, 1) = λ∇g
→ y = λ, x = λ → x = y
→ x + y = 10 → x = y = 5
→ Maximum xy = 25

Multiple Integrals

Double Integrals

∫∫_R f(x,y) dA

Computed as iterated integrals:

∫ₐᵇ ∫_{c(x)}^{d(x)} f(x,y) dy dx

Fubini's theorem: If f is continuous, the order of integration can be swapped.

Change of Variables

For transformation (x,y) = T(u,v):

∫∫_R f(x,y) dx dy = ∫∫_S f(T(u,v)) |det(J_T)| du dv

Polar coordinates: x = r cos θ, y = r sin θ, |J| = r.

∫∫ f(x,y) dx dy = ∫∫ f(r cos θ, r sin θ) r dr dθ

Spherical coordinates: x = ρ sin φ cos θ, y = ρ sin φ sin θ, z = ρ cos φ, |J| = ρ² sin φ.

Gaussian Integral

∫_{-∞}^{∞} e^{-x²} dx = √π

Proved using polar coordinates on the double integral ∫∫ e^{-(x²+y²)} dx dy.

This integral is foundational to probability (normal distribution) and physics (partition functions).

Applications in CS

  • Gradient descent: The fundamental optimization algorithm. Update rule: x ← x - α∇f(x).
  • Backpropagation: Chain rule through computational graphs. Jacobians propagate gradients layer by layer.
  • Newton's method: Uses the Hessian for second-order optimization: x ← x - H⁻¹∇f.
  • Probability: Joint PDFs integrate over regions. Marginal distributions require integration. Change of variables formula transforms distributions.
  • Physics simulation: Multivariable calculus governs fluid dynamics, electromagnetism, and mechanics used in game engines and scientific computing.
  • Constrained optimization: Lagrange multipliers solve constrained problems in ML (SVMs, regularized regression), economics, and engineering.
  • Computer vision: Image gradients (∂I/∂x, ∂I/∂y) detect edges. Optical flow uses the image Jacobian.
  • Robotics: Jacobians relate joint velocities to end-effector velocities. Inverse kinematics uses the Jacobian pseudoinverse.