Multivariable Calculus

Multivariable calculus extends differentiation and integration to functions of multiple variables. It is essential for optimization (machine learning), physics simulations, and probability theory.

Partial Derivatives

For f(x₁, x₂, ..., xₙ), the partial derivative with respect to xᵢ:

∂f/∂xᵢ = lim_{h→0} (f(x₁,...,xᵢ+h,...,xₙ) - f(x₁,...,xᵢ,...,xₙ)) / h

Differentiate with respect to xᵢ while holding all other variables constant.

Example: f(x,y) = x²y + sin(y).

∂f/∂x = 2xy
∂f/∂y = x² + cos(y)

Higher-order partial derivatives:

∂²f/∂x² = fₓₓ,  ∂²f/∂y² = f_yy,  ∂²f/∂x∂y = f_xy

Clairaut's theorem: If mixed partials are continuous, order doesn't matter: f_xy = f_yx.

Gradient

The gradient of f: ℝⁿ → ℝ is the vector of all partial derivatives:

∇f = (∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ)

Key properties:

∇f points in the direction of steepest ascent.
-∇f points in the direction of steepest descent (used in gradient descent).
‖∇f‖ gives the rate of change in the steepest direction.
∇f is perpendicular to level curves/surfaces of f.
At a local minimum/maximum: ∇f = 0 (critical point).

Example: f(x,y) = x² + y². ∇f = (2x, 2y). At (1, 1), the gradient points away from the origin — steepest ascent is radially outward.

Directional Derivatives

The rate of change of f in the direction of unit vector u:

D_u f = ∇f · u = ‖∇f‖ cos θ

where θ is the angle between ∇f and u.

Maximum when u is parallel to ∇f (steepest ascent).
Zero when u is perpendicular to ∇f (along a level curve).
Minimum when u is anti-parallel to ∇f (steepest descent).

Jacobian

For a vector-valued function f: ℝⁿ → ℝᵐ, the Jacobian matrix is the m × n matrix of all partial derivatives:

J = [∂f₁/∂x₁  ∂f₁/∂x₂  ...  ∂f₁/∂xₙ]
    [∂f₂/∂x₁  ∂f₂/∂x₂  ...  ∂f₂/∂xₙ]
    [   ⋮         ⋮       ⋱      ⋮    ]
    [∂fₘ/∂x₁  ∂fₘ/∂x₂  ...  ∂fₘ/∂xₙ]

Row i is the gradient of fᵢ. The Jacobian is the best linear approximation to f near a point:

f(x + Δx) ≈ f(x) + J · Δx

Jacobian determinant (for n = m): |det(J)| measures local volume distortion. Used in change of variables for integration.

In ML: The Jacobian appears in backpropagation through vector-valued layers.

Hessian

For f: ℝⁿ → ℝ, the Hessian matrix is the n × n matrix of second partial derivatives:

H = [∂²f/∂x₁²      ∂²f/∂x₁∂x₂  ...  ∂²f/∂x₁∂xₙ]
    [∂²f/∂x₂∂x₁    ∂²f/∂x₂²    ...  ∂²f/∂x₂∂xₙ]
    [    ⋮              ⋮         ⋱       ⋮       ]
    [∂²f/∂xₙ∂x₁    ∂²f/∂xₙ∂x₂  ...  ∂²f/∂xₙ²   ]

H is symmetric (by Clairaut's theorem).

Second-order Taylor approximation:

f(x + Δx) ≈ f(x) + ∇f(x)ᵀΔx + ½ ΔxᵀH(x)Δx

Classification of critical points (where ∇f = 0):

H positive definite (all eigenvalues > 0) → local minimum
H negative definite (all eigenvalues < 0) → local maximum
H indefinite (mixed signs) → saddle point
H singular → inconclusive (degenerate critical point)

In optimization: Newton's method uses H⁻¹∇f for the step direction. Quasi-Newton methods (BFGS) approximate H.

Chain Rule (Multivariable)

If y = f(g(x)) where g: ℝⁿ → ℝᵐ and f: ℝᵐ → ℝᵖ:

J_{f∘g} = J_f · J_g

For scalar f(g₁(t), g₂(t)):

df/dt = (∂f/∂g₁)(dg₁/dt) + (∂f/∂g₂)(dg₂/dt) = ∇f · g'(t)

Backpropagation in neural networks is the chain rule applied repeatedly through the computational graph.

Implicit Function Theorem

If F(x, y) = 0 defines y implicitly as a function of x, and ∂F/∂y ≠ 0:

dy/dx = -(∂F/∂x) / (∂F/∂y)

Generalization: If F: ℝⁿ⁺ᵐ → ℝᵐ with F(x, y) = 0, and the m × m matrix ∂F/∂y is invertible, then y can locally be expressed as a function of x.

Lagrange Multipliers

Optimize f(x) subject to the constraint g(x) = 0.

At the optimum, ∇f is parallel to ∇g:

∇f = λ∇g     (where λ is the Lagrange multiplier)
g(x) = 0

Interpretation: λ represents the sensitivity of the optimal value to the constraint. If the constraint is relaxed by ε, the optimal value changes by approximately λε.

Multiple constraints: g₁(x) = 0, ..., gₖ(x) = 0:

∇f = λ₁∇g₁ + λ₂∇g₂ + ... + λₖ∇gₖ

KKT conditions (inequality constraints) generalize Lagrange multipliers and are fundamental to optimization theory.

Example: Maximize f(x,y) = xy subject to x + y = 10.

∇f = (y, x) = λ(1, 1) = λ∇g
→ y = λ, x = λ → x = y
→ x + y = 10 → x = y = 5
→ Maximum xy = 25

Multiple Integrals

Double Integrals

∫∫_R f(x,y) dA

Computed as iterated integrals:

∫ₐᵇ ∫_{c(x)}^{d(x)} f(x,y) dy dx

Fubini's theorem: If f is continuous, the order of integration can be swapped.

Change of Variables

For transformation (x,y) = T(u,v):

∫∫_R f(x,y) dx dy = ∫∫_S f(T(u,v)) |det(J_T)| du dv

Polar coordinates: x = r cos θ, y = r sin θ, |J| = r.

∫∫ f(x,y) dx dy = ∫∫ f(r cos θ, r sin θ) r dr dθ

Spherical coordinates: x = ρ sin φ cos θ, y = ρ sin φ sin θ, z = ρ cos φ, |J| = ρ² sin φ.

Gaussian Integral

∫_{-∞}^{∞} e^{-x²} dx = √π

Proved using polar coordinates on the double integral ∫∫ e^{-(x²+y²)} dx dy.

This integral is foundational to probability (normal distribution) and physics (partition functions).

Applications in CS

Gradient descent: The fundamental optimization algorithm. Update rule: x ← x - α∇f(x).
Backpropagation: Chain rule through computational graphs. Jacobians propagate gradients layer by layer.
Newton's method: Uses the Hessian for second-order optimization: x ← x - H⁻¹∇f.
Probability: Joint PDFs integrate over regions. Marginal distributions require integration. Change of variables formula transforms distributions.
Physics simulation: Multivariable calculus governs fluid dynamics, electromagnetism, and mechanics used in game engines and scientific computing.
Constrained optimization: Lagrange multipliers solve constrained problems in ML (SVMs, regularized regression), economics, and engineering.
Computer vision: Image gradients (∂I/∂x, ∂I/∂y) detect edges. Optical flow uses the image Jacobian.
Robotics: Jacobians relate joint velocities to end-effector velocities. Inverse kinematics uses the Jacobian pseudoinverse.