Numerical Linear Algebra

Numerical linear algebra solves linear systems, eigenvalue problems, and least squares — the computational core of scientific computing, ML, and data analysis.

Direct Methods for Ax = b

LU with Pivoting

Covered in linear algebra topic. Key numerical points:

Partial pivoting: At each step, swap the row with the largest absolute value in the current column to the pivot position. Prevents division by small numbers.

Growth factor: With partial pivoting, the growth factor g ≤ 2^(n-1) in theory, but in practice g = O(1). Without pivoting, g can be enormous → catastrophic instability.

Complete pivoting: Swap both rows and columns. Growth factor ≤ n^(1/2) · Πₖ₌₂ⁿ k^(1/(2(k-1))). Rarely needed in practice.

Cost: O(2n³/3) flops.

Cholesky for SPD Systems

For symmetric positive definite A:

A = LLᵀ

Cost: O(n³/3) — half of LU. No pivoting needed (always stable).

When to use: Covariance matrices, kernel matrices, normal equations (AᵀA), any SPD system.

QR for Least Squares

A = QR (Householder). Solve Rx = Qᵀb by back-substitution.

More stable than normal equations (AᵀA can square the condition number: κ(AᵀA) = κ(A)²).

Cost: O(2mn² - 2n³/3) for m × n matrix (m ≥ n).

Iterative Methods for Ax = b

For large sparse systems, direct methods are too expensive (O(n³)). Iterative methods exploit sparsity.

Jacobi Method

Split A = D + (L + U) where D is diagonal:

x^(k+1) = D⁻¹(b - (L + U)x^(k))

Component-wise: xᵢ^(k+1) = (bᵢ - Σⱼ≠ᵢ aᵢⱼxⱼ^(k)) / aᵢᵢ.

Convergence: Converges if A is strictly diagonally dominant (|aᵢᵢ| > Σⱼ≠ᵢ |aᵢⱼ|).

Gauss-Seidel Method

Like Jacobi, but uses updated values immediately:

xᵢ^(k+1) = (bᵢ - Σⱼ<ᵢ aᵢⱼxⱼ^(k+1) - Σⱼ>ᵢ aᵢⱼxⱼ^(k)) / aᵢᵢ

Typically converges faster than Jacobi. Converges for SPD matrices.

Successive Over-Relaxation (SOR)

Accelerated Gauss-Seidel with relaxation parameter ω:

xᵢ^(k+1) = (1-ω)xᵢ^(k) + ω · GS_update

Optimal ω is problem-dependent (1 < ω < 2 for over-relaxation). Can dramatically speed convergence.

Conjugate Gradient (CG)

The method of choice for large sparse SPD systems.

Idea: Minimize f(x) = ½xᵀAx - bᵀx (whose gradient is Ax - b). Search along A-conjugate directions.

Algorithm:

r₀ = b - Ax₀,  p₀ = r₀
For k = 0, 1, 2, ...:
    αₖ = rₖᵀrₖ / (pₖᵀApₖ)
    xₖ₊₁ = xₖ + αₖpₖ
    rₖ₊₁ = rₖ - αₖApₖ
    if ‖rₖ₊₁‖ < tol: stop
    βₖ = rₖ₊₁ᵀrₖ₊₁ / (rₖᵀrₖ)
    pₖ₊₁ = rₖ₊₁ + βₖpₖ

Properties:

Converges in at most n iterations (exact arithmetic)
Each iteration: one matrix-vector multiply + O(n) vector operations
Convergence rate: ‖eₖ‖_A ≤ 2((√κ-1)/(√κ+1))^k ‖e₀‖_A
Fast when κ(A) is small (well-conditioned)

Cost per iteration: O(nnz) where nnz = number of non-zeros in A.

GMRES

Generalized Minimum Residual — the standard for non-symmetric sparse systems.

Minimizes ‖Ax - b‖ over the Krylov subspace K_k = span{r₀, Ar₀, A²r₀, ...}.

Properties: Converges in at most n iterations. Can be restarted (GMRES(m)) to bound memory.

BiCGSTAB

Bi-Conjugate Gradient Stabilized. For non-symmetric systems. More stable than plain BiCG.

Preconditioning

Transform Ax = b into M⁻¹Ax = M⁻¹b where M⁻¹A has a smaller condition number.

Good preconditioner M ≈ A but M⁻¹ is cheap to apply.

Common preconditioners:

Jacobi (diagonal): M = diag(A). Simplest.
SSOR: Symmetric SOR as preconditioner.
Incomplete LU (ILU): LU factorization keeping only the sparsity pattern of A.
Incomplete Cholesky (IC): For SPD systems.
Algebraic multigrid (AMG): Automatic multilevel preconditioner. Very effective.

Preconditioning can reduce iterations from thousands to tens.

Sparse Matrix Storage

Format	Description	Best for
COO	(row, col, val) triples	Construction, conversion
CSR	Row pointers + col indices + values	Row access, SpMV
CSC	Col pointers + row indices + values	Column access, direct solvers
Diagonal	Store diagonals only	Banded matrices
BSR	Block sparse row	Block-structured matrices

SpMV (Sparse Matrix-Vector multiply): O(nnz), the dominant cost in iterative methods.

Eigenvalue Computation

Power Iteration

Find the dominant eigenvalue (largest |λ|) and its eigenvector:

v^(k+1) = Av^(k) / ‖Av^(k)‖
λ ≈ (v^(k))ᵀ Av^(k) (Rayleigh quotient)

Convergence: Linear, rate |λ₂/λ₁|. Slow if the gap is small.

Inverse Iteration

Find eigenvalue closest to a shift σ:

(A - σI) v^(k+1) = v^(k)
normalize v^(k+1)

Convergence: Linear, rate |λ_closest - σ| / |λ_second_closest - σ|. Very fast when σ ≈ λ.

QR Algorithm

The standard method for computing all eigenvalues.

A₀ = A
For k = 0, 1, 2, ...:
    Qₖ Rₖ = Aₖ   (QR factorization)
    Aₖ₊₁ = Rₖ Qₖ  (reverse multiply)

Aₖ converges to upper triangular (Schur form). Diagonal entries → eigenvalues.

With shifts (Wilkinson shift): Cubic convergence. Reduces O(n³) iterations to O(n) iterations.

Cost: O(n³) total (after initial reduction to Hessenberg form in O(n³)).

Lanczos Algorithm

For large sparse symmetric matrices. Finds extreme eigenvalues.

Builds an orthonormal basis for the Krylov subspace. The eigenvalues of the tridiagonal projection approximate the extreme eigenvalues of A.

Cost: O(k · nnz) for k eigenvalues. Much cheaper than QR for sparse matrices.

Numerical issue: Loss of orthogonality. Requires reorthogonalization.

Arnoldi Algorithm

Lanczos generalized to non-symmetric matrices. Builds an upper Hessenberg projection.

SVD Computation

Reduce to bidiagonal form via Householder: O(mn²).
Apply iterative bidiagonal SVD (Golub-Kahan, divide-and-conquer): O(n²).

For large sparse: randomized SVD — project to low rank via random matrix, then compute SVD of the small matrix. O(mn log k) for rank-k approximation.

Summary: Which Method to Use

Problem	Small/Dense	Large/Sparse
Ax = b (general)	LU with pivoting	GMRES + preconditioner
Ax = b (SPD)	Cholesky	Conjugate gradient + preconditioner
Ax ≈ b (least squares)	QR or SVD	LSQR, CGLS
All eigenvalues	QR algorithm	N/A (use subset methods)
Few eigenvalues (symmetric)	N/A	Lanczos
Few eigenvalues (non-symmetric)	N/A	Arnoldi (ARPACK)
SVD	Golub-Kahan	Randomized SVD

Applications in CS

Machine learning: Solving linear systems in regression, kernel methods. SVD for PCA, recommendations. Sparse solvers for large-scale problems.
Graph algorithms: Spectral clustering uses Lanczos. PageRank uses power iteration.
Physics simulation: Finite element methods produce large sparse SPD systems → CG with AMG preconditioner.
Computer graphics: Solving Poisson equations for seamless cloning, mesh smoothing.
Optimization: Newton's method requires solving Hx = -g at each step. CG can solve approximately (truncated Newton).
Signal processing: Large Toeplitz systems from convolution → fast solvers (FFT-based).
Network analysis: Eigenvalues of adjacency/Laplacian matrices reveal graph properties.