Geometric Vision

Camera Calibration

Intrinsic Parameters

The camera intrinsic matrix K encodes internal geometry:

K = [f_x  s   c_x]
    [ 0  f_y  c_y]
    [ 0   0    1 ]

f_x, f_y: focal lengths in pixel units
(c_x, c_y): principal point (usually near image center)
s: skew coefficient (usually 0 for modern cameras)

Plus distortion coefficients: (k1, k2, k3, p1, p2) for radial and tangential distortion.

Zhang's Method

Calibration using a planar checkerboard pattern observed in multiple (>= 3) orientations.

Key insight: for a planar target (Z=0), the projection simplifies to a homography:

s * [u, v, 1]^T = K * [r1 | r2 | t] * [X, Y, 1]^T = H * [X, Y, 1]^T

Algorithm:

Detect checkerboard corners in N images (sub-pixel refinement)
Estimate homography H_i for each image via DLT
From H = K[r1|r2|t], derive constraints on K: h1^T * K^{-T} * K^{-1} * h2 = 0 and h1^T * K^{-T} * K^{-1} * h1 = h2^T * K^{-T} * K^{-1} * h2
Each image gives 2 constraints on B = K^{-T} * K^{-1} (symmetric, 6 unknowns)
With >= 3 images, solve for B via linear system, then extract K via Cholesky decomposition
Recover extrinsics [R_i | t_i] for each image
Refine all parameters jointly via Levenberg-Marquardt minimizing reprojection error

Reprojection Error

The standard metric for calibration quality:

E = sum_{i,j} ||p_{ij} - pi(K, R_i, t_i, d, P_j)||^2

where pi is the projection function including distortion. Good calibration: < 0.5 pixel RMS error.

Homography

A homography H is a 3x3 projective transformation mapping points between two planes (or two views of a planar scene):

s * [u', v', 1]^T = H * [u, v, 1]^T

8 degrees of freedom (3x3 matrix up to scale).

Direct Linear Transform (DLT)

Given N >= 4 point correspondences (x_i, x_i'), each pair provides 2 equations:

[-x_i^T  0^T    u_i'*x_i^T] [h1]
[0^T    -x_i^T  v_i'*x_i^T] [h2] = 0
                              [h3]

Stack all equations into A * h = 0, solve via SVD (h = last column of V).

Normalization (Hartley): translate points to origin, scale to average distance sqrt(2). Critical for numerical stability.

RANSAC (Random Sample Consensus)

Robust estimation in the presence of outliers:

Randomly sample minimum subset (4 points for homography)
Fit model (DLT)
Count inliers (points with reprojection error < threshold)
Repeat for N iterations, keep best model
Re-estimate using all inliers

Number of iterations: N = log(1 - p) / log(1 - w^s) where p is success probability, w is inlier ratio, s is sample size.

Variants: PROSAC (progressive sampling by match quality), MAGSAC (marginalizing over threshold), LO-RANSAC (local optimization of best model).

Applications of Homography

Image stitching: warp images to a common plane
Augmented reality: overlay virtual content on planar surfaces
Document scanning: rectify perspective distortion
Sport analytics: map field coordinates to overhead view

Epipolar Geometry

Relates two views of a non-planar 3D scene without knowing the scene structure.

Epipolar Constraint

For corresponding points x and x' in two views:

x'^T * F * x = 0

where F is the fundamental matrix (3x3, rank 2, 7 DoF).

Geometric Interpretation

Epipole: projection of one camera center into the other image
Epipolar line: l' = F * x -- the line in image 2 where the match of point x must lie
Epipolar plane: plane through the two camera centers and the 3D point

This constraint reduces matching from a 2D search to a 1D search along epipolar lines.

Fundamental Matrix

Relates pixel coordinates (no calibration needed):

x'^T * F * x = 0,  rank(F) = 2

Essential Matrix

For calibrated cameras (x_hat = K^{-1} * x):

x_hat'^T * E * x_hat' = 0

E = [t]_x * R where [t]_x is the skew-symmetric matrix of translation. Has 5 DoF (3 rotation + 2 translation direction). Relation: F = K'^{-T} * E * K^{-1}.

Eight-Point Algorithm

Estimate F from >= 8 correspondences:

Normalize points (Hartley normalization)
Set up linear system: each correspondence gives one equation x'^T F x = 0, stack into A * f = 0
Solve via SVD: f = V[:, -1], reshape to 3x3
Enforce rank-2 constraint: SVD of F, set smallest singular value to 0
Denormalize: F = T'^T * F_norm * T

Seven-point algorithm: minimum solver, gives up to 3 solutions (uses det(F) = 0 cubic constraint). Used inside RANSAC.

Five-point algorithm (Nister): for calibrated cameras estimating E. Minimum solver for RANSAC; yields up to 10 solutions.

Stereo Vision

Rectification

Transform two views so that epipolar lines are horizontal scan lines. After rectification, correspondence search is 1D along each row.

Stereo Matching

Estimate disparity d = x_L - x_R for each pixel:

Z = f * B / d

where B is the baseline (distance between cameras) and f is focal length.

Methods:

Block matching: compare patches along scan line (fast, noisy)
Semi-global matching (SGM): optimize cost along multiple directions with smoothness penalty
Graph cuts / belief propagation: global optimization (slow, accurate)
Learned stereo: RAFT-Stereo, AANet -- iterative refinement with correlation volumes

Disparity Map Quality

Occlusions: left-right consistency check
Textureless regions: aggregation over larger windows or global methods
Sub-pixel refinement: parabola fitting around cost minimum

Structure from Motion (SfM)

Recover 3D structure and camera poses from unordered images.

Incremental SfM Pipeline

Feature extraction: SIFT/SuperPoint on all images
Feature matching: pairwise matching with geometric verification
Initialization: select two-view pair, estimate E, triangulate initial points
Iterative expansion: a. Find next image with most 2D-3D correspondences b. Estimate pose via PnP + RANSAC c. Triangulate new 3D points d. Bundle adjustment after each addition
Final global bundle adjustment

Triangulation

Given two camera matrices P, P' and corresponding points x, x', find 3D point X:

x cross (P * X) = 0  and  x' cross (P' * X) = 0

Linear triangulation: stack into A * X = 0, solve via SVD. Refine with midpoint or optimal method.

Key Systems

COLMAP: state-of-the-art open-source SfM + MVS pipeline
OpenSfM: lightweight Python-based alternative
Visual SfM: GUI-based with GPU acceleration

Bundle Adjustment

Joint nonlinear optimization of all camera parameters and 3D points minimizing total reprojection error:

min_{R_i, t_i, X_j} sum_{i,j} rho(||x_{ij} - pi(K_i, R_i, t_i, X_j)||^2)

where rho is a robust loss function (Huber, Cauchy) to handle outliers.

Structure: the Jacobian is sparse because each observation depends on only one camera and one point. The Schur complement trick exploits this:

Partition variables into cameras c and points p
Eliminate points analytically: (J_c^T J_c - J_c^T J_p (J_p^T J_p)^{-1} J_p^T J_c) * delta_c = ...
J_p^T J_p is block-diagonal (each point independent) -- trivially invertible
Solve reduced camera system, back-substitute for points

Solvers: Ceres (Google), g2o, GTSAM. Use Levenberg-Marquardt with sparse Cholesky factorization.

Gauge Ambiguity

SfM recovers structure up to a similarity transformation (7 DoF: 3 rotation + 3 translation + 1 scale). Absolute scale requires known distances or inertial measurements.

Practical Considerations

Always normalize coordinates before computing F or H (Hartley normalization)
RANSAC threshold: typically 1-3 pixels for homography, 0.5-1 pixel for F
Five-point algorithm preferred over eight-point inside RANSAC (fewer outliers needed)
Bundle adjustment is the single most important step for accuracy
Degenerate configurations: planar scenes cause F estimation to fail (use H instead)
Lens distortion must be corrected before geometric computations

Key Takeaways

Concept	Core Idea
Zhang's calibration	Planar pattern + homographies to recover K and distortion
Homography	8-DoF projective warp between planes; estimated via DLT + RANSAC
Fundamental matrix	Encodes epipolar geometry; `x'^T F x = 0`
Essential matrix	Calibrated version of F; decomposes into R and t
Stereo vision	Depth from disparity: `Z = fB/d`
Bundle adjustment	Joint optimization of cameras + points; exploits Jacobian sparsity