6 min read
On this page

Geometric Vision

Camera Calibration

Intrinsic Parameters

The camera intrinsic matrix K encodes internal geometry:

K = [f_x  s   c_x]
    [ 0  f_y  c_y]
    [ 0   0    1 ]
  • f_x, f_y: focal lengths in pixel units
  • (c_x, c_y): principal point (usually near image center)
  • s: skew coefficient (usually 0 for modern cameras)

Plus distortion coefficients: (k1, k2, k3, p1, p2) for radial and tangential distortion.

Zhang's Method

Calibration using a planar checkerboard pattern observed in multiple (>= 3) orientations.

Key insight: for a planar target (Z=0), the projection simplifies to a homography:

s * [u, v, 1]^T = K * [r1 | r2 | t] * [X, Y, 1]^T = H * [X, Y, 1]^T

Algorithm:

  1. Detect checkerboard corners in N images (sub-pixel refinement)
  2. Estimate homography H_i for each image via DLT
  3. From H = K[r1|r2|t], derive constraints on K: h1^T * K^{-T} * K^{-1} * h2 = 0 and h1^T * K^{-T} * K^{-1} * h1 = h2^T * K^{-T} * K^{-1} * h2
  4. Each image gives 2 constraints on B = K^{-T} * K^{-1} (symmetric, 6 unknowns)
  5. With >= 3 images, solve for B via linear system, then extract K via Cholesky decomposition
  6. Recover extrinsics [R_i | t_i] for each image
  7. Refine all parameters jointly via Levenberg-Marquardt minimizing reprojection error

Reprojection Error

The standard metric for calibration quality:

E = sum_{i,j} ||p_{ij} - pi(K, R_i, t_i, d, P_j)||^2

where pi is the projection function including distortion. Good calibration: < 0.5 pixel RMS error.


Homography

A homography H is a 3x3 projective transformation mapping points between two planes (or two views of a planar scene):

s * [u', v', 1]^T = H * [u, v, 1]^T

8 degrees of freedom (3x3 matrix up to scale).

Direct Linear Transform (DLT)

Given N >= 4 point correspondences (x_i, x_i'), each pair provides 2 equations:

[-x_i^T  0^T    u_i'*x_i^T] [h1]
[0^T    -x_i^T  v_i'*x_i^T] [h2] = 0
                              [h3]

Stack all equations into A * h = 0, solve via SVD (h = last column of V).

Normalization (Hartley): translate points to origin, scale to average distance sqrt(2). Critical for numerical stability.

RANSAC (Random Sample Consensus)

Robust estimation in the presence of outliers:

  1. Randomly sample minimum subset (4 points for homography)
  2. Fit model (DLT)
  3. Count inliers (points with reprojection error < threshold)
  4. Repeat for N iterations, keep best model
  5. Re-estimate using all inliers

Number of iterations: N = log(1 - p) / log(1 - w^s) where p is success probability, w is inlier ratio, s is sample size.

Variants: PROSAC (progressive sampling by match quality), MAGSAC (marginalizing over threshold), LO-RANSAC (local optimization of best model).

Applications of Homography

  • Image stitching: warp images to a common plane
  • Augmented reality: overlay virtual content on planar surfaces
  • Document scanning: rectify perspective distortion
  • Sport analytics: map field coordinates to overhead view

Epipolar Geometry

Relates two views of a non-planar 3D scene without knowing the scene structure.

Epipolar Constraint

For corresponding points x and x' in two views:

x'^T * F * x = 0

where F is the fundamental matrix (3x3, rank 2, 7 DoF).

Geometric Interpretation

  • Epipole: projection of one camera center into the other image
  • Epipolar line: l' = F * x -- the line in image 2 where the match of point x must lie
  • Epipolar plane: plane through the two camera centers and the 3D point

This constraint reduces matching from a 2D search to a 1D search along epipolar lines.

Fundamental Matrix

Relates pixel coordinates (no calibration needed):

x'^T * F * x = 0,  rank(F) = 2

Essential Matrix

For calibrated cameras (x_hat = K^{-1} * x):

x_hat'^T * E * x_hat' = 0

E = [t]_x * R where [t]_x is the skew-symmetric matrix of translation. Has 5 DoF (3 rotation + 2 translation direction). Relation: F = K'^{-T} * E * K^{-1}.

Eight-Point Algorithm

Estimate F from >= 8 correspondences:

  1. Normalize points (Hartley normalization)
  2. Set up linear system: each correspondence gives one equation x'^T F x = 0, stack into A * f = 0
  3. Solve via SVD: f = V[:, -1], reshape to 3x3
  4. Enforce rank-2 constraint: SVD of F, set smallest singular value to 0
  5. Denormalize: F = T'^T * F_norm * T

Seven-point algorithm: minimum solver, gives up to 3 solutions (uses det(F) = 0 cubic constraint). Used inside RANSAC.

Five-point algorithm (Nister): for calibrated cameras estimating E. Minimum solver for RANSAC; yields up to 10 solutions.


Stereo Vision

Rectification

Transform two views so that epipolar lines are horizontal scan lines. After rectification, correspondence search is 1D along each row.

Stereo Matching

Estimate disparity d = x_L - x_R for each pixel:

Z = f * B / d

where B is the baseline (distance between cameras) and f is focal length.

Methods:

  • Block matching: compare patches along scan line (fast, noisy)
  • Semi-global matching (SGM): optimize cost along multiple directions with smoothness penalty
  • Graph cuts / belief propagation: global optimization (slow, accurate)
  • Learned stereo: RAFT-Stereo, AANet -- iterative refinement with correlation volumes

Disparity Map Quality

  • Occlusions: left-right consistency check
  • Textureless regions: aggregation over larger windows or global methods
  • Sub-pixel refinement: parabola fitting around cost minimum

Structure from Motion (SfM)

Recover 3D structure and camera poses from unordered images.

Incremental SfM Pipeline

  1. Feature extraction: SIFT/SuperPoint on all images
  2. Feature matching: pairwise matching with geometric verification
  3. Initialization: select two-view pair, estimate E, triangulate initial points
  4. Iterative expansion: a. Find next image with most 2D-3D correspondences b. Estimate pose via PnP + RANSAC c. Triangulate new 3D points d. Bundle adjustment after each addition
  5. Final global bundle adjustment

Triangulation

Given two camera matrices P, P' and corresponding points x, x', find 3D point X:

x cross (P * X) = 0  and  x' cross (P' * X) = 0

Linear triangulation: stack into A * X = 0, solve via SVD. Refine with midpoint or optimal method.

Key Systems

  • COLMAP: state-of-the-art open-source SfM + MVS pipeline
  • OpenSfM: lightweight Python-based alternative
  • Visual SfM: GUI-based with GPU acceleration

Bundle Adjustment

Joint nonlinear optimization of all camera parameters and 3D points minimizing total reprojection error:

min_{R_i, t_i, X_j} sum_{i,j} rho(||x_{ij} - pi(K_i, R_i, t_i, X_j)||^2)

where rho is a robust loss function (Huber, Cauchy) to handle outliers.

Structure: the Jacobian is sparse because each observation depends on only one camera and one point. The Schur complement trick exploits this:

  1. Partition variables into cameras c and points p
  2. Eliminate points analytically: (J_c^T J_c - J_c^T J_p (J_p^T J_p)^{-1} J_p^T J_c) * delta_c = ...
  3. J_p^T J_p is block-diagonal (each point independent) -- trivially invertible
  4. Solve reduced camera system, back-substitute for points

Solvers: Ceres (Google), g2o, GTSAM. Use Levenberg-Marquardt with sparse Cholesky factorization.

Gauge Ambiguity

SfM recovers structure up to a similarity transformation (7 DoF: 3 rotation + 3 translation + 1 scale). Absolute scale requires known distances or inertial measurements.


Practical Considerations

  • Always normalize coordinates before computing F or H (Hartley normalization)
  • RANSAC threshold: typically 1-3 pixels for homography, 0.5-1 pixel for F
  • Five-point algorithm preferred over eight-point inside RANSAC (fewer outliers needed)
  • Bundle adjustment is the single most important step for accuracy
  • Degenerate configurations: planar scenes cause F estimation to fail (use H instead)
  • Lens distortion must be corrected before geometric computations

Key Takeaways

| Concept | Core Idea | |---------|-----------| | Zhang's calibration | Planar pattern + homographies to recover K and distortion | | Homography | 8-DoF projective warp between planes; estimated via DLT + RANSAC | | Fundamental matrix | Encodes epipolar geometry; x'^T F x = 0 | | Essential matrix | Calibrated version of F; decomposes into R and t | | Stereo vision | Depth from disparity: Z = fB/d | | Bundle adjustment | Joint optimization of cameras + points; exploits Jacobian sparsity |