6 min read
On this page

Feature Detection and Description

Overview

Features are distinctive, repeatable image locations used for matching, recognition, and 3D reconstruction. A complete feature pipeline has three stages: detection (where), description (what), and matching (correspondence).

Desirable properties: repeatability (same feature found under viewpoint/lighting changes), distinctiveness (descriptors are discriminative), and efficiency (real-time capable).


Corner Detection

Harris Corner Detector

Corners are points where shifting a window in any direction produces a large intensity change.

Auto-correlation matrix (structure tensor) at pixel (x, y):

M = sum_{(u,v) in W} w(u,v) [I_x^2    I_x*I_y]
                              [I_x*I_y  I_y^2  ]

where I_x, I_y are image gradients and w is a Gaussian weighting window.

Corner response function:

R = det(M) - k * trace(M)^2 = lambda_1 * lambda_2 - k * (lambda_1 + lambda_2)^2

where k ~ 0.04-0.06. Classification by eigenvalues of M:

  • Corner: both lambda_1, lambda_2 large (R >> 0)
  • Edge: one large, one small (R << 0)
  • Flat: both small (|R| ~ 0)

Steps: compute gradients, form M with Gaussian weighting, compute R, threshold, apply non-maximum suppression (NMS).

Properties: rotation invariant, NOT scale invariant, NOT affine invariant.

Shi-Tomasi variant: uses R = min(lambda_1, lambda_2) -- simpler and often preferred (used in OpenCV's goodFeaturesToTrack).

FAST (Features from Accelerated Segment Test)

Tests a circle of 16 pixels around a candidate point. A corner exists if N contiguous pixels (typically N=12) are all brighter or all darker than the center by threshold t.

  • Extremely fast: uses decision tree for early rejection
  • No gradient computation required
  • Not scale invariant; typically used with image pyramids
  • Used as the detector in ORB

Blob Detection

Laplacian of Gaussian (LoG)

Detects blob-like structures at characteristic scales. The scale-normalized LoG:

sigma^2 * nabla^2 G(x, y, sigma) = sigma^2 * ((x^2 + y^2 - 2*sigma^2) / sigma^4) * G(x, y, sigma)

Scale selection: compute LoG response across scales, find maxima in (x, y, sigma) space. The scale of the extremum indicates blob size: r = sigma * sqrt(2).

Difference of Gaussians (DoG)

Efficient approximation to LoG:

DoG(x, y, sigma) = G(x, y, k*sigma) - G(x, y, sigma) ~ (k - 1) * sigma^2 * nabla^2 G

Used in SIFT. Constructed by subtracting adjacent levels in a Gaussian pyramid. Extrema detected in 3x3x3 neighborhoods across position and scale.


Scale-Space Theory

A multi-scale representation L(x, y, t) must satisfy the diffusion equation:

dL/dt = nabla^2 L

where t = sigma^2 / 2. The only kernel satisfying this with the causality axiom (no new extrema as scale increases) is the Gaussian. Scale space enables:

  • Detecting features at their natural scale
  • Coarse-to-fine analysis
  • Building image pyramids (octave = factor-of-2 downsampling)

Gaussian pyramid: repeatedly smooth and downsample. Each octave halves resolution. Laplacian pyramid: differences between Gaussian levels; encodes band-pass information.


Feature Descriptors

SIFT (Scale-Invariant Feature Transform)

Complete pipeline: detection + description.

Detection (DoG keypoints):

  1. Build Gaussian scale space with multiple octaves
  2. Compute DoG images by subtracting adjacent scales
  3. Find extrema in 3x3x3 neighborhoods
  4. Refine location via quadratic interpolation (sub-pixel accuracy)
  5. Remove low-contrast points and edge responses (ratio of principal curvatures)

Orientation assignment:

  • Compute gradient histogram in a region around the keypoint (36 bins, 10 degrees each)
  • Dominant orientation = peak of histogram
  • Multiple orientations create multiple descriptors (handles ambiguity)

Descriptor (128-D vector):

  1. Rotate patch to canonical orientation
  2. Divide 16x16 region into 4x4 grid of cells
  3. In each cell, compute 8-bin gradient orientation histogram
  4. Concatenate: 4x4x8 = 128 dimensions
  5. Normalize to unit length, clip values at 0.2, re-normalize

Properties: invariant to scale, rotation, and partially to affine/illumination changes. Patented until 2020.

SURF (Speeded-Up Robust Features)

Faster SIFT approximation:

  • Detection: Hessian matrix approximated with box filters + integral images
  • Descriptor: Haar wavelet responses in 4x4 subregions (64-D or 128-D)
  • 3-6x faster than SIFT with comparable performance

ORB (Oriented FAST and Rotated BRIEF)

Designed as a fast, free alternative to SIFT/SURF:

Detection: FAST keypoints + Harris corner score for ranking + image pyramid for multi-scale.

Orientation: intensity centroid method:

theta = atan2(m_01, m_10)

where m_pq are image moments of the patch.

Descriptor: BRIEF (Binary Robust Independent Elementary Features) rotated by keypoint orientation.

  • 256-bit binary string from pairwise intensity comparisons
  • Matching via Hamming distance (XOR + popcount) -- extremely fast
  • Learned comparison pairs for low correlation between bits

Comparison:

| Feature | Descriptor Size | Speed | Scale Inv. | Rotation Inv. | |---------|----------------|-------|------------|---------------| | SIFT | 128 floats | Slow | Yes | Yes | | SURF | 64 floats | Medium | Yes | Yes | | ORB | 256 bits | Fast | Via pyramid | Yes |


Feature Matching

Brute-Force Matching

Compare every descriptor in image 1 against every descriptor in image 2. Distance metrics:

  • L2 norm: for float descriptors (SIFT, SURF)
  • Hamming distance: for binary descriptors (ORB, BRIEF)

FLANN (Fast Library for Approximate Nearest Neighbors)

Uses randomized KD-trees or hierarchical k-means for approximate nearest neighbor search. Orders of magnitude faster than brute force for large descriptor sets. Automatically selects algorithm and parameters.

Lowe's Ratio Test

Reject ambiguous matches by comparing distances to the best and second-best match:

accept if: d(best) / d(second_best) < threshold  (typically 0.7-0.8)

Rationale: correct matches have the best match significantly closer than the second best. Ambiguous features (e.g., repeated patterns) have similar first and second distances.

Cross-Check

Require that if descriptor A matches B, then B also matches A. Eliminates many false matches.

Matching Pipeline (Practical)

  1. Detect keypoints in both images
  2. Compute descriptors
  3. Find nearest neighbors (brute-force or FLANN)
  4. Apply ratio test (discard ratio > 0.75)
  5. Apply cross-check or symmetry test
  6. Geometric verification with RANSAC (fit homography or fundamental matrix, reject outliers)

Learned Features

SuperPoint

CNN-based detector and descriptor trained with self-supervised homographic adaptation:

  • Shared encoder, two decoder heads (keypoint + descriptor)
  • 256-D dense descriptors at 1/8 resolution
  • Superior repeatability under large viewpoint changes

SuperGlue

Graph neural network for feature matching:

  • Attentional aggregation of visual and positional information
  • Jointly reasons about all matches (not independent pairwise)
  • Handles partial visibility and repeated structures

LightGlue

Efficient successor to SuperGlue with adaptive computation -- early exits when confidence is high. Achieves similar accuracy at significantly lower latency.


Applications

| Application | Features Used | Key Requirement | |------------|--------------|-----------------| | Image stitching (panoramas) | SIFT, ORB | Repeatability across viewpoints | | Visual SLAM | ORB, SuperPoint | Real-time speed | | Object recognition | SIFT, learned | Distinctiveness | | Image retrieval | Aggregated descriptors (VLAD, BoW) | Compact representation | | 3D reconstruction | SIFT, SuperPoint+SuperGlue | Dense, accurate matches |


Key Takeaways

| Concept | Core Idea | |---------|-----------| | Harris corners | Eigenvalues of structure tensor classify local geometry | | Scale space | Gaussian pyramid enables multi-scale feature detection | | SIFT | DoG detection + 128-D gradient histogram descriptor; gold standard | | ORB | FAST + rotated BRIEF; binary descriptor, fast matching | | Ratio test | Reject ambiguous matches by comparing 1st/2nd nearest neighbor | | Learned features | SuperPoint/SuperGlue outperform classical methods on difficult scenes |