Feature Detection and Description

Overview

Features are distinctive, repeatable image locations used for matching, recognition, and 3D reconstruction. A complete feature pipeline has three stages: detection (where), description (what), and matching (correspondence).

Desirable properties: repeatability (same feature found under viewpoint/lighting changes), distinctiveness (descriptors are discriminative), and efficiency (real-time capable).

Corner Detection

Harris Corner Detector

Corners are points where shifting a window in any direction produces a large intensity change.

Auto-correlation matrix (structure tensor) at pixel (x, y):

M = sum_{(u,v) in W} w(u,v) [I_x^2    I_x*I_y]
                              [I_x*I_y  I_y^2  ]

where I_x, I_y are image gradients and w is a Gaussian weighting window.

Corner response function:

R = det(M) - k * trace(M)^2 = lambda_1 * lambda_2 - k * (lambda_1 + lambda_2)^2

where k ~ 0.04-0.06. Classification by eigenvalues of M:

Corner: both lambda_1, lambda_2 large (R >> 0)
Edge: one large, one small (R << 0)
Flat: both small (|R| ~ 0)

Steps: compute gradients, form M with Gaussian weighting, compute R, threshold, apply non-maximum suppression (NMS).

Properties: rotation invariant, NOT scale invariant, NOT affine invariant.

Shi-Tomasi variant: uses R = min(lambda_1, lambda_2) -- simpler and often preferred (used in OpenCV's goodFeaturesToTrack).

FAST (Features from Accelerated Segment Test)

Tests a circle of 16 pixels around a candidate point. A corner exists if N contiguous pixels (typically N=12) are all brighter or all darker than the center by threshold t.

Extremely fast: uses decision tree for early rejection
No gradient computation required
Not scale invariant; typically used with image pyramids
Used as the detector in ORB

Blob Detection

Laplacian of Gaussian (LoG)

Detects blob-like structures at characteristic scales. The scale-normalized LoG:

sigma^2 * nabla^2 G(x, y, sigma) = sigma^2 * ((x^2 + y^2 - 2*sigma^2) / sigma^4) * G(x, y, sigma)

Scale selection: compute LoG response across scales, find maxima in (x, y, sigma) space. The scale of the extremum indicates blob size: r = sigma * sqrt(2).

Difference of Gaussians (DoG)

Efficient approximation to LoG:

DoG(x, y, sigma) = G(x, y, k*sigma) - G(x, y, sigma) ~ (k - 1) * sigma^2 * nabla^2 G

Used in SIFT. Constructed by subtracting adjacent levels in a Gaussian pyramid. Extrema detected in 3x3x3 neighborhoods across position and scale.

Scale-Space Theory

A multi-scale representation L(x, y, t) must satisfy the diffusion equation:

dL/dt = nabla^2 L

where t = sigma^2 / 2. The only kernel satisfying this with the causality axiom (no new extrema as scale increases) is the Gaussian. Scale space enables:

Detecting features at their natural scale
Coarse-to-fine analysis
Building image pyramids (octave = factor-of-2 downsampling)

Gaussian pyramid: repeatedly smooth and downsample. Each octave halves resolution. Laplacian pyramid: differences between Gaussian levels; encodes band-pass information.

Feature Descriptors

SIFT (Scale-Invariant Feature Transform)

Complete pipeline: detection + description.

Detection (DoG keypoints):

Build Gaussian scale space with multiple octaves
Compute DoG images by subtracting adjacent scales
Find extrema in 3x3x3 neighborhoods
Refine location via quadratic interpolation (sub-pixel accuracy)
Remove low-contrast points and edge responses (ratio of principal curvatures)

Orientation assignment:

Compute gradient histogram in a region around the keypoint (36 bins, 10 degrees each)
Dominant orientation = peak of histogram
Multiple orientations create multiple descriptors (handles ambiguity)

Descriptor (128-D vector):

Rotate patch to canonical orientation
Divide 16x16 region into 4x4 grid of cells
In each cell, compute 8-bin gradient orientation histogram
Concatenate: 4x4x8 = 128 dimensions
Normalize to unit length, clip values at 0.2, re-normalize

Properties: invariant to scale, rotation, and partially to affine/illumination changes. Patented until 2020.

SURF (Speeded-Up Robust Features)

Faster SIFT approximation:

Detection: Hessian matrix approximated with box filters + integral images
Descriptor: Haar wavelet responses in 4x4 subregions (64-D or 128-D)
3-6x faster than SIFT with comparable performance

ORB (Oriented FAST and Rotated BRIEF)

Designed as a fast, free alternative to SIFT/SURF:

Detection: FAST keypoints + Harris corner score for ranking + image pyramid for multi-scale.

Orientation: intensity centroid method:

theta = atan2(m_01, m_10)

where m_pq are image moments of the patch.

Descriptor: BRIEF (Binary Robust Independent Elementary Features) rotated by keypoint orientation.

256-bit binary string from pairwise intensity comparisons
Matching via Hamming distance (XOR + popcount) -- extremely fast
Learned comparison pairs for low correlation between bits

Comparison:

Feature	Descriptor Size	Speed	Scale Inv.	Rotation Inv.
SIFT	128 floats	Slow	Yes	Yes
SURF	64 floats	Medium	Yes	Yes
ORB	256 bits	Fast	Via pyramid	Yes

Feature Matching

Brute-Force Matching

Compare every descriptor in image 1 against every descriptor in image 2. Distance metrics:

L2 norm: for float descriptors (SIFT, SURF)
Hamming distance: for binary descriptors (ORB, BRIEF)

FLANN (Fast Library for Approximate Nearest Neighbors)

Uses randomized KD-trees or hierarchical k-means for approximate nearest neighbor search. Orders of magnitude faster than brute force for large descriptor sets. Automatically selects algorithm and parameters.

Lowe's Ratio Test

Reject ambiguous matches by comparing distances to the best and second-best match:

accept if: d(best) / d(second_best) < threshold  (typically 0.7-0.8)

Rationale: correct matches have the best match significantly closer than the second best. Ambiguous features (e.g., repeated patterns) have similar first and second distances.

Cross-Check

Require that if descriptor A matches B, then B also matches A. Eliminates many false matches.

Matching Pipeline (Practical)

Detect keypoints in both images
Compute descriptors
Find nearest neighbors (brute-force or FLANN)
Apply ratio test (discard ratio > 0.75)
Apply cross-check or symmetry test
Geometric verification with RANSAC (fit homography or fundamental matrix, reject outliers)

Learned Features

SuperPoint

CNN-based detector and descriptor trained with self-supervised homographic adaptation:

Shared encoder, two decoder heads (keypoint + descriptor)
256-D dense descriptors at 1/8 resolution
Superior repeatability under large viewpoint changes

SuperGlue

Graph neural network for feature matching:

Attentional aggregation of visual and positional information
Jointly reasons about all matches (not independent pairwise)
Handles partial visibility and repeated structures

LightGlue

Efficient successor to SuperGlue with adaptive computation -- early exits when confidence is high. Achieves similar accuracy at significantly lower latency.

Applications

Application	Features Used	Key Requirement
Image stitching (panoramas)	SIFT, ORB	Repeatability across viewpoints
Visual SLAM	ORB, SuperPoint	Real-time speed
Object recognition	SIFT, learned	Distinctiveness
Image retrieval	Aggregated descriptors (VLAD, BoW)	Compact representation
3D reconstruction	SIFT, SuperPoint+SuperGlue	Dense, accurate matches

Key Takeaways

Concept	Core Idea
Harris corners	Eigenvalues of structure tensor classify local geometry
Scale space	Gaussian pyramid enables multi-scale feature detection
SIFT	DoG detection + 128-D gradient histogram descriptor; gold standard
ORB	FAST + rotated BRIEF; binary descriptor, fast matching
Ratio test	Reject ambiguous matches by comparing 1st/2nd nearest neighbor
Learned features	SuperPoint/SuperGlue outperform classical methods on difficult scenes