6 min read
On this page

Object Detection and Recognition

Object detection evolution from R-CNN to YOLO

Overview

Object detection localizes and classifies objects in images, producing bounding boxes with class labels and confidence scores. The field has evolved from handcrafted features to deep learning, and from two-stage to single-shot and transformer-based architectures.


Classical Detection

HOG + SVM

Histogram of Oriented Gradients (Dalal & Triggs, 2005):

  1. Compute gradients: magnitude and direction at each pixel
  2. Cell histograms: divide image into cells (8x8 pixels), compute 9-bin orientation histogram per cell
  3. Block normalization: group cells into overlapping blocks (2x2 cells), L2-normalize each block's concatenated histograms
  4. Classification: concatenate all block descriptors into a feature vector, train linear SVM

Sliding window: scan image at multiple scales and positions, classify each window.

Deformable Parts Model (DPM): extends HOG with a root filter and deformable part filters, trained with latent SVM. Dominated detection benchmarks pre-deep learning.


Two-Stage Detectors

R-CNN (Regions with CNN features)

  1. Generate ~2000 region proposals (Selective Search)
  2. Warp each proposal to fixed size, extract CNN features (AlexNet/VGG)
  3. Classify with per-class SVMs
  4. Refine boxes with bounding box regression

Slow: CNN runs independently on each proposal.

Fast R-CNN

  • Run CNN once on entire image to get feature map
  • RoI Pooling: project proposals onto feature map, max-pool to fixed size
  • Multi-task loss: classification (softmax) + box regression (smooth L1) in single network
  • 10x faster than R-CNN at training, 150x at inference

Faster R-CNN

Replaces Selective Search with a Region Proposal Network (RPN):

  • Shares convolutional features with detection network
  • At each spatial location, predicts K anchor boxes (multiple scales and aspect ratios)
  • Outputs objectness score + box refinement for each anchor
  • NMS on proposals, top-N fed to detection head

Anchor design: typically 3 scales x 3 aspect ratios = 9 anchors per location.

Two-stage process: RPN proposes, detection head classifies and refines.

Feature Pyramid Network (FPN)

Multi-scale feature extraction via top-down pathway with lateral connections:

P5 = Conv1x1(C5)
P4 = Conv1x1(C4) + Upsample(P5)
P3 = Conv1x1(C3) + Upsample(P4)
P2 = Conv1x1(C2) + Upsample(P3)

Each level detects objects at different scales. Standard backbone for modern detectors.


Single-Shot Detectors

SSD (Single Shot MultiBox Detector)

  • Multi-scale detection from multiple feature map layers (VGG-based)
  • Default boxes (anchors) at each location on each feature map
  • Directly predicts class scores + box offsets for each default box
  • Hard negative mining: 3:1 negative-to-positive ratio
  • Faster than Faster R-CNN but less accurate on small objects

YOLO Family

YOLOv1 (You Only Look Once):

  • Divide image into SxS grid
  • Each cell predicts B bounding boxes + confidence + C class probabilities
  • Single regression problem: image pixels to boxes + classes
  • Extremely fast but coarse localization

YOLOv2/v3: multi-scale detection, anchor boxes, Darknet backbone, feature pyramid.

YOLOv4: bag of freebies (data augmentation, label smoothing) + bag of specials (SPP, PAN, Mish activation). CSPDarknet backbone.

YOLOv5-v8 (Ultralytics): engineering improvements, anchor-free (v8), decoupled head, task-specific variants (detection, segmentation, pose).

YOLOv9: Programmable Gradient Information (PGI) + Generalized ELAN architecture.

YOLOv10/v11: NMS-free training with consistent dual assignments, further architectural refinements.

RetinaNet and Focal Loss

Key problem: extreme foreground-background class imbalance in single-shot detectors. Most anchors are easy negatives that dominate the loss.

Focal Loss:

FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)

where p_t = p if y=1, else 1-p.

  • gamma = 0: standard cross-entropy
  • gamma = 2 (typical): down-weights easy examples by factor (1-p_t)^2
  • alpha_t: class balancing weight

Example: if p_t = 0.9 (easy example), focal loss is 100x smaller than CE.

Architecture: ResNet + FPN backbone, two subnetworks (classification + regression) applied to each FPN level. Matched two-stage detector accuracy with single-shot speed.


Anchor-Free Detectors

Eliminate the need for predefined anchor boxes.

CenterNet (Objects as Points)

  • Predict object center as a heatmap peak (one channel per class)
  • At each center, regress width, height, and optional offset
  • Keypoint estimation formulation using corner pooling or center pooling
  • No NMS needed (peaks are already local maxima, extracted via max-pooling)
  • Simple, fast, and effective

FCOS (Fully Convolutional One-Stage)

  • Per-pixel prediction: at each location, predict distances to four box sides (l, t, r, b)
  • Centerness branch: suppresses low-quality predictions far from object center
    centerness = sqrt(min(l,r)/max(l,r) * min(t,b)/max(t,b))
    
  • Multi-level prediction with FPN (assign objects to levels by size)
  • No anchors, no hyperparameter tuning for anchor shapes

Transformer-Based Detection

DETR (Detection Transformer)

End-to-end detection without anchors, NMS, or hand-designed components:

Architecture:

  1. CNN backbone extracts features
  2. Transformer encoder processes flattened feature map with positional encoding
  3. Transformer decoder takes N learned object queries and attends to encoder output
  4. FFN heads predict class + box for each query

Bipartite matching loss: Hungarian algorithm finds optimal 1-to-1 assignment between predictions and ground truth:

L_match = lambda_cls * L_cls + lambda_L1 * L_box + lambda_giou * L_GIoU

Properties:

  • No NMS post-processing
  • Learns to suppress duplicates through self-attention
  • Slow convergence (500 epochs vs 36 for Faster R-CNN)
  • Struggles with small objects

Deformable DETR

Addresses DETR's limitations:

  • Deformable attention: attends to a small set of sampling points around a reference (not all spatial locations)
  • Multi-scale deformable attention across FPN levels
  • 10x fewer training epochs, better small-object performance

RT-DETR

Real-time DETR variant with hybrid encoder and efficient decoder.


Non-Maximum Suppression (NMS)

Post-processing to remove duplicate detections:

  1. Sort detections by confidence score
  2. Select highest-scoring detection
  3. Remove all remaining detections with IoU > threshold (typically 0.5) with selected detection
  4. Repeat until no detections remain

Variants:

  • Soft-NMS: decay scores instead of hard removal: s_i = s_i * exp(-IoU^2 / sigma)
  • Class-agnostic NMS: apply across all classes
  • Batched NMS: offset boxes by class ID to prevent inter-class suppression

Evaluation Metrics

Intersection over Union (IoU)

IoU = |A ∩ B| / |A ∪ B| = |A ∩ B| / (|A| + |B| - |A ∩ B|)

A prediction is a true positive if IoU with a ground truth box exceeds a threshold (typically 0.5).

Precision-Recall and AP

  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)
  • AP: area under the precision-recall curve (interpolated at 101 recall levels)

Mean Average Precision (mAP)

mAP = (1/C) * sum_{c=1}^{C} AP_c

COCO mAP (standard):

  • mAP@0.5: AP at IoU threshold 0.5 (PASCAL VOC metric)
  • mAP@0.75: stricter localization
  • mAP@[0.5:0.95]: averaged over IoU thresholds 0.5, 0.55, ..., 0.95 (primary COCO metric)
  • mAP_S, mAP_M, mAP_L: by object size (small < 32^2, medium < 96^2, large)

GIoU, DIoU, CIoU

Improved IoU losses for box regression:

  • GIoU: GIoU = IoU - |C \ (A ∪ B)| / |C| where C is smallest enclosing box
  • DIoU: adds penalty for center distance
  • CIoU: adds aspect ratio consistency term

  • Open-vocabulary detection: detect novel categories using language embeddings (OWL-ViT, Grounding DINO)
  • Foundation models: pre-train on large-scale data, fine-tune or zero-shot (DINO, Florence)
  • Efficient architectures: mobile-friendly detectors (EfficientDet, NanoDet, YOLO-NAS)
  • Rotated detection: oriented bounding boxes for aerial images and text detection

Key Takeaways

| Concept | Core Idea | |---------|-----------| | Faster R-CNN | RPN + RoI pooling; dominant two-stage paradigm | | YOLO | Grid-based single-shot regression; speed-optimized | | Focal loss | Down-weight easy negatives to solve class imbalance | | Anchor-free | Predict box properties per-pixel (FCOS) or per-center (CenterNet) | | DETR | Transformer + Hungarian matching; end-to-end, no NMS | | mAP@[0.5:0.95] | Standard evaluation across IoU thresholds |