6 min read
On this page

Image Segmentation

Overview

Segmentation partitions an image into meaningful regions. Three main tasks:

| Task | Output | Granularity | |------|--------|-------------| | Semantic segmentation | Per-pixel class label | No instance distinction | | Instance segmentation | Per-pixel class + instance ID | Distinguishes individual objects | | Panoptic segmentation | Semantic + instance combined | Stuff (sky, road) + things (car, person) |


Classical Methods

Thresholding

Simplest segmentation: separate foreground/background by intensity threshold. Otsu's method selects the optimal threshold automatically by maximizing inter-class variance (see Image Fundamentals). Multi-level Otsu extends to K classes.

Region Growing

  1. Select seed pixels
  2. Iteratively add neighboring pixels that satisfy a similarity criterion (intensity difference < threshold)
  3. Stop when no more pixels can be added

Sensitive to seed selection and noise. Often used in medical imaging with user-provided seeds.

Watershed Transform

Treats the gradient magnitude image as a topographic surface:

  1. Markers: identify local minima (or user-provided markers)
  2. Flooding: simulate water rising from each minimum
  3. Dams: build barriers where waters from different basins meet

Over-segmentation is common. Marker-controlled watershed mitigates this by using pre-computed markers (e.g., from distance transform on binary image).

Mean Shift

Non-parametric clustering in spatial-color feature space:

  1. For each pixel, initialize a window at (x, y, L, a, b)
  2. Shift window to the mean of points within the kernel bandwidth
  3. Converge to mode (density peak)
  4. Merge pixels converging to the same mode

Parameters: spatial bandwidth h_s and range bandwidth h_r. Produces variable-sized, irregularly shaped segments.


Graph-Based Methods

Graph Cuts

Model image as a graph: pixels are nodes, edges connect neighbors with weights based on similarity.

Min-cut formulation: minimize energy:

E(L) = sum_i D_i(l_i) + lambda * sum_{(i,j)} V_{ij}(l_i, l_j)
  • D_i: data term (how well label l_i fits pixel i)
  • V_{ij}: smoothness term (penalty for different labels at neighbors)
  • Solved via max-flow/min-cut (alpha-expansion for multi-label)

Normalized Cuts (Ncut)

Minimize the cut cost normalized by total edge connections:

Ncut(A, B) = cut(A,B)/assoc(A,V) + cut(A,B)/assoc(B,V)

Avoids bias toward cutting small isolated segments. Solved as a generalized eigenvalue problem:

(D - W) * y = lambda * D * y

where W is the affinity matrix and D is the degree matrix. The second-smallest eigenvector (Fiedler vector) gives the bipartition.

GrabCut

Interactive segmentation combining graph cuts with GMM color models:

  1. User draws bounding box around object
  2. Initialize foreground/background GMMs (typically K=5 components each)
  3. Iteratively: assign pixels to GMM components, update GMMs, run graph cut
  4. User can refine with scribbles

Superpixels

Over-segment the image into perceptually uniform regions that respect edges. Reduce the number of primitives from millions of pixels to hundreds of superpixels.

SLIC (Simple Linear Iterative Clustering)

Adapted k-means in 5D space (x, y, L, a, b):

  1. Initialize K cluster centers on a regular grid with spacing S = sqrt(N/K)
  2. For each center, assign pixels within a 2S x 2S neighborhood using distance:
    D = sqrt(d_c^2 + (d_s / S)^2 * m^2)
    
    where d_c is color distance (LAB), d_s is spatial distance, m controls compactness
  3. Update cluster centers to mean of assigned pixels
  4. Iterate until convergence (typically 5-10 iterations)

O(N) complexity -- independent of K. Produces compact, roughly uniform superpixels.


Deep Semantic Segmentation

Fully Convolutional Networks (FCN)

First end-to-end deep segmentation model (Long et al., 2015):

  • Replace FC layers in classification CNN with 1x1 convolutions
  • Transposed convolutions (deconvolution) for upsampling
  • Skip connections: fuse high-resolution shallow features with deep semantic features
  • Variants: FCN-32s, FCN-16s, FCN-8s (number = upsampling stride)

U-Net

Encoder-decoder architecture with dense skip connections:

Encoder (contracting path):
  conv-conv-pool x4 (halve resolution, double channels)

Decoder (expanding path):
  upconv-concat(skip)-conv-conv x4 (double resolution, halve channels)
  • Skip connections concatenate encoder features to decoder at matching resolution
  • Excellent with limited training data (designed for biomedical imaging)
  • Variants: U-Net++, Attention U-Net, nnU-Net (self-configuring)

DeepLab Family

Key innovations across versions:

| Version | Key Contribution | |---------|-----------------| | v1 | Atrous (dilated) convolution for enlarged receptive field without losing resolution | | v2 | Atrous Spatial Pyramid Pooling (ASPP): parallel atrous convolutions at multiple rates | | v3 | Improved ASPP with image-level features (global average pooling branch) | | v3+ | Encoder-decoder with ASPP; Xception backbone |

Atrous convolution with rate r:

y[i] = sum_k x[i + r*k] * w[k]

Effective receptive field = k + (k-1)(r-1) without increasing parameters.

CRF post-processing (v1-v2): fully connected CRF refines boundaries using color and position pairwise potentials.

SegFormer

Transformer-based segmentation:

  • Hierarchical vision transformer encoder (Mix Transformer, MiT)
  • Overlapping patch embeddings (avoids positional interpolation issues)
  • Efficient self-attention with reduction ratio
  • Lightweight MLP decoder (no complex upsampling)
  • Scales from B0 (small) to B5 (large)

Instance Segmentation

Mask R-CNN

Extends Faster R-CNN with a parallel mask prediction branch:

  1. Backbone: ResNet/FPN extracts multi-scale features
  2. Region Proposal Network (RPN): proposes candidate bounding boxes
  3. RoI Align: bilinear interpolation for pixel-aligned feature extraction (fixes RoI Pooling's quantization)
  4. Three heads (parallel):
    • Classification head: object category
    • Box regression head: refined bounding box
    • Mask head: K binary masks (one per class) at 28x28 resolution

Key insight: decoupling mask and class prediction (per-class binary masks) avoids competition between classes.

Loss: L = L_cls + L_box + L_mask

Cascade Mask R-CNN

Multiple detection stages with increasing IoU thresholds for progressive refinement.


Segment Anything Model (SAM)

Foundation model for promptable segmentation (Meta, 2023):

Architecture

  • Image encoder: ViT-H (MAE pre-trained), runs once per image
  • Prompt encoder: encodes points, boxes, masks, or text as embeddings
  • Mask decoder: lightweight transformer decoder, outputs multiple valid masks with confidence scores

Training

  • SA-1B dataset: 11M images, 1.1B masks (semi-automatically generated)
  • Three-stage data engine: assisted-manual, semi-automatic, fully automatic
  • Trained with focal loss + dice loss

Capabilities

  • Zero-shot transfer: segment anything without task-specific training
  • Ambiguity-aware: outputs multiple masks for ambiguous prompts (e.g., a point on a person's shirt could mean shirt, person, or group)
  • SAM 2: extends to video with memory-based tracking across frames

Loss Functions for Segmentation

Cross-Entropy Loss

L_CE = -(1/N) * sum_i sum_c y_{ic} * log(p_{ic})

Standard per-pixel classification loss. Weighted version handles class imbalance.

Dice Loss

Based on the Dice coefficient (F1 score):

L_dice = 1 - (2 * sum(p * y) + epsilon) / (sum(p) + sum(y) + epsilon)

Directly optimizes overlap. Better for imbalanced datasets (small foreground).

Focal Loss

Down-weights easy examples to focus on hard ones:

L_focal = -alpha * (1 - p_t)^gamma * log(p_t)

With gamma = 2, well-classified pixels (p_t ~ 1) contribute negligibly to loss.


Evaluation Metrics

| Metric | Formula | Notes | |--------|---------|-------| | Pixel accuracy | correct / total | Misleading with class imbalance | | mIoU | mean over classes of TP / (TP + FP + FN) | Standard metric | | Boundary F1 | F1 score on boundary pixels | Measures edge quality | | PQ (Panoptic Quality) | SQ * RQ (segmentation quality * recognition quality) | For panoptic segmentation |


Practical Notes

  • U-Net remains the go-to for medical/scientific imaging with limited data
  • DeepLab v3+ with ResNet-101 is a strong baseline for outdoor scenes
  • For interactive segmentation, SAM has largely superseded GrabCut
  • Multi-scale inference (test-time augmentation) improves mIoU by 1-2%
  • CRF post-processing is mostly obsoleted by modern architectures
  • Superpixels are useful as preprocessing to reduce computation for classical methods

Key Takeaways

| Concept | Core Idea | |---------|-----------| | Watershed | Topographic flooding from markers; over-segments | | Normalized cuts | Spectral clustering via Fiedler vector of graph Laplacian | | U-Net | Encoder-decoder with skip connections; excels with limited data | | DeepLab | Atrous convolutions + ASPP for multi-scale context | | Mask R-CNN | Faster R-CNN + per-instance binary mask branch | | SAM | Foundation model for promptable zero-shot segmentation |