Image Segmentation

Overview

Segmentation partitions an image into meaningful regions. Three main tasks:

Task	Output	Granularity
Semantic segmentation	Per-pixel class label	No instance distinction
Instance segmentation	Per-pixel class + instance ID	Distinguishes individual objects
Panoptic segmentation	Semantic + instance combined	Stuff (sky, road) + things (car, person)

Simplest segmentation: separate foreground/background by intensity threshold. Otsu's method selects the optimal threshold automatically by maximizing inter-class variance (see Image Fundamentals). Multi-level Otsu extends to K classes.

Region Growing

Select seed pixels
Iteratively add neighboring pixels that satisfy a similarity criterion (intensity difference < threshold)
Stop when no more pixels can be added

Sensitive to seed selection and noise. Often used in medical imaging with user-provided seeds.

Watershed Transform

Treats the gradient magnitude image as a topographic surface:

Markers: identify local minima (or user-provided markers)
Flooding: simulate water rising from each minimum
Dams: build barriers where waters from different basins meet

Over-segmentation is common. Marker-controlled watershed mitigates this by using pre-computed markers (e.g., from distance transform on binary image).

Mean Shift

Non-parametric clustering in spatial-color feature space:

For each pixel, initialize a window at (x, y, L, a, b)
Shift window to the mean of points within the kernel bandwidth
Converge to mode (density peak)
Merge pixels converging to the same mode

Parameters: spatial bandwidth h_s and range bandwidth h_r. Produces variable-sized, irregularly shaped segments.

Graph-Based Methods

Graph Cuts

Model image as a graph: pixels are nodes, edges connect neighbors with weights based on similarity.

Min-cut formulation: minimize energy:

E(L) = sum_i D_i(l_i) + lambda * sum_{(i,j)} V_{ij}(l_i, l_j)

D_i: data term (how well label l_i fits pixel i)
V_{ij}: smoothness term (penalty for different labels at neighbors)
Solved via max-flow/min-cut (alpha-expansion for multi-label)

Normalized Cuts (Ncut)

Minimize the cut cost normalized by total edge connections:

Ncut(A, B) = cut(A,B)/assoc(A,V) + cut(A,B)/assoc(B,V)

Avoids bias toward cutting small isolated segments. Solved as a generalized eigenvalue problem:

(D - W) * y = lambda * D * y

where W is the affinity matrix and D is the degree matrix. The second-smallest eigenvector (Fiedler vector) gives the bipartition.

GrabCut

Interactive segmentation combining graph cuts with GMM color models:

User draws bounding box around object
Initialize foreground/background GMMs (typically K=5 components each)
Iteratively: assign pixels to GMM components, update GMMs, run graph cut
User can refine with scribbles

Superpixels

Over-segment the image into perceptually uniform regions that respect edges. Reduce the number of primitives from millions of pixels to hundreds of superpixels.

SLIC (Simple Linear Iterative Clustering)

Adapted k-means in 5D space (x, y, L, a, b):

Initialize K cluster centers on a regular grid with spacing S = sqrt(N/K)
For each center, assign pixels within a 2S x 2S neighborhood using distance:
```
D = sqrt(d_c^2 + (d_s / S)^2 * m^2)
```
where d_c is color distance (LAB), d_s is spatial distance, m controls compactness
Update cluster centers to mean of assigned pixels
Iterate until convergence (typically 5-10 iterations)

O(N) complexity -- independent of K. Produces compact, roughly uniform superpixels.

Deep Semantic Segmentation

Fully Convolutional Networks (FCN)

First end-to-end deep segmentation model (Long et al., 2015):

Replace FC layers in classification CNN with 1x1 convolutions
Transposed convolutions (deconvolution) for upsampling
Skip connections: fuse high-resolution shallow features with deep semantic features
Variants: FCN-32s, FCN-16s, FCN-8s (number = upsampling stride)

U-Net

Encoder-decoder architecture with dense skip connections:

Encoder (contracting path):
  conv-conv-pool x4 (halve resolution, double channels)

Decoder (expanding path):
  upconv-concat(skip)-conv-conv x4 (double resolution, halve channels)

Skip connections concatenate encoder features to decoder at matching resolution
Excellent with limited training data (designed for biomedical imaging)
Variants: U-Net++, Attention U-Net, nnU-Net (self-configuring)

DeepLab Family

Key innovations across versions:

Version	Key Contribution
v1	Atrous (dilated) convolution for enlarged receptive field without losing resolution
v2	Atrous Spatial Pyramid Pooling (ASPP): parallel atrous convolutions at multiple rates
v3	Improved ASPP with image-level features (global average pooling branch)
v3+	Encoder-decoder with ASPP; Xception backbone

Atrous convolution with rate r:

y[i] = sum_k x[i + r*k] * w[k]

Effective receptive field = k + (k-1)(r-1) without increasing parameters.

CRF post-processing (v1-v2): fully connected CRF refines boundaries using color and position pairwise potentials.

SegFormer

Transformer-based segmentation:

Hierarchical vision transformer encoder (Mix Transformer, MiT)
Overlapping patch embeddings (avoids positional interpolation issues)
Efficient self-attention with reduction ratio
Lightweight MLP decoder (no complex upsampling)
Scales from B0 (small) to B5 (large)

Instance Segmentation

Mask R-CNN

Extends Faster R-CNN with a parallel mask prediction branch:

Backbone: ResNet/FPN extracts multi-scale features
Region Proposal Network (RPN): proposes candidate bounding boxes
RoI Align: bilinear interpolation for pixel-aligned feature extraction (fixes RoI Pooling's quantization)
Three heads (parallel):
- Classification head: object category
- Box regression head: refined bounding box
- Mask head: K binary masks (one per class) at 28x28 resolution

Key insight: decoupling mask and class prediction (per-class binary masks) avoids competition between classes.

Loss: L = L_cls + L_box + L_mask

Image encoder: ViT-H (MAE pre-trained), runs once per image
Prompt encoder: encodes points, boxes, masks, or text as embeddings
Mask decoder: lightweight transformer decoder, outputs multiple valid masks with confidence scores

Training

SA-1B dataset: 11M images, 1.1B masks (semi-automatically generated)
Three-stage data engine: assisted-manual, semi-automatic, fully automatic
Trained with focal loss + dice loss

Capabilities

Zero-shot transfer: segment anything without task-specific training
Ambiguity-aware: outputs multiple masks for ambiguous prompts (e.g., a point on a person's shirt could mean shirt, person, or group)
SAM 2: extends to video with memory-based tracking across frames

Loss Functions for Segmentation

Cross-Entropy Loss

L_CE = -(1/N) * sum_i sum_c y_{ic} * log(p_{ic})

Standard per-pixel classification loss. Weighted version handles class imbalance.

Dice Loss

Based on the Dice coefficient (F1 score):

L_dice = 1 - (2 * sum(p * y) + epsilon) / (sum(p) + sum(y) + epsilon)

Directly optimizes overlap. Better for imbalanced datasets (small foreground).

Focal Loss

Down-weights easy examples to focus on hard ones:

L_focal = -alpha * (1 - p_t)^gamma * log(p_t)

With gamma = 2, well-classified pixels (p_t ~ 1) contribute negligibly to loss.

Evaluation Metrics

Metric	Formula	Notes
Pixel accuracy	correct / total	Misleading with class imbalance
mIoU	mean over classes of `TP / (TP + FP + FN)`	Standard metric
Boundary F1	F1 score on boundary pixels	Measures edge quality
PQ (Panoptic Quality)	`SQ * RQ` (segmentation quality * recognition quality)	For panoptic segmentation

Practical Notes

U-Net remains the go-to for medical/scientific imaging with limited data
DeepLab v3+ with ResNet-101 is a strong baseline for outdoor scenes
For interactive segmentation, SAM has largely superseded GrabCut
Multi-scale inference (test-time augmentation) improves mIoU by 1-2%
CRF post-processing is mostly obsoleted by modern architectures
Superpixels are useful as preprocessing to reduce computation for classical methods

Key Takeaways

Concept	Core Idea
Watershed	Topographic flooding from markers; over-segments
Normalized cuts	Spectral clustering via Fiedler vector of graph Laplacian
U-Net	Encoder-decoder with skip connections; excels with limited data
DeepLab	Atrous convolutions + ASPP for multi-scale context
Mask R-CNN	Faster R-CNN + per-instance binary mask branch
SAM	Foundation model for promptable zero-shot segmentation