Image Segmentation
Overview
Segmentation partitions an image into meaningful regions. Three main tasks:
| Task | Output | Granularity | |------|--------|-------------| | Semantic segmentation | Per-pixel class label | No instance distinction | | Instance segmentation | Per-pixel class + instance ID | Distinguishes individual objects | | Panoptic segmentation | Semantic + instance combined | Stuff (sky, road) + things (car, person) |
Classical Methods
Thresholding
Simplest segmentation: separate foreground/background by intensity threshold. Otsu's method selects the optimal threshold automatically by maximizing inter-class variance (see Image Fundamentals). Multi-level Otsu extends to K classes.
Region Growing
- Select seed pixels
- Iteratively add neighboring pixels that satisfy a similarity criterion (intensity difference < threshold)
- Stop when no more pixels can be added
Sensitive to seed selection and noise. Often used in medical imaging with user-provided seeds.
Watershed Transform
Treats the gradient magnitude image as a topographic surface:
- Markers: identify local minima (or user-provided markers)
- Flooding: simulate water rising from each minimum
- Dams: build barriers where waters from different basins meet
Over-segmentation is common. Marker-controlled watershed mitigates this by using pre-computed markers (e.g., from distance transform on binary image).
Mean Shift
Non-parametric clustering in spatial-color feature space:
- For each pixel, initialize a window at
(x, y, L, a, b) - Shift window to the mean of points within the kernel bandwidth
- Converge to mode (density peak)
- Merge pixels converging to the same mode
Parameters: spatial bandwidth h_s and range bandwidth h_r. Produces variable-sized, irregularly shaped segments.
Graph-Based Methods
Graph Cuts
Model image as a graph: pixels are nodes, edges connect neighbors with weights based on similarity.
Min-cut formulation: minimize energy:
E(L) = sum_i D_i(l_i) + lambda * sum_{(i,j)} V_{ij}(l_i, l_j)
D_i: data term (how well labell_ifits pixel i)V_{ij}: smoothness term (penalty for different labels at neighbors)- Solved via max-flow/min-cut (alpha-expansion for multi-label)
Normalized Cuts (Ncut)
Minimize the cut cost normalized by total edge connections:
Ncut(A, B) = cut(A,B)/assoc(A,V) + cut(A,B)/assoc(B,V)
Avoids bias toward cutting small isolated segments. Solved as a generalized eigenvalue problem:
(D - W) * y = lambda * D * y
where W is the affinity matrix and D is the degree matrix. The second-smallest eigenvector (Fiedler vector) gives the bipartition.
GrabCut
Interactive segmentation combining graph cuts with GMM color models:
- User draws bounding box around object
- Initialize foreground/background GMMs (typically K=5 components each)
- Iteratively: assign pixels to GMM components, update GMMs, run graph cut
- User can refine with scribbles
Superpixels
Over-segment the image into perceptually uniform regions that respect edges. Reduce the number of primitives from millions of pixels to hundreds of superpixels.
SLIC (Simple Linear Iterative Clustering)
Adapted k-means in 5D space (x, y, L, a, b):
- Initialize K cluster centers on a regular grid with spacing
S = sqrt(N/K) - For each center, assign pixels within a
2S x 2Sneighborhood using distance:
whereD = sqrt(d_c^2 + (d_s / S)^2 * m^2)d_cis color distance (LAB),d_sis spatial distance,mcontrols compactness - Update cluster centers to mean of assigned pixels
- Iterate until convergence (typically 5-10 iterations)
O(N) complexity -- independent of K. Produces compact, roughly uniform superpixels.
Deep Semantic Segmentation
Fully Convolutional Networks (FCN)
First end-to-end deep segmentation model (Long et al., 2015):
- Replace FC layers in classification CNN with 1x1 convolutions
- Transposed convolutions (deconvolution) for upsampling
- Skip connections: fuse high-resolution shallow features with deep semantic features
- Variants: FCN-32s, FCN-16s, FCN-8s (number = upsampling stride)
U-Net
Encoder-decoder architecture with dense skip connections:
Encoder (contracting path):
conv-conv-pool x4 (halve resolution, double channels)
Decoder (expanding path):
upconv-concat(skip)-conv-conv x4 (double resolution, halve channels)
- Skip connections concatenate encoder features to decoder at matching resolution
- Excellent with limited training data (designed for biomedical imaging)
- Variants: U-Net++, Attention U-Net, nnU-Net (self-configuring)
DeepLab Family
Key innovations across versions:
| Version | Key Contribution | |---------|-----------------| | v1 | Atrous (dilated) convolution for enlarged receptive field without losing resolution | | v2 | Atrous Spatial Pyramid Pooling (ASPP): parallel atrous convolutions at multiple rates | | v3 | Improved ASPP with image-level features (global average pooling branch) | | v3+ | Encoder-decoder with ASPP; Xception backbone |
Atrous convolution with rate r:
y[i] = sum_k x[i + r*k] * w[k]
Effective receptive field = k + (k-1)(r-1) without increasing parameters.
CRF post-processing (v1-v2): fully connected CRF refines boundaries using color and position pairwise potentials.
SegFormer
Transformer-based segmentation:
- Hierarchical vision transformer encoder (Mix Transformer, MiT)
- Overlapping patch embeddings (avoids positional interpolation issues)
- Efficient self-attention with reduction ratio
- Lightweight MLP decoder (no complex upsampling)
- Scales from B0 (small) to B5 (large)
Instance Segmentation
Mask R-CNN
Extends Faster R-CNN with a parallel mask prediction branch:
- Backbone: ResNet/FPN extracts multi-scale features
- Region Proposal Network (RPN): proposes candidate bounding boxes
- RoI Align: bilinear interpolation for pixel-aligned feature extraction (fixes RoI Pooling's quantization)
- Three heads (parallel):
- Classification head: object category
- Box regression head: refined bounding box
- Mask head:
Kbinary masks (one per class) at 28x28 resolution
Key insight: decoupling mask and class prediction (per-class binary masks) avoids competition between classes.
Loss: L = L_cls + L_box + L_mask
Cascade Mask R-CNN
Multiple detection stages with increasing IoU thresholds for progressive refinement.
Segment Anything Model (SAM)
Foundation model for promptable segmentation (Meta, 2023):
Architecture
- Image encoder: ViT-H (MAE pre-trained), runs once per image
- Prompt encoder: encodes points, boxes, masks, or text as embeddings
- Mask decoder: lightweight transformer decoder, outputs multiple valid masks with confidence scores
Training
- SA-1B dataset: 11M images, 1.1B masks (semi-automatically generated)
- Three-stage data engine: assisted-manual, semi-automatic, fully automatic
- Trained with focal loss + dice loss
Capabilities
- Zero-shot transfer: segment anything without task-specific training
- Ambiguity-aware: outputs multiple masks for ambiguous prompts (e.g., a point on a person's shirt could mean shirt, person, or group)
- SAM 2: extends to video with memory-based tracking across frames
Loss Functions for Segmentation
Cross-Entropy Loss
L_CE = -(1/N) * sum_i sum_c y_{ic} * log(p_{ic})
Standard per-pixel classification loss. Weighted version handles class imbalance.
Dice Loss
Based on the Dice coefficient (F1 score):
L_dice = 1 - (2 * sum(p * y) + epsilon) / (sum(p) + sum(y) + epsilon)
Directly optimizes overlap. Better for imbalanced datasets (small foreground).
Focal Loss
Down-weights easy examples to focus on hard ones:
L_focal = -alpha * (1 - p_t)^gamma * log(p_t)
With gamma = 2, well-classified pixels (p_t ~ 1) contribute negligibly to loss.
Evaluation Metrics
| Metric | Formula | Notes |
|--------|---------|-------|
| Pixel accuracy | correct / total | Misleading with class imbalance |
| mIoU | mean over classes of TP / (TP + FP + FN) | Standard metric |
| Boundary F1 | F1 score on boundary pixels | Measures edge quality |
| PQ (Panoptic Quality) | SQ * RQ (segmentation quality * recognition quality) | For panoptic segmentation |
Practical Notes
- U-Net remains the go-to for medical/scientific imaging with limited data
- DeepLab v3+ with ResNet-101 is a strong baseline for outdoor scenes
- For interactive segmentation, SAM has largely superseded GrabCut
- Multi-scale inference (test-time augmentation) improves mIoU by 1-2%
- CRF post-processing is mostly obsoleted by modern architectures
- Superpixels are useful as preprocessing to reduce computation for classical methods
Key Takeaways
| Concept | Core Idea | |---------|-----------| | Watershed | Topographic flooding from markers; over-segments | | Normalized cuts | Spectral clustering via Fiedler vector of graph Laplacian | | U-Net | Encoder-decoder with skip connections; excels with limited data | | DeepLab | Atrous convolutions + ASPP for multi-scale context | | Mask R-CNN | Faster R-CNN + per-instance binary mask branch | | SAM | Foundation model for promptable zero-shot segmentation |