6 min read
On this page

3D Vision

Depth Estimation

Stereo Depth

Recover depth from two rectified cameras using the disparity relationship:

Z = f * B / d

where f is focal length, B is baseline, d is disparity (pixel shift between left and right views). See Geometric Vision for stereo matching methods.

Monocular Depth Estimation

Predict depth from a single image -- inherently ill-posed, requires learned priors.

Supervised methods:

  • Train on images with ground-truth depth (LiDAR, structured light)
  • Loss: scale-invariant log depth error
    L = (1/n) * sum(d_i^2) - (lambda/n^2) * (sum d_i)^2,  d_i = log(y_i) - log(y_i*)
    
  • Architectures: encoder-decoder (ResNet/ViT + multi-scale decoder)

Self-supervised methods (Monodepth2):

  • Train on monocular video: predict depth + ego-motion, synthesize adjacent frames via warping
  • Photometric loss: L = alpha * SSIM(I, I') + (1-alpha) * |I - I'|
  • No ground-truth depth needed; scale ambiguity resolved at test time

Foundation models:

  • MiDaS: robust relative depth from diverse training data
  • Depth Anything: DINOv2 encoder + DPT decoder, trained on 62M images
  • Metric3D: absolute metric depth prediction
  • Marigold: diffusion-based depth estimation fine-tuned from Stable Diffusion

Point Clouds

Representation

A point cloud is an unordered set of 3D points {(x_i, y_i, z_i)}, optionally with color, normals, or other attributes. Sources: LiDAR, stereo, SfM, RGB-D sensors.

PointNet

First deep network to directly process raw point clouds (Qi et al., 2017):

Architecture:

  1. Per-point MLP: shared across all points (point-wise features)
  2. Symmetric aggregation: max pooling over all points to get global feature
  3. Classification/segmentation heads on global (+ per-point) features

Key insight: max pooling is a symmetric function, so the network is permutation invariant (order of points does not matter).

T-Net: small sub-network that predicts an input/feature transformation matrix for alignment (learned spatial transformer).

PointNet++: hierarchical version with set abstraction layers:

  1. Sampling: farthest point sampling selects representative points
  2. Grouping: ball query finds neighbors within radius
  3. PointNet: applied to each local group
  4. Multi-scale grouping handles varying point density

Other Point Cloud Networks

  • DGCNN: dynamic graph CNN; constructs k-NN graph in feature space and applies edge convolutions
  • Point Transformer: self-attention on point cloud neighborhoods
  • MinkowskiNet: sparse 3D convolutions on voxelized point clouds

3D Object Detection

From Point Clouds

VoxelNet: voxelize space, apply 3D convolutions.

PointPillars: organize points into vertical pillars (2D grid), encode each pillar with PointNet, apply 2D detection head. Fast and effective for autonomous driving.

CenterPoint: detect object centers in bird's-eye view, then refine 3D boxes. Anchor-free, state-of-the-art on nuScenes.

Multi-Modal Fusion

Combine LiDAR and camera for richer 3D detection:

  • BEVFusion: project both modalities to bird's-eye view, concatenate features
  • TransFusion: transformer-based fusion with query-based detection
  • Camera-only BEV: BEVFormer, BEVDet -- project image features to 3D via learned depth

3D Reconstruction

Multi-View Stereo (MVS)

Dense reconstruction from calibrated images:

  1. Compute depth maps for each view (plane-sweep stereo, PatchMatch)
  2. Fuse depth maps into consistent 3D model (TSDF fusion or point cloud merging)
  3. Surface extraction (Marching Cubes)

TSDF (Truncated Signed Distance Function)

Volumetric representation where each voxel stores the (truncated) signed distance to the nearest surface:

TSDF(x) = clamp(d(x) / delta, -1, 1)
  • Positive: in front of surface; negative: behind surface; zero: on surface
  • Fusion: running weighted average of observations from multiple views
  • Surface extraction: Marching Cubes algorithm at zero-crossing isosurface
  • Used in KinectFusion for real-time RGB-D reconstruction

Neural Radiance Fields (NeRF)

Represent a scene as a continuous volumetric function learned by a neural network (Mildenhall et al., 2020):

Representation: MLP maps 5D input to color and density:

F: (x, y, z, theta, phi) -> (r, g, b, sigma)
  • Position (x, y, z) determines density sigma
  • Viewing direction (theta, phi) modulates color (view-dependent effects like specularity)

Volume rendering: accumulate color along each camera ray:

C(r) = integral_{t_n}^{t_f} T(t) * sigma(t) * c(t) dt
T(t) = exp(-integral_{t_n}^{t} sigma(s) ds)

where T(t) is accumulated transmittance (probability of ray reaching depth t).

Positional encoding: map coordinates to higher dimensions using sinusoidal functions to capture high-frequency detail:

gamma(p) = [sin(2^0 pi p), cos(2^0 pi p), ..., sin(2^{L-1} pi p), cos(2^{L-1} pi p)]

Training: minimize photometric loss between rendered and observed pixels across multiple views.

Limitations: slow training (hours), slow rendering (seconds per frame), requires accurate camera poses.

Key extensions:

  • Instant-NGP: hash grid encoding for minutes-fast training
  • Mip-NeRF: anti-aliased cone tracing instead of ray tracing
  • Nerfacto (Nerfstudio): combined best practices

3D Gaussian Splatting

Explicit scene representation using 3D Gaussians (Kerbl et al., 2023):

Representation: set of 3D Gaussians, each parameterized by:

  • Position (mean): mu in R^3
  • Covariance: Sigma = R * S * S^T * R^T (rotation R + scale S)
  • Opacity: alpha
  • Color: spherical harmonics coefficients (view-dependent appearance)

Rendering (differentiable rasterization):

  1. Project 3D Gaussians to 2D screen-space Gaussians
  2. Sort by depth
  3. Alpha-blend front-to-back:
    C = sum_i c_i * alpha_i * prod_{j<i} (1 - alpha_j)
    

Advantages over NeRF:

  • Real-time rendering (100+ fps) via rasterization
  • Fast training (minutes)
  • Explicit representation enables easy editing and manipulation

Training: initialize from SfM point cloud, optimize Gaussian parameters with photometric loss, periodically densify (split/clone) and prune Gaussians based on gradient magnitude and opacity.


SLAM (Simultaneous Localization and Mapping)

Problem

Jointly estimate camera trajectory and build a map of the environment in real-time.

ORB-SLAM

Feature-based visual SLAM system (Mur-Artal et al.):

Three parallel threads:

  1. Tracking: extract ORB features, match to map, estimate pose via PnP + motion-only BA
  2. Local mapping: triangulate new map points, local bundle adjustment
  3. Loop closing: detect revisited places via bag-of-words (DBoW2), correct drift with pose graph optimization

Key components:

  • Keyframe selection: insert keyframes when tracking quality degrades
  • Covisibility graph: connects keyframes sharing map point observations
  • Essential graph: sparse subset for efficient global optimization
  • Relocalization: recover from tracking failure using place recognition

ORB-SLAM3: extends to visual-inertial (IMU fusion), multi-map, and multi-session SLAM.

Dense SLAM

  • DTAM: dense tracking and mapping with photometric optimization
  • ElasticFusion: surfel-based dense reconstruction with non-rigid deformation for loop closure
  • NICE-SLAM / iMAP: neural implicit representations for SLAM (encode map as neural network)
  • Gaussian Splatting SLAM: use 3D Gaussians as map representation

3D Scene Understanding

Indoor Scene Understanding

  • 3D semantic segmentation: per-point labels on point clouds (ScanNet benchmark)
  • 3D instance segmentation: separate individual object instances
  • Scene graph: objects + relationships in 3D space
  • 3D visual grounding: localize objects from natural language descriptions

Outdoor / Autonomous Driving

  • Occupancy prediction: predict 3D occupancy grid from camera/LiDAR (TPVFormer, SurroundOcc)
  • HD map construction: online construction of lane markings, road boundaries from onboard sensors
  • 3D lane detection: predict 3D lane curves from monocular or multi-view cameras

Open-Vocabulary 3D Understanding

  • LERF: language-embedded radiance fields for open-vocabulary 3D queries
  • OpenScene: per-point CLIP features on 3D point clouds
  • Enables text-based search in 3D scenes without predefined categories

Practical Considerations

  • Monocular depth is scale-ambiguous; metric depth requires known camera parameters or learned scale
  • PointNet++ with farthest point sampling is O(N^2); random sampling is a practical alternative for large clouds
  • NeRF requires 50-200 images with accurate poses (from COLMAP); Gaussian splatting requires fewer
  • SLAM systems must handle initialization, relocalization, and drift -- bundle adjustment is critical
  • For autonomous driving, BEV representations unify LiDAR and camera features in a common coordinate frame
  • 3D Gaussian splatting has rapidly become the preferred method for novel view synthesis due to its speed advantage

Key Takeaways

| Concept | Core Idea | |---------|-----------| | Monocular depth | Learned priors compensate for geometric ambiguity; foundation models generalize broadly | | PointNet | Permutation-invariant processing via shared MLP + max pool | | TSDF fusion | Running average of signed distance for volumetric reconstruction | | NeRF | MLP maps (x,y,z,theta,phi) to (color, density); trained via volume rendering | | 3D Gaussian splatting | Explicit Gaussians with differentiable rasterization; real-time rendering | | ORB-SLAM | Feature-based SLAM with tracking, mapping, and loop closure threads |