3D Vision

Depth Estimation

Stereo Depth

Recover depth from two rectified cameras using the disparity relationship:

Z = f * B / d

where f is focal length, B is baseline, d is disparity (pixel shift between left and right views). See Geometric Vision for stereo matching methods.

Monocular Depth Estimation

Predict depth from a single image -- inherently ill-posed, requires learned priors.

Supervised methods:

Train on images with ground-truth depth (LiDAR, structured light)

Loss: scale-invariant log depth error

L = (1/n) * sum(d_i^2) - (lambda/n^2) * (sum d_i)^2,  d_i = log(y_i) - log(y_i*)

Architectures: encoder-decoder (ResNet/ViT + multi-scale decoder)

Self-supervised methods (Monodepth2):

Train on monocular video: predict depth + ego-motion, synthesize adjacent frames via warping
Photometric loss: L = alpha * SSIM(I, I') + (1-alpha) * |I - I'|
No ground-truth depth needed; scale ambiguity resolved at test time

Foundation models:

MiDaS: robust relative depth from diverse training data
Depth Anything: DINOv2 encoder + DPT decoder, trained on 62M images
Metric3D: absolute metric depth prediction
Marigold: diffusion-based depth estimation fine-tuned from Stable Diffusion

Point Clouds

Representation

A point cloud is an unordered set of 3D points {(x_i, y_i, z_i)}, optionally with color, normals, or other attributes. Sources: LiDAR, stereo, SfM, RGB-D sensors.

PointNet

First deep network to directly process raw point clouds (Qi et al., 2017):

Architecture:

Per-point MLP: shared across all points (point-wise features)
Symmetric aggregation: max pooling over all points to get global feature
Classification/segmentation heads on global (+ per-point) features

Key insight: max pooling is a symmetric function, so the network is permutation invariant (order of points does not matter).

T-Net: small sub-network that predicts an input/feature transformation matrix for alignment (learned spatial transformer).

PointNet++: hierarchical version with set abstraction layers:

Sampling: farthest point sampling selects representative points
Grouping: ball query finds neighbors within radius
PointNet: applied to each local group
Multi-scale grouping handles varying point density

Other Point Cloud Networks

DGCNN: dynamic graph CNN; constructs k-NN graph in feature space and applies edge convolutions
Point Transformer: self-attention on point cloud neighborhoods
MinkowskiNet: sparse 3D convolutions on voxelized point clouds

3D Object Detection

From Point Clouds

VoxelNet: voxelize space, apply 3D convolutions.

PointPillars: organize points into vertical pillars (2D grid), encode each pillar with PointNet, apply 2D detection head. Fast and effective for autonomous driving.

CenterPoint: detect object centers in bird's-eye view, then refine 3D boxes. Anchor-free, state-of-the-art on nuScenes.

Combine LiDAR and camera for richer 3D detection:

BEVFusion: project both modalities to bird's-eye view, concatenate features
TransFusion: transformer-based fusion with query-based detection
Camera-only BEV: BEVFormer, BEVDet -- project image features to 3D via learned depth

3D Reconstruction

Multi-View Stereo (MVS)

Dense reconstruction from calibrated images:

Compute depth maps for each view (plane-sweep stereo, PatchMatch)
Fuse depth maps into consistent 3D model (TSDF fusion or point cloud merging)
Surface extraction (Marching Cubes)

TSDF (Truncated Signed Distance Function)

Volumetric representation where each voxel stores the (truncated) signed distance to the nearest surface:

TSDF(x) = clamp(d(x) / delta, -1, 1)

Positive: in front of surface; negative: behind surface; zero: on surface
Fusion: running weighted average of observations from multiple views
Surface extraction: Marching Cubes algorithm at zero-crossing isosurface
Used in KinectFusion for real-time RGB-D reconstruction

Neural Radiance Fields (NeRF)

Represent a scene as a continuous volumetric function learned by a neural network (Mildenhall et al., 2020):

Representation: MLP maps 5D input to color and density:

F: (x, y, z, theta, phi) -> (r, g, b, sigma)

Position (x, y, z) determines density sigma
Viewing direction (theta, phi) modulates color (view-dependent effects like specularity)

Volume rendering: accumulate color along each camera ray:

C(r) = integral_{t_n}^{t_f} T(t) * sigma(t) * c(t) dt
T(t) = exp(-integral_{t_n}^{t} sigma(s) ds)

where T(t) is accumulated transmittance (probability of ray reaching depth t).

Positional encoding: map coordinates to higher dimensions using sinusoidal functions to capture high-frequency detail:

gamma(p) = [sin(2^0 pi p), cos(2^0 pi p), ..., sin(2^{L-1} pi p), cos(2^{L-1} pi p)]

Training: minimize photometric loss between rendered and observed pixels across multiple views.

Limitations: slow training (hours), slow rendering (seconds per frame), requires accurate camera poses.

Key extensions:

Instant-NGP: hash grid encoding for minutes-fast training
Mip-NeRF: anti-aliased cone tracing instead of ray tracing
Nerfacto (Nerfstudio): combined best practices

3D Gaussian Splatting

Explicit scene representation using 3D Gaussians (Kerbl et al., 2023):

Representation: set of 3D Gaussians, each parameterized by:

Position (mean): mu in R^3
Covariance: Sigma = R * S * S^T * R^T (rotation R + scale S)
Opacity: alpha
Color: spherical harmonics coefficients (view-dependent appearance)

Rendering (differentiable rasterization):

Project 3D Gaussians to 2D screen-space Gaussians
Sort by depth

Alpha-blend front-to-back:

C = sum_i c_i * alpha_i * prod_{j<i} (1 - alpha_j)

Advantages over NeRF:

Real-time rendering (100+ fps) via rasterization
Fast training (minutes)
Explicit representation enables easy editing and manipulation

Training: initialize from SfM point cloud, optimize Gaussian parameters with photometric loss, periodically densify (split/clone) and prune Gaussians based on gradient magnitude and opacity.

Tracking: extract ORB features, match to map, estimate pose via PnP + motion-only BA
Local mapping: triangulate new map points, local bundle adjustment
Loop closing: detect revisited places via bag-of-words (DBoW2), correct drift with pose graph optimization

Key components:

Keyframe selection: insert keyframes when tracking quality degrades
Covisibility graph: connects keyframes sharing map point observations
Essential graph: sparse subset for efficient global optimization
Relocalization: recover from tracking failure using place recognition

ORB-SLAM3: extends to visual-inertial (IMU fusion), multi-map, and multi-session SLAM.

Dense SLAM

DTAM: dense tracking and mapping with photometric optimization
ElasticFusion: surfel-based dense reconstruction with non-rigid deformation for loop closure
NICE-SLAM / iMAP: neural implicit representations for SLAM (encode map as neural network)
Gaussian Splatting SLAM: use 3D Gaussians as map representation

3D Scene Understanding

Indoor Scene Understanding

3D semantic segmentation: per-point labels on point clouds (ScanNet benchmark)
3D instance segmentation: separate individual object instances
Scene graph: objects + relationships in 3D space
3D visual grounding: localize objects from natural language descriptions

Outdoor / Autonomous Driving

Occupancy prediction: predict 3D occupancy grid from camera/LiDAR (TPVFormer, SurroundOcc)
HD map construction: online construction of lane markings, road boundaries from onboard sensors
3D lane detection: predict 3D lane curves from monocular or multi-view cameras

Open-Vocabulary 3D Understanding

LERF: language-embedded radiance fields for open-vocabulary 3D queries
OpenScene: per-point CLIP features on 3D point clouds
Enables text-based search in 3D scenes without predefined categories

Practical Considerations

Monocular depth is scale-ambiguous; metric depth requires known camera parameters or learned scale
PointNet++ with farthest point sampling is O(N^2); random sampling is a practical alternative for large clouds
NeRF requires 50-200 images with accurate poses (from COLMAP); Gaussian splatting requires fewer
SLAM systems must handle initialization, relocalization, and drift -- bundle adjustment is critical
For autonomous driving, BEV representations unify LiDAR and camera features in a common coordinate frame
3D Gaussian splatting has rapidly become the preferred method for novel view synthesis due to its speed advantage

Key Takeaways

Concept	Core Idea
Monocular depth	Learned priors compensate for geometric ambiguity; foundation models generalize broadly
PointNet	Permutation-invariant processing via shared MLP + max pool
TSDF fusion	Running average of signed distance for volumetric reconstruction
NeRF	MLP maps (x,y,z,theta,phi) to (color, density); trained via volume rendering
3D Gaussian splatting	Explicit Gaussians with differentiable rasterization; real-time rendering
ORB-SLAM	Feature-based SLAM with tracking, mapping, and loop closure threads