3D Vision
Depth Estimation
Stereo Depth
Recover depth from two rectified cameras using the disparity relationship:
Z = f * B / d
where f is focal length, B is baseline, d is disparity (pixel shift between left and right views). See Geometric Vision for stereo matching methods.
Monocular Depth Estimation
Predict depth from a single image -- inherently ill-posed, requires learned priors.
Supervised methods:
- Train on images with ground-truth depth (LiDAR, structured light)
- Loss: scale-invariant log depth error
L = (1/n) * sum(d_i^2) - (lambda/n^2) * (sum d_i)^2, d_i = log(y_i) - log(y_i*) - Architectures: encoder-decoder (ResNet/ViT + multi-scale decoder)
Self-supervised methods (Monodepth2):
- Train on monocular video: predict depth + ego-motion, synthesize adjacent frames via warping
- Photometric loss:
L = alpha * SSIM(I, I') + (1-alpha) * |I - I'| - No ground-truth depth needed; scale ambiguity resolved at test time
Foundation models:
- MiDaS: robust relative depth from diverse training data
- Depth Anything: DINOv2 encoder + DPT decoder, trained on 62M images
- Metric3D: absolute metric depth prediction
- Marigold: diffusion-based depth estimation fine-tuned from Stable Diffusion
Point Clouds
Representation
A point cloud is an unordered set of 3D points {(x_i, y_i, z_i)}, optionally with color, normals, or other attributes. Sources: LiDAR, stereo, SfM, RGB-D sensors.
PointNet
First deep network to directly process raw point clouds (Qi et al., 2017):
Architecture:
- Per-point MLP: shared across all points (point-wise features)
- Symmetric aggregation: max pooling over all points to get global feature
- Classification/segmentation heads on global (+ per-point) features
Key insight: max pooling is a symmetric function, so the network is permutation invariant (order of points does not matter).
T-Net: small sub-network that predicts an input/feature transformation matrix for alignment (learned spatial transformer).
PointNet++: hierarchical version with set abstraction layers:
- Sampling: farthest point sampling selects representative points
- Grouping: ball query finds neighbors within radius
- PointNet: applied to each local group
- Multi-scale grouping handles varying point density
Other Point Cloud Networks
- DGCNN: dynamic graph CNN; constructs k-NN graph in feature space and applies edge convolutions
- Point Transformer: self-attention on point cloud neighborhoods
- MinkowskiNet: sparse 3D convolutions on voxelized point clouds
3D Object Detection
From Point Clouds
VoxelNet: voxelize space, apply 3D convolutions.
PointPillars: organize points into vertical pillars (2D grid), encode each pillar with PointNet, apply 2D detection head. Fast and effective for autonomous driving.
CenterPoint: detect object centers in bird's-eye view, then refine 3D boxes. Anchor-free, state-of-the-art on nuScenes.
Multi-Modal Fusion
Combine LiDAR and camera for richer 3D detection:
- BEVFusion: project both modalities to bird's-eye view, concatenate features
- TransFusion: transformer-based fusion with query-based detection
- Camera-only BEV: BEVFormer, BEVDet -- project image features to 3D via learned depth
3D Reconstruction
Multi-View Stereo (MVS)
Dense reconstruction from calibrated images:
- Compute depth maps for each view (plane-sweep stereo, PatchMatch)
- Fuse depth maps into consistent 3D model (TSDF fusion or point cloud merging)
- Surface extraction (Marching Cubes)
TSDF (Truncated Signed Distance Function)
Volumetric representation where each voxel stores the (truncated) signed distance to the nearest surface:
TSDF(x) = clamp(d(x) / delta, -1, 1)
- Positive: in front of surface; negative: behind surface; zero: on surface
- Fusion: running weighted average of observations from multiple views
- Surface extraction: Marching Cubes algorithm at zero-crossing isosurface
- Used in KinectFusion for real-time RGB-D reconstruction
Neural Radiance Fields (NeRF)
Represent a scene as a continuous volumetric function learned by a neural network (Mildenhall et al., 2020):
Representation: MLP maps 5D input to color and density:
F: (x, y, z, theta, phi) -> (r, g, b, sigma)
- Position
(x, y, z)determines densitysigma - Viewing direction
(theta, phi)modulates color (view-dependent effects like specularity)
Volume rendering: accumulate color along each camera ray:
C(r) = integral_{t_n}^{t_f} T(t) * sigma(t) * c(t) dt
T(t) = exp(-integral_{t_n}^{t} sigma(s) ds)
where T(t) is accumulated transmittance (probability of ray reaching depth t).
Positional encoding: map coordinates to higher dimensions using sinusoidal functions to capture high-frequency detail:
gamma(p) = [sin(2^0 pi p), cos(2^0 pi p), ..., sin(2^{L-1} pi p), cos(2^{L-1} pi p)]
Training: minimize photometric loss between rendered and observed pixels across multiple views.
Limitations: slow training (hours), slow rendering (seconds per frame), requires accurate camera poses.
Key extensions:
- Instant-NGP: hash grid encoding for minutes-fast training
- Mip-NeRF: anti-aliased cone tracing instead of ray tracing
- Nerfacto (Nerfstudio): combined best practices
3D Gaussian Splatting
Explicit scene representation using 3D Gaussians (Kerbl et al., 2023):
Representation: set of 3D Gaussians, each parameterized by:
- Position (mean):
mu in R^3 - Covariance:
Sigma = R * S * S^T * R^T(rotation R + scale S) - Opacity:
alpha - Color: spherical harmonics coefficients (view-dependent appearance)
Rendering (differentiable rasterization):
- Project 3D Gaussians to 2D screen-space Gaussians
- Sort by depth
- Alpha-blend front-to-back:
C = sum_i c_i * alpha_i * prod_{j<i} (1 - alpha_j)
Advantages over NeRF:
- Real-time rendering (100+ fps) via rasterization
- Fast training (minutes)
- Explicit representation enables easy editing and manipulation
Training: initialize from SfM point cloud, optimize Gaussian parameters with photometric loss, periodically densify (split/clone) and prune Gaussians based on gradient magnitude and opacity.
SLAM (Simultaneous Localization and Mapping)
Problem
Jointly estimate camera trajectory and build a map of the environment in real-time.
ORB-SLAM
Feature-based visual SLAM system (Mur-Artal et al.):
Three parallel threads:
- Tracking: extract ORB features, match to map, estimate pose via PnP + motion-only BA
- Local mapping: triangulate new map points, local bundle adjustment
- Loop closing: detect revisited places via bag-of-words (DBoW2), correct drift with pose graph optimization
Key components:
- Keyframe selection: insert keyframes when tracking quality degrades
- Covisibility graph: connects keyframes sharing map point observations
- Essential graph: sparse subset for efficient global optimization
- Relocalization: recover from tracking failure using place recognition
ORB-SLAM3: extends to visual-inertial (IMU fusion), multi-map, and multi-session SLAM.
Dense SLAM
- DTAM: dense tracking and mapping with photometric optimization
- ElasticFusion: surfel-based dense reconstruction with non-rigid deformation for loop closure
- NICE-SLAM / iMAP: neural implicit representations for SLAM (encode map as neural network)
- Gaussian Splatting SLAM: use 3D Gaussians as map representation
3D Scene Understanding
Indoor Scene Understanding
- 3D semantic segmentation: per-point labels on point clouds (ScanNet benchmark)
- 3D instance segmentation: separate individual object instances
- Scene graph: objects + relationships in 3D space
- 3D visual grounding: localize objects from natural language descriptions
Outdoor / Autonomous Driving
- Occupancy prediction: predict 3D occupancy grid from camera/LiDAR (TPVFormer, SurroundOcc)
- HD map construction: online construction of lane markings, road boundaries from onboard sensors
- 3D lane detection: predict 3D lane curves from monocular or multi-view cameras
Open-Vocabulary 3D Understanding
- LERF: language-embedded radiance fields for open-vocabulary 3D queries
- OpenScene: per-point CLIP features on 3D point clouds
- Enables text-based search in 3D scenes without predefined categories
Practical Considerations
- Monocular depth is scale-ambiguous; metric depth requires known camera parameters or learned scale
- PointNet++ with farthest point sampling is O(N^2); random sampling is a practical alternative for large clouds
- NeRF requires 50-200 images with accurate poses (from COLMAP); Gaussian splatting requires fewer
- SLAM systems must handle initialization, relocalization, and drift -- bundle adjustment is critical
- For autonomous driving, BEV representations unify LiDAR and camera features in a common coordinate frame
- 3D Gaussian splatting has rapidly become the preferred method for novel view synthesis due to its speed advantage
Key Takeaways
| Concept | Core Idea | |---------|-----------| | Monocular depth | Learned priors compensate for geometric ambiguity; foundation models generalize broadly | | PointNet | Permutation-invariant processing via shared MLP + max pool | | TSDF fusion | Running average of signed distance for volumetric reconstruction | | NeRF | MLP maps (x,y,z,theta,phi) to (color, density); trained via volume rendering | | 3D Gaussian splatting | Explicit Gaussians with differentiable rasterization; real-time rendering | | ORB-SLAM | Feature-based SLAM with tracking, mapping, and loop closure threads |