Video Analysis
Optical Flow
Optical flow estimates the apparent motion of each pixel between consecutive frames as a 2D displacement field (u, v) at each pixel.
Brightness Constancy Assumption
The fundamental assumption: a pixel's intensity does not change between frames.
I(x, y, t) = I(x + u, y + v, t + 1)
First-order Taylor expansion yields the optical flow constraint equation:
I_x * u + I_y * v + I_t = 0
One equation, two unknowns -- the aperture problem. Additional constraints needed.
Lucas-Kanade Method
Assumes flow is locally constant within a small window (e.g., 5x5):
[sum I_x^2 sum I_x*I_y] [u] [-sum I_x*I_t]
[sum I_x*I_y sum I_y^2 ] [v] = [-sum I_y*I_t]
Solved as a least-squares system: A^T A * d = A^T b.
Properties:
- Sparse: only reliable where
A^T Ais well-conditioned (textured regions) - Pyramidal LK: coarse-to-fine for large displacements
- Used in KLT tracker (Kanade-Lucas-Tomasi)
Horn-Schunck Method
Adds a global smoothness regularization:
E = integral (I_x*u + I_y*v + I_t)^2 + alpha^2 * (|grad u|^2 + |grad v|^2) dx dy
alpha: smoothness weight (larger = smoother flow)- Solved iteratively via Euler-Lagrange equations
- Produces dense flow fields
- Over-smooths motion boundaries
RAFT (Recurrent All-Pairs Field Transforms)
Deep learning-based optical flow (Teed & Deng, 2020):
- Feature extraction: shared CNN encodes both frames into feature maps
- Correlation volume: compute all-pairs dot products between frame 1 and frame 2 features (4D correlation volume)
- Correlation lookup: index the volume at current flow estimate to get local correlation features
- Iterative update: GRU-based recurrent unit refines flow estimate using correlation features, context features, and current flow
- Typically 12-32 iterations at inference
Key innovations:
- All-pairs correlation at 1/8 resolution + multi-scale lookup
- Iterative refinement converges smoothly
- State-of-the-art accuracy; generalizes well across datasets
Flow Evaluation
- EPE (End-Point Error): average L2 distance between predicted and ground-truth flow
- Fl-all: percentage of pixels with EPE > 3px and > 5% of ground truth magnitude
- Datasets: Sintel (synthetic), KITTI (driving), FlyingChairs/Things (pre-training)
Object Tracking
Problem Formulation
Given an object's location in frame 1, determine its location in all subsequent frames.
Single Object Tracking (SOT): track one target initialized in the first frame. Multi-Object Tracking (MOT): track all objects of a category across frames, maintaining consistent IDs.
Tracking by Detection (MOT)
Dominant paradigm: run a detector per frame, then associate detections across frames.
SORT (Simple Online and Realtime Tracking)
- Detection: run object detector on each frame
- State estimation: Kalman filter predicts each track's next position
- State:
[x, y, s, r, dx, dy, ds](center, scale, aspect ratio, velocities) - Constant-velocity motion model
- State:
- Association: Hungarian algorithm matches predictions to detections using IoU
- Track management: create new tracks for unmatched detections, delete tracks unmatected for T frames
Simple and fast (260 Hz) but fragile to occlusions and missed detections.
DeepSORT
Extends SORT with appearance features:
- Deep association metric: combine motion distance (Mahalanobis from Kalman) with appearance distance (cosine distance of ReID features)
- Gallery: maintain feature buffer per track for robust matching
- Cascade matching: prioritize recently seen tracks
- Significantly reduces ID switches compared to SORT
ByteTrack
Key insight: use low-confidence detections that other trackers discard.
- First association: match high-confidence detections to tracks (IoU-based)
- Second association: match remaining tracks to low-confidence detections
- This recovers partially occluded objects that produce weak detections
State-of-the-art MOT performance with any detector. No appearance model needed.
Modern Trackers
- OC-SORT: observation-centric SORT with virtual trajectory and online momentum
- BoT-SORT: combines motion, appearance, and camera motion compensation
- Tracking transformers: TrackFormer, MOTR -- end-to-end with track queries
MOT Evaluation Metrics
| Metric | Description |
|--------|-------------|
| MOTA | Multi-Object Tracking Accuracy: 1 - (FN + FP + IDsw) / GT |
| IDF1 | F1 score of ID-correct detections |
| HOTA | Higher Order Tracking Accuracy: geometric mean of detection and association |
| IDsw | Number of identity switches |
| MT/ML | Mostly Tracked / Mostly Lost track ratios |
Action Recognition
Problem Variants
- Action classification: single label per video clip
- Temporal action detection: localize action start/end times + classify
- Spatial-temporal detection: localize actor in space and time (AVA dataset)
Two-Stream Networks
- Spatial stream: single RGB frame through 2D CNN (appearance)
- Temporal stream: stacked optical flow frames through 2D CNN (motion)
- Fuse predictions (late fusion by averaging, or early/mid fusion)
I3D (Inflated 3D ConvNet)
Inflate 2D convolutional filters to 3D to process video volumes:
- 2D filter
k x kbecomes 3D filtert x k x k - Initialize by repeating 2D weights along temporal dimension (divided by t)
- Can inflate any pre-trained 2D architecture (Inception, ResNet)
- Two-stream I3D: RGB + optical flow streams with 3D convolutions
SlowFast Networks
Dual-pathway architecture:
- Slow pathway: low frame rate (e.g., 4 fps), large channel capacity -- captures spatial semantics
- Fast pathway: high frame rate (e.g., 32 fps), lightweight (1/8 channels) -- captures temporal dynamics
- Lateral connections: fuse fast pathway into slow pathway
Rationale: spatial structure changes slowly but motion requires high temporal resolution.
VideoMAE
Masked autoencoder for video self-supervised pre-training:
- Mask 90-95% of video patches (higher ratio than images due to temporal redundancy)
- ViT encoder processes visible patches only
- Lightweight decoder reconstructs masked patches
- Pre-trained model fine-tuned for action recognition
- Achieves strong performance with limited labeled data
Temporal Action Detection
Locate and classify actions in untrimmed videos.
Two-Stage Methods
- Proposal generation: produce candidate temporal segments
- ActionFormer: transformer-based with multi-scale features, predicts action boundaries at each temporal location
- BMN: Boundary-Matching Network with confidence map
- Classification: classify and refine each proposal
One-Stage Methods
- AFSD: anchor-free with coarse-to-fine refinement
- Predict action boundaries and class at each temporal location directly
Evaluation
- mAP at temporal IoU thresholds (0.3, 0.5, 0.7): analogous to object detection AP but in temporal domain
Video Understanding Beyond Recognition
Video Captioning
Generate natural language descriptions of video content:
- Encoder-decoder: video features (3D CNN or ViT) + language model (LSTM or transformer)
- Dense captioning: localize and describe multiple events
Video Question Answering (VideoQA)
Answer questions about video content requiring temporal reasoning.
Video-Language Models
- VideoCLIP: contrastive learning between video and text
- InternVideo: foundation model for video understanding
- Video-LLaVA: extend multimodal LLMs to video input
Video Generation
Diffusion-Based Video Generation
Extend image diffusion models to the temporal dimension:
- Video Diffusion Models: 3D U-Net with temporal attention layers
- Stable Video Diffusion: latent diffusion in video space, fine-tuned from image model
- Sora (OpenAI): DiT-based model generating high-fidelity, long-duration videos with coherent physics
- Kling, Runway Gen-3: commercial video generation systems
Key challenges:
- Temporal consistency (flickering, morphing)
- Long-duration coherence
- Physics plausibility
- Computational cost (3D attention is expensive)
Autoregressive Video Generation
- Generate frames sequentially conditioned on previous frames
- VideoGPT, MAGVIT: discrete video tokens with transformer
Practical Considerations
- Optical flow computation is expensive; RAFT requires GPU but runs at ~10 fps
- For MOT, detector quality matters more than tracker sophistication
- ByteTrack's use of low-confidence detections is a simple but crucial insight
- Action recognition benefits greatly from pre-training (Kinetics, HowTo100M)
- Video models are memory-intensive: use temporal sampling (uniform or random)
- Video generation quality has improved dramatically since 2023 but still struggles with fine-grained temporal coherence
Key Takeaways
| Concept | Core Idea | |---------|-----------| | Lucas-Kanade | Local constant flow; sparse, fast, pyramidal extension | | RAFT | All-pairs correlation + iterative GRU refinement; SOTA flow | | SORT/DeepSORT | Kalman filter + Hungarian matching; add ReID features for robustness | | ByteTrack | Use low-confidence detections in second association round | | SlowFast | Dual pathways for spatial semantics (slow) and motion (fast) | | Video diffusion | Extend image diffusion to temporal dimension for generation |