Video Analysis

Optical Flow

Optical flow estimates the apparent motion of each pixel between consecutive frames as a 2D displacement field (u, v) at each pixel.

Brightness Constancy Assumption

The fundamental assumption: a pixel's intensity does not change between frames.

I(x, y, t) = I(x + u, y + v, t + 1)

First-order Taylor expansion yields the optical flow constraint equation:

I_x * u + I_y * v + I_t = 0

One equation, two unknowns -- the aperture problem. Additional constraints needed.

Lucas-Kanade Method

Assumes flow is locally constant within a small window (e.g., 5x5):

[sum I_x^2     sum I_x*I_y] [u]   [-sum I_x*I_t]
[sum I_x*I_y   sum I_y^2  ] [v] = [-sum I_y*I_t]

Solved as a least-squares system: A^T A * d = A^T b.

Properties:

Sparse: only reliable where A^T A is well-conditioned (textured regions)
Pyramidal LK: coarse-to-fine for large displacements
Used in KLT tracker (Kanade-Lucas-Tomasi)

Horn-Schunck Method

Adds a global smoothness regularization:

E = integral (I_x*u + I_y*v + I_t)^2 + alpha^2 * (|grad u|^2 + |grad v|^2) dx dy

alpha: smoothness weight (larger = smoother flow)
Solved iteratively via Euler-Lagrange equations
Produces dense flow fields
Over-smooths motion boundaries

RAFT (Recurrent All-Pairs Field Transforms)

Deep learning-based optical flow (Teed & Deng, 2020):

Feature extraction: shared CNN encodes both frames into feature maps
Correlation volume: compute all-pairs dot products between frame 1 and frame 2 features (4D correlation volume)
Correlation lookup: index the volume at current flow estimate to get local correlation features
Iterative update: GRU-based recurrent unit refines flow estimate using correlation features, context features, and current flow
Typically 12-32 iterations at inference

Key innovations:

All-pairs correlation at 1/8 resolution + multi-scale lookup
Iterative refinement converges smoothly
State-of-the-art accuracy; generalizes well across datasets

Flow Evaluation

EPE (End-Point Error): average L2 distance between predicted and ground-truth flow
Fl-all: percentage of pixels with EPE > 3px and > 5% of ground truth magnitude
Datasets: Sintel (synthetic), KITTI (driving), FlyingChairs/Things (pre-training)

Object Tracking

Problem Formulation

Given an object's location in frame 1, determine its location in all subsequent frames.

Single Object Tracking (SOT): track one target initialized in the first frame. Multi-Object Tracking (MOT): track all objects of a category across frames, maintaining consistent IDs.

Tracking by Detection (MOT)

Dominant paradigm: run a detector per frame, then associate detections across frames.

SORT (Simple Online and Realtime Tracking)

Detection: run object detector on each frame
State estimation: Kalman filter predicts each track's next position
- State: [x, y, s, r, dx, dy, ds] (center, scale, aspect ratio, velocities)
- Constant-velocity motion model
Association: Hungarian algorithm matches predictions to detections using IoU
Track management: create new tracks for unmatched detections, delete tracks unmatected for T frames

Simple and fast (260 Hz) but fragile to occlusions and missed detections.

DeepSORT

Extends SORT with appearance features:

Deep association metric: combine motion distance (Mahalanobis from Kalman) with appearance distance (cosine distance of ReID features)
Gallery: maintain feature buffer per track for robust matching
Cascade matching: prioritize recently seen tracks
Significantly reduces ID switches compared to SORT

ByteTrack

Key insight: use low-confidence detections that other trackers discard.

First association: match high-confidence detections to tracks (IoU-based)
Second association: match remaining tracks to low-confidence detections
This recovers partially occluded objects that produce weak detections

State-of-the-art MOT performance with any detector. No appearance model needed.

Modern Trackers

OC-SORT: observation-centric SORT with virtual trajectory and online momentum
BoT-SORT: combines motion, appearance, and camera motion compensation
Tracking transformers: TrackFormer, MOTR -- end-to-end with track queries

MOT Evaluation Metrics

Metric	Description
MOTA	Multi-Object Tracking Accuracy: `1 - (FN + FP + IDsw) / GT`
IDF1	F1 score of ID-correct detections
HOTA	Higher Order Tracking Accuracy: geometric mean of detection and association
IDsw	Number of identity switches
MT/ML	Mostly Tracked / Mostly Lost track ratios

Action Recognition

Problem Variants

Action classification: single label per video clip
Temporal action detection: localize action start/end times + classify
Spatial-temporal detection: localize actor in space and time (AVA dataset)

Two-Stream Networks

Spatial stream: single RGB frame through 2D CNN (appearance)
Temporal stream: stacked optical flow frames through 2D CNN (motion)
Fuse predictions (late fusion by averaging, or early/mid fusion)

I3D (Inflated 3D ConvNet)

Inflate 2D convolutional filters to 3D to process video volumes:

2D filter k x k becomes 3D filter t x k x k
Initialize by repeating 2D weights along temporal dimension (divided by t)
Can inflate any pre-trained 2D architecture (Inception, ResNet)
Two-stream I3D: RGB + optical flow streams with 3D convolutions

SlowFast Networks

Dual-pathway architecture:

Slow pathway: low frame rate (e.g., 4 fps), large channel capacity -- captures spatial semantics
Fast pathway: high frame rate (e.g., 32 fps), lightweight (1/8 channels) -- captures temporal dynamics
Lateral connections: fuse fast pathway into slow pathway

Rationale: spatial structure changes slowly but motion requires high temporal resolution.

VideoMAE

Masked autoencoder for video self-supervised pre-training:

Mask 90-95% of video patches (higher ratio than images due to temporal redundancy)
ViT encoder processes visible patches only
Lightweight decoder reconstructs masked patches
Pre-trained model fine-tuned for action recognition
Achieves strong performance with limited labeled data

Temporal Action Detection

Locate and classify actions in untrimmed videos.

Two-Stage Methods

Proposal generation: produce candidate temporal segments
- ActionFormer: transformer-based with multi-scale features, predicts action boundaries at each temporal location
- BMN: Boundary-Matching Network with confidence map
Classification: classify and refine each proposal

One-Stage Methods

AFSD: anchor-free with coarse-to-fine refinement
Predict action boundaries and class at each temporal location directly

Evaluation

mAP at temporal IoU thresholds (0.3, 0.5, 0.7): analogous to object detection AP but in temporal domain

Video Understanding Beyond Recognition

Video Captioning

Generate natural language descriptions of video content:

Encoder-decoder: video features (3D CNN or ViT) + language model (LSTM or transformer)
Dense captioning: localize and describe multiple events

Video Question Answering (VideoQA)

Answer questions about video content requiring temporal reasoning.

Video-Language Models

VideoCLIP: contrastive learning between video and text
InternVideo: foundation model for video understanding
Video-LLaVA: extend multimodal LLMs to video input

Video Generation

Diffusion-Based Video Generation

Extend image diffusion models to the temporal dimension:

Video Diffusion Models: 3D U-Net with temporal attention layers
Stable Video Diffusion: latent diffusion in video space, fine-tuned from image model
Sora (OpenAI): DiT-based model generating high-fidelity, long-duration videos with coherent physics
Kling, Runway Gen-3: commercial video generation systems

Key challenges:

Temporal consistency (flickering, morphing)
Long-duration coherence
Physics plausibility
Computational cost (3D attention is expensive)

Autoregressive Video Generation

Generate frames sequentially conditioned on previous frames
VideoGPT, MAGVIT: discrete video tokens with transformer

Practical Considerations

Optical flow computation is expensive; RAFT requires GPU but runs at ~10 fps
For MOT, detector quality matters more than tracker sophistication
ByteTrack's use of low-confidence detections is a simple but crucial insight
Action recognition benefits greatly from pre-training (Kinetics, HowTo100M)
Video models are memory-intensive: use temporal sampling (uniform or random)
Video generation quality has improved dramatically since 2023 but still struggles with fine-grained temporal coherence

Key Takeaways

Concept	Core Idea
Lucas-Kanade	Local constant flow; sparse, fast, pyramidal extension
RAFT	All-pairs correlation + iterative GRU refinement; SOTA flow
SORT/DeepSORT	Kalman filter + Hungarian matching; add ReID features for robustness
ByteTrack	Use low-confidence detections in second association round
SlowFast	Dual pathways for spatial semantics (slow) and motion (fast)
Video diffusion	Extend image diffusion to temporal dimension for generation