6 min read
On this page

Video Analysis

Optical Flow

Optical flow estimates the apparent motion of each pixel between consecutive frames as a 2D displacement field (u, v) at each pixel.

Brightness Constancy Assumption

The fundamental assumption: a pixel's intensity does not change between frames.

I(x, y, t) = I(x + u, y + v, t + 1)

First-order Taylor expansion yields the optical flow constraint equation:

I_x * u + I_y * v + I_t = 0

One equation, two unknowns -- the aperture problem. Additional constraints needed.

Lucas-Kanade Method

Assumes flow is locally constant within a small window (e.g., 5x5):

[sum I_x^2     sum I_x*I_y] [u]   [-sum I_x*I_t]
[sum I_x*I_y   sum I_y^2  ] [v] = [-sum I_y*I_t]

Solved as a least-squares system: A^T A * d = A^T b.

Properties:

  • Sparse: only reliable where A^T A is well-conditioned (textured regions)
  • Pyramidal LK: coarse-to-fine for large displacements
  • Used in KLT tracker (Kanade-Lucas-Tomasi)

Horn-Schunck Method

Adds a global smoothness regularization:

E = integral (I_x*u + I_y*v + I_t)^2 + alpha^2 * (|grad u|^2 + |grad v|^2) dx dy
  • alpha: smoothness weight (larger = smoother flow)
  • Solved iteratively via Euler-Lagrange equations
  • Produces dense flow fields
  • Over-smooths motion boundaries

RAFT (Recurrent All-Pairs Field Transforms)

Deep learning-based optical flow (Teed & Deng, 2020):

  1. Feature extraction: shared CNN encodes both frames into feature maps
  2. Correlation volume: compute all-pairs dot products between frame 1 and frame 2 features (4D correlation volume)
  3. Correlation lookup: index the volume at current flow estimate to get local correlation features
  4. Iterative update: GRU-based recurrent unit refines flow estimate using correlation features, context features, and current flow
  5. Typically 12-32 iterations at inference

Key innovations:

  • All-pairs correlation at 1/8 resolution + multi-scale lookup
  • Iterative refinement converges smoothly
  • State-of-the-art accuracy; generalizes well across datasets

Flow Evaluation

  • EPE (End-Point Error): average L2 distance between predicted and ground-truth flow
  • Fl-all: percentage of pixels with EPE > 3px and > 5% of ground truth magnitude
  • Datasets: Sintel (synthetic), KITTI (driving), FlyingChairs/Things (pre-training)

Object Tracking

Problem Formulation

Given an object's location in frame 1, determine its location in all subsequent frames.

Single Object Tracking (SOT): track one target initialized in the first frame. Multi-Object Tracking (MOT): track all objects of a category across frames, maintaining consistent IDs.

Tracking by Detection (MOT)

Dominant paradigm: run a detector per frame, then associate detections across frames.

SORT (Simple Online and Realtime Tracking)

  1. Detection: run object detector on each frame
  2. State estimation: Kalman filter predicts each track's next position
    • State: [x, y, s, r, dx, dy, ds] (center, scale, aspect ratio, velocities)
    • Constant-velocity motion model
  3. Association: Hungarian algorithm matches predictions to detections using IoU
  4. Track management: create new tracks for unmatched detections, delete tracks unmatected for T frames

Simple and fast (260 Hz) but fragile to occlusions and missed detections.

DeepSORT

Extends SORT with appearance features:

  • Deep association metric: combine motion distance (Mahalanobis from Kalman) with appearance distance (cosine distance of ReID features)
  • Gallery: maintain feature buffer per track for robust matching
  • Cascade matching: prioritize recently seen tracks
  • Significantly reduces ID switches compared to SORT

ByteTrack

Key insight: use low-confidence detections that other trackers discard.

  1. First association: match high-confidence detections to tracks (IoU-based)
  2. Second association: match remaining tracks to low-confidence detections
  3. This recovers partially occluded objects that produce weak detections

State-of-the-art MOT performance with any detector. No appearance model needed.

Modern Trackers

  • OC-SORT: observation-centric SORT with virtual trajectory and online momentum
  • BoT-SORT: combines motion, appearance, and camera motion compensation
  • Tracking transformers: TrackFormer, MOTR -- end-to-end with track queries

MOT Evaluation Metrics

| Metric | Description | |--------|-------------| | MOTA | Multi-Object Tracking Accuracy: 1 - (FN + FP + IDsw) / GT | | IDF1 | F1 score of ID-correct detections | | HOTA | Higher Order Tracking Accuracy: geometric mean of detection and association | | IDsw | Number of identity switches | | MT/ML | Mostly Tracked / Mostly Lost track ratios |


Action Recognition

Problem Variants

  • Action classification: single label per video clip
  • Temporal action detection: localize action start/end times + classify
  • Spatial-temporal detection: localize actor in space and time (AVA dataset)

Two-Stream Networks

  • Spatial stream: single RGB frame through 2D CNN (appearance)
  • Temporal stream: stacked optical flow frames through 2D CNN (motion)
  • Fuse predictions (late fusion by averaging, or early/mid fusion)

I3D (Inflated 3D ConvNet)

Inflate 2D convolutional filters to 3D to process video volumes:

  • 2D filter k x k becomes 3D filter t x k x k
  • Initialize by repeating 2D weights along temporal dimension (divided by t)
  • Can inflate any pre-trained 2D architecture (Inception, ResNet)
  • Two-stream I3D: RGB + optical flow streams with 3D convolutions

SlowFast Networks

Dual-pathway architecture:

  • Slow pathway: low frame rate (e.g., 4 fps), large channel capacity -- captures spatial semantics
  • Fast pathway: high frame rate (e.g., 32 fps), lightweight (1/8 channels) -- captures temporal dynamics
  • Lateral connections: fuse fast pathway into slow pathway

Rationale: spatial structure changes slowly but motion requires high temporal resolution.

VideoMAE

Masked autoencoder for video self-supervised pre-training:

  • Mask 90-95% of video patches (higher ratio than images due to temporal redundancy)
  • ViT encoder processes visible patches only
  • Lightweight decoder reconstructs masked patches
  • Pre-trained model fine-tuned for action recognition
  • Achieves strong performance with limited labeled data

Temporal Action Detection

Locate and classify actions in untrimmed videos.

Two-Stage Methods

  1. Proposal generation: produce candidate temporal segments
    • ActionFormer: transformer-based with multi-scale features, predicts action boundaries at each temporal location
    • BMN: Boundary-Matching Network with confidence map
  2. Classification: classify and refine each proposal

One-Stage Methods

  • AFSD: anchor-free with coarse-to-fine refinement
  • Predict action boundaries and class at each temporal location directly

Evaluation

  • mAP at temporal IoU thresholds (0.3, 0.5, 0.7): analogous to object detection AP but in temporal domain

Video Understanding Beyond Recognition

Video Captioning

Generate natural language descriptions of video content:

  • Encoder-decoder: video features (3D CNN or ViT) + language model (LSTM or transformer)
  • Dense captioning: localize and describe multiple events

Video Question Answering (VideoQA)

Answer questions about video content requiring temporal reasoning.

Video-Language Models

  • VideoCLIP: contrastive learning between video and text
  • InternVideo: foundation model for video understanding
  • Video-LLaVA: extend multimodal LLMs to video input

Video Generation

Diffusion-Based Video Generation

Extend image diffusion models to the temporal dimension:

  • Video Diffusion Models: 3D U-Net with temporal attention layers
  • Stable Video Diffusion: latent diffusion in video space, fine-tuned from image model
  • Sora (OpenAI): DiT-based model generating high-fidelity, long-duration videos with coherent physics
  • Kling, Runway Gen-3: commercial video generation systems

Key challenges:

  • Temporal consistency (flickering, morphing)
  • Long-duration coherence
  • Physics plausibility
  • Computational cost (3D attention is expensive)

Autoregressive Video Generation

  • Generate frames sequentially conditioned on previous frames
  • VideoGPT, MAGVIT: discrete video tokens with transformer

Practical Considerations

  • Optical flow computation is expensive; RAFT requires GPU but runs at ~10 fps
  • For MOT, detector quality matters more than tracker sophistication
  • ByteTrack's use of low-confidence detections is a simple but crucial insight
  • Action recognition benefits greatly from pre-training (Kinetics, HowTo100M)
  • Video models are memory-intensive: use temporal sampling (uniform or random)
  • Video generation quality has improved dramatically since 2023 but still struggles with fine-grained temporal coherence

Key Takeaways

| Concept | Core Idea | |---------|-----------| | Lucas-Kanade | Local constant flow; sparse, fast, pyramidal extension | | RAFT | All-pairs correlation + iterative GRU refinement; SOTA flow | | SORT/DeepSORT | Kalman filter + Hungarian matching; add ReID features for robustness | | ByteTrack | Use low-confidence detections in second association round | | SlowFast | Dual pathways for spatial semantics (slow) and motion (fast) | | Video diffusion | Extend image diffusion to temporal dimension for generation |