9 min read
On this page

Multimodal Learning

Foundations

Multimodal learning integrates information from multiple modalities — vision, language, audio, video, touch, and others — into unified representations and models. Humans naturally process multimodal input; enabling machines to do so requires solving alignment (matching concepts across modalities), fusion (combining information), and generation (producing one modality from another).

Core Challenges

  • Representation: How to encode heterogeneous modalities into a shared or compatible representation space.
  • Alignment: How to identify correspondences between elements across modalities (words ↔ image regions, audio ↔ video frames).
  • Fusion: How to combine modality-specific information — early fusion (raw features), late fusion (decision-level), or intermediate fusion (cross-attention).
  • Translation/Generation: How to generate one modality conditioned on another (text → image, image → text).
  • Missing modalities: Robustness when some modalities are absent at inference.

Vision-Language Models

CLIP (Contrastive Language-Image Pre-training)

Radford et al. (2021) learn aligned vision-language representations via contrastive learning on 400M image-text pairs from the internet.

Architecture: Separate image encoder (ViT or ResNet) and text encoder (Transformer). Each produces a fixed-dimensional embedding. Training minimizes the InfoNCE loss over a batch of image-text pairs: matching pairs should have high cosine similarity, non-matching pairs low similarity.

Zero-shot classification: Encode class names as text prompts ("a photo of a {class}"), compute text embeddings, classify images by nearest text embedding. No task-specific training. CLIP matches supervised ResNet-50 on ImageNet zero-shot.

Impact: CLIP embeddings serve as a universal vision-language interface. Used as the text encoder for Stable Diffusion, as a feature extractor for zero-shot detection, and as a reward model for text-to-image generation.

Limitations: Bag-of-words behavior (insensitive to word order and compositional structure). "A red car on a blue road" vs. "A blue car on a red road" may receive similar scores. Limited understanding of spatial relationships, counting, and negation.

SigLIP

Zhai et al. (2023): Replace CLIP's softmax-based contrastive loss with a sigmoid loss applied independently to each image-text pair:

L = -Σ_{i,j} log σ(y_{ij} · (sim(z_i^I, z_j^T)/τ - b))

where y_{ij} = +1 for matching pairs and -1 for non-matching. No softmax normalization across the batch — each pair is classified independently as matching/non-matching. Enables larger batch sizes (no need for all-pairs softmax), simpler implementation, and comparable or superior performance to CLIP.

ALIGN and BASIC

ALIGN (Jia et al., 2021): Scale CLIP-style training to 1.8B noisy image-text pairs with minimal curation. EfficientNet image encoder + BERT text encoder. Demonstrates that scale compensates for noise.

BASIC (Pham et al., 2023): Scale further with a larger ViT (ViT-g) and a locked-image text tuning approach. Combined with LiT (Locked-image Text tuning): freeze a pretrained image encoder, only train the text encoder to align with the frozen visual features.

Visual Question Answering (VQA)

Given an image and a natural language question, produce a natural language answer.

Approaches

Classification-based: Treat VQA as multi-class classification over a fixed answer vocabulary (top 3000 answers). Encode image (CNN features or region proposals) and question (LSTM/Transformer), fuse representations (attention, bilinear pooling), classify. Models: BUTD (bottom-up top-down attention using Faster R-CNN region features).

Generative: Generate free-form answers token by token. More flexible but harder to evaluate. Modern multimodal LLMs (GPT-4V, Gemini) take this approach.

Reasoning-focused: VQA requiring multi-step reasoning over image content. Neural Module Networks: Compose specialized modules (attention, comparison, counting) based on question parsing. GQA, CLEVR benchmarks test compositional reasoning.

Image Captioning

Generate a natural language description of an image.

Encoder-decoder: CNN/ViT encodes the image; a language model decoder generates the caption autoregressively, attending to image features. Show and Tell (Vinyals et al., 2015): LSTM decoder with CNN encoder. Show, Attend and Tell: Added spatial attention over image regions.

Modern approaches: Pretrained vision-language models (CoCa, BLIP-2, PaLI) generate captions as part of broader vision-language capabilities. Pretraining on web-scale image-text data produces rich captioning abilities.

CIDEr, SPICE metrics: Evaluate caption quality beyond BLEU/METEOR by measuring consensus with reference captions (CIDEr) or semantic propositional content (SPICE).

Text-to-Image Generation

DALL-E and DALL-E 2

DALL-E (Ramesh et al., 2021): Autoregressive transformer generates image tokens (from a discrete VAE) conditioned on text tokens. 12B parameter model trained on 250M image-text pairs.

DALL-E 2 (Ramesh et al., 2022): Two-stage approach — a prior maps CLIP text embeddings to CLIP image embeddings; a decoder (diffusion model) generates images conditioned on the CLIP image embedding. Leverages CLIP's aligned embedding space for text-image correspondence.

Stable Diffusion / Latent Diffusion

Rombach et al. (2022): Diffusion in the latent space of a VQ-VAE. Text conditioning via cross-attention with CLIP text encoder embeddings. Open-source, enabling widespread adoption and fine-tuning (LoRA, DreamBooth, textual inversion for personalization). Stable Diffusion XL scales to 6.6B parameters with dual text encoders (CLIP and OpenCLIP).

Imagen

Saharia et al. (2022): Text-to-image diffusion in pixel space with a frozen T5-XXL text encoder. Key insight: scaling the language model improves image quality more than scaling the image model. Cascaded diffusion: 64×64 → 256×256 → 1024×1024 with separate super-resolution diffusion models.

DALL-E 3

Betker et al. (2023): Improved text-image alignment by training on highly detailed, synthetic captions (generated by a captioning model from images). The recaptioning approach addresses CLIP's compositional weakness — detailed captions provide richer conditioning signals. Integrated into ChatGPT for iterative refinement.

Video Understanding and Generation

Video Understanding

TimeSformer (Bertasius et al., 2021): Apply self-attention to space-time patches. Divided space-time attention: temporal attention (across frames at same spatial position) followed by spatial attention (within each frame). More efficient than full space-time attention.

VideoMAE (Tong et al., 2022): Masked autoencoder for video. Masks 90-95% of space-time tubes, exploiting high temporal redundancy. Learns strong video representations with limited labeled data.

Video-language models: VideoCLIP, InternVideo — extend CLIP/image-language models to video by adding temporal modeling (temporal attention, frame sampling, video-text contrastive learning).

Video Generation

Sora (OpenAI, 2024): Diffusion transformer operating on space-time patches of video. Generates high-quality, temporally consistent videos up to one minute. Trained on large-scale video data. Demonstrates emergent understanding of 3D consistency, object permanence, and physical dynamics.

Runway Gen-2/Gen-3: Commercial video generation models. Text-to-video and image-to-video.

SVD (Stable Video Diffusion): Extends latent diffusion to video with temporal attention layers. Fine-tuned from Stable Diffusion image model.

Challenges: Temporal consistency (avoiding flickering, maintaining object identity across frames), long-duration generation, physics plausibility, computational cost (video is 3D data — time × height × width).

Audio-Visual Learning

Audio-Visual Correspondence

Learning from natural co-occurrence of audio and visual signals in video.

Audio-visual source separation: Given a video with multiple sound sources, separate each source's audio using visual cues (e.g., seeing a guitar identifies which frequency components belong to it). Sound of Pixels (Zhao et al., 2018).

Audio-visual speech recognition: Lip reading combined with audio. Visual information improves ASR in noisy environments. AV-HuBERT extends HuBERT to audio-visual pretraining.

Audio Generation

AudioLDM: Latent diffusion for audio generation conditioned on text descriptions. MusicLM (Google): Music generation from text descriptions, using hierarchical token-based generation with MuLan embeddings.

Multimodal Large Language Models

Architecture Patterns

Visual encoder + LLM: A pretrained visual encoder (CLIP ViT, SigLIP) extracts image features. A projection layer (linear, MLP, or cross-attention) maps visual features into the LLM's token embedding space. The LLM processes interleaved text and visual tokens.

LLaVA (Liu et al., 2023): Simple linear projection from CLIP ViT features to Vicuna LLM's input space. Two-stage training: (1) pretrain projection on image-caption pairs, (2) instruction-tune on multimodal instruction data. Surprisingly effective despite architectural simplicity.

BLIP-2 (Li et al., 2023): Q-Former (lightweight Transformer with learnable queries) bridges a frozen image encoder and frozen LLM. The Q-Former is trained in two stages: vision-language representation learning (contrastive + matching + generation), then vision-to-language generative learning.

Proprietary Multimodal LLMs

GPT-4V/GPT-4o (OpenAI): Native multimodal understanding. Accepts interleaved text and images. Strong performance on visual reasoning, OCR, chart understanding, spatial reasoning. GPT-4o extends to audio input/output, enabling real-time voice conversation with visual understanding.

Gemini (Google DeepMind): Natively multimodal from pretraining (not retrofitted). Trained on interleaved text, images, audio, and video. Gemini Ultra achieves state-of-the-art on multimodal benchmarks. Long-context variants (Gemini 1.5 Pro) process up to 1M tokens including video.

Claude (Anthropic): Vision capabilities for image understanding, document analysis, and visual reasoning. Processes images alongside text in the conversation context.

Open Multimodal LLMs

LLaVA-NeXT/LLaVA-OneVision: Improved resolution handling (dynamic high-resolution), video understanding, stronger base LLMs.

InternVL: Scales the vision encoder alongside the LLM. InternVL2 achieves competitive results with proprietary models.

Qwen-VL: Alibaba's multimodal LLM series with strong multilingual and document understanding capabilities.

Embodied AI

Multimodal models controlling agents in physical or simulated environments.

Vision-Language-Action Models

RT-2 (Broussard et al., 2023): Fine-tune a vision-language model (PaLI-X) to output robot actions as text tokens. The model takes camera images and language instructions, outputs tokenized action commands (base coordinates, gripper angle). Emergent capabilities: generalization to unseen objects and instructions.

PaLM-E: Embodied multimodal LLM that processes sensor data (images, state estimates) alongside text. Integrates robot planning and visual reasoning in a single model.

Simulation and Real-World Transfer

Habitat, AI2-THOR, Isaac Sim: Simulated environments for training embodied agents. Navigation, manipulation, and instruction following tasks. Sim-to-real transfer remains challenging due to the visual and physical gap.

Benchmarks and Evaluation

| Benchmark | Modalities | Task | |-----------|-----------|------| | MMLU | Text | Knowledge/reasoning (text-only baseline) | | MMMU | Image + Text | Multi-discipline multimodal understanding | | MMBench | Image + Text | Multi-ability visual reasoning | | TextVQA | Image + Text | Reading text in images | | DocVQA | Document + Text | Document understanding | | Video-MME | Video + Text | Video understanding | | AudioCaps | Audio + Text | Audio captioning |

Evaluation Challenges

  • Contamination: Web-scraped pretraining data may contain benchmark data.
  • Hallucination: Multimodal LLMs generate plausible but factually incorrect descriptions (describing objects not present in the image). POPE, CHAIR metrics measure hallucination rates.
  • Compositional understanding: Models struggle with spatial relationships ("left of"), counting, negation, and attribute binding ("the red cube on the blue table"). Winoground, ARO benchmarks test this.