Multimodal Learning
Foundations
Multimodal learning integrates information from multiple modalities — vision, language, audio, video, touch, and others — into unified representations and models. Humans naturally process multimodal input; enabling machines to do so requires solving alignment (matching concepts across modalities), fusion (combining information), and generation (producing one modality from another).
Core Challenges
- Representation: How to encode heterogeneous modalities into a shared or compatible representation space.
- Alignment: How to identify correspondences between elements across modalities (words ↔ image regions, audio ↔ video frames).
- Fusion: How to combine modality-specific information — early fusion (raw features), late fusion (decision-level), or intermediate fusion (cross-attention).
- Translation/Generation: How to generate one modality conditioned on another (text → image, image → text).
- Missing modalities: Robustness when some modalities are absent at inference.
Vision-Language Models
CLIP (Contrastive Language-Image Pre-training)
Radford et al. (2021) learn aligned vision-language representations via contrastive learning on 400M image-text pairs from the internet.
Architecture: Separate image encoder (ViT or ResNet) and text encoder (Transformer). Each produces a fixed-dimensional embedding. Training minimizes the InfoNCE loss over a batch of image-text pairs: matching pairs should have high cosine similarity, non-matching pairs low similarity.
Zero-shot classification: Encode class names as text prompts ("a photo of a {class}"), compute text embeddings, classify images by nearest text embedding. No task-specific training. CLIP matches supervised ResNet-50 on ImageNet zero-shot.
Impact: CLIP embeddings serve as a universal vision-language interface. Used as the text encoder for Stable Diffusion, as a feature extractor for zero-shot detection, and as a reward model for text-to-image generation.
Limitations: Bag-of-words behavior (insensitive to word order and compositional structure). "A red car on a blue road" vs. "A blue car on a red road" may receive similar scores. Limited understanding of spatial relationships, counting, and negation.
SigLIP
Zhai et al. (2023): Replace CLIP's softmax-based contrastive loss with a sigmoid loss applied independently to each image-text pair:
L = -Σ_{i,j} log σ(y_{ij} · (sim(z_i^I, z_j^T)/τ - b))
where y_{ij} = +1 for matching pairs and -1 for non-matching. No softmax normalization across the batch — each pair is classified independently as matching/non-matching. Enables larger batch sizes (no need for all-pairs softmax), simpler implementation, and comparable or superior performance to CLIP.
ALIGN and BASIC
ALIGN (Jia et al., 2021): Scale CLIP-style training to 1.8B noisy image-text pairs with minimal curation. EfficientNet image encoder + BERT text encoder. Demonstrates that scale compensates for noise.
BASIC (Pham et al., 2023): Scale further with a larger ViT (ViT-g) and a locked-image text tuning approach. Combined with LiT (Locked-image Text tuning): freeze a pretrained image encoder, only train the text encoder to align with the frozen visual features.
Visual Question Answering (VQA)
Given an image and a natural language question, produce a natural language answer.
Approaches
Classification-based: Treat VQA as multi-class classification over a fixed answer vocabulary (top 3000 answers). Encode image (CNN features or region proposals) and question (LSTM/Transformer), fuse representations (attention, bilinear pooling), classify. Models: BUTD (bottom-up top-down attention using Faster R-CNN region features).
Generative: Generate free-form answers token by token. More flexible but harder to evaluate. Modern multimodal LLMs (GPT-4V, Gemini) take this approach.
Reasoning-focused: VQA requiring multi-step reasoning over image content. Neural Module Networks: Compose specialized modules (attention, comparison, counting) based on question parsing. GQA, CLEVR benchmarks test compositional reasoning.
Image Captioning
Generate a natural language description of an image.
Encoder-decoder: CNN/ViT encodes the image; a language model decoder generates the caption autoregressively, attending to image features. Show and Tell (Vinyals et al., 2015): LSTM decoder with CNN encoder. Show, Attend and Tell: Added spatial attention over image regions.
Modern approaches: Pretrained vision-language models (CoCa, BLIP-2, PaLI) generate captions as part of broader vision-language capabilities. Pretraining on web-scale image-text data produces rich captioning abilities.
CIDEr, SPICE metrics: Evaluate caption quality beyond BLEU/METEOR by measuring consensus with reference captions (CIDEr) or semantic propositional content (SPICE).
Text-to-Image Generation
DALL-E and DALL-E 2
DALL-E (Ramesh et al., 2021): Autoregressive transformer generates image tokens (from a discrete VAE) conditioned on text tokens. 12B parameter model trained on 250M image-text pairs.
DALL-E 2 (Ramesh et al., 2022): Two-stage approach — a prior maps CLIP text embeddings to CLIP image embeddings; a decoder (diffusion model) generates images conditioned on the CLIP image embedding. Leverages CLIP's aligned embedding space for text-image correspondence.
Stable Diffusion / Latent Diffusion
Rombach et al. (2022): Diffusion in the latent space of a VQ-VAE. Text conditioning via cross-attention with CLIP text encoder embeddings. Open-source, enabling widespread adoption and fine-tuning (LoRA, DreamBooth, textual inversion for personalization). Stable Diffusion XL scales to 6.6B parameters with dual text encoders (CLIP and OpenCLIP).
Imagen
Saharia et al. (2022): Text-to-image diffusion in pixel space with a frozen T5-XXL text encoder. Key insight: scaling the language model improves image quality more than scaling the image model. Cascaded diffusion: 64×64 → 256×256 → 1024×1024 with separate super-resolution diffusion models.
DALL-E 3
Betker et al. (2023): Improved text-image alignment by training on highly detailed, synthetic captions (generated by a captioning model from images). The recaptioning approach addresses CLIP's compositional weakness — detailed captions provide richer conditioning signals. Integrated into ChatGPT for iterative refinement.
Video Understanding and Generation
Video Understanding
TimeSformer (Bertasius et al., 2021): Apply self-attention to space-time patches. Divided space-time attention: temporal attention (across frames at same spatial position) followed by spatial attention (within each frame). More efficient than full space-time attention.
VideoMAE (Tong et al., 2022): Masked autoencoder for video. Masks 90-95% of space-time tubes, exploiting high temporal redundancy. Learns strong video representations with limited labeled data.
Video-language models: VideoCLIP, InternVideo — extend CLIP/image-language models to video by adding temporal modeling (temporal attention, frame sampling, video-text contrastive learning).
Video Generation
Sora (OpenAI, 2024): Diffusion transformer operating on space-time patches of video. Generates high-quality, temporally consistent videos up to one minute. Trained on large-scale video data. Demonstrates emergent understanding of 3D consistency, object permanence, and physical dynamics.
Runway Gen-2/Gen-3: Commercial video generation models. Text-to-video and image-to-video.
SVD (Stable Video Diffusion): Extends latent diffusion to video with temporal attention layers. Fine-tuned from Stable Diffusion image model.
Challenges: Temporal consistency (avoiding flickering, maintaining object identity across frames), long-duration generation, physics plausibility, computational cost (video is 3D data — time × height × width).
Audio-Visual Learning
Audio-Visual Correspondence
Learning from natural co-occurrence of audio and visual signals in video.
Audio-visual source separation: Given a video with multiple sound sources, separate each source's audio using visual cues (e.g., seeing a guitar identifies which frequency components belong to it). Sound of Pixels (Zhao et al., 2018).
Audio-visual speech recognition: Lip reading combined with audio. Visual information improves ASR in noisy environments. AV-HuBERT extends HuBERT to audio-visual pretraining.
Audio Generation
AudioLDM: Latent diffusion for audio generation conditioned on text descriptions. MusicLM (Google): Music generation from text descriptions, using hierarchical token-based generation with MuLan embeddings.
Multimodal Large Language Models
Architecture Patterns
Visual encoder + LLM: A pretrained visual encoder (CLIP ViT, SigLIP) extracts image features. A projection layer (linear, MLP, or cross-attention) maps visual features into the LLM's token embedding space. The LLM processes interleaved text and visual tokens.
LLaVA (Liu et al., 2023): Simple linear projection from CLIP ViT features to Vicuna LLM's input space. Two-stage training: (1) pretrain projection on image-caption pairs, (2) instruction-tune on multimodal instruction data. Surprisingly effective despite architectural simplicity.
BLIP-2 (Li et al., 2023): Q-Former (lightweight Transformer with learnable queries) bridges a frozen image encoder and frozen LLM. The Q-Former is trained in two stages: vision-language representation learning (contrastive + matching + generation), then vision-to-language generative learning.
Proprietary Multimodal LLMs
GPT-4V/GPT-4o (OpenAI): Native multimodal understanding. Accepts interleaved text and images. Strong performance on visual reasoning, OCR, chart understanding, spatial reasoning. GPT-4o extends to audio input/output, enabling real-time voice conversation with visual understanding.
Gemini (Google DeepMind): Natively multimodal from pretraining (not retrofitted). Trained on interleaved text, images, audio, and video. Gemini Ultra achieves state-of-the-art on multimodal benchmarks. Long-context variants (Gemini 1.5 Pro) process up to 1M tokens including video.
Claude (Anthropic): Vision capabilities for image understanding, document analysis, and visual reasoning. Processes images alongside text in the conversation context.
Open Multimodal LLMs
LLaVA-NeXT/LLaVA-OneVision: Improved resolution handling (dynamic high-resolution), video understanding, stronger base LLMs.
InternVL: Scales the vision encoder alongside the LLM. InternVL2 achieves competitive results with proprietary models.
Qwen-VL: Alibaba's multimodal LLM series with strong multilingual and document understanding capabilities.
Embodied AI
Multimodal models controlling agents in physical or simulated environments.
Vision-Language-Action Models
RT-2 (Broussard et al., 2023): Fine-tune a vision-language model (PaLI-X) to output robot actions as text tokens. The model takes camera images and language instructions, outputs tokenized action commands (base coordinates, gripper angle). Emergent capabilities: generalization to unseen objects and instructions.
PaLM-E: Embodied multimodal LLM that processes sensor data (images, state estimates) alongside text. Integrates robot planning and visual reasoning in a single model.
Simulation and Real-World Transfer
Habitat, AI2-THOR, Isaac Sim: Simulated environments for training embodied agents. Navigation, manipulation, and instruction following tasks. Sim-to-real transfer remains challenging due to the visual and physical gap.
Benchmarks and Evaluation
| Benchmark | Modalities | Task | |-----------|-----------|------| | MMLU | Text | Knowledge/reasoning (text-only baseline) | | MMMU | Image + Text | Multi-discipline multimodal understanding | | MMBench | Image + Text | Multi-ability visual reasoning | | TextVQA | Image + Text | Reading text in images | | DocVQA | Document + Text | Document understanding | | Video-MME | Video + Text | Video understanding | | AudioCaps | Audio + Text | Audio captioning |
Evaluation Challenges
- Contamination: Web-scraped pretraining data may contain benchmark data.
- Hallucination: Multimodal LLMs generate plausible but factually incorrect descriptions (describing objects not present in the image). POPE, CHAIR metrics measure hallucination rates.
- Compositional understanding: Models struggle with spatial relationships ("left of"), counting, negation, and attribute binding ("the red cube on the blue table"). Winoground, ARO benchmarks test this.