Image Generation
Generative Adversarial Networks (GANs)
Framework
Two networks trained adversarially (Goodfellow et al., 2014):
- Generator
G(z): maps random noisez ~ N(0, I)to an image - Discriminator
D(x): classifies images as real or generated
Minimax objective:
min_G max_D E_{x~p_data}[log D(x)] + E_{z~p_z}[log(1 - D(G(z)))]
At Nash equilibrium, G produces samples indistinguishable from real data and D outputs 0.5 everywhere.
Training instability: mode collapse (generator produces limited diversity), vanishing gradients, oscillation. Addressed by architectural and loss modifications.
DCGAN (Deep Convolutional GAN)
Architectural guidelines for stable GAN training:
- Replace pooling with strided convolutions (discriminator) and transposed convolutions (generator)
- Batch normalization in both networks (except D input and G output)
- Remove fully connected layers (fully convolutional)
- ReLU in generator (except output: Tanh), LeakyReLU in discriminator
StyleGAN Family
Progressive high-resolution image synthesis (Karras et al., NVIDIA):
StyleGAN (v1):
- Mapping network: 8-layer MLP maps
zto intermediate latentw(disentangled) - Synthesis network: generates image progressively via constant input + AdaIN at each layer
- Adaptive Instance Normalization (AdaIN):
y = gamma(w) * (x - mu) / sigma + beta(w)-- style vector modulates feature statistics - Stochastic variation: per-pixel noise inputs for fine details (hair, freckles)
StyleGAN2: removes artifacts by replacing AdaIN with weight demodulation, adds path length regularization, improves progressive growing.
StyleGAN3: alias-free generation using continuous signal interpretation; equivariant to sub-pixel translations and rotations.
Latent space properties:
- W space is more disentangled than Z space
- W+ space: different w per layer enables fine-grained editing
- Linear directions in W correspond to semantic attributes (age, smile, glasses)
GAN Evaluation Metrics
- FID (Frechet Inception Distance): distance between real and generated feature distributions
Lower is better. Computed using Inception-v3 features.FID = ||mu_r - mu_g||^2 + Tr(Sigma_r + Sigma_g - 2*(Sigma_r * Sigma_g)^{1/2}) - IS (Inception Score): measures quality and diversity; less reliable than FID
- KID (Kernel Inception Distance): unbiased alternative to FID for small sample sizes
Diffusion Models
DDPM (Denoising Diffusion Probabilistic Models)
Forward process gradually adds Gaussian noise over T steps; reverse process learns to denoise.
Forward process (fixed):
q(x_t | x_{t-1}) = N(x_t; sqrt(1-beta_t) * x_{t-1}, beta_t * I)
Closed-form sampling at any step t:
x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon, epsilon ~ N(0, I)
where alpha_bar_t = prod_{s=1}^{t} (1 - beta_s).
Reverse process (learned):
p_theta(x_{t-1} | x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma_t^2 * I)
Training objective (simplified):
L = E_{t, x_0, epsilon} [||epsilon - epsilon_theta(x_t, t)||^2]
The network epsilon_theta predicts the noise added at step t. Equivalent to score matching: epsilon_theta ~ -sqrt(1-alpha_bar_t) * nabla_{x_t} log p(x_t).
Architecture
U-Net with modifications:
- ResNet blocks with time embedding (sinusoidal positional encoding)
- Self-attention layers at lower resolutions
- Group normalization
- Skip connections between encoder and decoder
Sampling
DDPM sampling (T=1000 steps): slow but high quality.
DDIM (Denoising Diffusion Implicit Models): deterministic sampling with fewer steps by reformulating as non-Markovian process. Same trained model, 10-50x fewer steps.
Classifier-free guidance (CFG): interpolate between conditional and unconditional predictions:
epsilon_guided = epsilon_uncond + s * (epsilon_cond - epsilon_uncond)
Guidance scale s > 1 improves quality and prompt adherence at the cost of diversity.
Stable Diffusion (Latent Diffusion Models)
Diffusion in a compressed latent space for efficiency (Rombach et al., 2022):
Architecture:
- VAE encoder
E: compress imagexto latentz = E(x)(4x spatial compression) - Diffusion U-Net: operates in latent space with cross-attention for conditioning
- VAE decoder
D: reconstruct imagex' = D(z)
Conditioning: text encoded by CLIP text encoder, injected via cross-attention:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V
Q = W_Q * phi(z_t), K = W_K * tau(y), V = W_V * tau(y)
where phi(z_t) is a U-Net intermediate feature and tau(y) is the text embedding.
SDXL: larger U-Net, dual text encoders (CLIP-ViT-L + OpenCLIP-ViT-G), refinement model.
Stable Diffusion 3: replaces U-Net with MM-DiT (multimodal diffusion transformer), uses flow matching objective, T5 text encoder for better text rendering.
Consistency Models
Generate high-quality images in one or few steps (Song et al., 2023):
Key idea: learn a function f_theta(x_t, t) that maps any noisy sample x_t on the same trajectory to the clean sample x_0:
f_theta(x_t, t) = f_theta(x_t', t') for all t, t' on the same trajectory
Training: consistency distillation from a pre-trained diffusion model, or consistency training from scratch.
Advantage: 1-2 step generation with quality approaching multi-step diffusion.
Latent Consistency Models (LCM)
Apply consistency distillation in latent space. Combined with LoRA, enables fast fine-tuned generation.
Flow Matching
Alternative to DDPM that learns a vector field transporting noise to data:
v_theta(x_t, t) ~ dx_t/dt
Optimal transport paths are straight lines, enabling fewer sampling steps. Used in Stable Diffusion 3 and Flux.
Text-to-Image Models
DALL-E Family
DALL-E (OpenAI, 2021): discrete VAE encodes images as tokens, autoregressive transformer generates image tokens from text.
DALL-E 2: CLIP text embedding -> diffusion prior -> image embedding -> diffusion decoder. Uses CLIP's joint embedding space to connect text and images.
DALL-E 3: improved prompt following via re-captioning training data with detailed synthetic captions. Integrated with ChatGPT for prompt refinement.
Imagen
Google's text-to-image diffusion model:
- T5-XXL text encoder (frozen, 4.6B params) -- large language model provides rich text understanding
- Base model: 64x64 diffusion
- Two super-resolution diffusion models: 64->256->1024
- Finding: text encoder size matters more than image model size
Flux
Black Forest Labs' state-of-the-art open model:
- MM-DiT architecture (from SD3 team)
- Flow matching training objective
- Strong prompt adherence and image quality
- Variants: Flux.1 [pro], [dev], [schnell] (1-4 step distilled)
Image Editing
Inpainting
Fill in missing or masked image regions:
- Traditional: PatchMatch -- iteratively search for best matching patches from known regions
- Deep learning: encode partial image + mask, decode complete image
- Diffusion-based: condition on unmasked regions during sampling; repaint (resample unmasked regions at each step)
Super-Resolution
Upscale low-resolution images:
- SRCNN: first CNN-based method (learn LR->HR mapping)
- ESRGAN: GAN-based with perceptual loss for realistic textures
- Real-ESRGAN: handles real-world degradations (blur, noise, compression)
- Diffusion-based SR: StableSR, SUPIR -- use diffusion priors for photorealistic upscaling
Loss functions:
- Pixel loss (L1/L2): blurry results
- Perceptual loss: L2 in VGG feature space -- preserves structure
- Adversarial loss: GAN discriminator promotes realistic textures
- Best results combine all three
Style Transfer
Neural style transfer (Gatys et al., 2016):
Optimize image to match:
- Content: feature activations at deeper layers of VGG
- Style: Gram matrices of feature activations (texture statistics)
L = alpha * L_content + beta * L_style
L_content = sum_l ||F_l(I) - F_l(I_content)||^2
L_style = sum_l ||G_l(I) - G_l(I_style)||^2
where G_l = F_l^T * F_l is the Gram matrix at layer l.
Fast style transfer: train a feed-forward network for a single style (Johnson et al.). Arbitrary style transfer: AdaIN -- match content feature statistics to style feature statistics.
Diffusion-Based Editing
- SDEdit: add noise to input image, then denoise with text guidance
- InstructPix2Pix: trained on instruction-following pairs to edit images from text instructions
- ControlNet: add spatial conditioning (edges, depth, pose) to pre-trained diffusion models via trainable copy of encoder blocks
- IP-Adapter: image prompt conditioning via decoupled cross-attention
Evaluation
| Metric | Measures | Notes |
|--------|----------|-------|
| FID | Distribution similarity | Standard for unconditional generation |
| CLIP Score | Text-image alignment | cos(CLIP_image, CLIP_text) |
| Aesthetic Score | Visual quality | Learned predictor on human ratings |
| Human Evaluation | Overall preference | Gold standard but expensive |
| GenEval / T2I-CompBench | Compositional accuracy | Tests attributes, relations, counting |
Training Techniques
Fine-Tuning Diffusion Models
- DreamBooth: fine-tune entire model on 3-5 images of a subject with unique identifier token
- Textual Inversion: learn new token embedding for a concept (frozen model weights)
- LoRA: low-rank adaptation of attention weights -- parameter-efficient fine-tuning
W' = W + alpha * B * A, B in R^{d x r}, A in R^{r x k}, r << min(d,k)
Reward-Based Training
- RLHF for diffusion: fine-tune with human preference reward models
- DPO for diffusion: direct preference optimization without explicit reward model
- ReFL: reward feedback learning with differentiable reward functions
Practical Considerations
- GANs are still preferred for real-time generation (face synthesis, video)
- Diffusion models dominate image quality benchmarks but are slower
- Classifier-free guidance scale is the most impactful inference hyperparameter (typically 7-10)
- LoRA fine-tuning requires only 10-50 images and consumer GPU
- Generated images increasingly require watermarking and detection systems
- Copyright and ethical implications of training on internet-scraped data remain active legal questions
Key Takeaways
| Concept | Core Idea | |---------|-----------| | GANs | Generator vs discriminator adversarial training; fast sampling | | StyleGAN | Mapping network + AdaIN for disentangled, high-quality face synthesis | | DDPM | Learn to reverse a gradual noising process; train by predicting noise | | Stable Diffusion | Latent space diffusion + cross-attention conditioning; efficient and flexible | | Consistency models | Map any point on diffusion trajectory to x_0; few-step generation | | ControlNet | Add spatial control to pre-trained diffusion via trainable encoder copy |