7 min read
On this page

Image Generation

Generative Adversarial Networks (GANs)

Framework

Two networks trained adversarially (Goodfellow et al., 2014):

  • Generator G(z): maps random noise z ~ N(0, I) to an image
  • Discriminator D(x): classifies images as real or generated

Minimax objective:

min_G max_D  E_{x~p_data}[log D(x)] + E_{z~p_z}[log(1 - D(G(z)))]

At Nash equilibrium, G produces samples indistinguishable from real data and D outputs 0.5 everywhere.

Training instability: mode collapse (generator produces limited diversity), vanishing gradients, oscillation. Addressed by architectural and loss modifications.

DCGAN (Deep Convolutional GAN)

Architectural guidelines for stable GAN training:

  • Replace pooling with strided convolutions (discriminator) and transposed convolutions (generator)
  • Batch normalization in both networks (except D input and G output)
  • Remove fully connected layers (fully convolutional)
  • ReLU in generator (except output: Tanh), LeakyReLU in discriminator

StyleGAN Family

Progressive high-resolution image synthesis (Karras et al., NVIDIA):

StyleGAN (v1):

  • Mapping network: 8-layer MLP maps z to intermediate latent w (disentangled)
  • Synthesis network: generates image progressively via constant input + AdaIN at each layer
  • Adaptive Instance Normalization (AdaIN): y = gamma(w) * (x - mu) / sigma + beta(w) -- style vector modulates feature statistics
  • Stochastic variation: per-pixel noise inputs for fine details (hair, freckles)

StyleGAN2: removes artifacts by replacing AdaIN with weight demodulation, adds path length regularization, improves progressive growing.

StyleGAN3: alias-free generation using continuous signal interpretation; equivariant to sub-pixel translations and rotations.

Latent space properties:

  • W space is more disentangled than Z space
  • W+ space: different w per layer enables fine-grained editing
  • Linear directions in W correspond to semantic attributes (age, smile, glasses)

GAN Evaluation Metrics

  • FID (Frechet Inception Distance): distance between real and generated feature distributions
    FID = ||mu_r - mu_g||^2 + Tr(Sigma_r + Sigma_g - 2*(Sigma_r * Sigma_g)^{1/2})
    
    Lower is better. Computed using Inception-v3 features.
  • IS (Inception Score): measures quality and diversity; less reliable than FID
  • KID (Kernel Inception Distance): unbiased alternative to FID for small sample sizes

Diffusion Models

DDPM (Denoising Diffusion Probabilistic Models)

Forward process gradually adds Gaussian noise over T steps; reverse process learns to denoise.

Forward process (fixed):

q(x_t | x_{t-1}) = N(x_t; sqrt(1-beta_t) * x_{t-1}, beta_t * I)

Closed-form sampling at any step t:

x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon,  epsilon ~ N(0, I)

where alpha_bar_t = prod_{s=1}^{t} (1 - beta_s).

Reverse process (learned):

p_theta(x_{t-1} | x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma_t^2 * I)

Training objective (simplified):

L = E_{t, x_0, epsilon} [||epsilon - epsilon_theta(x_t, t)||^2]

The network epsilon_theta predicts the noise added at step t. Equivalent to score matching: epsilon_theta ~ -sqrt(1-alpha_bar_t) * nabla_{x_t} log p(x_t).

Architecture

U-Net with modifications:

  • ResNet blocks with time embedding (sinusoidal positional encoding)
  • Self-attention layers at lower resolutions
  • Group normalization
  • Skip connections between encoder and decoder

Sampling

DDPM sampling (T=1000 steps): slow but high quality.

DDIM (Denoising Diffusion Implicit Models): deterministic sampling with fewer steps by reformulating as non-Markovian process. Same trained model, 10-50x fewer steps.

Classifier-free guidance (CFG): interpolate between conditional and unconditional predictions:

epsilon_guided = epsilon_uncond + s * (epsilon_cond - epsilon_uncond)

Guidance scale s > 1 improves quality and prompt adherence at the cost of diversity.

Stable Diffusion (Latent Diffusion Models)

Diffusion in a compressed latent space for efficiency (Rombach et al., 2022):

Architecture:

  1. VAE encoder E: compress image x to latent z = E(x) (4x spatial compression)
  2. Diffusion U-Net: operates in latent space with cross-attention for conditioning
  3. VAE decoder D: reconstruct image x' = D(z)

Conditioning: text encoded by CLIP text encoder, injected via cross-attention:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V
Q = W_Q * phi(z_t),  K = W_K * tau(y),  V = W_V * tau(y)

where phi(z_t) is a U-Net intermediate feature and tau(y) is the text embedding.

SDXL: larger U-Net, dual text encoders (CLIP-ViT-L + OpenCLIP-ViT-G), refinement model.

Stable Diffusion 3: replaces U-Net with MM-DiT (multimodal diffusion transformer), uses flow matching objective, T5 text encoder for better text rendering.

Consistency Models

Generate high-quality images in one or few steps (Song et al., 2023):

Key idea: learn a function f_theta(x_t, t) that maps any noisy sample x_t on the same trajectory to the clean sample x_0:

f_theta(x_t, t) = f_theta(x_t', t')  for all t, t' on the same trajectory

Training: consistency distillation from a pre-trained diffusion model, or consistency training from scratch.

Advantage: 1-2 step generation with quality approaching multi-step diffusion.

Latent Consistency Models (LCM)

Apply consistency distillation in latent space. Combined with LoRA, enables fast fine-tuned generation.

Flow Matching

Alternative to DDPM that learns a vector field transporting noise to data:

v_theta(x_t, t) ~ dx_t/dt

Optimal transport paths are straight lines, enabling fewer sampling steps. Used in Stable Diffusion 3 and Flux.


Text-to-Image Models

DALL-E Family

DALL-E (OpenAI, 2021): discrete VAE encodes images as tokens, autoregressive transformer generates image tokens from text.

DALL-E 2: CLIP text embedding -> diffusion prior -> image embedding -> diffusion decoder. Uses CLIP's joint embedding space to connect text and images.

DALL-E 3: improved prompt following via re-captioning training data with detailed synthetic captions. Integrated with ChatGPT for prompt refinement.

Imagen

Google's text-to-image diffusion model:

  • T5-XXL text encoder (frozen, 4.6B params) -- large language model provides rich text understanding
  • Base model: 64x64 diffusion
  • Two super-resolution diffusion models: 64->256->1024
  • Finding: text encoder size matters more than image model size

Flux

Black Forest Labs' state-of-the-art open model:

  • MM-DiT architecture (from SD3 team)
  • Flow matching training objective
  • Strong prompt adherence and image quality
  • Variants: Flux.1 [pro], [dev], [schnell] (1-4 step distilled)

Image Editing

Inpainting

Fill in missing or masked image regions:

  • Traditional: PatchMatch -- iteratively search for best matching patches from known regions
  • Deep learning: encode partial image + mask, decode complete image
  • Diffusion-based: condition on unmasked regions during sampling; repaint (resample unmasked regions at each step)

Super-Resolution

Upscale low-resolution images:

  • SRCNN: first CNN-based method (learn LR->HR mapping)
  • ESRGAN: GAN-based with perceptual loss for realistic textures
  • Real-ESRGAN: handles real-world degradations (blur, noise, compression)
  • Diffusion-based SR: StableSR, SUPIR -- use diffusion priors for photorealistic upscaling

Loss functions:

  • Pixel loss (L1/L2): blurry results
  • Perceptual loss: L2 in VGG feature space -- preserves structure
  • Adversarial loss: GAN discriminator promotes realistic textures
  • Best results combine all three

Style Transfer

Neural style transfer (Gatys et al., 2016):

Optimize image to match:

  • Content: feature activations at deeper layers of VGG
  • Style: Gram matrices of feature activations (texture statistics)
L = alpha * L_content + beta * L_style
L_content = sum_l ||F_l(I) - F_l(I_content)||^2
L_style = sum_l ||G_l(I) - G_l(I_style)||^2

where G_l = F_l^T * F_l is the Gram matrix at layer l.

Fast style transfer: train a feed-forward network for a single style (Johnson et al.). Arbitrary style transfer: AdaIN -- match content feature statistics to style feature statistics.

Diffusion-Based Editing

  • SDEdit: add noise to input image, then denoise with text guidance
  • InstructPix2Pix: trained on instruction-following pairs to edit images from text instructions
  • ControlNet: add spatial conditioning (edges, depth, pose) to pre-trained diffusion models via trainable copy of encoder blocks
  • IP-Adapter: image prompt conditioning via decoupled cross-attention

Evaluation

| Metric | Measures | Notes | |--------|----------|-------| | FID | Distribution similarity | Standard for unconditional generation | | CLIP Score | Text-image alignment | cos(CLIP_image, CLIP_text) | | Aesthetic Score | Visual quality | Learned predictor on human ratings | | Human Evaluation | Overall preference | Gold standard but expensive | | GenEval / T2I-CompBench | Compositional accuracy | Tests attributes, relations, counting |


Training Techniques

Fine-Tuning Diffusion Models

  • DreamBooth: fine-tune entire model on 3-5 images of a subject with unique identifier token
  • Textual Inversion: learn new token embedding for a concept (frozen model weights)
  • LoRA: low-rank adaptation of attention weights -- parameter-efficient fine-tuning
    W' = W + alpha * B * A,  B in R^{d x r}, A in R^{r x k},  r << min(d,k)
    

Reward-Based Training

  • RLHF for diffusion: fine-tune with human preference reward models
  • DPO for diffusion: direct preference optimization without explicit reward model
  • ReFL: reward feedback learning with differentiable reward functions

Practical Considerations

  • GANs are still preferred for real-time generation (face synthesis, video)
  • Diffusion models dominate image quality benchmarks but are slower
  • Classifier-free guidance scale is the most impactful inference hyperparameter (typically 7-10)
  • LoRA fine-tuning requires only 10-50 images and consumer GPU
  • Generated images increasingly require watermarking and detection systems
  • Copyright and ethical implications of training on internet-scraped data remain active legal questions

Key Takeaways

| Concept | Core Idea | |---------|-----------| | GANs | Generator vs discriminator adversarial training; fast sampling | | StyleGAN | Mapping network + AdaIN for disentangled, high-quality face synthesis | | DDPM | Learn to reverse a gradual noising process; train by predicting noise | | Stable Diffusion | Latent space diffusion + cross-attention conditioning; efficient and flexible | | Consistency models | Map any point on diffusion trajectory to x_0; few-step generation | | ControlNet | Add spatial control to pre-trained diffusion via trainable encoder copy |