Image Generation

Generative Adversarial Networks (GANs)

Framework

Two networks trained adversarially (Goodfellow et al., 2014):

Generator G(z): maps random noise z ~ N(0, I) to an image
Discriminator D(x): classifies images as real or generated

Minimax objective:

min_G max_D  E_{x~p_data}[log D(x)] + E_{z~p_z}[log(1 - D(G(z)))]

At Nash equilibrium, G produces samples indistinguishable from real data and D outputs 0.5 everywhere.

Training instability: mode collapse (generator produces limited diversity), vanishing gradients, oscillation. Addressed by architectural and loss modifications.

DCGAN (Deep Convolutional GAN)

Architectural guidelines for stable GAN training:

Replace pooling with strided convolutions (discriminator) and transposed convolutions (generator)
Batch normalization in both networks (except D input and G output)
Remove fully connected layers (fully convolutional)
ReLU in generator (except output: Tanh), LeakyReLU in discriminator

StyleGAN Family

Progressive high-resolution image synthesis (Karras et al., NVIDIA):

StyleGAN (v1):

Mapping network: 8-layer MLP maps z to intermediate latent w (disentangled)
Synthesis network: generates image progressively via constant input + AdaIN at each layer
Adaptive Instance Normalization (AdaIN): y = gamma(w) * (x - mu) / sigma + beta(w) -- style vector modulates feature statistics
Stochastic variation: per-pixel noise inputs for fine details (hair, freckles)

StyleGAN2: removes artifacts by replacing AdaIN with weight demodulation, adds path length regularization, improves progressive growing.

StyleGAN3: alias-free generation using continuous signal interpretation; equivariant to sub-pixel translations and rotations.

Latent space properties:

W space is more disentangled than Z space
W+ space: different w per layer enables fine-grained editing
Linear directions in W correspond to semantic attributes (age, smile, glasses)

GAN Evaluation Metrics

FID (Frechet Inception Distance): distance between real and generated feature distributions
```
FID = ||mu_r - mu_g||^2 + Tr(Sigma_r + Sigma_g - 2*(Sigma_r * Sigma_g)^{1/2})
```
Lower is better. Computed using Inception-v3 features.
IS (Inception Score): measures quality and diversity; less reliable than FID
KID (Kernel Inception Distance): unbiased alternative to FID for small sample sizes

Diffusion Models

DDPM (Denoising Diffusion Probabilistic Models)

Forward process gradually adds Gaussian noise over T steps; reverse process learns to denoise.

Forward process (fixed):

q(x_t | x_{t-1}) = N(x_t; sqrt(1-beta_t) * x_{t-1}, beta_t * I)

Closed-form sampling at any step t:

x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon,  epsilon ~ N(0, I)

where alpha_bar_t = prod_{s=1}^{t} (1 - beta_s).

Reverse process (learned):

p_theta(x_{t-1} | x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma_t^2 * I)

Training objective (simplified):

L = E_{t, x_0, epsilon} [||epsilon - epsilon_theta(x_t, t)||^2]

The network epsilon_theta predicts the noise added at step t. Equivalent to score matching: epsilon_theta ~ -sqrt(1-alpha_bar_t) * nabla_{x_t} log p(x_t).

Architecture

U-Net with modifications:

ResNet blocks with time embedding (sinusoidal positional encoding)
Self-attention layers at lower resolutions
Group normalization
Skip connections between encoder and decoder

Sampling

DDPM sampling (T=1000 steps): slow but high quality.

DDIM (Denoising Diffusion Implicit Models): deterministic sampling with fewer steps by reformulating as non-Markovian process. Same trained model, 10-50x fewer steps.

Classifier-free guidance (CFG): interpolate between conditional and unconditional predictions:

epsilon_guided = epsilon_uncond + s * (epsilon_cond - epsilon_uncond)

Guidance scale s > 1 improves quality and prompt adherence at the cost of diversity.

Stable Diffusion (Latent Diffusion Models)

Diffusion in a compressed latent space for efficiency (Rombach et al., 2022):

Architecture:

VAE encoder E: compress image x to latent z = E(x) (4x spatial compression)
Diffusion U-Net: operates in latent space with cross-attention for conditioning
VAE decoder D: reconstruct image x' = D(z)

Conditioning: text encoded by CLIP text encoder, injected via cross-attention:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V
Q = W_Q * phi(z_t),  K = W_K * tau(y),  V = W_V * tau(y)

where phi(z_t) is a U-Net intermediate feature and tau(y) is the text embedding.

SDXL: larger U-Net, dual text encoders (CLIP-ViT-L + OpenCLIP-ViT-G), refinement model.

Stable Diffusion 3: replaces U-Net with MM-DiT (multimodal diffusion transformer), uses flow matching objective, T5 text encoder for better text rendering.

Consistency Models

Generate high-quality images in one or few steps (Song et al., 2023):

Key idea: learn a function f_theta(x_t, t) that maps any noisy sample x_t on the same trajectory to the clean sample x_0:

f_theta(x_t, t) = f_theta(x_t', t')  for all t, t' on the same trajectory

Training: consistency distillation from a pre-trained diffusion model, or consistency training from scratch.

Advantage: 1-2 step generation with quality approaching multi-step diffusion.

Latent Consistency Models (LCM)

Apply consistency distillation in latent space. Combined with LoRA, enables fast fine-tuned generation.

Flow Matching

Alternative to DDPM that learns a vector field transporting noise to data:

v_theta(x_t, t) ~ dx_t/dt

Optimal transport paths are straight lines, enabling fewer sampling steps. Used in Stable Diffusion 3 and Flux.

Text-to-Image Models

DALL-E Family

DALL-E (OpenAI, 2021): discrete VAE encodes images as tokens, autoregressive transformer generates image tokens from text.

DALL-E 2: CLIP text embedding -> diffusion prior -> image embedding -> diffusion decoder. Uses CLIP's joint embedding space to connect text and images.

DALL-E 3: improved prompt following via re-captioning training data with detailed synthetic captions. Integrated with ChatGPT for prompt refinement.

Imagen

Google's text-to-image diffusion model:

T5-XXL text encoder (frozen, 4.6B params) -- large language model provides rich text understanding
Base model: 64x64 diffusion
Two super-resolution diffusion models: 64->256->1024
Finding: text encoder size matters more than image model size

Flux

Black Forest Labs' state-of-the-art open model:

MM-DiT architecture (from SD3 team)
Flow matching training objective
Strong prompt adherence and image quality
Variants: Flux.1 [pro], [dev], [schnell] (1-4 step distilled)

Image Editing

Inpainting

Fill in missing or masked image regions:

Traditional: PatchMatch -- iteratively search for best matching patches from known regions
Deep learning: encode partial image + mask, decode complete image
Diffusion-based: condition on unmasked regions during sampling; repaint (resample unmasked regions at each step)

Super-Resolution

Upscale low-resolution images:

SRCNN: first CNN-based method (learn LR->HR mapping)
ESRGAN: GAN-based with perceptual loss for realistic textures
Real-ESRGAN: handles real-world degradations (blur, noise, compression)
Diffusion-based SR: StableSR, SUPIR -- use diffusion priors for photorealistic upscaling

Loss functions:

Pixel loss (L1/L2): blurry results
Perceptual loss: L2 in VGG feature space -- preserves structure
Adversarial loss: GAN discriminator promotes realistic textures
Best results combine all three

Style Transfer

Neural style transfer (Gatys et al., 2016):

Optimize image to match:

Content: feature activations at deeper layers of VGG
Style: Gram matrices of feature activations (texture statistics)

L = alpha * L_content + beta * L_style
L_content = sum_l ||F_l(I) - F_l(I_content)||^2
L_style = sum_l ||G_l(I) - G_l(I_style)||^2

where G_l = F_l^T * F_l is the Gram matrix at layer l.

Fast style transfer: train a feed-forward network for a single style (Johnson et al.). Arbitrary style transfer: AdaIN -- match content feature statistics to style feature statistics.

Diffusion-Based Editing

SDEdit: add noise to input image, then denoise with text guidance
InstructPix2Pix: trained on instruction-following pairs to edit images from text instructions
ControlNet: add spatial conditioning (edges, depth, pose) to pre-trained diffusion models via trainable copy of encoder blocks
IP-Adapter: image prompt conditioning via decoupled cross-attention

Evaluation

Metric	Measures	Notes
FID	Distribution similarity	Standard for unconditional generation
CLIP Score	Text-image alignment	`cos(CLIP_image, CLIP_text)`
Aesthetic Score	Visual quality	Learned predictor on human ratings
Human Evaluation	Overall preference	Gold standard but expensive
GenEval / T2I-CompBench	Compositional accuracy	Tests attributes, relations, counting

Training Techniques

Fine-Tuning Diffusion Models

DreamBooth: fine-tune entire model on 3-5 images of a subject with unique identifier token
Textual Inversion: learn new token embedding for a concept (frozen model weights)
LoRA: low-rank adaptation of attention weights -- parameter-efficient fine-tuning
```
W' = W + alpha * B * A,  B in R^{d x r}, A in R^{r x k},  r << min(d,k)
```

Reward-Based Training

RLHF for diffusion: fine-tune with human preference reward models
DPO for diffusion: direct preference optimization without explicit reward model
ReFL: reward feedback learning with differentiable reward functions

Practical Considerations

GANs are still preferred for real-time generation (face synthesis, video)
Diffusion models dominate image quality benchmarks but are slower
Classifier-free guidance scale is the most impactful inference hyperparameter (typically 7-10)
LoRA fine-tuning requires only 10-50 images and consumer GPU
Generated images increasingly require watermarking and detection systems
Copyright and ethical implications of training on internet-scraped data remain active legal questions

Key Takeaways

Concept	Core Idea
GANs	Generator vs discriminator adversarial training; fast sampling
StyleGAN	Mapping network + AdaIN for disentangled, high-quality face synthesis
DDPM	Learn to reverse a gradual noising process; train by predicting noise
Stable Diffusion	Latent space diffusion + cross-attention conditioning; efficient and flexible
Consistency models	Map any point on diffusion trajectory to x_0; few-step generation
ControlNet	Add spatial control to pre-trained diffusion via trainable encoder copy