Text Generation

Overview

Text generation produces coherent natural language output for tasks including summarization, question answering, dialogue, and open-ended writing. The quality of generated text depends critically on decoding strategy, training approach, and task-specific design. This document covers decoding methods, major generation tasks, and their architectures.

Decoding Strategies

Given a language model that produces a probability distribution over the vocabulary at each step, the decoding strategy determines how tokens are selected.

Greedy Decoding

Select the highest-probability token at each step.

y_t = argmax P(y | y_1, ..., y_{t-1}, x)

Fast (single forward pass per token)
Often produces repetitive, generic text
Misses high-quality sequences that start with lower-probability tokens

Beam Search

Track the top-b highest-scoring partial sequences at each step.

b (beam width) typically 4-10
Finds approximately the most probable sequence
Produces higher-quality output than greedy for MT and summarization
Tends to generate text that is too short, generic, or repetitive for open-ended generation
Length normalization and repetition penalties help

Top-k Sampling

Sample from the k most probable tokens, redistributing probability mass.

1. Compute logits for all vocabulary tokens
2. Keep only the top-k tokens
3. Renormalize probabilities over these k tokens
4. Sample from the resulting distribution

k=1 is greedy; larger k increases diversity
Fixed k is suboptimal: when the distribution is peaked, k=50 includes unlikely tokens; when flat, k=50 may exclude reasonable options

Nucleus Sampling (Top-p)

Sample from the smallest set of tokens whose cumulative probability exceeds threshold p.

1. Sort tokens by descending probability
2. Accumulate probabilities until sum >= p
3. Sample from this dynamic set

p=0.9-0.95 is common
Adapts vocabulary size to the model's confidence
Generally preferred over top-k for open-ended generation

Temperature Scaling

Adjust the sharpness of the probability distribution before sampling.

P(y_i) = exp(logit_i / T) / sum_j exp(logit_j / T)

Temperature	Effect
T < 1	Sharper distribution, more deterministic
T = 1	Original distribution
T > 1	Flatter distribution, more random
T -> 0	Approaches greedy decoding

Temperature is typically combined with top-k or top-p sampling.

Repetition Control

Repetition penalty: Divide logits of previously generated tokens by a penalty factor
No-repeat n-gram: Block n-grams that have already appeared
Frequency/presence penalty: Penalize tokens based on how often they have appeared

Choosing a Strategy

Task	Recommended	Rationale
Machine translation	Beam search (b=4-6)	Need most probable, faithful output
Summarization	Beam search with length penalty	Faithful, concise output
Creative writing	Nucleus sampling (p=0.9, T=0.8-1.0)	Diversity with coherence
Code generation	Low temperature (T=0.2-0.4)	Correctness over diversity
Dialogue	Nucleus sampling (p=0.9)	Natural, varied responses

Summarization

Condensing a document into a shorter version that preserves key information.

Extractive Summarization

Selects sentences from the source document.

Approaches:

TextRank: Graph-based (sentences as nodes, similarity as edges); PageRank-style ranking
BERT extractive: Encode each sentence, classify as include/exclude
BertSumExt: Add [CLS] tokens between sentences, use inter-sentence transformer layers

Advantages: Grammatical output (original sentences), faithful to source. Limitations: Cannot paraphrase or combine information; may lack coherence.

Abstractive Summarization

Generates novel text, potentially using words not in the source.

Architectures:

Encoder-decoder with attention and copy mechanism (Pointer-Generator)
Pretrained seq2seq models: BART, T5, Pegasus
Pegasus pretraining: mask important sentences (gap sentences) as pretraining for summarization

Challenges:

Faithfulness/hallucination: Generated summaries may contain facts not in the source
Factual consistency: Active research area; verification models (FactCC, SummaC, AlignScore)
Long documents: Standard transformers limited to 512-1024 tokens; solutions include LED (Longformer Encoder-Decoder), chunking, hierarchical models

Evaluation

Metric	Type	Measures
ROUGE-1/2/L	N-gram overlap	Recall of unigrams, bigrams, longest common subsequence
BERTScore	Embedding similarity	Semantic overlap
FactCC, SummaC	Trained classifier	Factual consistency
Human evaluation	Manual	Fluency, informativeness, faithfulness

Question Answering (QA)

Extractive QA

Select a span from a context passage as the answer.

Context: "Paris is the capital of France."
Question: "What is the capital of France?"
Answer: "Paris" (span [0, 0])

SQuAD model architecture:

Encode [CLS] question [SEP] context [SEP] with BERT
Predict start and end token positions with linear layers
Answer span = text between start and end positions

Datasets: SQuAD 1.1/2.0, Natural Questions, TriviaQA

Generative QA

Generate the answer as free text rather than extracting a span.

More flexible: can synthesize information from multiple passages
Models: T5, GPT, UnifiedQA
Handles questions requiring reasoning, multi-hop inference, or aggregation

Retrieval-Augmented Generation (RAG)

Combines retrieval with generation for knowledge-intensive QA.

Question -> Retriever -> Top-k passages -> Generator (conditioned on passages) -> Answer

Components:

Retriever: Dense (DPR) or sparse (BM25) search over a document corpus
Generator: Seq2seq model (BART, T5) or LLM that conditions on retrieved passages
Advantages: Updatable knowledge (update corpus, not model), attributable answers, reduced hallucination

Variants:

RAG-Token: retrieve once, attend to passages at each generation step
RAG-Sequence: retrieve once, generate independently per passage, marginalize
Iterative RAG: retrieve multiple times during generation
Self-RAG: model decides when to retrieve

Dialogue Systems

Task-Oriented Dialogue

Helps users accomplish specific goals (booking, information lookup).

Pipeline architecture:

NLU: Intent classification + slot filling ("Book a flight" -> intent: book_flight, slots: {dest: "Paris"})
Dialogue state tracking: Maintain belief state over slots across turns
Policy: Decide next system action (ask for departure city, confirm booking)
NLG: Generate natural language response from action

End-to-end: Train a single model from dialogue history to response. SimpleTOD, SOLOIST, and LLM-based systems.

Open-Domain Dialogue

Conversational agents without a specific task.

Approaches:

Retrieval-based: Select best response from a candidate set (bi-encoder scoring)
Generative: Generate responses with a seq2seq or autoregressive model
Hybrid: Retrieve candidates, then rerank or use as context for generation

Key models:

Model	Approach	Key Innovation
DialoGPT	Fine-tuned GPT-2	Reddit conversations
BlenderBot	Encoder-decoder	Multi-skill (knowledge, empathy, personality)
LaMDA	Decoder-only	Safety and groundedness filters
ChatGPT	GPT + RLHF	Alignment for helpful, harmless dialogue

Challenges in Dialogue

Consistency: Maintaining a coherent persona across turns
Grounding: Using external knowledge accurately
Safety: Avoiding harmful, biased, or misleading content
Evaluation: Automatic metrics (perplexity, BLEU) correlate poorly with human judgment; human evaluation remains essential

Controllable Generation

Steering generated text toward desired attributes (style, topic, sentiment, length).

Method	Mechanism
Conditional training	Prepend control tokens during training (e.g., "", "")
CTRL	Large LM trained with control codes
PPLM	Plug-and-play: gradient-based steering at inference
Prefix tuning	Learn task-specific continuous prefixes
Instruction tuning	Train on diverse instructions to follow natural language control
Constitutional AI	Self-critique and revision guided by principles

Key Takeaways

Decoding strategy profoundly affects generation quality; nucleus sampling suits open-ended tasks, beam search suits constrained tasks
Extractive summarization is faithful but rigid; abstractive summarization is flexible but prone to hallucination
RAG combines retrieval with generation for knowledge-intensive tasks, reducing hallucination and enabling updatable knowledge
Dialogue systems range from task-oriented pipelines to open-domain generative agents
Controllable generation remains an active area; instruction tuning and RLHF are the current dominant approaches
Evaluation of generated text is fundamentally difficult; human evaluation remains the gold standard