Text Generation
Overview
Text generation produces coherent natural language output for tasks including summarization, question answering, dialogue, and open-ended writing. The quality of generated text depends critically on decoding strategy, training approach, and task-specific design. This document covers decoding methods, major generation tasks, and their architectures.
Decoding Strategies
Given a language model that produces a probability distribution over the vocabulary at each step, the decoding strategy determines how tokens are selected.
Greedy Decoding
Select the highest-probability token at each step.
y_t = argmax P(y | y_1, ..., y_{t-1}, x)
- Fast (single forward pass per token)
- Often produces repetitive, generic text
- Misses high-quality sequences that start with lower-probability tokens
Beam Search
Track the top-b highest-scoring partial sequences at each step.
- b (beam width) typically 4-10
- Finds approximately the most probable sequence
- Produces higher-quality output than greedy for MT and summarization
- Tends to generate text that is too short, generic, or repetitive for open-ended generation
- Length normalization and repetition penalties help
Top-k Sampling
Sample from the k most probable tokens, redistributing probability mass.
1. Compute logits for all vocabulary tokens
2. Keep only the top-k tokens
3. Renormalize probabilities over these k tokens
4. Sample from the resulting distribution
- k=1 is greedy; larger k increases diversity
- Fixed k is suboptimal: when the distribution is peaked, k=50 includes unlikely tokens; when flat, k=50 may exclude reasonable options
Nucleus Sampling (Top-p)
Sample from the smallest set of tokens whose cumulative probability exceeds threshold p.
1. Sort tokens by descending probability
2. Accumulate probabilities until sum >= p
3. Sample from this dynamic set
- p=0.9-0.95 is common
- Adapts vocabulary size to the model's confidence
- Generally preferred over top-k for open-ended generation
Temperature Scaling
Adjust the sharpness of the probability distribution before sampling.
P(y_i) = exp(logit_i / T) / sum_j exp(logit_j / T)
| Temperature | Effect | |---|---| | T < 1 | Sharper distribution, more deterministic | | T = 1 | Original distribution | | T > 1 | Flatter distribution, more random | | T -> 0 | Approaches greedy decoding |
Temperature is typically combined with top-k or top-p sampling.
Repetition Control
- Repetition penalty: Divide logits of previously generated tokens by a penalty factor
- No-repeat n-gram: Block n-grams that have already appeared
- Frequency/presence penalty: Penalize tokens based on how often they have appeared
Choosing a Strategy
| Task | Recommended | Rationale | |---|---|---| | Machine translation | Beam search (b=4-6) | Need most probable, faithful output | | Summarization | Beam search with length penalty | Faithful, concise output | | Creative writing | Nucleus sampling (p=0.9, T=0.8-1.0) | Diversity with coherence | | Code generation | Low temperature (T=0.2-0.4) | Correctness over diversity | | Dialogue | Nucleus sampling (p=0.9) | Natural, varied responses |
Summarization
Condensing a document into a shorter version that preserves key information.
Extractive Summarization
Selects sentences from the source document.
Approaches:
- TextRank: Graph-based (sentences as nodes, similarity as edges); PageRank-style ranking
- BERT extractive: Encode each sentence, classify as include/exclude
- BertSumExt: Add [CLS] tokens between sentences, use inter-sentence transformer layers
Advantages: Grammatical output (original sentences), faithful to source. Limitations: Cannot paraphrase or combine information; may lack coherence.
Abstractive Summarization
Generates novel text, potentially using words not in the source.
Architectures:
- Encoder-decoder with attention and copy mechanism (Pointer-Generator)
- Pretrained seq2seq models: BART, T5, Pegasus
- Pegasus pretraining: mask important sentences (gap sentences) as pretraining for summarization
Challenges:
- Faithfulness/hallucination: Generated summaries may contain facts not in the source
- Factual consistency: Active research area; verification models (FactCC, SummaC, AlignScore)
- Long documents: Standard transformers limited to 512-1024 tokens; solutions include LED (Longformer Encoder-Decoder), chunking, hierarchical models
Evaluation
| Metric | Type | Measures | |---|---|---| | ROUGE-1/2/L | N-gram overlap | Recall of unigrams, bigrams, longest common subsequence | | BERTScore | Embedding similarity | Semantic overlap | | FactCC, SummaC | Trained classifier | Factual consistency | | Human evaluation | Manual | Fluency, informativeness, faithfulness |
Question Answering (QA)
Extractive QA
Select a span from a context passage as the answer.
Context: "Paris is the capital of France."
Question: "What is the capital of France?"
Answer: "Paris" (span [0, 0])
SQuAD model architecture:
- Encode [CLS] question [SEP] context [SEP] with BERT
- Predict start and end token positions with linear layers
- Answer span = text between start and end positions
Datasets: SQuAD 1.1/2.0, Natural Questions, TriviaQA
Generative QA
Generate the answer as free text rather than extracting a span.
- More flexible: can synthesize information from multiple passages
- Models: T5, GPT, UnifiedQA
- Handles questions requiring reasoning, multi-hop inference, or aggregation
Retrieval-Augmented Generation (RAG)
Combines retrieval with generation for knowledge-intensive QA.
Question -> Retriever -> Top-k passages -> Generator (conditioned on passages) -> Answer
Components:
- Retriever: Dense (DPR) or sparse (BM25) search over a document corpus
- Generator: Seq2seq model (BART, T5) or LLM that conditions on retrieved passages
- Advantages: Updatable knowledge (update corpus, not model), attributable answers, reduced hallucination
Variants:
- RAG-Token: retrieve once, attend to passages at each generation step
- RAG-Sequence: retrieve once, generate independently per passage, marginalize
- Iterative RAG: retrieve multiple times during generation
- Self-RAG: model decides when to retrieve
Dialogue Systems
Task-Oriented Dialogue
Helps users accomplish specific goals (booking, information lookup).
Pipeline architecture:
- NLU: Intent classification + slot filling ("Book a flight" -> intent: book_flight, slots: {dest: "Paris"})
- Dialogue state tracking: Maintain belief state over slots across turns
- Policy: Decide next system action (ask for departure city, confirm booking)
- NLG: Generate natural language response from action
End-to-end: Train a single model from dialogue history to response. SimpleTOD, SOLOIST, and LLM-based systems.
Open-Domain Dialogue
Conversational agents without a specific task.
Approaches:
- Retrieval-based: Select best response from a candidate set (bi-encoder scoring)
- Generative: Generate responses with a seq2seq or autoregressive model
- Hybrid: Retrieve candidates, then rerank or use as context for generation
Key models: | Model | Approach | Key Innovation | |---|---|---| | DialoGPT | Fine-tuned GPT-2 | Reddit conversations | | BlenderBot | Encoder-decoder | Multi-skill (knowledge, empathy, personality) | | LaMDA | Decoder-only | Safety and groundedness filters | | ChatGPT | GPT + RLHF | Alignment for helpful, harmless dialogue |
Challenges in Dialogue
- Consistency: Maintaining a coherent persona across turns
- Grounding: Using external knowledge accurately
- Safety: Avoiding harmful, biased, or misleading content
- Evaluation: Automatic metrics (perplexity, BLEU) correlate poorly with human judgment; human evaluation remains essential
Controllable Generation
Steering generated text toward desired attributes (style, topic, sentiment, length).
| Method | Mechanism |
|---|---|
| Conditional training | Prepend control tokens during training (e.g., "
Key Takeaways
- Decoding strategy profoundly affects generation quality; nucleus sampling suits open-ended tasks, beam search suits constrained tasks
- Extractive summarization is faithful but rigid; abstractive summarization is flexible but prone to hallucination
- RAG combines retrieval with generation for knowledge-intensive tasks, reducing hallucination and enabling updatable knowledge
- Dialogue systems range from task-oriented pipelines to open-domain generative agents
- Controllable generation remains an active area; instruction tuning and RLHF are the current dominant approaches
- Evaluation of generated text is fundamentally difficult; human evaluation remains the gold standard