6 min read
On this page

Text Generation

Overview

Text generation produces coherent natural language output for tasks including summarization, question answering, dialogue, and open-ended writing. The quality of generated text depends critically on decoding strategy, training approach, and task-specific design. This document covers decoding methods, major generation tasks, and their architectures.


Decoding Strategies

Given a language model that produces a probability distribution over the vocabulary at each step, the decoding strategy determines how tokens are selected.

Greedy Decoding

Select the highest-probability token at each step.

y_t = argmax P(y | y_1, ..., y_{t-1}, x)
  • Fast (single forward pass per token)
  • Often produces repetitive, generic text
  • Misses high-quality sequences that start with lower-probability tokens

Track the top-b highest-scoring partial sequences at each step.

  • b (beam width) typically 4-10
  • Finds approximately the most probable sequence
  • Produces higher-quality output than greedy for MT and summarization
  • Tends to generate text that is too short, generic, or repetitive for open-ended generation
  • Length normalization and repetition penalties help

Top-k Sampling

Sample from the k most probable tokens, redistributing probability mass.

1. Compute logits for all vocabulary tokens
2. Keep only the top-k tokens
3. Renormalize probabilities over these k tokens
4. Sample from the resulting distribution
  • k=1 is greedy; larger k increases diversity
  • Fixed k is suboptimal: when the distribution is peaked, k=50 includes unlikely tokens; when flat, k=50 may exclude reasonable options

Nucleus Sampling (Top-p)

Sample from the smallest set of tokens whose cumulative probability exceeds threshold p.

1. Sort tokens by descending probability
2. Accumulate probabilities until sum >= p
3. Sample from this dynamic set
  • p=0.9-0.95 is common
  • Adapts vocabulary size to the model's confidence
  • Generally preferred over top-k for open-ended generation

Temperature Scaling

Adjust the sharpness of the probability distribution before sampling.

P(y_i) = exp(logit_i / T) / sum_j exp(logit_j / T)
Temperature Effect
T < 1 Sharper distribution, more deterministic
T = 1 Original distribution
T > 1 Flatter distribution, more random
T -> 0 Approaches greedy decoding

Temperature is typically combined with top-k or top-p sampling.

Repetition Control

  • Repetition penalty: Divide logits of previously generated tokens by a penalty factor
  • No-repeat n-gram: Block n-grams that have already appeared
  • Frequency/presence penalty: Penalize tokens based on how often they have appeared

Choosing a Strategy

Task Recommended Rationale
Machine translation Beam search (b=4-6) Need most probable, faithful output
Summarization Beam search with length penalty Faithful, concise output
Creative writing Nucleus sampling (p=0.9, T=0.8-1.0) Diversity with coherence
Code generation Low temperature (T=0.2-0.4) Correctness over diversity
Dialogue Nucleus sampling (p=0.9) Natural, varied responses

Summarization

Condensing a document into a shorter version that preserves key information.

Extractive Summarization

Selects sentences from the source document.

Approaches:

  • TextRank: Graph-based (sentences as nodes, similarity as edges); PageRank-style ranking
  • BERT extractive: Encode each sentence, classify as include/exclude
  • BertSumExt: Add [CLS] tokens between sentences, use inter-sentence transformer layers

Advantages: Grammatical output (original sentences), faithful to source. Limitations: Cannot paraphrase or combine information; may lack coherence.

Abstractive Summarization

Generates novel text, potentially using words not in the source.

Architectures:

  • Encoder-decoder with attention and copy mechanism (Pointer-Generator)
  • Pretrained seq2seq models: BART, T5, Pegasus
  • Pegasus pretraining: mask important sentences (gap sentences) as pretraining for summarization

Challenges:

  • Faithfulness/hallucination: Generated summaries may contain facts not in the source
  • Factual consistency: Active research area; verification models (FactCC, SummaC, AlignScore)
  • Long documents: Standard transformers limited to 512-1024 tokens; solutions include LED (Longformer Encoder-Decoder), chunking, hierarchical models

Evaluation

Metric Type Measures
ROUGE-1/2/L N-gram overlap Recall of unigrams, bigrams, longest common subsequence
BERTScore Embedding similarity Semantic overlap
FactCC, SummaC Trained classifier Factual consistency
Human evaluation Manual Fluency, informativeness, faithfulness

Question Answering (QA)

Extractive QA

Select a span from a context passage as the answer.

Context: "Paris is the capital of France."
Question: "What is the capital of France?"
Answer: "Paris" (span [0, 0])

SQuAD model architecture:

  1. Encode [CLS] question [SEP] context [SEP] with BERT
  2. Predict start and end token positions with linear layers
  3. Answer span = text between start and end positions

Datasets: SQuAD 1.1/2.0, Natural Questions, TriviaQA

Generative QA

Generate the answer as free text rather than extracting a span.

  • More flexible: can synthesize information from multiple passages
  • Models: T5, GPT, UnifiedQA
  • Handles questions requiring reasoning, multi-hop inference, or aggregation

Retrieval-Augmented Generation (RAG)

Combines retrieval with generation for knowledge-intensive QA.

Question -> Retriever -> Top-k passages -> Generator (conditioned on passages) -> Answer

Components:

  • Retriever: Dense (DPR) or sparse (BM25) search over a document corpus
  • Generator: Seq2seq model (BART, T5) or LLM that conditions on retrieved passages
  • Advantages: Updatable knowledge (update corpus, not model), attributable answers, reduced hallucination

Variants:

  • RAG-Token: retrieve once, attend to passages at each generation step
  • RAG-Sequence: retrieve once, generate independently per passage, marginalize
  • Iterative RAG: retrieve multiple times during generation
  • Self-RAG: model decides when to retrieve

Dialogue Systems

Task-Oriented Dialogue

Helps users accomplish specific goals (booking, information lookup).

Pipeline architecture:

  1. NLU: Intent classification + slot filling ("Book a flight" -> intent: book_flight, slots: {dest: "Paris"})
  2. Dialogue state tracking: Maintain belief state over slots across turns
  3. Policy: Decide next system action (ask for departure city, confirm booking)
  4. NLG: Generate natural language response from action

End-to-end: Train a single model from dialogue history to response. SimpleTOD, SOLOIST, and LLM-based systems.

Open-Domain Dialogue

Conversational agents without a specific task.

Approaches:

  • Retrieval-based: Select best response from a candidate set (bi-encoder scoring)
  • Generative: Generate responses with a seq2seq or autoregressive model
  • Hybrid: Retrieve candidates, then rerank or use as context for generation

Key models:

Model Approach Key Innovation
DialoGPT Fine-tuned GPT-2 Reddit conversations
BlenderBot Encoder-decoder Multi-skill (knowledge, empathy, personality)
LaMDA Decoder-only Safety and groundedness filters
ChatGPT GPT + RLHF Alignment for helpful, harmless dialogue

Challenges in Dialogue

  • Consistency: Maintaining a coherent persona across turns
  • Grounding: Using external knowledge accurately
  • Safety: Avoiding harmful, biased, or misleading content
  • Evaluation: Automatic metrics (perplexity, BLEU) correlate poorly with human judgment; human evaluation remains essential

Controllable Generation

Steering generated text toward desired attributes (style, topic, sentiment, length).

Method Mechanism
Conditional training Prepend control tokens during training (e.g., "", "")
CTRL Large LM trained with control codes
PPLM Plug-and-play: gradient-based steering at inference
Prefix tuning Learn task-specific continuous prefixes
Instruction tuning Train on diverse instructions to follow natural language control
Constitutional AI Self-critique and revision guided by principles

Key Takeaways

  • Decoding strategy profoundly affects generation quality; nucleus sampling suits open-ended tasks, beam search suits constrained tasks
  • Extractive summarization is faithful but rigid; abstractive summarization is flexible but prone to hallucination
  • RAG combines retrieval with generation for knowledge-intensive tasks, reducing hallucination and enabling updatable knowledge
  • Dialogue systems range from task-oriented pipelines to open-domain generative agents
  • Controllable generation remains an active area; instruction tuning and RLHF are the current dominant approaches
  • Evaluation of generated text is fundamentally difficult; human evaluation remains the gold standard