6 min read
On this page

Text Generation

Overview

Text generation produces coherent natural language output for tasks including summarization, question answering, dialogue, and open-ended writing. The quality of generated text depends critically on decoding strategy, training approach, and task-specific design. This document covers decoding methods, major generation tasks, and their architectures.


Decoding Strategies

Given a language model that produces a probability distribution over the vocabulary at each step, the decoding strategy determines how tokens are selected.

Greedy Decoding

Select the highest-probability token at each step.

y_t = argmax P(y | y_1, ..., y_{t-1}, x)
  • Fast (single forward pass per token)
  • Often produces repetitive, generic text
  • Misses high-quality sequences that start with lower-probability tokens

Track the top-b highest-scoring partial sequences at each step.

  • b (beam width) typically 4-10
  • Finds approximately the most probable sequence
  • Produces higher-quality output than greedy for MT and summarization
  • Tends to generate text that is too short, generic, or repetitive for open-ended generation
  • Length normalization and repetition penalties help

Top-k Sampling

Sample from the k most probable tokens, redistributing probability mass.

1. Compute logits for all vocabulary tokens
2. Keep only the top-k tokens
3. Renormalize probabilities over these k tokens
4. Sample from the resulting distribution
  • k=1 is greedy; larger k increases diversity
  • Fixed k is suboptimal: when the distribution is peaked, k=50 includes unlikely tokens; when flat, k=50 may exclude reasonable options

Nucleus Sampling (Top-p)

Sample from the smallest set of tokens whose cumulative probability exceeds threshold p.

1. Sort tokens by descending probability
2. Accumulate probabilities until sum >= p
3. Sample from this dynamic set
  • p=0.9-0.95 is common
  • Adapts vocabulary size to the model's confidence
  • Generally preferred over top-k for open-ended generation

Temperature Scaling

Adjust the sharpness of the probability distribution before sampling.

P(y_i) = exp(logit_i / T) / sum_j exp(logit_j / T)

| Temperature | Effect | |---|---| | T < 1 | Sharper distribution, more deterministic | | T = 1 | Original distribution | | T > 1 | Flatter distribution, more random | | T -> 0 | Approaches greedy decoding |

Temperature is typically combined with top-k or top-p sampling.

Repetition Control

  • Repetition penalty: Divide logits of previously generated tokens by a penalty factor
  • No-repeat n-gram: Block n-grams that have already appeared
  • Frequency/presence penalty: Penalize tokens based on how often they have appeared

Choosing a Strategy

| Task | Recommended | Rationale | |---|---|---| | Machine translation | Beam search (b=4-6) | Need most probable, faithful output | | Summarization | Beam search with length penalty | Faithful, concise output | | Creative writing | Nucleus sampling (p=0.9, T=0.8-1.0) | Diversity with coherence | | Code generation | Low temperature (T=0.2-0.4) | Correctness over diversity | | Dialogue | Nucleus sampling (p=0.9) | Natural, varied responses |


Summarization

Condensing a document into a shorter version that preserves key information.

Extractive Summarization

Selects sentences from the source document.

Approaches:

  • TextRank: Graph-based (sentences as nodes, similarity as edges); PageRank-style ranking
  • BERT extractive: Encode each sentence, classify as include/exclude
  • BertSumExt: Add [CLS] tokens between sentences, use inter-sentence transformer layers

Advantages: Grammatical output (original sentences), faithful to source. Limitations: Cannot paraphrase or combine information; may lack coherence.

Abstractive Summarization

Generates novel text, potentially using words not in the source.

Architectures:

  • Encoder-decoder with attention and copy mechanism (Pointer-Generator)
  • Pretrained seq2seq models: BART, T5, Pegasus
  • Pegasus pretraining: mask important sentences (gap sentences) as pretraining for summarization

Challenges:

  • Faithfulness/hallucination: Generated summaries may contain facts not in the source
  • Factual consistency: Active research area; verification models (FactCC, SummaC, AlignScore)
  • Long documents: Standard transformers limited to 512-1024 tokens; solutions include LED (Longformer Encoder-Decoder), chunking, hierarchical models

Evaluation

| Metric | Type | Measures | |---|---|---| | ROUGE-1/2/L | N-gram overlap | Recall of unigrams, bigrams, longest common subsequence | | BERTScore | Embedding similarity | Semantic overlap | | FactCC, SummaC | Trained classifier | Factual consistency | | Human evaluation | Manual | Fluency, informativeness, faithfulness |


Question Answering (QA)

Extractive QA

Select a span from a context passage as the answer.

Context: "Paris is the capital of France."
Question: "What is the capital of France?"
Answer: "Paris" (span [0, 0])

SQuAD model architecture:

  1. Encode [CLS] question [SEP] context [SEP] with BERT
  2. Predict start and end token positions with linear layers
  3. Answer span = text between start and end positions

Datasets: SQuAD 1.1/2.0, Natural Questions, TriviaQA

Generative QA

Generate the answer as free text rather than extracting a span.

  • More flexible: can synthesize information from multiple passages
  • Models: T5, GPT, UnifiedQA
  • Handles questions requiring reasoning, multi-hop inference, or aggregation

Retrieval-Augmented Generation (RAG)

Combines retrieval with generation for knowledge-intensive QA.

Question -> Retriever -> Top-k passages -> Generator (conditioned on passages) -> Answer

Components:

  • Retriever: Dense (DPR) or sparse (BM25) search over a document corpus
  • Generator: Seq2seq model (BART, T5) or LLM that conditions on retrieved passages
  • Advantages: Updatable knowledge (update corpus, not model), attributable answers, reduced hallucination

Variants:

  • RAG-Token: retrieve once, attend to passages at each generation step
  • RAG-Sequence: retrieve once, generate independently per passage, marginalize
  • Iterative RAG: retrieve multiple times during generation
  • Self-RAG: model decides when to retrieve

Dialogue Systems

Task-Oriented Dialogue

Helps users accomplish specific goals (booking, information lookup).

Pipeline architecture:

  1. NLU: Intent classification + slot filling ("Book a flight" -> intent: book_flight, slots: {dest: "Paris"})
  2. Dialogue state tracking: Maintain belief state over slots across turns
  3. Policy: Decide next system action (ask for departure city, confirm booking)
  4. NLG: Generate natural language response from action

End-to-end: Train a single model from dialogue history to response. SimpleTOD, SOLOIST, and LLM-based systems.

Open-Domain Dialogue

Conversational agents without a specific task.

Approaches:

  • Retrieval-based: Select best response from a candidate set (bi-encoder scoring)
  • Generative: Generate responses with a seq2seq or autoregressive model
  • Hybrid: Retrieve candidates, then rerank or use as context for generation

Key models: | Model | Approach | Key Innovation | |---|---|---| | DialoGPT | Fine-tuned GPT-2 | Reddit conversations | | BlenderBot | Encoder-decoder | Multi-skill (knowledge, empathy, personality) | | LaMDA | Decoder-only | Safety and groundedness filters | | ChatGPT | GPT + RLHF | Alignment for helpful, harmless dialogue |

Challenges in Dialogue

  • Consistency: Maintaining a coherent persona across turns
  • Grounding: Using external knowledge accurately
  • Safety: Avoiding harmful, biased, or misleading content
  • Evaluation: Automatic metrics (perplexity, BLEU) correlate poorly with human judgment; human evaluation remains essential

Controllable Generation

Steering generated text toward desired attributes (style, topic, sentiment, length).

| Method | Mechanism | |---|---| | Conditional training | Prepend control tokens during training (e.g., "", "") | | CTRL | Large LM trained with control codes | | PPLM | Plug-and-play: gradient-based steering at inference | | Prefix tuning | Learn task-specific continuous prefixes | | Instruction tuning | Train on diverse instructions to follow natural language control | | Constitutional AI | Self-critique and revision guided by principles |


Key Takeaways

  • Decoding strategy profoundly affects generation quality; nucleus sampling suits open-ended tasks, beam search suits constrained tasks
  • Extractive summarization is faithful but rigid; abstractive summarization is flexible but prone to hallucination
  • RAG combines retrieval with generation for knowledge-intensive tasks, reducing hallucination and enabling updatable knowledge
  • Dialogue systems range from task-oriented pipelines to open-domain generative agents
  • Controllable generation remains an active area; instruction tuning and RLHF are the current dominant approaches
  • Evaluation of generated text is fundamentally difficult; human evaluation remains the gold standard