3 min read
On this page

Structured Output & Chains

LLMs produce text. Your application needs data. A JSON object with specific fields, a classification label from a fixed set, a list of extracted entities. Getting reliable, parseable output from a model that fundamentally produces a stream of tokens is one of the most practical challenges in LLM application development.

The Problem with Free-Form Output

Ask an LLM to "extract the person's name and email from this text" and you might get:

Response 1: "The person's name is John Smith and their email is john@example.com"
Response 2: "Name: John Smith\nEmail: john@example.com"
Response 3: '{"name": "John Smith", "email": "john@example.com"}'
Response 4: "I found the following information:\n- Name: John Smith\n- Email: john@example.com"

All four are correct answers. None are reliably parseable by code. Your downstream system needs one format, every time.

JSON Mode

The simplest approach: tell the model to output JSON.

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "You extract contact information from text. Always respond in JSON with 'name' and 'email' fields."},
        {"role": "user", "content": "Please reach out to John Smith at john@example.com about the proposal."}
    ]
)

import json
result = json.loads(response.choices[0].message.content)
# {"name": "John Smith", "email": "john@example.com"}

JSON mode guarantees the output is valid JSON, but it does not guarantee the JSON has the fields you expect. The model might return {"contact": "John Smith <john@example.com>"} instead of separate name and email fields.

Function Calling (Structured Outputs)

Function calling gives you schema enforcement. You define the exact structure you want, and the model's output conforms to it.

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class ContactInfo(BaseModel):
    name: str
    email: str
    company: str | None = None
    role: str | None = None

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract contact information from the text."},
        {"role": "user", "content": "John Smith, Senior Engineer at Acme Corp (john@acme.com) sent the proposal."}
    ],
    response_format=ContactInfo,
)

contact = response.choices[0].message.parsed
# ContactInfo(name='John Smith', email='john@acme.com', company='Acme Corp', role='Senior Engineer')

This is the right default for structured extraction. The model is constrained to output valid instances of your schema. No parsing errors, no missing fields, no format guessing.

Complex Schemas

Pydantic models compose naturally for complex structures:

from pydantic import BaseModel
from enum import Enum

class Severity(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"
    critical = "critical"

class ActionItem(BaseModel):
    description: str
    assignee: str | None = None
    due_date: str | None = None
    severity: Severity

class MeetingNotes(BaseModel):
    title: str
    date: str
    attendees: list[str]
    summary: str
    action_items: list[ActionItem]
    decisions: list[str]
    open_questions: list[str]

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Parse the following meeting transcript into structured notes."},
        {"role": "user", "content": transcript_text}
    ],
    response_format=MeetingNotes,
)

Chains: Multi-Step LLM Workflows

A chain is a sequence of LLM calls where the output of one step feeds into the next. Chains decompose complex tasks into simpler, more reliable subtasks.

When to Chain vs When to Do It in One Prompt

Use a single prompt when:
  - The task is straightforward (classify, extract, summarize)
  - The input fits in the context window
  - Quality is good enough without decomposition

Use a chain when:
  - The task has distinct phases (understand -> analyze -> generate)
  - Different steps need different models or temperatures
  - You need intermediate validation between steps
  - The full task in one prompt produces unreliable results

Example: Document Processing Chain

class DocumentProcessor:
    def __init__(self, client):
        self.client = client
    
    def process(self, document_text):
        # Step 1: Classify the document type
        doc_type = self._classify(document_text)
        
        # Step 2: Extract structured data based on type
        extracted = self._extract(document_text, doc_type)
        
        # Step 3: Generate a summary
        summary = self._summarize(document_text, doc_type, extracted)
        
        return {
            "type": doc_type,
            "extracted_data": extracted,
            "summary": summary,
        }
    
    def _classify(self, text):
        class DocType(BaseModel):
            document_type: str  # "invoice", "contract", "report", "email"
            confidence: float
        
        response = self.client.beta.chat.completions.parse(
            model="gpt-4o-mini",  # cheap model for classification
            messages=[
                {"role": "system", "content": "Classify the document type."},
                {"role": "user", "content": text[:2000]}  # first 2000 chars is enough
            ],
            response_format=DocType,
        )
        return response.choices[0].message.parsed.document_type
    
    def _extract(self, text, doc_type):
        # Different extraction schema per document type
        schemas = {
            "invoice": InvoiceData,
            "contract": ContractData,
            "report": ReportData,
        }
        schema = schemas.get(doc_type, GenericData)
        
        response = self.client.beta.chat.completions.parse(
            model="gpt-4o",  # better model for extraction
            messages=[
                {"role": "system", "content": f"Extract structured data from this {doc_type}."},
                {"role": "user", "content": text}
            ],
            response_format=schema,
        )
        return response.choices[0].message.parsed
    
    def _summarize(self, text, doc_type, extracted):
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": f"Write a 2-3 sentence summary of this {doc_type}."},
                {"role": "user", "content": text}
            ],
        )
        return response.choices[0].message.content

Chain with Validation

The real power of chains is inserting validation between steps:

def process_with_validation(self, document_text):
    # Step 1: Classify
    doc_type = self._classify(document_text)
    
    # Validation: reject unknown types early
    if doc_type not in ["invoice", "contract", "report", "email"]:
        return {"error": f"Unknown document type: {doc_type}"}
    
    # Step 2: Extract
    extracted = self._extract(document_text, doc_type)
    
    # Validation: check required fields
    if doc_type == "invoice" and not extracted.total_amount:
        # Retry extraction with a more explicit prompt
        extracted = self._extract_with_hints(
            document_text, doc_type,
            hint="Make sure to extract the total amount. Look for 'Total', 'Amount Due', etc."
        )
    
    # Step 3: Summarize (only if extraction succeeded)
    if extracted:
        summary = self._summarize(document_text, doc_type, extracted)
    else:
        summary = "Extraction failed. Manual review required."
    
    return {"type": doc_type, "extracted_data": extracted, "summary": summary}

Error Handling in Chains

LLM calls fail. They time out, return malformed output, or produce nonsensical results. Chains need error handling at every step.

import time
from openai import APITimeoutError, RateLimitError

def llm_call_with_retry(func, max_retries=3, backoff_base=2):
    """Retry LLM calls with exponential backoff."""
    for attempt in range(max_retries):
        try:
            return func()
        except APITimeoutError:
            if attempt == max_retries - 1:
                raise
            time.sleep(backoff_base ** attempt)
        except RateLimitError:
            time.sleep(backoff_base ** (attempt + 1))
        except Exception as e:
            # Log unexpected errors, do not retry
            raise

def safe_chain_step(func, fallback=None):
    """Execute a chain step with fallback on failure."""
    try:
        result = llm_call_with_retry(func)
        return result
    except Exception as e:
        if fallback is not None:
            return fallback
        raise

In batch processing, handle partial failures gracefully: mark failed steps as null or "unknown," continue processing remaining documents, and surface errors for review rather than crashing the entire pipeline.

Common Pitfalls

  • Parsing free-form LLM output with regex. This breaks constantly. Use function calling or structured outputs. If you are writing regex to parse LLM output, you are solving the wrong problem.
  • Over-chaining. Each step in a chain adds latency, cost, and a failure point. A 7-step chain with 95% reliability per step has only 70% end-to-end reliability. Keep chains short.
  • No validation between steps. If step 1 produces garbage, steps 2-5 amplify the garbage. Validate intermediate outputs and fail fast.
  • Same model for every step. Use cheap, fast models (GPT-4o-mini, Haiku) for simple steps like classification and routing. Reserve expensive models for steps that need high quality.
  • Ignoring latency. A 3-step chain where each step takes 2 seconds means 6 seconds of user-visible latency. Pipeline where possible, and consider whether the chain is really necessary.
  • Not handling partial extraction gracefully. If the model cannot extract a field, returning null is better than hallucinating a value. Design your schema to allow optional fields and handle missing data downstream.

Key Takeaways

  • Use structured outputs (function calling, Pydantic models) as the default for any task that needs parseable output. Stop writing regex to parse LLM text.
  • Chains decompose complex tasks into reliable subtasks. Each step should do one thing, and validation should happen between steps.
  • Error handling is not optional. LLM calls fail, produce bad output, and time out. Handle every step.
  • Use cheaper models for simple chain steps. Not every step needs the most capable model.
  • Keep chains short. Each step multiplies latency and reduces reliability. If you can do it in one well-crafted prompt, do that instead.
  • Few-shot examples in the prompt remain one of the most effective ways to get consistent output format, even when not using formal schema enforcement.