Routing & Fallbacks

Not every query needs your most expensive model. A user asking "what are your business hours?" does not require GPT-4o. A user asking "analyze this 50-page contract and identify all liability clauses" does. Routing directs each query to the right model based on complexity, and fallbacks ensure that if the first choice fails, a better model catches it. Together, these patterns can cut LLM costs by 50-80% without measurable quality loss.

The Cost Spectrum

Model                  Cost per 1M tokens (input)   Relative Speed   Quality
GPT-4o                 $2.50                         Medium           High
GPT-4o-mini            $0.15                         Fast             Good
Claude Sonnet          $3.00                         Medium           High
Claude Haiku           $0.25                         Fast             Good
Llama 3 70B (self-hosted) ~$0.50 (compute)           Medium           Good
Llama 3 8B (self-hosted)  ~$0.05 (compute)           Fast             Moderate

The price difference between the cheapest and most expensive models is 15-50x. If 70% of your queries can be handled by the cheap model, you save 70% of that difference.

Intent Classification for Routing

The most common routing pattern: classify the user's intent, then route to the appropriate model.

from openai import OpenAI
from pydantic import BaseModel
from enum import Enum

client = OpenAI()

class QueryComplexity(str, Enum):
    simple = "simple"       # FAQ, greetings, simple lookups
    moderate = "moderate"   # summarization, standard analysis
    complex = "complex"     # multi-step reasoning, long document analysis

class RouteDecision(BaseModel):
    complexity: QueryComplexity
    reasoning: str

def classify_query(user_message):
    """Classify query complexity to determine routing."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o-mini",  # use the cheap model for routing itself
        messages=[
            {"role": "system", "content": """Classify the complexity of this user query.

simple: Greetings, FAQs, simple factual questions, yes/no questions.
moderate: Summarization, standard analysis, comparisons, explanations.
complex: Multi-step reasoning, long document analysis, code generation, 
         creative writing, nuanced judgment calls."""},
            {"role": "user", "content": user_message}
        ],
        response_format=RouteDecision,
    )
    return response.choices[0].message.parsed

MODEL_MAP = {
    QueryComplexity.simple: "gpt-4o-mini",
    QueryComplexity.moderate: "gpt-4o-mini",
    QueryComplexity.complex: "gpt-4o",
}

def route_and_respond(user_message, system_prompt):
    route = classify_query(user_message)
    model = MODEL_MAP[route.complexity]
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ],
    )
    return response.choices[0].message.content, model

The Router Tax

The classification call itself costs money and adds latency. For this to be worthwhile:

Router cost + cheap model cost < expensive model cost for all queries

Example:
  Router call (gpt-4o-mini): ~$0.001 per query
  Cheap model response:       ~$0.003 per query
  Expensive model response:   ~$0.05 per query

  If 70% of queries route to the cheap model:
    With routing:    0.70 * ($0.001 + $0.003) + 0.30 * ($0.001 + $0.05) = $0.018
    Without routing: 1.00 * $0.05 = $0.05
    Savings: 64%

The math works as long as the router is cheap and a meaningful fraction of queries route to the cheaper model. In most applications, well over 50% of queries are simple enough for a small model.

Embedding-Based Routing

Instead of an LLM classifier, use embeddings to route queries. This is faster and cheaper than an LLM call.

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Pre-compute embeddings for example queries of each complexity level
SIMPLE_EXAMPLES = [
    "What are your hours?",
    "How do I reset my password?",
    "What is the return policy?",
    "Hi there",
    "Thanks for your help",
]

COMPLEX_EXAMPLES = [
    "Analyze this contract and identify all liability clauses with their implications",
    "Compare these three architectural approaches and recommend one with tradeoffs",
    "Debug this code and explain why the race condition occurs",
    "Write a detailed technical specification for a distributed caching system",
]

simple_centroid = np.mean(model.encode(SIMPLE_EXAMPLES), axis=0)
complex_centroid = np.mean(model.encode(COMPLEX_EXAMPLES), axis=0)

def route_by_embedding(query):
    query_emb = model.encode(query)
    
    simple_sim = np.dot(query_emb, simple_centroid) / (
        np.linalg.norm(query_emb) * np.linalg.norm(simple_centroid)
    )
    complex_sim = np.dot(query_emb, complex_centroid) / (
        np.linalg.norm(query_emb) * np.linalg.norm(complex_centroid)
    )
    
    if simple_sim > complex_sim + 0.1:  # bias toward simple (cheaper)
        return "gpt-4o-mini"
    else:
        return "gpt-4o"

Embedding-based routing runs in under 5ms locally (no API call), making it effectively free in terms of latency and cost.

Heuristic Routing

Sometimes the simplest router is the best one:

def route_by_heuristics(user_message, context=None):
    """Route based on simple heuristics. No LLM needed."""
    msg_len = len(user_message)
    
    # Very short messages are almost always simple
    if msg_len < 50:
        return "gpt-4o-mini"
    
    # Messages with attached documents need more capability
    if context and context.get("has_attachment"):
        return "gpt-4o"
    
    # Messages asking for analysis or comparison
    complex_keywords = ["analyze", "compare", "evaluate", "recommend",
                        "debug", "explain why", "tradeoff", "architecture"]
    if any(kw in user_message.lower() for kw in complex_keywords):
        return "gpt-4o"
    
    # Default to cheap
    return "gpt-4o-mini"

Heuristic routing is fast, free, deterministic, and surprisingly effective. It will not catch every case, but combined with a fallback strategy it does not need to.

Fallback Chains

The most practical pattern: try the cheap model first, fall back to the expensive model when confidence is low or the response is bad.

class FallbackChain:
    def __init__(self, client):
        self.client = client
        self.models = [
            {"name": "gpt-4o-mini", "max_tokens": 1024},
            {"name": "gpt-4o", "max_tokens": 2048},
        ]
    
    def respond(self, messages, quality_check=None):
        """Try models in order, fall back on failure or low quality."""
        for i, model_config in enumerate(self.models):
            try:
                response = self.client.chat.completions.create(
                    model=model_config["name"],
                    messages=messages,
                    max_tokens=model_config["max_tokens"],
                )
                result = response.choices[0].message.content
                
                # If a quality check is provided, verify the response
                if quality_check and not quality_check(result):
                    continue  # try the next model
                
                return result, model_config["name"]
                
            except Exception as e:
                if i == len(self.models) - 1:
                    raise  # last model failed, propagate error
                continue  # try the next model
        
        raise RuntimeError("All models failed")

# Quality check: is the response long enough? Does it contain expected structure?
def check_response_quality(response):
    if len(response) < 20:
        return False  # suspiciously short
    if "I don't know" in response or "I cannot" in response:
        return False  # model punted
    return True

chain = FallbackChain(client)
answer, model_used = chain.respond(messages, quality_check=check_response_quality)

Confidence-Based Fallback

Some tasks produce a natural confidence signal. For classification, use the model's predicted probability:

def classify_with_fallback(text, client):
    """Classify with cheap model, fall back to expensive if uncertain."""
    
    class Classification(BaseModel):
        category: str
        confidence: float  # 0.0 to 1.0
    
    # Try cheap model first
    response = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Classify this support ticket. Include your confidence from 0.0 to 1.0."},
            {"role": "user", "content": text}
        ],
        response_format=Classification,
    )
    result = response.choices[0].message.parsed
    
    # Fall back to expensive model if confidence is low
    if result.confidence < 0.8:
        response = client.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Classify this support ticket. Include your confidence from 0.0 to 1.0."},
                {"role": "user", "content": text}
            ],
            response_format=Classification,
        )
        result = response.choices[0].message.parsed
    
    return result

Note: LLM self-reported confidence is not well-calibrated. A model that says it is 90% confident is not necessarily right 90% of the time. But it is a useful signal for routing — low confidence correlates with harder queries, even if the absolute numbers are not reliable.

Multi-Provider Fallback

Do not depend on a single LLM provider. APIs go down. Rate limits hit. Build a provider abstraction that tries your primary provider first and falls back to a secondary. Maintain a list of provider configurations (client, model name, API format), iterate through them on failure, and raise only when all providers are exhausted. This is production hygiene, not premature optimization.

Common Pitfalls

Routing overhead exceeds savings. If your router uses the same expensive model you are trying to avoid, you have gained nothing. Use the cheapest possible classifier or heuristic for routing.
Over-routing to the cheap model. Aggressive cost optimization leads to quality degradation on queries that needed the better model. Monitor user satisfaction by route, not just overall.
No fallback strategy. Routing without fallbacks means misrouted queries get bad answers with no recovery. Always have a path from the cheap model to the expensive one.
Hardcoded thresholds. A confidence threshold of 0.8 might be right today and wrong next month as your query distribution changes. Monitor and adjust regularly.
Single provider dependency. If 100% of your LLM traffic goes through one provider and they have an outage, your product is down. Multi-provider fallback is production hygiene, not over-engineering.
Not A/B testing the router. Your routing logic is itself a model decision. A/B test routed traffic vs always-expensive-model traffic to verify that routing does not hurt quality.

Key Takeaways

Most LLM queries do not need the most expensive model. Route simple queries to cheap models and complex queries to capable ones. Typical savings are 50-80%.
Three routing approaches: LLM classifier (most accurate, adds latency), embedding-based (fast, no API call), heuristic (simplest, surprisingly effective). Start with heuristics, upgrade if needed.
Always implement fallbacks. Try the cheap model, fall back to the expensive one on low confidence or failure. This is the highest-leverage pattern for cost optimization with quality preservation.
Build multi-provider fallback as production infrastructure. Single-provider dependency is a reliability risk.
Monitor routing decisions, costs, and quality per model tier. You cannot optimize routing without data on how each tier performs.
The router itself must be cheap. An expensive routing step defeats the purpose.