Routing & Fallbacks
Not every query needs your most expensive model. A user asking "what are your business hours?" does not require GPT-4o. A user asking "analyze this 50-page contract and identify all liability clauses" does. Routing directs each query to the right model based on complexity, and fallbacks ensure that if the first choice fails, a better model catches it. Together, these patterns can cut LLM costs by 50-80% without measurable quality loss.
The Cost Spectrum
Model Cost per 1M tokens (input) Relative Speed Quality
GPT-4o $2.50 Medium High
GPT-4o-mini $0.15 Fast Good
Claude Sonnet $3.00 Medium High
Claude Haiku $0.25 Fast Good
Llama 3 70B (self-hosted) ~$0.50 (compute) Medium Good
Llama 3 8B (self-hosted) ~$0.05 (compute) Fast Moderate
The price difference between the cheapest and most expensive models is 15-50x. If 70% of your queries can be handled by the cheap model, you save 70% of that difference.
Intent Classification for Routing
The most common routing pattern: classify the user's intent, then route to the appropriate model.
from openai import OpenAI
from pydantic import BaseModel
from enum import Enum
client = OpenAI()
class QueryComplexity(str, Enum):
simple = "simple" # FAQ, greetings, simple lookups
moderate = "moderate" # summarization, standard analysis
complex = "complex" # multi-step reasoning, long document analysis
class RouteDecision(BaseModel):
complexity: QueryComplexity
reasoning: str
def classify_query(user_message):
"""Classify query complexity to determine routing."""
response = client.beta.chat.completions.parse(
model="gpt-4o-mini", # use the cheap model for routing itself
messages=[
{"role": "system", "content": """Classify the complexity of this user query.
simple: Greetings, FAQs, simple factual questions, yes/no questions.
moderate: Summarization, standard analysis, comparisons, explanations.
complex: Multi-step reasoning, long document analysis, code generation,
creative writing, nuanced judgment calls."""},
{"role": "user", "content": user_message}
],
response_format=RouteDecision,
)
return response.choices[0].message.parsed
MODEL_MAP = {
QueryComplexity.simple: "gpt-4o-mini",
QueryComplexity.moderate: "gpt-4o-mini",
QueryComplexity.complex: "gpt-4o",
}
def route_and_respond(user_message, system_prompt):
route = classify_query(user_message)
model = MODEL_MAP[route.complexity]
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
],
)
return response.choices[0].message.content, model
The Router Tax
The classification call itself costs money and adds latency. For this to be worthwhile:
Router cost + cheap model cost < expensive model cost for all queries
Example:
Router call (gpt-4o-mini): ~$0.001 per query
Cheap model response: ~$0.003 per query
Expensive model response: ~$0.05 per query
If 70% of queries route to the cheap model:
With routing: 0.70 * ($0.001 + $0.003) + 0.30 * ($0.001 + $0.05) = $0.018
Without routing: 1.00 * $0.05 = $0.05
Savings: 64%
The math works as long as the router is cheap and a meaningful fraction of queries route to the cheaper model. In most applications, well over 50% of queries are simple enough for a small model.
Embedding-Based Routing
Instead of an LLM classifier, use embeddings to route queries. This is faster and cheaper than an LLM call.
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Pre-compute embeddings for example queries of each complexity level
SIMPLE_EXAMPLES = [
"What are your hours?",
"How do I reset my password?",
"What is the return policy?",
"Hi there",
"Thanks for your help",
]
COMPLEX_EXAMPLES = [
"Analyze this contract and identify all liability clauses with their implications",
"Compare these three architectural approaches and recommend one with tradeoffs",
"Debug this code and explain why the race condition occurs",
"Write a detailed technical specification for a distributed caching system",
]
simple_centroid = np.mean(model.encode(SIMPLE_EXAMPLES), axis=0)
complex_centroid = np.mean(model.encode(COMPLEX_EXAMPLES), axis=0)
def route_by_embedding(query):
query_emb = model.encode(query)
simple_sim = np.dot(query_emb, simple_centroid) / (
np.linalg.norm(query_emb) * np.linalg.norm(simple_centroid)
)
complex_sim = np.dot(query_emb, complex_centroid) / (
np.linalg.norm(query_emb) * np.linalg.norm(complex_centroid)
)
if simple_sim > complex_sim + 0.1: # bias toward simple (cheaper)
return "gpt-4o-mini"
else:
return "gpt-4o"
Embedding-based routing runs in under 5ms locally (no API call), making it effectively free in terms of latency and cost.
Heuristic Routing
Sometimes the simplest router is the best one:
def route_by_heuristics(user_message, context=None):
"""Route based on simple heuristics. No LLM needed."""
msg_len = len(user_message)
# Very short messages are almost always simple
if msg_len < 50:
return "gpt-4o-mini"
# Messages with attached documents need more capability
if context and context.get("has_attachment"):
return "gpt-4o"
# Messages asking for analysis or comparison
complex_keywords = ["analyze", "compare", "evaluate", "recommend",
"debug", "explain why", "tradeoff", "architecture"]
if any(kw in user_message.lower() for kw in complex_keywords):
return "gpt-4o"
# Default to cheap
return "gpt-4o-mini"
Heuristic routing is fast, free, deterministic, and surprisingly effective. It will not catch every case, but combined with a fallback strategy it does not need to.
Fallback Chains
The most practical pattern: try the cheap model first, fall back to the expensive model when confidence is low or the response is bad.
class FallbackChain:
def __init__(self, client):
self.client = client
self.models = [
{"name": "gpt-4o-mini", "max_tokens": 1024},
{"name": "gpt-4o", "max_tokens": 2048},
]
def respond(self, messages, quality_check=None):
"""Try models in order, fall back on failure or low quality."""
for i, model_config in enumerate(self.models):
try:
response = self.client.chat.completions.create(
model=model_config["name"],
messages=messages,
max_tokens=model_config["max_tokens"],
)
result = response.choices[0].message.content
# If a quality check is provided, verify the response
if quality_check and not quality_check(result):
continue # try the next model
return result, model_config["name"]
except Exception as e:
if i == len(self.models) - 1:
raise # last model failed, propagate error
continue # try the next model
raise RuntimeError("All models failed")
# Quality check: is the response long enough? Does it contain expected structure?
def check_response_quality(response):
if len(response) < 20:
return False # suspiciously short
if "I don't know" in response or "I cannot" in response:
return False # model punted
return True
chain = FallbackChain(client)
answer, model_used = chain.respond(messages, quality_check=check_response_quality)
Confidence-Based Fallback
Some tasks produce a natural confidence signal. For classification, use the model's predicted probability:
def classify_with_fallback(text, client):
"""Classify with cheap model, fall back to expensive if uncertain."""
class Classification(BaseModel):
category: str
confidence: float # 0.0 to 1.0
# Try cheap model first
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Classify this support ticket. Include your confidence from 0.0 to 1.0."},
{"role": "user", "content": text}
],
response_format=Classification,
)
result = response.choices[0].message.parsed
# Fall back to expensive model if confidence is low
if result.confidence < 0.8:
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Classify this support ticket. Include your confidence from 0.0 to 1.0."},
{"role": "user", "content": text}
],
response_format=Classification,
)
result = response.choices[0].message.parsed
return result
Note: LLM self-reported confidence is not well-calibrated. A model that says it is 90% confident is not necessarily right 90% of the time. But it is a useful signal for routing — low confidence correlates with harder queries, even if the absolute numbers are not reliable.
Multi-Provider Fallback
Do not depend on a single LLM provider. APIs go down. Rate limits hit. Build a provider abstraction that tries your primary provider first and falls back to a secondary. Maintain a list of provider configurations (client, model name, API format), iterate through them on failure, and raise only when all providers are exhausted. This is production hygiene, not premature optimization.
Common Pitfalls
- Routing overhead exceeds savings. If your router uses the same expensive model you are trying to avoid, you have gained nothing. Use the cheapest possible classifier or heuristic for routing.
- Over-routing to the cheap model. Aggressive cost optimization leads to quality degradation on queries that needed the better model. Monitor user satisfaction by route, not just overall.
- No fallback strategy. Routing without fallbacks means misrouted queries get bad answers with no recovery. Always have a path from the cheap model to the expensive one.
- Hardcoded thresholds. A confidence threshold of 0.8 might be right today and wrong next month as your query distribution changes. Monitor and adjust regularly.
- Single provider dependency. If 100% of your LLM traffic goes through one provider and they have an outage, your product is down. Multi-provider fallback is production hygiene, not over-engineering.
- Not A/B testing the router. Your routing logic is itself a model decision. A/B test routed traffic vs always-expensive-model traffic to verify that routing does not hurt quality.
Key Takeaways
- Most LLM queries do not need the most expensive model. Route simple queries to cheap models and complex queries to capable ones. Typical savings are 50-80%.
- Three routing approaches: LLM classifier (most accurate, adds latency), embedding-based (fast, no API call), heuristic (simplest, surprisingly effective). Start with heuristics, upgrade if needed.
- Always implement fallbacks. Try the cheap model, fall back to the expensive one on low confidence or failure. This is the highest-leverage pattern for cost optimization with quality preservation.
- Build multi-provider fallback as production infrastructure. Single-provider dependency is a reliability risk.
- Monitor routing decisions, costs, and quality per model tier. You cannot optimize routing without data on how each tier performs.
- The router itself must be cheap. An expensive routing step defeats the purpose.