Model Serving
A model in a Jupyter notebook is a prototype. A model behind an API serving thousands of requests per second is a product. The gap between these two states is where most ML projects die. Model serving is the discipline of getting a trained model into production where it can receive inputs, produce predictions, and do so reliably, cheaply, and fast enough for your use case.
The Serving Spectrum
Not every model needs the same serving strategy. The first question is latency: does the user need a response in milliseconds, or is it fine to process results overnight?
Real-time serving: <100ms response time. User is waiting.
Examples: search ranking, fraud detection, autocomplete.
Near real-time: 100ms - 5s. User is waiting but tolerant.
Examples: content moderation, recommendation on page load.
Batch: Minutes to hours. No user waiting.
Examples: nightly report generation, bulk classification,
email campaign personalization.
Batch serving is simpler. Run inference on a dataset, write results to a database, serve the precomputed results. Real-time serving requires an always-on service that can handle concurrent requests with consistent latency.
REST API Serving
The most common pattern. Wrap your model in an HTTP endpoint.
# FastAPI model server - the practical default
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
app = FastAPI()
# Load model once at startup, not per request
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
)
model.eval()
class PredictionRequest(BaseModel):
text: str
class PredictionResponse(BaseModel):
label: str
confidence: float
@app.post("/predict", response_model=PredictionResponse)
def predict(request: PredictionRequest):
inputs = tokenizer(request.text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
confidence, predicted = torch.max(probs, dim=-1)
label = "positive" if predicted.item() == 1 else "negative"
return PredictionResponse(label=label, confidence=confidence.item())
This works for most teams. FastAPI gives you automatic OpenAPI docs, request validation, and async support. For higher throughput, run multiple workers behind a load balancer.
gRPC Serving
When REST is too slow — gRPC uses Protocol Buffers (binary serialization) and HTTP/2 (multiplexed connections), which reduces latency and bandwidth. Common in internal microservice architectures where the model server talks to other services, not directly to users.
REST (JSON): ~1-5ms serialization overhead per request
gRPC (protobuf): ~0.1-0.5ms serialization overhead per request
The difference matters when you are doing thousands of requests per second or when the payload is large (images, long text). For most applications serving <100 requests/second, REST is fine.
GPU vs CPU Inference
GPUs are fast for inference but expensive. CPUs are slower but cheaper and easier to manage.
Model Type CPU Latency GPU Latency Recommendation
Small classifier (DistilBERT) 10-50ms 2-10ms CPU is fine
Medium model (BERT-large) 50-200ms 5-20ms GPU if latency matters
Large model (Llama 7B) 2-10s 100-500ms GPU required
Very large model (Llama 70B) Minutes 1-5s Multiple GPUs required
Embedding model 5-30ms 1-5ms CPU for low volume, GPU for batch
The math usually works out like this: a single GPU instance (g5.xlarge on AWS, ~$1/hr) replaces 5-10 CPU instances for inference on medium models. But GPU instances have less flexible scaling — you cannot scale to 0 easily, and cold starts take longer.
When to Use CPU
- Small models (decision trees, linear models, small transformers)
- Low request volume (<10 requests/second)
- Cost-sensitive applications where latency is not critical
- Serverless deployments (Lambda, Cloud Functions)
When to Use GPU
- Large transformer models
- High throughput requirements (>50 requests/second)
- Batch inference on large datasets
- Multi-modal models (vision + text)
Quantization
Quantization reduces model weights from 32-bit floats to lower precision, making inference faster and cheaper with minimal quality loss.
# ONNX Runtime quantization - one of the most practical approaches
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
# Export to ONNX and quantize
model = ORTModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english",
export=True
)
quantizer = ORTQuantizer.from_pretrained(model)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False) # dynamic quantization
quantizer.quantize(save_dir="model_quantized/", quantization_config=qconfig)
Precision Size Reduction Speed Improvement Quality Loss
FP32 Baseline Baseline None
FP16 2x smaller 1.5-2x faster Negligible
INT8 4x smaller 2-4x faster Small (<1% accuracy)
INT4 8x smaller 3-5x faster Moderate (1-3% accuracy)
For LLMs specifically, quantization is practically mandatory for self-hosting. Running a 70B parameter model at FP16 requires 140 GB of GPU memory. At INT4, it fits in ~35 GB.
# LLM quantization with llama.cpp (via Python bindings)
# GGUF format with Q4_K_M quantization is the practical default
from llama_cpp import Llama
llm = Llama(
model_path="models/llama-3-8b-instruct-Q4_K_M.gguf",
n_ctx=4096,
n_gpu_layers=-1, # offload all layers to GPU
)
response = llm("What is the capital of France?", max_tokens=100)
Self-Hosted vs API Providers
This is less a technical decision and more a business decision.
Factor Self-Hosted API Provider
Upfront cost High (infra, engineering) Zero
Per-request cost Low at scale High at scale
Latency control Full Limited
Data privacy Full control Data leaves your network
Scaling You manage it Automatic
Model selection Any model Provider's catalog
Ops burden Significant Near zero
Time to production Weeks to months Hours
The Cost Equation
# Rough cost comparison: self-hosted vs API for a text classification task
# API approach: OpenAI GPT-3.5-turbo for classification
api_cost_per_request = 0.002 # ~500 input tokens + 50 output tokens
daily_requests = 100_000
api_monthly_cost = api_cost_per_request * daily_requests * 30 # $6,000/month
# Self-hosted approach: fine-tuned DistilBERT on a g5.xlarge
gpu_instance_hourly = 1.006 # g5.xlarge on-demand
instances_needed = 2 # for redundancy and load
self_hosted_monthly = gpu_instance_hourly * 24 * 30 * instances_needed # ~$1,449/month
# Self-hosted engineering cost: 1 ML engineer part-time = ~$5,000/month equivalent
# Total self-hosted: ~$6,449/month
# Break-even: at 100K requests/day, costs are similar
# At 500K requests/day, self-hosted wins by 4x
# At 10K requests/day, API wins (no engineering overhead)
The crossover point depends on volume, model complexity, and engineering capability. Most teams should start with APIs and self-host when cost forces the decision.
Serving Frameworks
Several frameworks specialize in model serving:
Framework Language Notes
TorchServe Python PyTorch native. Good for PyTorch models.
TF Serving C++ TensorFlow native. Very fast, battle-tested.
Triton C++/Python NVIDIA. Multi-framework, GPU-optimized.
vLLM Python LLM-specific. PagedAttention for high throughput.
Ollama Go Local LLM serving. Simple setup.
BentoML Python Framework-agnostic. Good developer experience.
Ray Serve Python Distributed serving. Good for complex pipelines.
For LLM serving specifically, vLLM is the current standard. It implements PagedAttention, which dramatically improves throughput by efficiently managing GPU memory:
# vLLM serving - the practical choice for LLM inference
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct", quantization="awq")
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
# vLLM handles batching automatically
prompts = ["Explain quantum computing", "Write a haiku about coffee"]
outputs = llm.generate(prompts, sampling_params)
Serverless Model Serving
For low-to-medium volume, serverless removes all operational burden. The tradeoff is cold starts and limited GPU support.
Platform GPU Support Cold Start Max Timeout
AWS Lambda No 1-10s 15 min
Google Cloud Functions No 1-5s 60 min
Modal Yes 5-30s 24 hrs
Replicate Yes 5-60s Varies
Banana/Baseten Yes 10-60s Varies
Modal is particularly practical for ML workloads — it supports GPU instances, persistent volumes for model weights, and scales to zero:
# Modal serverless GPU inference
import modal
app = modal.App("classifier")
image = modal.Image.debian_slim().pip_install("transformers", "torch")
@app.cls(gpu="T4", image=image)
class Classifier:
@modal.enter()
def load_model(self):
from transformers import pipeline
self.pipe = pipeline("sentiment-analysis")
@modal.method()
def predict(self, text: str):
return self.pipe(text)[0]
Common Pitfalls
- Loading the model on every request. Load once at startup, predict on every request. This is the most common performance bug in model serving code.
- No health checks or readiness probes. Your model server needs a
/healthendpoint that verifies the model is loaded and ready. Without it, load balancers send traffic to servers that are still loading models. - Ignoring cold start latency. If your GPU model takes 30 seconds to load into memory, the first request after scaling up will time out. Pre-warm instances or use persistent serving.
- Not batching inference. Processing one input at a time wastes GPU parallelism. Batch requests together for 3-10x throughput improvement on GPU.
- Premature self-hosting. Teams spend months building serving infrastructure when an API call would have shipped the feature in a day. Start with APIs, optimize later.
- No fallback for model server failures. Your model server will go down. Have a degraded experience (cached results, rule-based fallback, or a simpler model) ready.
Key Takeaways
- Match serving strategy to latency requirements. Batch for offline, REST API for most real-time, gRPC for high-throughput internal services.
- Start with API providers (OpenAI, Anthropic, Cohere). Self-host when volume makes the cost equation favor it — typically above 100K-500K requests per day.
- Quantization (INT8, INT4) is not optional for large models. It cuts costs 2-4x with minimal quality loss.
- vLLM is the current standard for self-hosted LLM inference. FastAPI + a transformer model is the standard for everything else.
- Load models once at startup. Implement health checks. Batch inference on GPU. These three things solve 80% of serving performance problems.
- Always have a fallback. Model servers fail. Plan for degraded service, not total outage.