Model Serving

A model in a Jupyter notebook is a prototype. A model behind an API serving thousands of requests per second is a product. The gap between these two states is where most ML projects die. Model serving is the discipline of getting a trained model into production where it can receive inputs, produce predictions, and do so reliably, cheaply, and fast enough for your use case.

The Serving Spectrum

Not every model needs the same serving strategy. The first question is latency: does the user need a response in milliseconds, or is it fine to process results overnight?

Real-time serving:    <100ms response time. User is waiting.
                      Examples: search ranking, fraud detection, autocomplete.

Near real-time:       100ms - 5s. User is waiting but tolerant.
                      Examples: content moderation, recommendation on page load.

Batch:                Minutes to hours. No user waiting.
                      Examples: nightly report generation, bulk classification,
                      email campaign personalization.

Batch serving is simpler. Run inference on a dataset, write results to a database, serve the precomputed results. Real-time serving requires an always-on service that can handle concurrent requests with consistent latency.

REST API Serving

The most common pattern. Wrap your model in an HTTP endpoint.

# FastAPI model server - the practical default
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

app = FastAPI()

# Load model once at startup, not per request
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
model.eval()

class PredictionRequest(BaseModel):
    text: str

class PredictionResponse(BaseModel):
    label: str
    confidence: float

@app.post("/predict", response_model=PredictionResponse)
def predict(request: PredictionRequest):
    inputs = tokenizer(request.text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    confidence, predicted = torch.max(probs, dim=-1)
    label = "positive" if predicted.item() == 1 else "negative"
    return PredictionResponse(label=label, confidence=confidence.item())

This works for most teams. FastAPI gives you automatic OpenAPI docs, request validation, and async support. For higher throughput, run multiple workers behind a load balancer.

gRPC Serving

When REST is too slow — gRPC uses Protocol Buffers (binary serialization) and HTTP/2 (multiplexed connections), which reduces latency and bandwidth. Common in internal microservice architectures where the model server talks to other services, not directly to users.

REST (JSON):  ~1-5ms serialization overhead per request
gRPC (protobuf): ~0.1-0.5ms serialization overhead per request

The difference matters when you are doing thousands of requests per second or when the payload is large (images, long text). For most applications serving <100 requests/second, REST is fine.

GPU vs CPU Inference

GPUs are fast for inference but expensive. CPUs are slower but cheaper and easier to manage.

Model Type                    CPU Latency    GPU Latency    Recommendation
Small classifier (DistilBERT)    10-50ms       2-10ms       CPU is fine
Medium model (BERT-large)        50-200ms      5-20ms       GPU if latency matters
Large model (Llama 7B)          2-10s          100-500ms    GPU required
Very large model (Llama 70B)    Minutes        1-5s         Multiple GPUs required
Embedding model                  5-30ms         1-5ms       CPU for low volume, GPU for batch

The math usually works out like this: a single GPU instance (g5.xlarge on AWS, ~$1/hr) replaces 5-10 CPU instances for inference on medium models. But GPU instances have less flexible scaling — you cannot scale to 0 easily, and cold starts take longer.

When to Use CPU

Small models (decision trees, linear models, small transformers)
Low request volume (<10 requests/second)
Cost-sensitive applications where latency is not critical
Serverless deployments (Lambda, Cloud Functions)

When to Use GPU

Large transformer models
High throughput requirements (>50 requests/second)
Batch inference on large datasets
Multi-modal models (vision + text)

Quantization

Quantization reduces model weights from 32-bit floats to lower precision, making inference faster and cheaper with minimal quality loss.

# ONNX Runtime quantization - one of the most practical approaches
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

# Export to ONNX and quantize
model = ORTModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english",
    export=True
)
quantizer = ORTQuantizer.from_pretrained(model)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False)  # dynamic quantization
quantizer.quantize(save_dir="model_quantized/", quantization_config=qconfig)

Precision    Size Reduction    Speed Improvement    Quality Loss
FP32         Baseline          Baseline             None
FP16         2x smaller        1.5-2x faster        Negligible
INT8         4x smaller        2-4x faster          Small (<1% accuracy)
INT4         8x smaller        3-5x faster          Moderate (1-3% accuracy)

For LLMs specifically, quantization is practically mandatory for self-hosting. Running a 70B parameter model at FP16 requires 140 GB of GPU memory. At INT4, it fits in ~35 GB.

# LLM quantization with llama.cpp (via Python bindings)
# GGUF format with Q4_K_M quantization is the practical default
from llama_cpp import Llama

llm = Llama(
    model_path="models/llama-3-8b-instruct-Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,  # offload all layers to GPU
)
response = llm("What is the capital of France?", max_tokens=100)

Self-Hosted vs API Providers

This is less a technical decision and more a business decision.

Factor              Self-Hosted                   API Provider
Upfront cost        High (infra, engineering)      Zero
Per-request cost    Low at scale                   High at scale
Latency control     Full                           Limited
Data privacy        Full control                   Data leaves your network
Scaling             You manage it                  Automatic
Model selection     Any model                      Provider's catalog
Ops burden          Significant                    Near zero
Time to production  Weeks to months                Hours

The Cost Equation

# Rough cost comparison: self-hosted vs API for a text classification task

# API approach: OpenAI GPT-3.5-turbo for classification
api_cost_per_request = 0.002  # ~500 input tokens + 50 output tokens
daily_requests = 100_000
api_monthly_cost = api_cost_per_request * daily_requests * 30  # $6,000/month

# Self-hosted approach: fine-tuned DistilBERT on a g5.xlarge
gpu_instance_hourly = 1.006  # g5.xlarge on-demand
instances_needed = 2  # for redundancy and load
self_hosted_monthly = gpu_instance_hourly * 24 * 30 * instances_needed  # ~$1,449/month

# Self-hosted engineering cost: 1 ML engineer part-time = ~$5,000/month equivalent
# Total self-hosted: ~$6,449/month

# Break-even: at 100K requests/day, costs are similar
# At 500K requests/day, self-hosted wins by 4x
# At 10K requests/day, API wins (no engineering overhead)

The crossover point depends on volume, model complexity, and engineering capability. Most teams should start with APIs and self-host when cost forces the decision.

Serving Frameworks

Several frameworks specialize in model serving:

Framework        Language    Notes
TorchServe       Python      PyTorch native. Good for PyTorch models.
TF Serving       C++         TensorFlow native. Very fast, battle-tested.
Triton           C++/Python  NVIDIA. Multi-framework, GPU-optimized.
vLLM             Python      LLM-specific. PagedAttention for high throughput.
Ollama           Go          Local LLM serving. Simple setup.
BentoML          Python      Framework-agnostic. Good developer experience.
Ray Serve        Python      Distributed serving. Good for complex pipelines.

For LLM serving specifically, vLLM is the current standard. It implements PagedAttention, which dramatically improves throughput by efficiently managing GPU memory:

# vLLM serving - the practical choice for LLM inference
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8B-Instruct", quantization="awq")
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

# vLLM handles batching automatically
prompts = ["Explain quantum computing", "Write a haiku about coffee"]
outputs = llm.generate(prompts, sampling_params)

Serverless Model Serving

For low-to-medium volume, serverless removes all operational burden. The tradeoff is cold starts and limited GPU support.

Platform              GPU Support    Cold Start    Max Timeout
AWS Lambda            No             1-10s         15 min
Google Cloud Functions No            1-5s          60 min
Modal                 Yes            5-30s         24 hrs
Replicate             Yes            5-60s         Varies
Banana/Baseten        Yes            10-60s        Varies

Modal is particularly practical for ML workloads — it supports GPU instances, persistent volumes for model weights, and scales to zero:

# Modal serverless GPU inference
import modal

app = modal.App("classifier")
image = modal.Image.debian_slim().pip_install("transformers", "torch")

@app.cls(gpu="T4", image=image)
class Classifier:
    @modal.enter()
    def load_model(self):
        from transformers import pipeline
        self.pipe = pipeline("sentiment-analysis")
    
    @modal.method()
    def predict(self, text: str):
        return self.pipe(text)[0]

Common Pitfalls

Loading the model on every request. Load once at startup, predict on every request. This is the most common performance bug in model serving code.
No health checks or readiness probes. Your model server needs a /health endpoint that verifies the model is loaded and ready. Without it, load balancers send traffic to servers that are still loading models.
Ignoring cold start latency. If your GPU model takes 30 seconds to load into memory, the first request after scaling up will time out. Pre-warm instances or use persistent serving.
Not batching inference. Processing one input at a time wastes GPU parallelism. Batch requests together for 3-10x throughput improvement on GPU.
Premature self-hosting. Teams spend months building serving infrastructure when an API call would have shipped the feature in a day. Start with APIs, optimize later.
No fallback for model server failures. Your model server will go down. Have a degraded experience (cached results, rule-based fallback, or a simpler model) ready.

Key Takeaways

Match serving strategy to latency requirements. Batch for offline, REST API for most real-time, gRPC for high-throughput internal services.
Start with API providers (OpenAI, Anthropic, Cohere). Self-host when volume makes the cost equation favor it — typically above 100K-500K requests per day.
Quantization (INT8, INT4) is not optional for large models. It cuts costs 2-4x with minimal quality loss.
vLLM is the current standard for self-hosted LLM inference. FastAPI + a transformer model is the standard for everything else.
Load models once at startup. Implement health checks. Batch inference on GPU. These three things solve 80% of serving performance problems.
Always have a fallback. Model servers fail. Plan for degraded service, not total outage.