Deployment Strategies
Why Deployment Strategy Matters
Deployment strategy determines how much risk you take each time you ship code to production. The wrong strategy turns every deploy into a potential outage. The right one makes deployments boring -- and boring is exactly what you want.
Blue-Green Deployment
Maintain two identical production environments. Deploy to the inactive one, verify it works, then switch traffic.
Before: Traffic → [Blue (v1)] ← active [Green (idle)]
Deploy: Traffic → [Blue (v1)] ← active [Green (v2)] ← deploying
Switch: Traffic → [Blue (idle)] [Green (v2)] ← active
How it works
- Blue is running v1, serving all production traffic.
- Deploy v2 to Green (which receives no traffic).
- Run smoke tests against Green.
- Switch the load balancer / DNS to point at Green.
- Green is now live. Blue becomes the idle standby.
Rollback
Instant -- switch traffic back to Blue. The old version is still running, untouched.
When to use
- Services where instant rollback is critical (payment processing, authentication).
- Stateless services. Stateful services are harder because database schema changes may not be backward-compatible.
- When you can afford double infrastructure cost (or use cloud auto-scaling to minimize it).
Trade-offs
| Pros | Cons | |------|------| | Zero downtime | Double infrastructure cost during deployment | | Instant rollback (seconds) | Database migrations need careful handling | | Full environment validation before switch | DNS propagation can delay the switch (if DNS-based) |
Canary Deployment
Route a small percentage of traffic to the new version. Monitor for errors. Gradually increase if healthy. Roll back automatically if not.
Step 1: 5% traffic → [v2] 95% traffic → [v1]
Step 2: 25% traffic → [v2] 75% traffic → [v1]
Step 3: 50% traffic → [v2] 50% traffic → [v1]
Step 4: 100% traffic → [v2] 0% traffic → [v1]
How it works
- Deploy v2 alongside v1.
- Route 5% of traffic to v2.
- Monitor error rates, latency, and business metrics for a defined period (e.g., 10 minutes).
- If metrics are healthy, increase to 25%, then 50%, then 100%.
- If any step shows degradation, automatically route 100% back to v1.
Automated canary analysis
The key to effective canary deployments is automated metric comparison:
Canary metrics during window:
- Error rate: canary 0.12% vs baseline 0.10% → PASS (within threshold)
- p99 latency: canary 180ms vs baseline 165ms → PASS (within threshold)
- CPU usage: canary 45% vs baseline 40% → PASS (within threshold)
Verdict: PROMOTE to next stage
Without automated analysis, a canary deployment is just a deployment with extra steps. You need metrics and automated comparison to detect regressions.
When to use
- User-facing services where regressions directly impact revenue.
- Services with good observability (metrics, tracing, alerting).
- When you want real production traffic validation, not just staging tests.
Trade-offs
| Pros | Cons | |------|------| | Limited blast radius (only 5% of users affected initially) | More complex routing infrastructure | | Real production validation | Requires good metrics and automated analysis | | Gradual confidence building | Slower rollout than blue-green |
Rolling Deployment
Update instances one at a time. Each instance is taken out of the load balancer, updated, health-checked, and returned to service.
[v1] [v1] [v1] [v1] ← Start
[v2] [v1] [v1] [v1] ← First instance updated
[v2] [v2] [v1] [v1] ← Second instance updated
[v2] [v2] [v2] [v1] ← Third instance updated
[v2] [v2] [v2] [v2] ← Complete
How it works
- Remove one instance from the load balancer.
- Deploy v2 to that instance.
- Run health checks against it.
- If healthy, add it back to the load balancer and proceed to the next instance.
- If unhealthy, stop the rollout and roll back the updated instances.
When to use
- Internal services where brief version mixing is acceptable.
- When you cannot afford double infrastructure (blue-green cost).
- Kubernetes uses rolling deployments by default (
strategy: RollingUpdate).
Trade-offs
| Pros | Cons | |------|------| | No extra infrastructure needed | Mixed versions during deployment | | Gradual rollout | Slower rollback (must re-deploy all instances) | | Built into Kubernetes natively | API compatibility required between v1 and v2 |
Automated Rollback
Regardless of strategy, every deployment needs an automated rollback mechanism. Manual rollbacks under pressure at 2 AM are where mistakes compound.
Rollback triggers
- Error rate exceeds threshold. If the 5xx error rate jumps from 0.1% to 2%, roll back.
- Latency spike. If p99 latency doubles compared to the pre-deploy baseline, roll back.
- Health check failure. If the new version fails readiness probes, Kubernetes rolls back automatically.
- Business metric anomaly. If checkout conversion drops 10%, roll back (even if technical metrics look fine).
Rollback speed by strategy
| Strategy | Rollback speed | Mechanism | |----------|---------------|-----------| | Blue-green | Seconds | Switch traffic back to old environment | | Canary | Seconds | Route 100% back to v1 | | Rolling | Minutes | Re-deploy v1 across all instances |
Real-World Examples
GitHub: ChatOps Deployment
GitHub deploys to production dozens of times per day using a ChatOps system:
- An engineer types
/deployin a chat channel. - The system triggers a canary deployment.
- Automated monitoring watches error rates and latency for 10 minutes.
- If healthy, it proceeds to full rollout.
- If not, it automatically rolls back.
- The entire process takes ~15 minutes from command to full deployment.
The ChatOps approach gives visibility: everyone in the channel can see who is deploying what, and the bot reports status updates in real time.
Amazon: 11.7-Second Deploys
Amazon deploys code every 11.7 seconds on average (across all services). This velocity is enabled by:
- Fully automated CI/CD pipelines with no manual gates.
- Automated canary analysis comparing metrics against baselines.
- One-box deployment: deploy to a single instance first, validate, then proceed.
- Automatic rollback triggered by metric anomalies.
Their deployment philosophy: small changes, deployed frequently, are safer than large changes deployed rarely. Each deployment touches a small amount of code, so when something breaks, the cause is obvious.
Etsy: Deploying 50+ Times Per Day
Etsy pioneered continuous deployment in 2010 with ~200 engineers deploying 50+ times per day:
- Feature flags decouple deployment from release (code ships but is not active).
- Comprehensive monitoring dashboards visible to the entire company.
- Deploying was normal -- not a special event requiring a change advisory board.
Choosing a Strategy
| Scenario | Recommended strategy | |----------|---------------------| | Payment / auth service, instant rollback critical | Blue-green | | User-facing API with good observability | Canary | | Internal service, cost-sensitive | Rolling | | Database migration (schema change) | Blue-green with backward-compatible migrations | | First deployment of a new service | Rolling (simplest) |
Common Mistakes
- Canary without automated metrics. If nobody is watching the canary, it is just a deployment with extra steps.
- No rollback testing. A rollback plan that has never been tested is not a plan. Test rollbacks regularly.
- Deploying on Friday. Unless your monitoring and on-call are excellent, avoid deploying before weekends.
- Manual deployments. Any manual step is a step that can be forgotten, done incorrectly, or skipped under pressure. Automate everything.
- Large, infrequent deploys. Deploying 10,000 lines of changes is riskier than deploying 100 lines ten times. Small batches reduce blast radius.
Key Takeaways
- Deployment strategy should match risk tolerance. Canary for user-facing, rolling for internal, blue-green for critical.
- Automated rollback is not optional. Define triggers (error rate, latency, health checks) and test them.
- Small, frequent deployments are safer than large, infrequent ones. This is counter-intuitive but consistently proven by DORA research.
- Always have a rollback plan. Test it before you need it.