Capacity Planning & Chaos Engineering
Two questions define operational maturity. Capacity planning asks: "Do we have enough infrastructure to handle what is coming?" Chaos engineering asks: "Will our systems survive when things go wrong?" Both require you to think about failure before it happens instead of reacting after the fact.
Capacity Planning
Capacity planning is the process of determining how much infrastructure you need to serve your current and future load. Get it wrong in one direction, you waste money on idle resources. Get it wrong in the other direction, your service falls over during a traffic spike.
Start with Current Load
Before you can plan for the future, you need to understand the present.
Key metrics to collect:
- Requests per second (RPS) at peak and average
- CPU utilization at peak and average
- Memory usage at peak and average
- Database connections in use vs available
- Storage growth rate (GB per month)
- Network bandwidth at peak
Example baseline:
Service: checkout-api
Average RPS: 500
Peak RPS: 2,000 (during sales events)
Average CPU: 35%
Peak CPU: 75%
Database connections: 80 of 200 at peak
Storage growth: 50 GB/month
Extrapolate from Growth
Once you know the current load, project forward based on growth trends.
Growth estimation:
Current peak RPS: 2,000
Monthly growth rate: 15%
In 3 months: 2,000 * 1.15^3 = 3,041 RPS
In 6 months: 2,000 * 1.15^6 = 4,626 RPS
In 12 months: 2,000 * 1.15^12 = 10,702 RPS
Current infrastructure handles 3,000 RPS comfortably.
We need to scale before month 3.
Buffer for Spikes
Normal traffic patterns have spikes that exceed the daily average. Plan for 2-3x the daily average to handle peaks without degradation.
Capacity planning formula:
Required capacity = Peak load * Safety buffer * Growth factor
Example:
Peak RPS: 2,000
Safety buffer: 2x (handle unexpected spikes)
Growth factor: 1.5 (6-month runway)
Required capacity: 2,000 * 2 * 1.5 = 6,000 RPS
Current capacity: 3,000 RPS
Gap: 3,000 RPS → need to double infrastructure
Common Traffic Patterns
Different services have different patterns. Know yours.
E-commerce:
- Black Friday / Cyber Monday: 10-50x normal traffic
- Holiday season: 3-5x normal traffic
- Daily pattern: peak at lunchtime and evening
SaaS B2B:
- Weekdays 9-5: high traffic
- Weekends: minimal traffic
- Month-end: spike from billing and reporting
Social media:
- Viral events: unpredictable 100x spikes
- Daily pattern: peak in evening hours
- Gradual growth with sudden jumps
Gaming:
- Launch day: 10-100x normal traffic
- Event weekends: 5-10x
- Daily pattern: peak in evening and weekends
Load Testing Before Launches
Never launch a major feature or marketing campaign without load testing first.
# Load testing with k6
# This script simulates gradual ramp-up to peak traffic
# k6-load-test.js would contain:
# export let options = {
# stages: [
# { duration: '2m', target: 100 }, // ramp up
# { duration: '5m', target: 500 }, // hold at expected peak
# { duration: '2m', target: 2000 }, // push to 2x peak
# { duration: '5m', target: 2000 }, // sustain
# { duration: '2m', target: 0 }, // ramp down
# ],
# };
# Run the load test against staging
k6 run --out influxdb=http://localhost:8086/k6 k6-load-test.js
Load test results:
At 500 RPS: p99 latency = 120ms, error rate = 0% (OK)
At 1000 RPS: p99 latency = 250ms, error rate = 0% (OK)
At 1500 RPS: p99 latency = 800ms, error rate = 0.1% (Warning)
At 2000 RPS: p99 latency = 2.5s, error rate = 5% (Critical)
Bottleneck: Database connection pool saturates at ~1400 RPS
Action: Increase pool size, add read replicas, implement caching
Capacity Planning as a Regular Practice
Capacity planning is not a one-time exercise. Review quarterly.
Quarterly capacity review:
1. Update baseline metrics (current peak load)
2. Recalculate growth projections
3. Identify bottlenecks approaching their limits
4. Plan scaling actions for next quarter
5. Review cost of current infrastructure
6. Document decisions and assumptions
Chaos Engineering
Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. You break things intentionally, in a controlled way, to find weaknesses before they cause real outages.
The Principles
1. Define "steady state" — What does normal look like?
(RPS, latency, error rate at normal levels)
2. Hypothesize — "If we kill one database replica, the
service will failover within 30 seconds with no errors."
3. Run the experiment — Actually kill the replica.
4. Observe — Did the failover work? Were there errors?
How long did it take?
5. Learn — Fix whatever broke. Update runbooks.
Improve the system.
Netflix Chaos Monkey
Netflix pioneered chaos engineering with Chaos Monkey, a tool that randomly terminates production instances during business hours. The idea is simple: if your system cannot survive a single instance dying, it is not resilient enough.
Chaos Monkey levels:
Chaos Monkey: Kill random instances
Chaos Kong: Simulate entire region failure
Latency Monkey: Inject artificial network delays
Chaos Gorilla: Simulate availability zone failure
The key insight is that Netflix runs these in production, not just staging. Production is the only environment that accurately represents production.
Chaos Engineering Tools
Gremlin:
- Commercial platform for chaos engineering
- Attack types: CPU, memory, disk, network, process
- Targeted: choose specific hosts, containers, or services
- Safety: automatic rollback, halt conditions
Litmus Chaos:
- Open-source, Kubernetes-native
- ChaosEngine CRDs for defining experiments
- Community hub with pre-built experiments
Chaos Mesh:
- Open-source, Kubernetes-native
- Network chaos: partition, delay, loss, corruption
- Pod chaos: kill, failure, container kill
- Dashboard for experiment management
AWS Fault Injection Simulator:
- AWS-native chaos engineering
- Integrates with EC2, ECS, EKS, RDS
- Safety: stop conditions, rollback
Designing Chaos Experiments
Start small and controlled. Do not inject chaos randomly into production on day one.
Progression of chaos experiments:
Level 1: Known recovery paths
- Kill a single pod (Kubernetes should restart it)
- Kill a single instance behind a load balancer
- Hypothesis: the system recovers automatically
Level 2: Dependency failures
- Block network to the database for 30 seconds
- Simulate a downstream API returning errors
- Hypothesis: circuit breakers activate, graceful degradation
Level 3: Infrastructure failures
- Simulate an availability zone failure
- Kill the primary database (failover to replica)
- Hypothesis: multi-AZ setup handles the failure
Level 4: Regional failures
- Simulate an entire region going down
- Test DNS failover to another region
- Hypothesis: multi-region setup handles the failure
Running a Chaos Experiment
# Example Litmus Chaos experiment: kill a random pod
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: payment-pod-kill
spec:
appinfo:
appns: production
applabel: app=payment-service
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"
Game Days
A game day is a planned event where the team practices responding to failures. It combines chaos engineering with incident response practice.
Game day structure:
Preparation (1 week before):
- Define the scenario (e.g., "Primary database fails")
- Set up monitoring dashboards
- Brief the team on the exercise
- Establish safety controls (how to abort)
Execution (2-4 hours):
- Inject the failure
- Team responds as if it were a real incident
- Facilitator observes and takes notes
- Time key milestones (detection, diagnosis, recovery)
Debrief (1 hour):
- What went well?
- What surprised us?
- What would we do differently?
- Action items for improvement
Game Day Scenarios
Scenario 1: "The database is down"
- Kill the primary database
- Does the application failover to the replica?
- Can the team access the runbook?
- How long until recovery?
Scenario 2: "A dependency is returning errors"
- Simulate a third-party API returning 500s
- Does the circuit breaker trip?
- Does the service degrade gracefully?
- Are users shown a meaningful error?
Scenario 3: "We got hacked"
- Simulate a compromised service account
- Can the team revoke credentials quickly?
- Is there an incident response playbook?
- How long until containment?
Scenario 4: "Traffic spike"
- Simulate 10x normal traffic
- Does auto-scaling work?
- What is the first bottleneck?
- How long until the system stabilizes?
Real-World Example
An e-commerce company was preparing for their biggest sale of the year. Load testing showed their checkout flow could handle 3x normal traffic, but they expected 5-8x. They had three weeks.
The capacity planning response:
- Added read replicas for the database
- Increased cache capacity by 2x
- Pre-scaled Kubernetes pods to handle 8x
- Set up auto-scaling with aggressive thresholds
The chaos engineering response:
- Ran a game day simulating database failover during peak load
- Discovered the connection pool was too small for the failover scenario
- Found that the cache invalidation during failover caused a thundering herd
- Fixed both issues before the sale
The sale hit 6x normal traffic. The system handled it with no downtime. Without load testing, they would have discovered the connection pool issue during the sale. Without the game day, they would have discovered the thundering herd problem during a real database failover.
Common Pitfalls
- Capacity planning based on averages -- Averages hide peaks; plan for p99 traffic, not average traffic
- No buffer for the unknown -- A 10% buffer is not enough; unexpected viral moments, bot traffic, and DDoS attacks can multiply traffic overnight; use 2-3x buffers
- Load testing only in staging -- Staging rarely matches production topology; test in production with feature flags and canary traffic
- Chaos engineering without a hypothesis -- Randomly breaking things teaches you nothing; start with a hypothesis and measure the result
- Skipping the safety controls -- Every chaos experiment needs a kill switch; without one, a failed experiment becomes a real outage
- Running chaos experiments without buy-in -- If leadership does not understand why you are intentionally causing failures, the first experiment that goes wrong will be the last
Key Takeaways
- Capacity planning starts with understanding current load, extrapolating growth, and adding a 2-3x buffer for spikes
- Load test before major launches; never assume the system can handle more than you have proven it can handle
- Chaos engineering intentionally breaks systems to find weaknesses before real failures do
- Start chaos experiments small (kill a pod) and progress to larger failures (zone outage, region failover)
- Game days combine chaos engineering with incident response practice in a structured event
- Both capacity planning and chaos engineering are ongoing practices, not one-time activities