Capacity Planning & Chaos Engineering

Two questions define operational maturity. Capacity planning asks: "Do we have enough infrastructure to handle what is coming?" Chaos engineering asks: "Will our systems survive when things go wrong?" Both require you to think about failure before it happens instead of reacting after the fact.

Capacity Planning

Capacity planning is the process of determining how much infrastructure you need to serve your current and future load. Get it wrong in one direction, you waste money on idle resources. Get it wrong in the other direction, your service falls over during a traffic spike.

Start with Current Load

Before you can plan for the future, you need to understand the present.

Key metrics to collect:
  - Requests per second (RPS) at peak and average
  - CPU utilization at peak and average
  - Memory usage at peak and average
  - Database connections in use vs available
  - Storage growth rate (GB per month)
  - Network bandwidth at peak

Example baseline:
  Service: checkout-api
  Average RPS: 500
  Peak RPS: 2,000 (during sales events)
  Average CPU: 35%
  Peak CPU: 75%
  Database connections: 80 of 200 at peak
  Storage growth: 50 GB/month

Extrapolate from Growth

Once you know the current load, project forward based on growth trends.

Growth estimation:
  Current peak RPS: 2,000
  Monthly growth rate: 15%
  
  In 3 months: 2,000 * 1.15^3 = 3,041 RPS
  In 6 months: 2,000 * 1.15^6 = 4,626 RPS
  In 12 months: 2,000 * 1.15^12 = 10,702 RPS
  
  Current infrastructure handles 3,000 RPS comfortably.
  We need to scale before month 3.

Buffer for Spikes

Normal traffic patterns have spikes that exceed the daily average. Plan for 2-3x the daily average to handle peaks without degradation.

Capacity planning formula:
  Required capacity = Peak load * Safety buffer * Growth factor

  Example:
  Peak RPS: 2,000
  Safety buffer: 2x (handle unexpected spikes)
  Growth factor: 1.5 (6-month runway)
  
  Required capacity: 2,000 * 2 * 1.5 = 6,000 RPS

  Current capacity: 3,000 RPS
  Gap: 3,000 RPS → need to double infrastructure

Common Traffic Patterns

Different services have different patterns. Know yours.

E-commerce:
  - Black Friday / Cyber Monday: 10-50x normal traffic
  - Holiday season: 3-5x normal traffic
  - Daily pattern: peak at lunchtime and evening

SaaS B2B:
  - Weekdays 9-5: high traffic
  - Weekends: minimal traffic
  - Month-end: spike from billing and reporting

Social media:
  - Viral events: unpredictable 100x spikes
  - Daily pattern: peak in evening hours
  - Gradual growth with sudden jumps

Gaming:
  - Launch day: 10-100x normal traffic
  - Event weekends: 5-10x
  - Daily pattern: peak in evening and weekends

Load Testing Before Launches

Never launch a major feature or marketing campaign without load testing first.

# Load testing with k6
# This script simulates gradual ramp-up to peak traffic

# k6-load-test.js would contain:
# export let options = {
#   stages: [
#     { duration: '2m', target: 100 },   // ramp up
#     { duration: '5m', target: 500 },   // hold at expected peak
#     { duration: '2m', target: 2000 },  // push to 2x peak
#     { duration: '5m', target: 2000 },  // sustain
#     { duration: '2m', target: 0 },     // ramp down
#   ],
# };

# Run the load test against staging
k6 run --out influxdb=http://localhost:8086/k6 k6-load-test.js

Load test results:
  At 500 RPS:  p99 latency = 120ms, error rate = 0%    (OK)
  At 1000 RPS: p99 latency = 250ms, error rate = 0%    (OK)
  At 1500 RPS: p99 latency = 800ms, error rate = 0.1%  (Warning)
  At 2000 RPS: p99 latency = 2.5s,  error rate = 5%    (Critical)

  Bottleneck: Database connection pool saturates at ~1400 RPS
  Action: Increase pool size, add read replicas, implement caching

Capacity Planning as a Regular Practice

Capacity planning is not a one-time exercise. Review quarterly.

Quarterly capacity review:
  1. Update baseline metrics (current peak load)
  2. Recalculate growth projections
  3. Identify bottlenecks approaching their limits
  4. Plan scaling actions for next quarter
  5. Review cost of current infrastructure
  6. Document decisions and assumptions

Chaos Engineering

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. You break things intentionally, in a controlled way, to find weaknesses before they cause real outages.

The Principles

1. Define "steady state" — What does normal look like?
   (RPS, latency, error rate at normal levels)

2. Hypothesize — "If we kill one database replica, the
   service will failover within 30 seconds with no errors."

3. Run the experiment — Actually kill the replica.

4. Observe — Did the failover work? Were there errors?
   How long did it take?

5. Learn — Fix whatever broke. Update runbooks.
   Improve the system.

Netflix Chaos Monkey

Netflix pioneered chaos engineering with Chaos Monkey, a tool that randomly terminates production instances during business hours. The idea is simple: if your system cannot survive a single instance dying, it is not resilient enough.

Chaos Monkey levels:
  Chaos Monkey:    Kill random instances
  Chaos Kong:      Simulate entire region failure
  Latency Monkey:  Inject artificial network delays
  Chaos Gorilla:   Simulate availability zone failure

The key insight is that Netflix runs these in production, not just staging. Production is the only environment that accurately represents production.

Chaos Engineering Tools

Gremlin:
  - Commercial platform for chaos engineering
  - Attack types: CPU, memory, disk, network, process
  - Targeted: choose specific hosts, containers, or services
  - Safety: automatic rollback, halt conditions

Litmus Chaos:
  - Open-source, Kubernetes-native
  - ChaosEngine CRDs for defining experiments
  - Community hub with pre-built experiments

Chaos Mesh:
  - Open-source, Kubernetes-native
  - Network chaos: partition, delay, loss, corruption
  - Pod chaos: kill, failure, container kill
  - Dashboard for experiment management

AWS Fault Injection Simulator:
  - AWS-native chaos engineering
  - Integrates with EC2, ECS, EKS, RDS
  - Safety: stop conditions, rollback

Designing Chaos Experiments

Start small and controlled. Do not inject chaos randomly into production on day one.

Progression of chaos experiments:

Level 1: Known recovery paths
  - Kill a single pod (Kubernetes should restart it)
  - Kill a single instance behind a load balancer
  - Hypothesis: the system recovers automatically

Level 2: Dependency failures
  - Block network to the database for 30 seconds
  - Simulate a downstream API returning errors
  - Hypothesis: circuit breakers activate, graceful degradation

Level 3: Infrastructure failures
  - Simulate an availability zone failure
  - Kill the primary database (failover to replica)
  - Hypothesis: multi-AZ setup handles the failure

Level 4: Regional failures
  - Simulate an entire region going down
  - Test DNS failover to another region
  - Hypothesis: multi-region setup handles the failure

Running a Chaos Experiment

# Example Litmus Chaos experiment: kill a random pod
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-pod-kill
spec:
  appinfo:
    appns: production
    applabel: app=payment-service
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "false"

Game Days

A game day is a planned event where the team practices responding to failures. It combines chaos engineering with incident response practice.

Game day structure:

Preparation (1 week before):
  - Define the scenario (e.g., "Primary database fails")
  - Set up monitoring dashboards
  - Brief the team on the exercise
  - Establish safety controls (how to abort)

Execution (2-4 hours):
  - Inject the failure
  - Team responds as if it were a real incident
  - Facilitator observes and takes notes
  - Time key milestones (detection, diagnosis, recovery)

Debrief (1 hour):
  - What went well?
  - What surprised us?
  - What would we do differently?
  - Action items for improvement

Game Day Scenarios

Scenario 1: "The database is down"
  - Kill the primary database
  - Does the application failover to the replica?
  - Can the team access the runbook?
  - How long until recovery?

Scenario 2: "A dependency is returning errors"
  - Simulate a third-party API returning 500s
  - Does the circuit breaker trip?
  - Does the service degrade gracefully?
  - Are users shown a meaningful error?

Scenario 3: "We got hacked"
  - Simulate a compromised service account
  - Can the team revoke credentials quickly?
  - Is there an incident response playbook?
  - How long until containment?

Scenario 4: "Traffic spike"
  - Simulate 10x normal traffic
  - Does auto-scaling work?
  - What is the first bottleneck?
  - How long until the system stabilizes?

Real-World Example

An e-commerce company was preparing for their biggest sale of the year. Load testing showed their checkout flow could handle 3x normal traffic, but they expected 5-8x. They had three weeks.

The capacity planning response:

Added read replicas for the database
Increased cache capacity by 2x
Pre-scaled Kubernetes pods to handle 8x
Set up auto-scaling with aggressive thresholds

The chaos engineering response:

Ran a game day simulating database failover during peak load
Discovered the connection pool was too small for the failover scenario
Found that the cache invalidation during failover caused a thundering herd
Fixed both issues before the sale

The sale hit 6x normal traffic. The system handled it with no downtime. Without load testing, they would have discovered the connection pool issue during the sale. Without the game day, they would have discovered the thundering herd problem during a real database failover.

Common Pitfalls

Capacity planning based on averages -- Averages hide peaks; plan for p99 traffic, not average traffic
No buffer for the unknown -- A 10% buffer is not enough; unexpected viral moments, bot traffic, and DDoS attacks can multiply traffic overnight; use 2-3x buffers
Load testing only in staging -- Staging rarely matches production topology; test in production with feature flags and canary traffic
Chaos engineering without a hypothesis -- Randomly breaking things teaches you nothing; start with a hypothesis and measure the result
Skipping the safety controls -- Every chaos experiment needs a kill switch; without one, a failed experiment becomes a real outage
Running chaos experiments without buy-in -- If leadership does not understand why you are intentionally causing failures, the first experiment that goes wrong will be the last

Key Takeaways

Capacity planning starts with understanding current load, extrapolating growth, and adding a 2-3x buffer for spikes
Load test before major launches; never assume the system can handle more than you have proven it can handle
Chaos engineering intentionally breaks systems to find weaknesses before real failures do
Start chaos experiments small (kill a pod) and progress to larger failures (zone outage, region failover)
Game days combine chaos engineering with incident response practice in a structured event
Both capacity planning and chaos engineering are ongoing practices, not one-time activities