Cost Management

Cloud costs grow silently. What starts as a $200 monthly bill becomes$ 20,000 before anyone notices. The pay-as-you-go model that makes cloud flexible also makes it easy to waste money. Nobody reviews the bill because everybody assumes someone else is watching it. FinOps -- treating cloud spend as an engineering problem -- is the discipline that fixes this.

How Cloud Costs Grow

Month 1:   $200   Two EC2 instances, one RDS database
Month 3:   $800   Added staging environment, more instances
Month 6:   $3,000 New services, bigger databases, S3 growing
Month 12:  $8,000 Load balancers, NAT gateways, data transfer
Month 18:  $15,000 "Wait, why is our bill this high?"
Month 24:  $22,000 Engineer investigates, finds $7,000 in waste

The problem is not that cloud is expensive. The problem is that cloud costs are invisible. Nobody writes a check for a server. Instances are created with a click. Databases are upgraded "just in case." Staging environments run 24/7 even though nobody uses them on weekends.

Pricing Models

Understanding pricing models is the first step to controlling costs.

On-Demand

Pay by the hour or second. No commitment. The most expensive option per hour but the most flexible.

On-demand pricing (example, us-east-1):
  t3.medium:   $0.0416/hour  = $30/month
  m5.xlarge:   $0.192/hour   = $138/month
  r5.2xlarge:  $0.504/hour   = $363/month
  
Use on-demand for:
  - Variable workloads
  - Short-lived resources
  - Development and testing
  - New services where you do not know the baseline yet

Reserved Instances / Savings Plans

Commit to 1 or 3 years of usage in exchange for a discount (30-60% off on-demand).

Reserved Instance pricing (1-year, partial upfront):
  m5.xlarge on-demand:  $138/month
  m5.xlarge reserved:   $87/month  (37% savings)
  
Savings Plans (AWS):
  Compute Savings Plan: commit to $/hour of compute
  Flexible across instance types, regions, and services
  
  Example: commit to $100/hour for 1 year
  Save 30-40% vs on-demand on all matching compute

When to reserve:
  - Production workloads running 24/7
  - Databases that will exist for 1+ years
  - Baseline compute that does not change
  
When NOT to reserve:
  - New services (you do not know the baseline yet)
  - Workloads that might be decommissioned
  - Rapidly changing architectures

Spot / Preemptible Instances

Excess cloud capacity sold at 60-90% discount. The catch: the cloud provider can reclaim the instance with 2 minutes notice.

Spot pricing:
  m5.xlarge on-demand:  $0.192/hour
  m5.xlarge spot:       $0.058/hour (70% savings)
  
Use spot for:
  - Batch processing
  - CI/CD build agents
  - Data processing pipelines
  - Stateless workers that can restart
  - Load testing
  
Do NOT use spot for:
  - Databases
  - Stateful services
  - Anything that cannot tolerate interruption

Right-Sizing

Most cloud instances are over-provisioned. Engineers choose "large" because it feels safe. Nobody goes back to check if "medium" would suffice.

Typical over-provisioning:
  Instance: m5.2xlarge (8 vCPU, 32 GB RAM)
  Actual CPU usage: average 15%, peak 40%
  Actual memory usage: average 8 GB, peak 14 GB
  
  Right-sized: m5.xlarge (4 vCPU, 16 GB RAM)
  Savings: 50% on this instance ($363/month → $138/month)

How to Right-Size

# AWS: use Compute Optimizer recommendations
aws compute-optimizer get-ec2-instance-recommendations \
  --filters Name=Finding,Values=OVER_PROVISIONED

# Check actual utilization with CloudWatch
# Look at: CPUUtilization, MemoryUtilization (requires agent)
# Timeframe: at least 2 weeks including peak periods

Right-sizing process:
  1. Collect 2-4 weeks of utilization data
  2. Identify instances where peak usage < 50% of capacity
  3. Recommend one size down
  4. Test in staging first
  5. Apply in production during a maintenance window
  6. Monitor for 1 week after the change
  7. Repeat quarterly

Database Right-Sizing

Databases are often the biggest over-provisioning offenders.

Common pattern:
  RDS db.r5.4xlarge chosen "because we might need it"
  Actual connections: 30 of 1,000 max
  Actual CPU: 10% average
  Actual memory: 20 GB of 128 GB used
  
  Cost: $2,700/month
  Right-sized to db.r5.xlarge: $675/month
  Savings: $2,025/month ($24,300/year)

Tagging Everything

Without tags, you cannot answer "who is spending this money?" or "what is this resource for?"

Required tags for every resource:
  Team:        Which team owns this resource?
  Environment: production, staging, development, sandbox
  Service:     Which service does this support?
  CostCenter:  Which budget does this charge to?
  ManagedBy:   terraform, manual, pulumi

# Terraform example: enforce tagging
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"

  tags = {
    Team        = "payments"
    Environment = "production"
    Service     = "payment-api"
    CostCenter  = "eng-payments"
    ManagedBy   = "terraform"
  }
}

Enforce tagging with policies:
  AWS: Service Control Policies (SCPs) that deny resource creation
       without required tags
  GCP: Organization policies
  Azure: Azure Policy
  
  If a resource has no tags, it is invisible in cost reports.
  Untagged resources are the dark matter of cloud bills.

Cost Alerts

Set up alerts before you need them, not after you get a surprise bill.

# AWS Budget alert (Terraform)
resource "aws_budgets_budget" "monthly" {
  name              = "monthly-budget"
  budget_type       = "COST"
  limit_amount      = "10000"
  limit_unit        = "USD"
  time_unit         = "MONTHLY"

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = ["team-leads@company.com"]
  }

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 100
    threshold_type            = "PERCENTAGE"
    notification_type         = "FORECASTED"
    subscriber_email_addresses = ["engineering-vp@company.com"]
  }
}

Alert thresholds:
  50% of budget:  Informational (to the team)
  80% of budget:  Warning (to team leads)
  100% forecast:  Critical (to engineering leadership)
  
Anomaly detection:
  AWS Cost Anomaly Detection identifies unusual spending
  GCP billing alerts support anomaly detection
  Third-party tools: Vantage, CloudHealth, Spot.io

FinOps: Cloud Spend as an Engineering Problem

FinOps is the practice of bringing financial accountability to cloud spending. It treats cost as a first-class engineering metric, like performance or reliability.

FinOps principles:
  1. Teams must own their cloud costs
  2. Cost data must be accessible and understandable
  3. Decisions are driven by business value, not just cost
  4. Everyone takes advantage of the cloud's variable cost model
  5. A centralized FinOps team enables but does not dictate

The FinOps Team

FinOps team responsibilities:
  - Maintain cost visibility dashboards
  - Identify optimization opportunities
  - Negotiate reserved instances and savings plans
  - Set cost policies and budgets
  - Train engineering teams on cost awareness
  - Report cost trends to leadership

FinOps is NOT:
  - A team that approves every cloud resource
  - A cost-cutting squad that cancels resources
  - Finance people telling engineers what to do
  
FinOps IS:
  - Engineers and finance collaborating on cloud economics
  - Data-driven decisions about infrastructure investment
  - Continuous optimization, not one-time cost cuts

The Biggest Cost Drivers

1. Idle resources
   Staging environments running 24/7 when only used 9-5 Mon-Fri
   Development instances left running over weekends
   Forgotten load testing infrastructure
   Old environments from projects that ended months ago
   
   Fix: Schedule non-production environments to shut down
        after hours. Tag everything. Review monthly.

2. Over-provisioned databases
   RDS instances sized for peak + 10x safety margin
   DynamoDB with pre-provisioned capacity set too high
   Redis clusters sized for a load that never materialized
   
   Fix: Right-size based on actual utilization. Use auto-scaling
        for DynamoDB. Review database sizing quarterly.

3. Data transfer (egress)
   Moving data out of the cloud is expensive
   Cross-region replication
   API responses to users (especially large payloads)
   Logs shipped to external observability platforms
   
   Fix: Use CDN for static content. Compress API responses.
        Keep data processing within the same region. Consider
        in-cloud observability tools.

4. Unattached storage
   EBS volumes from terminated instances (still charged)
   Unused snapshots piling up
   S3 buckets with data nobody accesses
   
   Fix: Automated cleanup of unattached EBS volumes.
        Lifecycle policies for S3 and snapshots.

5. NAT Gateway costs
   NAT Gateways charge per GB processed
   High-traffic services in private subnets can generate
   surprising NAT Gateway bills
   
   Fix: Use VPC endpoints for AWS service traffic.
        Review NAT Gateway data processing costs.

Real-World Example

A series C startup was spending $45,000 per month on AWS. The CTO asked the platform team to investigate. They found:

** $8,000 in idle staging environments.** Three staging environments ran 24/7. Teams used them 8 hours per day, 5 days per week. They scheduled them to shut down outside business hours. Savings:$ 5,500/month.
** $6,000 in over-provisioned RDS instances.** The production database was an r5.4xlarge with 10% average CPU. They right-sized it to r5.xlarge with auto-scaling storage. Savings:$ 4,200/month.
** $4,000 in data transfer.** Application logs were shipped to an external observability platform. They compressed logs and filtered out debug-level messages before shipping. Savings:$ 2,800/month.
** $3,000 in unattached EBS volumes.** Engineers terminated instances but forgot to delete the attached volumes. An automated cleanup Lambda function reclaimed them. Savings:$ 3,000/month.

Total savings: $15,500/month ($ 186,000/year) without reducing any capability. They reinvested the savings into reserved instances for their baseline compute, saving another $4,000/month.

Common Pitfalls

Nobody owns costs -- If nobody is responsible for the cloud bill, nobody optimizes it; assign cost ownership to teams
Over-reserving -- Buying 3-year reserved instances for services that might be decommissioned next year; start with 1-year commitments
Ignoring data transfer -- Egress charges are hidden in the bill but can be significant; review data transfer costs monthly
Cost cutting that hurts reliability -- Removing redundancy to save money is false economy; an outage costs more than the infrastructure it would have prevented
One-time optimization -- Running a cost optimization exercise once and never again; costs drift back up within months; make it a continuous practice
No developer visibility -- If developers cannot see the cost impact of their decisions, they cannot optimize; give teams access to cost dashboards

Key Takeaways

Cloud costs grow silently; set up alerts and review the bill monthly before it surprises you
Reserved instances save 30-60% on predictable workloads; spot instances save 60-90% on interruptible workloads
Most instances are over-provisioned; right-size based on actual utilization, not assumptions
Tag every resource with team, environment, service, and cost center; untagged resources are invisible in cost reports
FinOps treats cloud spend as an engineering problem with data-driven decisions and team-level cost ownership
The biggest cost drivers are idle resources, over-provisioned databases, data transfer, and unattached storage