Cost Management
Cloud costs grow silently. What starts as a 20,000 before anyone notices. The pay-as-you-go model that makes cloud flexible also makes it easy to waste money. Nobody reviews the bill because everybody assumes someone else is watching it. FinOps -- treating cloud spend as an engineering problem -- is the discipline that fixes this.
How Cloud Costs Grow
Month 1: $200 Two EC2 instances, one RDS database
Month 3: $800 Added staging environment, more instances
Month 6: $3,000 New services, bigger databases, S3 growing
Month 12: $8,000 Load balancers, NAT gateways, data transfer
Month 18: $15,000 "Wait, why is our bill this high?"
Month 24: $22,000 Engineer investigates, finds $7,000 in waste
The problem is not that cloud is expensive. The problem is that cloud costs are invisible. Nobody writes a check for a server. Instances are created with a click. Databases are upgraded "just in case." Staging environments run 24/7 even though nobody uses them on weekends.
Pricing Models
Understanding pricing models is the first step to controlling costs.
On-Demand
Pay by the hour or second. No commitment. The most expensive option per hour but the most flexible.
On-demand pricing (example, us-east-1):
t3.medium: $0.0416/hour = $30/month
m5.xlarge: $0.192/hour = $138/month
r5.2xlarge: $0.504/hour = $363/month
Use on-demand for:
- Variable workloads
- Short-lived resources
- Development and testing
- New services where you do not know the baseline yet
Reserved Instances / Savings Plans
Commit to 1 or 3 years of usage in exchange for a discount (30-60% off on-demand).
Reserved Instance pricing (1-year, partial upfront):
m5.xlarge on-demand: $138/month
m5.xlarge reserved: $87/month (37% savings)
Savings Plans (AWS):
Compute Savings Plan: commit to $/hour of compute
Flexible across instance types, regions, and services
Example: commit to $100/hour for 1 year
Save 30-40% vs on-demand on all matching compute
When to reserve:
- Production workloads running 24/7
- Databases that will exist for 1+ years
- Baseline compute that does not change
When NOT to reserve:
- New services (you do not know the baseline yet)
- Workloads that might be decommissioned
- Rapidly changing architectures
Spot / Preemptible Instances
Excess cloud capacity sold at 60-90% discount. The catch: the cloud provider can reclaim the instance with 2 minutes notice.
Spot pricing:
m5.xlarge on-demand: $0.192/hour
m5.xlarge spot: $0.058/hour (70% savings)
Use spot for:
- Batch processing
- CI/CD build agents
- Data processing pipelines
- Stateless workers that can restart
- Load testing
Do NOT use spot for:
- Databases
- Stateful services
- Anything that cannot tolerate interruption
Right-Sizing
Most cloud instances are over-provisioned. Engineers choose "large" because it feels safe. Nobody goes back to check if "medium" would suffice.
Typical over-provisioning:
Instance: m5.2xlarge (8 vCPU, 32 GB RAM)
Actual CPU usage: average 15%, peak 40%
Actual memory usage: average 8 GB, peak 14 GB
Right-sized: m5.xlarge (4 vCPU, 16 GB RAM)
Savings: 50% on this instance ($363/month → $138/month)
How to Right-Size
# AWS: use Compute Optimizer recommendations
aws compute-optimizer get-ec2-instance-recommendations \
--filters Name=Finding,Values=OVER_PROVISIONED
# Check actual utilization with CloudWatch
# Look at: CPUUtilization, MemoryUtilization (requires agent)
# Timeframe: at least 2 weeks including peak periods
Right-sizing process:
1. Collect 2-4 weeks of utilization data
2. Identify instances where peak usage < 50% of capacity
3. Recommend one size down
4. Test in staging first
5. Apply in production during a maintenance window
6. Monitor for 1 week after the change
7. Repeat quarterly
Database Right-Sizing
Databases are often the biggest over-provisioning offenders.
Common pattern:
RDS db.r5.4xlarge chosen "because we might need it"
Actual connections: 30 of 1,000 max
Actual CPU: 10% average
Actual memory: 20 GB of 128 GB used
Cost: $2,700/month
Right-sized to db.r5.xlarge: $675/month
Savings: $2,025/month ($24,300/year)
Tagging Everything
Without tags, you cannot answer "who is spending this money?" or "what is this resource for?"
Required tags for every resource:
Team: Which team owns this resource?
Environment: production, staging, development, sandbox
Service: Which service does this support?
CostCenter: Which budget does this charge to?
ManagedBy: terraform, manual, pulumi
# Terraform example: enforce tagging
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.medium"
tags = {
Team = "payments"
Environment = "production"
Service = "payment-api"
CostCenter = "eng-payments"
ManagedBy = "terraform"
}
}
Enforce tagging with policies:
AWS: Service Control Policies (SCPs) that deny resource creation
without required tags
GCP: Organization policies
Azure: Azure Policy
If a resource has no tags, it is invisible in cost reports.
Untagged resources are the dark matter of cloud bills.
Cost Alerts
Set up alerts before you need them, not after you get a surprise bill.
# AWS Budget alert (Terraform)
resource "aws_budgets_budget" "monthly" {
name = "monthly-budget"
budget_type = "COST"
limit_amount = "10000"
limit_unit = "USD"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["team-leads@company.com"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = ["engineering-vp@company.com"]
}
}
Alert thresholds:
50% of budget: Informational (to the team)
80% of budget: Warning (to team leads)
100% forecast: Critical (to engineering leadership)
Anomaly detection:
AWS Cost Anomaly Detection identifies unusual spending
GCP billing alerts support anomaly detection
Third-party tools: Vantage, CloudHealth, Spot.io
FinOps: Cloud Spend as an Engineering Problem
FinOps is the practice of bringing financial accountability to cloud spending. It treats cost as a first-class engineering metric, like performance or reliability.
FinOps principles:
1. Teams must own their cloud costs
2. Cost data must be accessible and understandable
3. Decisions are driven by business value, not just cost
4. Everyone takes advantage of the cloud's variable cost model
5. A centralized FinOps team enables but does not dictate
The FinOps Team
FinOps team responsibilities:
- Maintain cost visibility dashboards
- Identify optimization opportunities
- Negotiate reserved instances and savings plans
- Set cost policies and budgets
- Train engineering teams on cost awareness
- Report cost trends to leadership
FinOps is NOT:
- A team that approves every cloud resource
- A cost-cutting squad that cancels resources
- Finance people telling engineers what to do
FinOps IS:
- Engineers and finance collaborating on cloud economics
- Data-driven decisions about infrastructure investment
- Continuous optimization, not one-time cost cuts
The Biggest Cost Drivers
1. Idle resources
Staging environments running 24/7 when only used 9-5 Mon-Fri
Development instances left running over weekends
Forgotten load testing infrastructure
Old environments from projects that ended months ago
Fix: Schedule non-production environments to shut down
after hours. Tag everything. Review monthly.
2. Over-provisioned databases
RDS instances sized for peak + 10x safety margin
DynamoDB with pre-provisioned capacity set too high
Redis clusters sized for a load that never materialized
Fix: Right-size based on actual utilization. Use auto-scaling
for DynamoDB. Review database sizing quarterly.
3. Data transfer (egress)
Moving data out of the cloud is expensive
Cross-region replication
API responses to users (especially large payloads)
Logs shipped to external observability platforms
Fix: Use CDN for static content. Compress API responses.
Keep data processing within the same region. Consider
in-cloud observability tools.
4. Unattached storage
EBS volumes from terminated instances (still charged)
Unused snapshots piling up
S3 buckets with data nobody accesses
Fix: Automated cleanup of unattached EBS volumes.
Lifecycle policies for S3 and snapshots.
5. NAT Gateway costs
NAT Gateways charge per GB processed
High-traffic services in private subnets can generate
surprising NAT Gateway bills
Fix: Use VPC endpoints for AWS service traffic.
Review NAT Gateway data processing costs.
Real-World Example
A series C startup was spending $45,000 per month on AWS. The CTO asked the platform team to investigate. They found:
-
**5,500/month.
-
**4,200/month.
-
**2,800/month.
-
**3,000/month.
Total savings: 186,000/year) without reducing any capability. They reinvested the savings into reserved instances for their baseline compute, saving another $4,000/month.
Common Pitfalls
- Nobody owns costs -- If nobody is responsible for the cloud bill, nobody optimizes it; assign cost ownership to teams
- Over-reserving -- Buying 3-year reserved instances for services that might be decommissioned next year; start with 1-year commitments
- Ignoring data transfer -- Egress charges are hidden in the bill but can be significant; review data transfer costs monthly
- Cost cutting that hurts reliability -- Removing redundancy to save money is false economy; an outage costs more than the infrastructure it would have prevented
- One-time optimization -- Running a cost optimization exercise once and never again; costs drift back up within months; make it a continuous practice
- No developer visibility -- If developers cannot see the cost impact of their decisions, they cannot optimize; give teams access to cost dashboards
Key Takeaways
- Cloud costs grow silently; set up alerts and review the bill monthly before it surprises you
- Reserved instances save 30-60% on predictable workloads; spot instances save 60-90% on interruptible workloads
- Most instances are over-provisioned; right-size based on actual utilization, not assumptions
- Tag every resource with team, environment, service, and cost center; untagged resources are invisible in cost reports
- FinOps treats cloud spend as an engineering problem with data-driven decisions and team-level cost ownership
- The biggest cost drivers are idle resources, over-provisioned databases, data transfer, and unattached storage