Cost Monitoring
Cloud costs grow silently. There is no alarm that goes off when your AWS bill doubles. There is no popup when someone leaves a test cluster running over the weekend. There is no warning when your storage costs quietly climb from 100 per month because nobody configured log rotation. By the time you notice, you have already overpaid by hundreds or thousands of dollars.
Set billing alerts on day one. Not day two. Day one.
The Silent Growth Problem
How cloud costs typically grow at a startup:
Month 1: $15 — just the basics
Month 2: $18 — added a staging environment
Month 3: $25 — started storing more data
Month 4: $35 — someone spun up a test database
Month 5: $55 — the test database is still running
Month 6: $80 — added a new service, forgot to set resource limits
Month 7: $120 — storage is growing, nobody's cleaning old data
Month 8: $180 — added monitoring, the monitoring is now expensive
Month 9: $250 — autoscaling kicked in during a traffic spike, never scaled back
Month 10: $350 — "wait, why is our bill $350?"
This is a real pattern. The $350 month is not the result of a single decision — it is the accumulation of small, unmonitored changes. Each one seemed reasonable at the time. Nobody was watching the total.
Billing Alerts: The First Thing You Set Up
Every major cloud provider supports billing alerts. Set them up before you deploy anything.
Alert thresholds for an early-stage startup:
Alert at $25 — "We've hit our baseline. Things are normal."
Alert at $50 — "Something new is costing money. Check what."
Alert at $100 — "This needs investigation today."
Alert at $200 — "Stop and figure out what's happening."
Where to set alerts:
AWS: Billing → Budgets → Create budget
GCP: Billing → Budgets & alerts → Create budget
Vercel: Settings → Billing → Spend management
Railway: Settings → Usage limits
Cloudflare: Not needed at free tier
How long this takes: 5 minutes per provider.
The alerts do not prevent overspending. They make sure you know about it. The action you take after receiving an alert is what saves money.
Tag Everything
Tagging resources is the difference between "our AWS bill is 300, of which 60 is staging, 20 is everything else."
Minimum tagging strategy:
Tag: environment
Values: production, staging, development, test
Tag: service
Values: api, web, worker, database, cache
Tag: owner
Values: engineer name or team name
Example:
Production Postgres: environment=production, service=database, owner=platform
Test Redis: environment=test, service=cache, owner=alice
Dev API server: environment=development, service=api, owner=bob
When the bill spikes, you can filter by tag to find the culprit in minutes instead of hours.
Without tags:
"Our EC2 bill is $150. I see 8 instances. Which ones are needed?"
→ 2 hours of investigation
With tags:
"Our EC2 bill is $150. Production is $80, staging is $30, and there
are two test instances tagged 'owner=alice' costing $40 total."
→ 5 minutes to identify the problem
Review Bills Monthly
Set a recurring calendar event: first Monday of each month, review the cloud bill. This takes 15 minutes and saves hundreds of dollars over time.
Monthly bill review checklist:
1. What is the total bill this month? ___
2. How does it compare to last month? ___
3. What are the top 3 line items? ___
4. Are there any new line items that were not there last month? ___
5. Are there any resources tagged as "test" or "development" that
should have been shut down? ___
6. Is storage growing? If so, do we need all that data? ___
7. Are there any services we are paying for but not using? ___
This is not glamorous work. It is the kind of work that prevents a 500/month bill without anyone noticing.
The Biggest Line Items
At most early-stage startups, the largest cloud costs are not what you expect:
What people think costs the most:
Compute (servers, functions)
What actually costs the most:
1. Storage — data accumulates and never gets deleted
2. Egress — data transfer out of cloud providers
3. Forgotten resources — test instances, old environments, unused databases
4. Compute — but often because of over-provisioning, not usage
Storage
Storage is cheap per gigabyte but grows indefinitely if nobody manages it.
Common storage cost traps:
- Application logs that are never rotated or archived
- Database backups retained for 90 days when 7 would suffice
- User uploads that are never cleaned up after account deletion
- Old deployment artifacts that accumulate
- Development database snapshots from months ago
Fixes:
- Set log retention policies (7-14 days for most logs)
- Configure backup retention to match your actual needs
- Implement lifecycle policies for object storage:
- Move to cold storage after 30 days
- Delete after 90 days (unless compliance requires longer)
- Clean up old snapshots quarterly
S3 lifecycle policies are free to configure and can save significant money:
S3 lifecycle policy example:
Rule: Move objects older than 30 days to Glacier Instant Retrieval
Rule: Delete objects older than 90 days from the temp/ prefix
Rule: Delete incomplete multipart uploads after 7 days
Setup time: 10 minutes
Annual savings: Often 30-50% of storage costs
Egress
Data transfer out of AWS, GCP, and Azure is expensive. This is by design — cloud providers make it cheap to put data in and expensive to take it out.
Egress costs comparison:
AWS: $0.09/GB for first 10TB
GCP: $0.12/GB for first 1TB
Azure: $0.087/GB for first 5GB (then $0.05-0.12)
Cloudflare: $0.00/GB (free egress from R2, Workers, Pages)
Hetzner: 20TB included, then $1.30/TB
At 100GB/month outbound:
AWS: $9/month
Cloudflare: $0/month
At 1TB/month outbound:
AWS: $90/month
Cloudflare: $0/month
If egress is a significant cost, consider:
Egress reduction strategies:
- Serve static assets from Cloudflare (free CDN, free egress)
- Use Cloudflare R2 instead of S3 for public assets
- Compress responses (gzip/brotli reduces transfer size 60-80%)
- Cache aggressively at the CDN layer
- Keep data processing within the same cloud region
Forgotten Resources
The $50 test database that nobody remembered to shut down. The staging environment that runs 24/7 but is used 2 hours per week. The load balancer with no backends. These are pure waste.
Common forgotten resources:
- Test databases or cache instances
- Staging environments running during nights and weekends
- Old load balancers or API gateways
- Unattached EBS volumes or persistent disks
- Elastic IPs or static IPs not attached to anything
- Old container images in registries
- Lambda functions triggered by events that no longer occur
Finding them:
AWS: Cost Explorer → filter by usage type → sort by cost
AWS: Trusted Advisor → cost optimization checks (free)
GCP: Recommender → idle resource recommendations
Manual: Review all resources tagged "test" or "development" monthly
Over-Provisioned Compute
Most startup applications are over-provisioned. A t3.large (15/month).
Right-sizing checklist:
1. Check average CPU utilization over the last 30 days
If <20%: downsize
2. Check average memory utilization over the last 30 days
If <40%: downsize
3. Check if you are paying for compute during off-hours
If usage drops to near-zero at night: schedule shutdown or
use auto-scaling with scale-to-zero
Tools:
AWS: Compute Optimizer (free)
AWS: Cost Explorer → right-sizing recommendations
GCP: VM instance recommendations
Manual: Check top/htop on your servers, check cloud metrics
Cost Monitoring Tools
Free tools:
AWS Cost Explorer: Built-in, shows cost breakdown by service
AWS Budgets: Built-in, sends alerts at thresholds
GCP Billing Reports: Built-in, similar to AWS Cost Explorer
Infracost: Open source, estimates cost of Terraform changes
Paid tools (usually not needed at early stage):
Vantage: $0/month for up to $2,500 in cloud spend
CloudZero: Starts around $1,000/month
Kubecost: For Kubernetes cost allocation
For most startups spending under $1,000/month on cloud, the built-in tools are sufficient. Set up billing alerts, review the cost dashboard monthly, and tag resources. You do not need a cost management platform until your cloud bill is consistently in the thousands.
SaaS Subscription Creep
Cloud infrastructure is not the only cost that grows silently. SaaS subscriptions accumulate too:
SaaS audit checklist:
Run quarterly:
1. List every SaaS subscription with monthly cost
2. For each: how many team members actively use it?
3. For each: what would we use instead if we cancelled?
4. Flag any subscription >$50/month with <50% team adoption
Common finds:
- Paid Slack plan when free tier is sufficient for <10 people
- Multiple overlapping tools (Notion + Confluence + Google Docs)
- Monitoring tools nobody checks
- Design tools with per-seat pricing for people who never open them
- Annual subscriptions that auto-renewed for a tool you stopped using
Cancel ruthlessly. You can always re-subscribe.
Building a Cost-Conscious Culture
Practices that keep costs low:
- Include estimated cost in infrastructure PRs
"This adds a t3.medium ($30/month) for the worker service"
- Require cost justification for new services
"Why do we need this $50/month service? What are we replacing?"
- Share the cloud bill with the team monthly
Transparency prevents waste better than policies do
- Celebrate cost reductions
"We reduced our cloud bill from $300 to $180 by right-sizing"
- Default to the smallest viable resource size
You can always scale up. Scaling down requires remembering to.
Common Pitfalls
- Not setting billing alerts on day one. This is the single highest-ROI infrastructure task you can do. Five minutes of setup prevents surprise bills every month.
- Not tagging resources. Without tags, a bill spike triggers a multi-hour investigation. With tags, you know the culprit in minutes.
- Assuming auto-scaling scales down. Most auto-scaling configurations are optimized for scaling up. They often do not scale back down aggressively, especially if minimum instance counts are set too high.
- Ignoring storage costs. Storage is cheap per-unit but grows linearly forever. Set retention policies on logs, backups, and object storage from the start.
- Paying for egress when alternatives exist. Cloudflare offers free egress on R2 and free CDN. Serving static assets from S3 instead of Cloudflare is paying for something that is free elsewhere.
- Annual SaaS subscriptions that auto-renew. Set calendar reminders 30 days before every annual renewal. Evaluate whether you still need the service. Cancel or downgrade proactively.
- Over-provisioning "just in case." Running a 4-CPU, 16GB server for an app that uses 0.5 CPU and 1GB of memory costs 8x more than it should. Start small. Monitor. Scale when the metrics tell you to.
Key Takeaways
- Set billing alerts on every cloud provider before you deploy anything. Alert at 2x, 4x, and 8x your baseline cost.
- Tag all resources with environment, service, and owner. This turns a "why is the bill high?" investigation from hours to minutes.
- Review cloud bills monthly. The biggest costs are usually storage and egress, not compute. Set retention policies on logs and backups. Use Cloudflare to eliminate egress fees.
- Audit forgotten resources quarterly: test databases, unused staging environments, unattached storage volumes, idle load balancers.
- Right-size compute resources. If average CPU utilization is under 20%, downsize. Start with the smallest viable instance and scale up based on data, not assumptions.