Cost Monitoring

Cloud costs grow silently. There is no alarm that goes off when your AWS bill doubles. There is no popup when someone leaves a test cluster running over the weekend. There is no warning when your storage costs quietly climb from $10 to$ 100 per month because nobody configured log rotation. By the time you notice, you have already overpaid by hundreds or thousands of dollars.

Set billing alerts on day one. Not day two. Day one.

The Silent Growth Problem

How cloud costs typically grow at a startup:
  Month 1:  $15 — just the basics
  Month 2:  $18 — added a staging environment
  Month 3:  $25 — started storing more data
  Month 4:  $35 — someone spun up a test database
  Month 5:  $55 — the test database is still running
  Month 6:  $80 — added a new service, forgot to set resource limits
  Month 7:  $120 — storage is growing, nobody's cleaning old data
  Month 8:  $180 — added monitoring, the monitoring is now expensive
  Month 9:  $250 — autoscaling kicked in during a traffic spike, never scaled back
  Month 10: $350 — "wait, why is our bill $350?"

This is a real pattern. The $350 month is not the result of a single decision — it is the accumulation of small, unmonitored changes. Each one seemed reasonable at the time. Nobody was watching the total.

Billing Alerts: The First Thing You Set Up

Every major cloud provider supports billing alerts. Set them up before you deploy anything.

Alert thresholds for an early-stage startup:
  Alert at $25  — "We've hit our baseline. Things are normal."
  Alert at $50  — "Something new is costing money. Check what."
  Alert at $100 — "This needs investigation today."
  Alert at $200 — "Stop and figure out what's happening."

Where to set alerts:
  AWS:        Billing → Budgets → Create budget
  GCP:        Billing → Budgets & alerts → Create budget
  Vercel:     Settings → Billing → Spend management
  Railway:    Settings → Usage limits
  Cloudflare: Not needed at free tier

How long this takes: 5 minutes per provider.

The alerts do not prevent overspending. They make sure you know about it. The action you take after receiving an alert is what saves money.

Tag Everything

Tagging resources is the difference between "our AWS bill is $300" and "our AWS bill is$ 300, of which $180 is the production database,$ 60 is staging, $40 is that test cluster from last month, and$ 20 is everything else."

Minimum tagging strategy:
  Tag: environment
  Values: production, staging, development, test

  Tag: service
  Values: api, web, worker, database, cache

  Tag: owner
  Values: engineer name or team name

Example:
  Production Postgres:  environment=production, service=database, owner=platform
  Test Redis:           environment=test, service=cache, owner=alice
  Dev API server:       environment=development, service=api, owner=bob

When the bill spikes, you can filter by tag to find the culprit in minutes instead of hours.

Without tags:
  "Our EC2 bill is $150. I see 8 instances. Which ones are needed?"
  → 2 hours of investigation

With tags:
  "Our EC2 bill is $150. Production is $80, staging is $30, and there
   are two test instances tagged 'owner=alice' costing $40 total."
  → 5 minutes to identify the problem

Review Bills Monthly

Set a recurring calendar event: first Monday of each month, review the cloud bill. This takes 15 minutes and saves hundreds of dollars over time.

Monthly bill review checklist:
  1. What is the total bill this month? ___
  2. How does it compare to last month? ___
  3. What are the top 3 line items? ___
  4. Are there any new line items that were not there last month? ___
  5. Are there any resources tagged as "test" or "development" that
     should have been shut down? ___
  6. Is storage growing? If so, do we need all that data? ___
  7. Are there any services we are paying for but not using? ___

This is not glamorous work. It is the kind of work that prevents a $50/month bill from becoming a$ 500/month bill without anyone noticing.

The Biggest Line Items

At most early-stage startups, the largest cloud costs are not what you expect:

What people think costs the most:
  Compute (servers, functions)

What actually costs the most:
  1. Storage — data accumulates and never gets deleted
  2. Egress — data transfer out of cloud providers
  3. Forgotten resources — test instances, old environments, unused databases
  4. Compute — but often because of over-provisioning, not usage

Storage

Storage is cheap per gigabyte but grows indefinitely if nobody manages it.

Common storage cost traps:
  - Application logs that are never rotated or archived
  - Database backups retained for 90 days when 7 would suffice
  - User uploads that are never cleaned up after account deletion
  - Old deployment artifacts that accumulate
  - Development database snapshots from months ago

Fixes:
  - Set log retention policies (7-14 days for most logs)
  - Configure backup retention to match your actual needs
  - Implement lifecycle policies for object storage:
    - Move to cold storage after 30 days
    - Delete after 90 days (unless compliance requires longer)
  - Clean up old snapshots quarterly

S3 lifecycle policies are free to configure and can save significant money:

S3 lifecycle policy example:
  Rule: Move objects older than 30 days to Glacier Instant Retrieval
  Rule: Delete objects older than 90 days from the temp/ prefix
  Rule: Delete incomplete multipart uploads after 7 days

  Setup time: 10 minutes
  Annual savings: Often 30-50% of storage costs

Egress

Data transfer out of AWS, GCP, and Azure is expensive. This is by design — cloud providers make it cheap to put data in and expensive to take it out.

Egress costs comparison:
  AWS:         $0.09/GB for first 10TB
  GCP:         $0.12/GB for first 1TB
  Azure:       $0.087/GB for first 5GB (then $0.05-0.12)
  Cloudflare:  $0.00/GB (free egress from R2, Workers, Pages)
  Hetzner:     20TB included, then $1.30/TB

At 100GB/month outbound:
  AWS:         $9/month
  Cloudflare:  $0/month

At 1TB/month outbound:
  AWS:         $90/month
  Cloudflare:  $0/month

If egress is a significant cost, consider:

Egress reduction strategies:
  - Serve static assets from Cloudflare (free CDN, free egress)
  - Use Cloudflare R2 instead of S3 for public assets
  - Compress responses (gzip/brotli reduces transfer size 60-80%)
  - Cache aggressively at the CDN layer
  - Keep data processing within the same cloud region

Forgotten Resources

The $50 test database that nobody remembered to shut down. The staging environment that runs 24/7 but is used 2 hours per week. The load balancer with no backends. These are pure waste.

Common forgotten resources:
  - Test databases or cache instances
  - Staging environments running during nights and weekends
  - Old load balancers or API gateways
  - Unattached EBS volumes or persistent disks
  - Elastic IPs or static IPs not attached to anything
  - Old container images in registries
  - Lambda functions triggered by events that no longer occur

Finding them:
  AWS: Cost Explorer → filter by usage type → sort by cost
  AWS: Trusted Advisor → cost optimization checks (free)
  GCP: Recommender → idle resource recommendations
  Manual: Review all resources tagged "test" or "development" monthly

Over-Provisioned Compute

Most startup applications are over-provisioned. A t3.large ( $60/month) running at 5% CPU utilization should be a t3.small ($ 15/month).

Right-sizing checklist:
  1. Check average CPU utilization over the last 30 days
     If <20%: downsize
  2. Check average memory utilization over the last 30 days
     If <40%: downsize
  3. Check if you are paying for compute during off-hours
     If usage drops to near-zero at night: schedule shutdown or
     use auto-scaling with scale-to-zero

Tools:
  AWS: Compute Optimizer (free)
  AWS: Cost Explorer → right-sizing recommendations
  GCP: VM instance recommendations
  Manual: Check top/htop on your servers, check cloud metrics

Cost Monitoring Tools

Free tools:
  AWS Cost Explorer:     Built-in, shows cost breakdown by service
  AWS Budgets:           Built-in, sends alerts at thresholds
  GCP Billing Reports:   Built-in, similar to AWS Cost Explorer
  Infracost:             Open source, estimates cost of Terraform changes

Paid tools (usually not needed at early stage):
  Vantage:               $0/month for up to $2,500 in cloud spend
  CloudZero:             Starts around $1,000/month
  Kubecost:              For Kubernetes cost allocation

For most startups spending under $1,000/month on cloud, the built-in tools are sufficient. Set up billing alerts, review the cost dashboard monthly, and tag resources. You do not need a cost management platform until your cloud bill is consistently in the thousands.

SaaS Subscription Creep

Cloud infrastructure is not the only cost that grows silently. SaaS subscriptions accumulate too:

SaaS audit checklist:
  Run quarterly:
  1. List every SaaS subscription with monthly cost
  2. For each: how many team members actively use it?
  3. For each: what would we use instead if we cancelled?
  4. Flag any subscription >$50/month with <50% team adoption

Common finds:
  - Paid Slack plan when free tier is sufficient for <10 people
  - Multiple overlapping tools (Notion + Confluence + Google Docs)
  - Monitoring tools nobody checks
  - Design tools with per-seat pricing for people who never open them
  - Annual subscriptions that auto-renewed for a tool you stopped using

Cancel ruthlessly. You can always re-subscribe.

Building a Cost-Conscious Culture

Practices that keep costs low:
  - Include estimated cost in infrastructure PRs
    "This adds a t3.medium ($30/month) for the worker service"
  - Require cost justification for new services
    "Why do we need this $50/month service? What are we replacing?"
  - Share the cloud bill with the team monthly
    Transparency prevents waste better than policies do
  - Celebrate cost reductions
    "We reduced our cloud bill from $300 to $180 by right-sizing"
  - Default to the smallest viable resource size
    You can always scale up. Scaling down requires remembering to.

Common Pitfalls

Not setting billing alerts on day one. This is the single highest-ROI infrastructure task you can do. Five minutes of setup prevents surprise bills every month.
Not tagging resources. Without tags, a bill spike triggers a multi-hour investigation. With tags, you know the culprit in minutes.
Assuming auto-scaling scales down. Most auto-scaling configurations are optimized for scaling up. They often do not scale back down aggressively, especially if minimum instance counts are set too high.
Ignoring storage costs. Storage is cheap per-unit but grows linearly forever. Set retention policies on logs, backups, and object storage from the start.
Paying for egress when alternatives exist. Cloudflare offers free egress on R2 and free CDN. Serving static assets from S3 instead of Cloudflare is paying for something that is free elsewhere.
Annual SaaS subscriptions that auto-renew. Set calendar reminders 30 days before every annual renewal. Evaluate whether you still need the service. Cancel or downgrade proactively.
Over-provisioning "just in case." Running a 4-CPU, 16GB server for an app that uses 0.5 CPU and 1GB of memory costs 8x more than it should. Start small. Monitor. Scale when the metrics tell you to.

Key Takeaways

Set billing alerts on every cloud provider before you deploy anything. Alert at 2x, 4x, and 8x your baseline cost.
Tag all resources with environment, service, and owner. This turns a "why is the bill high?" investigation from hours to minutes.
Review cloud bills monthly. The biggest costs are usually storage and egress, not compute. Set retention policies on logs and backups. Use Cloudflare to eliminate egress fees.
Audit forgotten resources quarterly: test databases, unused staging environments, unattached storage volumes, idle load balancers.
Right-size compute resources. If average CPU utilization is under 20%, downsize. Start with the smallest viable instance and scale up based on data, not assumptions.