Monitoring Before Users Complain

The worst way to learn about a production problem is from a customer email. The second worst way is from a tweet. The best way is from an alert that fires before any user notices.

Monitoring is not a nice-to-have. It is the difference between proactive engineering and reactive firefighting. Without monitoring, you are flying blind. Every deployment is a guess. Every "seems fine" is wishful thinking. Every quiet period could be an outage you have not noticed yet.

Set up monitoring before launch. Not after the first outage. Not when you have time. Before launch.

The Monitoring Hierarchy

Not all monitoring is equally important. Start at the top of the hierarchy and work down. Each level catches a different class of problems.

Monitoring hierarchy (most to least critical):

1. Is the site up? (uptime monitoring)
   Catches: complete outages, DNS failures, SSL expiry
   Tool: UptimeRobot, BetterStack, Checkly
   Cost: free

2. Are errors happening? (error tracking)
   Catches: unhandled exceptions, API failures, crashed processes
   Tool: Sentry, Bugsnag
   Cost: free tier

3. Is it fast enough? (performance monitoring)
   Catches: slow queries, degraded response times, memory leaks
   Tool: application metrics, APM tools
   Cost: free to moderate

4. Are users doing what we expect? (business metrics)
   Catches: broken signup flows, payment failures, feature issues
   Tool: analytics, custom dashboards
   Cost: varies

Most startups never get past level 1 and 2, which is fine. Levels 1 and 2 catch the majority of production problems. Add levels 3 and 4 as you grow.

Uptime Monitoring

The simplest and most essential form of monitoring. An external service pings your site every few minutes and alerts you if it is down.

Uptime monitoring setup:
1. Create a free UptimeRobot account
2. Add a monitor for your production URL
3. Set check interval to 5 minutes (free tier maximum)
4. Add your email and phone as alert contacts
5. Optionally add a Slack or Discord webhook

What to monitor:
- Your main domain (https://yourapp.com)
- Your API endpoint (https://api.yourapp.com/health)
- Your status page if you have one

Check types:
- HTTP check: expects 200 response
- Keyword check: expects specific text in the response
- Port check: verifies a port is open

UptimeRobot's free tier gives you 50 monitors with 5-minute intervals. That is more than enough for most startups. BetterStack and Checkly offer more advanced features on their free tiers.

The alert should go to your phone. Not just email. Not just Slack. Your phone. Outages happen at 2am on Saturday. You need to wake up.

Alert routing:
- Email: always (serves as a log)
- Phone call or SMS: for complete outages
- Slack or Discord: for team visibility
- Do not rely on a single channel

Error Tracking

Error tracking captures every unhandled exception in your application, with full context: the stack trace, the request that caused it, the user who triggered it, and the environment details.

Sentry is the standard tool. Their free tier (5,000 events per month) covers most early-stage startups.

Sentry setup:
1. Create a free Sentry account
2. Install the SDK for your framework
   - Next.js: @sentry/nextjs
   - Express: @sentry/node
   - Django: sentry-sdk
   - Rails: sentry-ruby
3. Configure with your DSN (project key)
4. Deploy

What Sentry captures automatically:
- Unhandled exceptions with stack traces
- Request URL, method, and headers
- User information (if configured)
- Browser and OS details (frontend)
- Release version
- Environment (production, staging)

The power of error tracking is not just knowing that errors happen. It is knowing which errors are new, how often they occur, and which users are affected.

Error tracking workflow:
1. New error appears in Sentry
2. Alert fires (email, Slack, or PagerDuty)
3. Review the stack trace and context
4. Determine severity (affects all users or one?)
5. Fix, deploy, verify the error stops
6. Mark as resolved in Sentry
7. If it recurs, Sentry notifies you

Without error tracking, errors happen silently. Users experience problems and either complain (the minority) or leave (the majority). You never know what went wrong.

Linear, the project management tool, used Sentry from day one. When they launched publicly, they could see every error in real time and fix issues within minutes of being reported. Their reputation for reliability was built on this foundation.

Basic Metrics

Beyond uptime and errors, three metrics give you a clear picture of your application's health: request count, error rate, and response time.

The three essential metrics:

Request count (traffic):
- How many requests per minute or hour
- Spot anomalies: sudden drop might mean outage
- Sudden spike might mean viral moment or DDoS
- Baseline: know what "normal" looks like

Error rate (reliability):
- Percentage of requests returning 5xx errors
- Healthy: under 0.1%
- Concerning: 0.1-1%
- Critical: above 1%
- Alert when error rate exceeds your threshold

Response time (performance):
- P50 (median): what most users experience
- P95: what your slowest users experience
- P99: worst case
- Healthy P50: under 200ms
- Concerning P50: 200-500ms
- Critical P50: over 500ms

Most hosting platforms provide these metrics out of the box. Vercel, Railway, Render, and Fly.io all have built-in dashboards showing request count, error rate, and response time.

If you need more detailed metrics, Grafana Cloud has a generous free tier. Datadog and New Relic have free tiers suitable for startups.

What "Good" Looks Like

Knowing your metrics is useless without knowing what they should be. Here are reasonable targets for an early-stage web application.

Healthy production metrics:

Uptime: 99.9% (about 8 hours downtime per year)
- Realistic for a startup
- 99.99% requires significant investment
- Track monthly, not daily

Error rate: under 0.1% of requests
- Some errors are normal (bad input, expired sessions)
- 5xx errors should be rare
- Alert on sustained increase, not individual errors

Response time:
- P50: under 200ms
- P95: under 1 second
- P99: under 3 seconds
- These are for API responses, not full page loads

Database:
- Connection count: well under your maximum
- Slow queries: none over 1 second regularly
- Disk usage: under 80%

These are not aspirational targets. They are reasonable baselines for a production application running on modern infrastructure. If you are significantly worse, something needs attention.

Setting Up Alerts

Monitoring without alerting is a dashboard you never look at. Alerts are what turn monitoring into action.

Essential alerts:

Critical (wake you up):
- Site is down (uptime check fails for 2+ consecutive checks)
- Error rate exceeds 5% for 5+ minutes
- Database is unreachable
- SSL certificate expires in less than 7 days

Warning (check during business hours):
- Error rate exceeds 1% for 15+ minutes
- Response time P95 exceeds 3 seconds
- Database disk usage exceeds 80%
- Background job queue depth exceeds normal by 10x

Informational (review weekly):
- New error types appearing
- Traffic patterns changing
- Database slow query log

The key is avoiding alert fatigue. If you alert on everything, you start ignoring alerts. If you only alert on critical issues, every alert gets your attention.

Alert fatigue prevention:
- Start with fewer alerts and add as needed
- Every alert should require action (if not, remove it)
- Group related alerts (one alert for "database issues", not ten)
- Use different channels for different severities
- Review and prune alerts monthly

Monitoring Your Dependencies

Your application depends on external services: payment processors, email providers, auth services, APIs. When they go down, your application might go down too.

Dependencies to monitor:
- Payment processor (Stripe, etc.) - check their status page
- Auth provider (Auth0, Clerk, etc.) - check their status page
- Email service (SendGrid, Postmark, etc.)
- Database provider (if managed)
- DNS provider
- CDN (if separate from hosting)

How to monitor dependencies:
- Subscribe to their status page notifications
- Monitor your own integration endpoints
- Set up alerts for increased error rates on external calls
- Have fallback behavior when dependencies are degraded

When Stripe has an outage, you want to know immediately — not when a customer cannot check out and sends you an angry email.

The Status Page

Even if you are a one-person startup, a status page communicates professionalism and builds trust. When something goes wrong, customers want to check a status page before sending a support email.

Status page options:
- Instatus: free tier available
- BetterStack: included with monitoring
- Atlassian Statuspage: free for small teams
- GitHub Pages: manual updates, free
- Simple approach: a page on your site that you update manually

What to show:
- Current status of core services (operational, degraded, outage)
- Recent incidents with updates
- Uptime history (if your monitoring tool provides it)

You do not need to automate the status page initially. Manually updating it during incidents is fine. The point is having a place customers can check.

Logging as Monitoring

Structured logs are a form of monitoring. When something goes wrong, logs are how you debug it.

Logging strategy for monitoring:
- Log every request with timing information
- Log every error with full context
- Log every external API call with response status
- Use structured format (JSON) for machine parsing
- Include request IDs for tracing requests across services

Useful log patterns:
- Search logs for 500 errors in the last hour
- Find all requests from a specific user
- Track response time trends
- Identify slow database queries
- Trace a single request through the system

The combination of error tracking (Sentry) and structured logging gives you two complementary views: Sentry shows you individual errors with context, and logs show you the broader patterns.

The Free Monitoring Stack

You can monitor a production application for $0 using free tiers. Here is the complete setup.

Free monitoring stack:

Uptime: UptimeRobot
- 50 monitors, 5-minute intervals
- Email, SMS, webhook alerts
- Setup time: 10 minutes

Errors: Sentry
- 5,000 events per month
- Full stack traces with context
- Release tracking
- Setup time: 30 minutes

Metrics: hosting platform built-in
- Request count, error rate, response time
- Varies by platform but usually included
- Setup time: 0 minutes (already there)

Logging: platform stdout capture
- Railway, Render, Fly.io capture stdout
- Searchable through platform dashboard
- Setup time: 0 minutes (already there)

Status page: Instatus or BetterStack
- Basic status page with incident tracking
- Setup time: 15 minutes

Total cost: $0
Total setup time: under 1 hour

There is no excuse for launching without monitoring. The entire stack is free and takes less time to set up than a single feature.

Common Pitfalls

Monitoring only uptime. Your site can be "up" while returning errors to every user. Uptime monitoring is necessary but not sufficient. Add error tracking at minimum.

Alert fatigue. Too many alerts means you ignore them all. Start with a few critical alerts and add more only when needed. Every alert should demand action.

Not monitoring from outside your network. Internal health checks can pass while external users cannot reach your site. Use an external monitoring service, not just internal checks.

Monitoring in staging but not production. Staging monitoring is nice. Production monitoring is essential. If you can only do one, do production.

Not including context in alerts. "Error rate is high" is not actionable. "Error rate is 5.2% (normal is 0.1%), most errors are 500s on /api/checkout, started 10 minutes ago" is actionable.

Waiting until you have time. You will never have time. Monitoring takes an hour to set up. Do it before launch. Your future self at 2am during an outage will thank you.

Key Takeaways

Set up monitoring before launch, not after the first outage. The entire free stack takes under an hour.
The monitoring hierarchy is uptime, errors, performance, business metrics. Start at the top and work down.
UptimeRobot (free) plus Sentry (free) covers the majority of production monitoring needs for early-stage startups.
Every alert should require action. If an alert does not change your behavior, remove it.
You should know about problems before your users do. That is the entire point of monitoring.
A status page, even a simple one, builds trust and reduces support burden during incidents.