6 min read
On this page

The First Incident

Your first production incident will happen. This is not pessimism. It is certainty. Every production system fails eventually. Servers crash. Databases fill up. Deploys break things. External services go down. Someone pushes bad code on a Friday afternoon.

The question is not whether you will have an incident. The question is whether you will handle it well. And handling it well means being prepared before it happens, not figuring it out in the middle of a crisis.

How you respond to your first incident shapes your engineering culture for years to come. Blame someone, and people start hiding mistakes. Handle it calmly and learn from it, and people start reporting issues earlier and taking smart risks.

This is about being ready.

Before the Incident: Preparation

You do not need a 50-page incident response plan. You need the answers to five questions written down somewhere your team can find them.

Minimum incident preparation:

1. Who gets called when something breaks?
   - Name, phone number, and backup contact
   - Even if it is just you, write it down
   - If you are a team, define an on-call rotation

2. How do we know something is broken?
   - Monitoring alerts (uptime, errors, metrics)
   - Customer reports (support channel)
   - Automated health checks

3. How do we communicate with customers?
   - Status page URL
   - Support email
   - In-app banner template
   - Social media account (if applicable)

4. How do we fix common problems?
   - How to roll back a deploy
   - How to restart the application
   - How to connect to the database
   - How to check logs
   - How to scale up if traffic spikes

5. What do we do after it is fixed?
   - Post-mortem template
   - Action items tracking
   - Communication follow-up with customers

Write these down in a document that everyone on the team can access. A Google Doc is fine. A page in your wiki is fine. What matters is that it exists and people know where it is.

PagerDuty, which exists because of incident management, advocates for simple, accessible incident response documentation over comprehensive plans that nobody reads.

The Anatomy of an Incident

Most startup incidents follow a predictable pattern. Understanding the pattern helps you respond faster.

Typical incident timeline:

00:00 - Something breaks
  Trigger: bad deploy, external service failure, traffic spike,
  database issue, or infrastructure problem

00:00 to 00:05 - Detection
  Best case: monitoring alert fires automatically
  Worst case: customer reports the problem

00:05 to 00:15 - Assessment
  What is broken? How many users are affected?
  Is it getting worse or stable?
  What changed recently? (deploys, config changes, traffic)

00:15 to 00:45 - Mitigation
  Can we roll back the last deploy?
  Can we restart the service?
  Can we failover to a backup?
  Can we disable the broken feature?

00:45 to 02:00 - Resolution
  Fix the root cause
  Verify the fix in production
  Monitor for recurrence

After resolution - Recovery
  Update status page
  Communicate with affected customers
  Schedule post-mortem
  Track action items

The most important phase is mitigation. Your first goal is not to fix the root cause. It is to stop the bleeding. If rolling back a deploy stops the incident, roll it back. You can figure out what went wrong later.

During the Incident: Stay Calm

The first incident feels like an emergency. Your adrenaline spikes. You want to fix it immediately. This is exactly when people make mistakes that make things worse.

Incident response rules:
1. Breathe. A few extra seconds of thinking saves minutes of wrong actions.
2. Assess before acting. What is actually broken? What is the blast radius?
3. Communicate early. Tell your team and customers that you are aware and working on it.
4. Try the simplest fix first. Rollback. Restart. Disable the feature.
5. Do not make changes you cannot reverse. Avoid destructive actions under pressure.
6. Keep a timeline. Write down what you did and when. You will need this later.
7. Ask for help. If you are stuck for more than 15 minutes, call someone.

The biggest mistake during a first incident is panic-driven changes. An engineer sees a database error, panics, restarts the database, and now loses in-flight transactions. Another engineer sees a deploy failure, force-pushes a fix without testing, and introduces a second bug.

Slow is smooth. Smooth is fast.

Customer Communication

How you communicate during an incident matters more than how fast you fix it. Customers can tolerate downtime. They cannot tolerate silence.

Customer communication template:

When the incident starts:
"We are experiencing issues with [specific feature or service].
Our team is investigating. We will provide updates as we learn more."

During the incident:
"We have identified the cause of the issue affecting [feature].
We are working on a fix. Estimated resolution: [time or unknown]."

When resolved:
"The issue affecting [feature] has been resolved.
All services are operating normally. We apologize for the inconvenience
and will share more details in a follow-up."

Follow-up (next day):
"Yesterday, [feature] was unavailable for [duration] due to [brief cause].
We have [taken specific action] to prevent this from happening again.
We apologize for the disruption."

Post these updates on your status page, send them to affected customers, and share them on whatever channels your customers use.

The tone matters. Be factual, not dramatic. Be specific, not vague. Acknowledge the impact without over-apologizing. Customers respect honesty and transparency.

Heroku and GitHub have had their share of incidents. The ones that are remembered positively are the ones with clear, honest, frequent communication. The ones remembered negatively are the ones with silence or corporate non-statements.

The Blameless Post-Mortem

After the incident is resolved and everyone has slept, conduct a post-mortem. This is the most important part of the process because it is how you prevent the next incident.

The post-mortem must be blameless. Not "who screwed up?" but "how did our systems allow this to happen?"

Blameless post-mortem template:

Title: [Brief description of incident]
Date: [Date of incident]
Duration: [Start time to resolution time]
Severity: [How many users affected, revenue impact]
Author: [Who is writing this]

Timeline:
- [Time]: [What happened]
- [Time]: [What action was taken]
- [Time]: [What was the result]
(Be specific. Include who did what.)

Root cause:
[What actually caused the incident? Go deep.
Not "bad deploy" but "database migration added a column
with a NOT NULL constraint without a default value,
causing INSERT failures for all new records."]

Contributing factors:
[What made the incident worse or delayed detection?]
- No alert on database error rate increase
- Migration was not tested against production data
- Deploy happened at 5pm on Friday

What went well:
- [What worked during the response]
- [Quick detection, fast rollback, good communication, etc.]

What went poorly:
- [What did not work during the response]
- [Slow detection, no rollback plan, unclear ownership, etc.]

Action items:
- [ ] [Specific action] - Owner: [name] - Due: [date]
- [ ] [Specific action] - Owner: [name] - Due: [date]
- [ ] [Specific action] - Owner: [name] - Due: [date]

The key principle of blameless post-mortems: people make mistakes. If a mistake caused an incident, the question is not "who made the mistake?" but "why did our system allow this mistake to cause an incident?"

Blame-oriented thinking:
"John deployed bad code that broke the site."
Result: John is afraid to deploy. Other engineers are afraid to deploy.
Nothing changes about the system.

Blameless thinking:
"A deploy with a database migration broke the site because we had no
migration testing step, no automatic rollback, and no alert on elevated
error rates."
Result: we add migration testing, automatic rollback, and error alerts.
John deploys confidently next time.

Etsy pioneered the blameless post-mortem approach in tech and published extensively about it. Their reasoning: if people are punished for mistakes, they hide mistakes. Hidden mistakes become bigger incidents. Blameless culture encourages transparency, which enables prevention.

Google's SRE book dedicates an entire chapter to blameless post-mortems. They consider it one of the most important practices in maintaining reliable systems. If it matters at Google's scale, it matters at yours.

Common First Incidents

Certain types of incidents are disproportionately common for early-stage startups. Knowing what to expect helps you prepare.

Most common first incidents:

1. Bad deploy
   Cause: code that worked locally but breaks in production
   Fix: roll back the deploy
   Prevention: staging environment, smoke tests after deploy

2. Database migration failure
   Cause: migration that works on empty DB but fails on production data
   Fix: roll back migration, fix and redeploy
   Prevention: test migrations against production-like data

3. External service outage
   Cause: Stripe, Auth0, SendGrid, etc. goes down
   Fix: wait, or implement degraded mode
   Prevention: monitoring external dependencies, fallback behavior

4. Traffic spike
   Cause: Hacker News, Product Hunt, viral social post
   Fix: scale up, add caching, enable rate limiting
   Prevention: basic caching, CDN for static assets

5. Disk space exhaustion
   Cause: logs growing without rotation, uploads without limits
   Fix: clear space, restart services
   Prevention: log rotation, upload size limits, disk space alerts

6. SSL certificate expiry
   Cause: certificate not set to auto-renew
   Fix: manually renew or re-provision
   Prevention: auto-renewing certificates, expiry monitoring

7. Secret exposure
   Cause: API key committed to public repo, env var misconfigured
   Fix: rotate all exposed secrets immediately
   Prevention: git-secrets, environment variable management

Building the Incident Muscle

Incident response is a skill. Like all skills, it improves with practice. Your first incident will be messy. Your fifth will be smoother. Your tenth will be routine.

Incident response improvement cycle:
1. Incident happens
2. Respond (probably messily the first time)
3. Conduct post-mortem
4. Implement action items
5. Next incident is slightly less messy
6. Repeat

After 5-10 incidents, you will have:
- A library of runbooks for common problems
- Refined alerting that catches issues faster
- Practiced communication patterns
- Confidence that you can handle the next one

The goal is not to eliminate incidents. The goal is to detect them faster, resolve them faster, and learn from them consistently.

On-Call for Small Teams

If you are a solo founder or a two-person team, on-call is simple: you are always on call. But even at this stage, having a structure helps.

Solo or two-person on-call:
- Set your phone to allow alerts from monitoring tools
- Have a laptop accessible (not always at your desk, but reachable)
- Set expectations: respond within 30 minutes, not 30 seconds
- Take turns if there are two of you (week on, week off)
- Use a tool like PagerDuty or Opsgenie even at small scale
  (free tiers available, builds good habits)

As the team grows to 3-5:
- Weekly on-call rotation
- Primary and secondary on-call
- Handoff at a consistent time (Monday morning)
- On-call person handles incidents and non-urgent ops tasks
- Compensate on-call with time off after busy rotations

The key is sustainability. Being on-call 24/7 for months leads to burnout. Even a simple rotation between two people makes a massive difference in quality of life.

Common Pitfalls

No preparation at all. "We'll figure it out when it happens" means you will figure it out slowly, poorly, and under maximum stress. Spend an hour writing down the basics before your first incident.

Blaming individuals. Blame kills transparency. If an engineer is blamed for an incident, every engineer learns to hide mistakes. This makes future incidents worse, not better.

Not doing a post-mortem. The incident is over. Everyone wants to move on. But without a post-mortem, you will have the same incident again. And again.

All action items, no follow-through. A post-mortem with 10 action items that nobody completes is worse than no post-mortem. It creates cynicism. Assign owners, set dates, and track completion.

Over-communicating internally, under-communicating externally. Your team knows everything about the incident in real time. Your customers know nothing. Flip this. Communicate with customers early and often.

Trying to fix the root cause during the incident. Mitigation first, root cause later. If rolling back a deploy fixes the problem, roll it back. Understand why the deploy was bad during the post-mortem, not during the outage.

Key Takeaways

  • Your first incident will happen. Being prepared means having answers to five questions: who to call, how you detect, how you communicate, how you fix, and how you learn.
  • During an incident, mitigation before root cause. Stop the bleeding first.
  • Communicate with customers early and often. Silence is worse than bad news.
  • Blameless post-mortems are non-negotiable. Blame prevents learning. Learning prevents future incidents.
  • Every incident makes your team better — if you do the post-mortem and follow through on action items.
  • How you handle your first incident shapes your engineering culture. Handle it with calm, transparency, and curiosity.