21 min read
On this page

Incident Management & Postmortems

Incident Management & Postmortems

Things will break. That's not a pessimistic take — it's an engineering reality. Bugs will ship, servers will go down, third-party integrations will fail at the worst possible moment. Your job as a team leader isn't to prevent every incident from ever happening (that's impossible and anyone who promises otherwise is selling something). Your job is to make sure your team responds well when things go sideways and actually learns from what happened.

The teams that handle incidents well aren't the ones that never have them. They're the ones that recover fast, communicate clearly, and come out the other side stronger. This guide will help you build that kind of team.


Incident Response Basics

Every incident, no matter how big or small, follows the same general flow:

Detect — Something is wrong. Maybe monitoring caught it, maybe a customer reported it, maybe someone on the team noticed weird behavior. The faster you detect, the less damage accumulates. Invest in good alerting. If your customers are finding your bugs before your monitoring does, that's a problem worth fixing.

Triage — How bad is it? Who is affected? Is this a "drop everything" situation or a "we'll fix it in the next hour" situation? Not every alert is a five-alarm fire. Learn to assess severity quickly. A good rule of thumb: if customers are actively impacted and can't work around it, it's urgent. If it's degraded but functional, you have a bit more breathing room.

Mitigate — Stop the bleeding. This isn't about finding the perfect fix — it's about reducing impact right now. Roll back a deploy, flip a feature flag, scale up servers, redirect traffic. The goal is to get customers back to a working state as fast as possible. The elegant fix comes later.

Resolve — Now find and fix the actual root cause. This might happen immediately after mitigation or it might be a follow-up task over the next few days. It depends on the severity and complexity.

Communicate — This runs parallel to everything above. Your stakeholders, your customers, your team — they all need to know what's happening. We'll dig into this more below.

Who does what: In practice, you want someone technical working the problem, someone coordinating communication, and someone making decisions about severity and customer impact. On a small team, that might be two people. On a larger team, you might have a formal incident commander role. The key is that everyone knows their role before the incident happens — figuring out who does what in the middle of an outage is a recipe for chaos.


Your Role During an Incident

Here's the thing most new team leaders get wrong: your job during an incident is probably not to fix the code.

Your job is to coordinate, communicate, and unblock.

Be the calm presence. If you're panicking, your team will panic. If you're steady and focused, they'll follow your lead. This is one of those moments where emotional regulation isn't a soft skill — it's a critical technical leadership skill.

Don't panic-code. The instinct to jump in and start typing is strong, especially if you're a strong engineer. Resist it. Unless you're the only person who can fix the problem, your time is better spent making sure the people who are fixing it have everything they need. Are they blocked on access? Get it. Do they need someone from another team? Pull them in. Does the CEO keep pinging Slack asking for updates? Handle that so your engineers can focus.

Keep stakeholders updated. Your manager, the product team, customer support, maybe executives — they all want to know what's happening. Give them regular updates even when the update is "still investigating, no change." Silence during an incident makes people nervous, and nervous people start making bad decisions or interrupting your team.

Shield your team from chaos. During a major incident, everyone suddenly has opinions and questions. Your job is to be the buffer. Let your engineers focus on the problem. You handle the noise.

Make decisions. Sometimes the team will be debating between two approaches. "Should we roll back or push a hotfix forward?" That's your call to make. Gather input quickly, decide, and move. A good decision made quickly beats a perfect decision made slowly during an incident.


MTTR Over MTBF

There are two metrics people talk about with incidents:

  • MTBF — Mean Time Between Failures. How often things break.
  • MTTR — Mean Time to Recovery. How fast you fix things when they break.

Here's the uncomfortable truth: in complex systems, you can't prevent all failures. You can reduce them, sure. But the marginal cost of preventing that next failure keeps going up while the system keeps getting more complex. At some point, you hit diminishing returns.

What you can do is get really, really good at recovering.

A team with an MTTR of 15 minutes and an MTBF of two weeks is in better shape than a team with an MTBF of two months but an MTTR of six hours. The first team has more incidents, but customers barely notice because they're resolved so fast. The second team has fewer incidents, but when they happen, it's a disaster.

What drives fast recovery:

  • Good monitoring and alerting (detect fast)
  • Runbooks for common failure modes (don't reinvent the wheel every time)
  • Easy rollback mechanisms (feature flags, blue-green deploys, canary releases)
  • On-call engineers who have context on the system (not someone who's never seen the codebase)
  • Practice. Seriously. If your team has never dealt with a production incident, the first real one will be ugly

Optimize for fast recovery. It's a better investment than chasing zero incidents.


Communication During Incidents

Communication is where most incident responses fall apart. The technical work is usually fine — engineers are good at fixing things. But the communication around it? That's where trust is won or lost.

Internal Communication

  • Update every 30 minutes at minimum during an active incident. Even if nothing has changed, say so. "Still investigating, current theory is X, next update in 30 min." People can handle bad news. They can't handle silence.
  • Use a dedicated channel. Slack channel, incident bridge call, whatever your tool is. Keep all incident discussion in one place. Side conversations in DMs create information gaps.
  • Post facts, not speculation. "Error rate spiked at 14:23 UTC after deploy v2.47" is useful. "I think maybe the database is slow?" is not.
  • Tag the right people. If you need someone from infrastructure, page them. Don't hope they'll notice.

External Communication (Customers)

  • Be honest. Customers can handle "we broke something." They can't handle being gaslit about whether there's actually a problem.
  • Be brief and actionable. "We're experiencing issues with payment processing. Our team is actively working on a fix. We'll update within the hour. No action needed on your side." That's it. That's the whole message.
  • Give an ETA when you can, but don't lie. "We expect this to be resolved within 2 hours" is fine if you believe it. Don't say 30 minutes if you have no idea.
  • Never blame. Not your team, not a vendor, not a specific person. "We identified an issue with our deployment process" is fine. "Dave pushed a bad commit" is absolutely not fine, even if Dave did push a bad commit.
  • Follow up when resolved. "The issue has been resolved. Service is fully restored. We'll be conducting a review to prevent recurrence." Customers notice when you close the loop.

Blameless Postmortems

This is the most important concept in this entire guide, so pay attention.

The goal of a postmortem is learning, not punishment.

When something goes wrong, humans naturally want to find someone to blame. "Who did this?" feels like progress. It isn't. It's actually counterproductive, because here's what happens in a blame culture:

  • People hide mistakes instead of reporting them
  • People don't take risks or ship quickly because they're afraid of being blamed
  • People cover their tracks instead of providing honest timelines
  • The actual systemic issues that caused the incident never get addressed because "Dave got a talking-to" feels like a resolution

A blameless postmortem asks different questions. Not "who screwed up?" but "what in our system allowed this to happen?" Not "why didn't you catch this?" but "why didn't our process catch this?"

The person who pushed the bad deploy isn't the root cause. The root cause is that a bad deploy could reach production without being caught. Fix the system, not the person.

This doesn't mean no one is accountable. People still own their work and their growth. But during the postmortem, the focus is on systemic causes and systemic fixes. If someone genuinely needs coaching or feedback, that's a private conversation, not a public postmortem topic.

If people fear blame, they'll hide problems. Hidden problems become bigger problems. Bigger problems become the kind of outages that make the news.


Running a Postmortem

Hold the postmortem within 48 hours of the incident while it's still fresh. Here's a template structure that works:

1. Summary

Two to three sentences. What happened, when, how long it lasted, who was affected, what was the business impact.

2. Timeline

Reconstruct the timeline in detail. Use timestamps. Be specific.

  • 14:02 UTC — Deploy v2.47 pushed to production
  • 14:08 UTC — Error rate alert fires in #alerts
  • 14:12 UTC — On-call engineer acknowledges and begins investigation
  • 14:25 UTC — Root cause identified: database migration had a locking query
  • 14:31 UTC — Rollback initiated
  • 14:34 UTC — Service restored

This timeline is the foundation of the whole postmortem. Get it right.

3. Root Cause Analysis

Use the 5 Whys technique:

  1. Why did the site go down? — A database migration locked a critical table.
  2. Why did the migration lock the table? — It used an ALTER TABLE on a large table without a concurrent migration strategy.
  3. Why wasn't this caught in review? — We don't have a checklist for migration safety.
  4. Why don't we have a migration checklist? — We've never formalized our migration review process.
  5. Why haven't we formalized it? — We haven't had a major migration incident before, so it wasn't prioritized.

Now you have something actionable: create a migration safety checklist and integrate it into the code review process.

4. Contributing Factors

What else made this worse? Maybe the on-call person was in a meeting and didn't see the alert for 10 minutes. Maybe the runbook was outdated. Maybe rollback took longer than expected because the process wasn't automated. List them all.

5. What Went Well

This matters. Call out what worked. Fast detection? Good communication? Quick decision-making? Recognizing what went right reinforces good behavior.

6. Action Items

This is the most important section. Every action item needs:

  • A clear description of what needs to be done
  • An owner — a specific person, not "the team"
  • A deadline — "soon" is not a deadline
  • A priority — is this "do it this week" or "add to next quarter's roadmap"?

Action Item Follow-Through

Here's where most teams fail. They run a great postmortem, write up thoughtful action items, and then... nothing happens. Everyone goes back to feature work and the action items rot in a document no one reads.

A postmortem without follow-through is theater. It's worse than no postmortem at all, because it teaches your team that postmortems are performative — something you do to check a box, not something that actually changes anything.

How to avoid this:

  • Track action items in your actual project tracker. Not in the postmortem document. In Jira, Linear, Asana, whatever your team uses for real work. If it's not in the backlog, it doesn't exist.
  • Assign real owners. "The team will improve monitoring" means no one will improve monitoring. "Sarah will add latency alerts for the payments service by April 3" means it might actually happen.
  • Set deadlines. Reasonable ones. But deadlines.
  • Review in 2 weeks. Put a calendar reminder. In your next team meeting or 1:1s, ask about the status. "Hey, how's that migration checklist coming along?" This simple act of following up signals that you actually care about the outcome.
  • Close the loop publicly. When action items are completed, mention it. "Remember that outage last month? We shipped the automated rollback feature that came out of that postmortem. Here's how it works." This reinforces that postmortems lead to real change.

Building a Learning Culture

The best engineering organizations treat incidents as learning opportunities, not embarrassments. Here's how to build that culture on your team:

Share postmortems widely. Don't hide them. Post them in a shared channel or wiki. Other teams will learn from your incidents and maybe prevent their own. Transparency builds trust across the organization.

Celebrate good incident response. When your team handles an incident well — fast detection, clear communication, smooth recovery — call it out. "Hey everyone, great job on yesterday's incident. We detected it in 3 minutes and had it resolved in 20. That's the kind of response that keeps our customers trusting us." This matters more than you think.

Talk about near misses. A near miss is an incident that almost happened but didn't. Maybe someone caught a dangerous config change in code review. Maybe a test caught a bug that would have taken down production. These are goldmines. You get all the learning without any of the pain. Encourage your team to share them. "Hey, I almost deployed a migration that would have locked the users table for 10 minutes. Here's what I caught and how." Treat near misses with the same seriousness as actual incidents.

Make it safe to report problems. If someone breaks something and tells you immediately, thank them. Genuinely. "Thanks for flagging this fast — that probably saved us an hour of customer impact." The first time someone reports a mistake and gets thanked instead of yelled at, you've changed the culture. The first time someone reports a mistake and gets blamed, you've killed it.

Rotate on-call fairly. Nothing breeds resentment like one person always being the hero. Spread the knowledge and the burden.


Business Value

If someone asks you why you're spending time on incident management process instead of shipping features, here's your answer:

Customer trust. Customers don't expect zero downtime. They expect honest communication and fast recovery. A well-handled incident actually builds trust. "That company had an issue, but they were transparent about it and fixed it in 20 minutes" is a reputation builder. A poorly-handled incident — denial, silence, slow recovery — destroys trust that took years to build.

Revenue protected by fast recovery. Do the math. If your service generates 10,000perhourinrevenue,reducingyouraveragerecoverytimefrom2hoursto30minutessaves10,000 per hour in revenue, reducing your average recovery time from 2 hours to 30 minutes saves 15,000 per incident. If you have one significant incident per month, that's $180,000 a year. The investment in better incident response tooling and process pays for itself quickly.

Cost of downtime. Beyond direct revenue loss, there are support costs (ticket volume spikes), engineering costs (unplanned work disrupts planned work), reputation costs (social media complaints, negative reviews), and contractual costs (SLA violations, credits). A single major outage can cost orders of magnitude more than the investment in preventing or shortening it.

Reduced repeat incidents. This is the big one. When you actually follow through on postmortem action items, you stop having the same incident twice. Every repeat incident is a failure of your learning process, not your engineering. Teams that invest in postmortem follow-through see their incident rate drop over time because they're systematically eliminating classes of failure.


Real-World Scenarios

Scenario 1: The Well-Handled Incident That Built Trust

A payment processing integration goes down at 2 PM on a Tuesday. The team detects it within 4 minutes through automated monitoring. The team leader immediately sets up a coordination channel, assigns one engineer to investigate and another to draft customer communications. Within 15 minutes, they've identified the issue (a third-party API changed their response format without notice) and have a workaround in progress. The team leader posts updates to the status page every 20 minutes. By 3 PM, a fix is deployed.

The next day, the team leader sends a brief note to affected customers: "Here's what happened, here's what we did, here's what we're doing to prevent it (adding response format validation and a fallback payment processor)." Three customers reply thanking them for the transparency. One customer who was evaluating a competitor decides to stay, specifically citing the incident response as a reason they trust the team.

What went right: Fast detection, clear roles, proactive communication, honest follow-up, actionable prevention steps.

Scenario 2: The Poorly-Handled Incident That Lost a Customer

A data export feature starts silently producing corrupted files after a Friday afternoon deploy. No one notices until Monday morning when a major customer emails support saying their weekly reports are wrong. Support escalates, but the team leader is in back-to-back meetings and doesn't see the message until after lunch. When they do, they minimize it: "It's probably a one-off thing, let's see if it happens again."

It happens again. The customer escalates to their account manager. Engineering finally investigates on Tuesday and finds the bug. There's no rollback mechanism, so a hotfix takes until Wednesday. No postmortem is held. The same class of bug ships again six weeks later.

The customer churns at renewal, citing "reliability concerns and slow response to issues." In the exit interview, they don't mention the bug itself — they mention that no one seemed to care when they reported it and that it happened twice.

What went wrong: No monitoring for data integrity, slow response to customer report, minimizing the issue, no postmortem, no follow-through, repeat incident.

Scenario 3: The Postmortem That Prevented a Major Future Outage

A minor incident occurs: a background job queue backs up for 45 minutes, causing delayed notifications. Impact is low — users get their notifications late but nothing breaks. Some teams would shrug this off. This team runs a postmortem anyway.

During the 5 Whys analysis, they discover that the job queue has no backpressure mechanism. If it backs up, it just keeps accepting jobs until it runs out of memory. The 45-minute delay happened because the queue was at 80% memory capacity. If it had hit 100%, the entire job processing system would have crashed, taking down email notifications, webhook deliveries, and billing processes.

The action item: implement backpressure and circuit breakers on the job queue. An engineer implements it over the next sprint. Two months later, a traffic spike five times larger than normal hits the system. The circuit breaker kicks in, the queue gracefully degrades (delays some low-priority jobs), and customers never notice. Without the postmortem and follow-through, that traffic spike would have been a major outage affecting billing for thousands of customers.

What went right: Running a postmortem even for a minor incident, thorough root cause analysis that looked beyond the immediate symptom, follow-through on action items, systemic fix that prevented a much larger future incident.


Common Mistakes

Blaming individuals. "This happened because Alex didn't test properly." No. This happened because your testing process didn't catch the issue. Maybe you need better test coverage requirements, staging environments, or canary deploys. Fix the system. Talk to Alex privately if needed, but the postmortem isn't the place.

No postmortem at all. The incident happens, you fix it, everyone moves on. Three months later, the same thing happens again. And again. You're stuck in a loop because you never stopped to learn. If it affected customers or took more than 30 minutes to resolve, run a postmortem. Period.

Postmortem but no follow-up. You wrote a beautiful postmortem document. It has action items. It's in a wiki somewhere. And nobody ever looked at it again. This is arguably worse than no postmortem because it teaches your team that the process is meaningless. Follow through or don't bother.

Hiding incidents. Some teams try to quietly fix things and hope nobody noticed. This destroys trust internally (your team knows what happened even if leadership doesn't) and externally (customers always find out eventually, and the cover-up is worse than the crime). Be transparent. It's always the right call.

Over-reacting to every minor issue. Not every bug is a crisis. If a non-critical internal tool has a UI glitch, you don't need a war room. Learn to calibrate your response to the severity. Over-reacting to minor issues creates alert fatigue and means your team won't take it seriously when a real incident happens. Save the adrenaline for when it matters.

No severity classification. Related to the above — if you don't have clear definitions of what constitutes a Sev 1 vs. a Sev 3, every incident becomes a debate about how seriously to take it. Define your severity levels upfront. Something like: Sev 1 is customer-facing, service down. Sev 2 is customer-facing, degraded. Sev 3 is internal impact only. Adjust to your context, but have the conversation before you need it.

Not practicing. Your incident response process is only as good as your team's ability to execute it under pressure. If the first time someone runs an incident is during a real outage, it's going to be messy. Consider game days, tabletop exercises, or even just walking through "what would we do if X happened?" in a team meeting. Practice doesn't make perfect, but it makes competent.


Pulling It All Together

Incident management isn't glamorous work. Nobody gets promoted for writing a great postmortem (though maybe they should). But it's one of the highest-leverage things you can do as a team leader. A team that handles incidents well is a team that ships with confidence, recovers quickly, earns customer trust, and gets better over time.

Your checklist as a new team leader:

  1. Make sure your team has basic monitoring and alerting in place
  2. Define severity levels and response expectations
  3. Establish who does what during an incident (even informally)
  4. Run blameless postmortems for every significant incident
  5. Follow through on action items — this is where the magic happens
  6. Celebrate good responses and near-miss catches
  7. Build the culture where reporting problems early is rewarded, not punished

The goal isn't zero incidents. The goal is a team that handles incidents so well that your customers barely notice, your stakeholders stay calm, and your engineers aren't afraid to ship.

That's the team you're building.


Common Pitfalls

  • Blaming individuals in postmortems. Saying "this happened because Alex didn't test properly" shuts down honesty and prevents the team from finding systemic fixes. The person who pushed the bad deploy is not the root cause; the system that allowed a bad deploy to reach production is.
  • Skipping postmortems entirely. Fixing the incident and moving on without reflection guarantees you will have the same class of failure again. If it affected customers or took more than 30 minutes to resolve, run a postmortem.
  • Writing postmortem action items but never following through. A postmortem without follow-through is worse than no postmortem because it teaches the team the process is meaningless. Track actions in your real project tracker with owners, deadlines, and regular review.
  • Hiding incidents instead of being transparent. Quietly fixing problems and hoping nobody noticed destroys trust internally and externally. Customers always find out eventually, and the cover-up is worse than the original issue.
  • Over-reacting to every minor issue. Not every bug is a crisis. If you treat a UI glitch in an internal tool with war-room urgency, you create alert fatigue and your team will not take real incidents seriously.
  • Not having severity levels defined before an incident occurs. Without clear definitions of Sev 1 through Sev 3, every incident becomes a debate about how seriously to take it instead of a focused response.

Key Takeaways

  • Things will break. Your job is not to prevent every incident but to ensure your team responds well, recovers fast, and learns from what happened.
  • During an incident, your role as team leader is to coordinate, communicate, and unblock rather than to fix the code yourself. Be the calm presence, shield your team from noise, and make decisions quickly.
  • Optimize for MTTR (Mean Time to Recovery) over MTBF (Mean Time Between Failures). Fast recovery has more impact than chasing zero incidents.
  • Communication during incidents should be honest, frequent (every 30 minutes minimum), factual, and never blame anyone. Follow up when resolved.
  • Blameless postmortems ask "what in our system allowed this?" not "who screwed up?" This is what enables teams to fix root causes rather than punish individuals.
  • Every postmortem needs a timeline, 5 Whys root cause analysis, contributing factors, what went well, and concrete action items with owners and deadlines.
  • Build a learning culture by sharing postmortems widely, celebrating good incident response, discussing near misses seriously, and making it safe to report problems by thanking people who flag issues quickly.