Incident Postmortem template thumbnail

SOFTWARE ENGINEERING TEMPLATE

Incident Postmortem Template

Blameless incident analysis with timeline, root cause, impact quantification, and tracked action items.

Use this template

What's inside

Field

Details

Incident ID

INC-XXXX

Severity

SEV-1

Status

Draft

Date of Incident

Duration

X hours Y minutes (from detection to resolution)

Time to Detect

X minutes (from trigger to first alert)

Time to Resolve

X hours Y minutes (from detection to resolution)

Incident Commander

Name

Authors

Names of people who wrote this postmortem

Review Date

Executive Summary

Write 3-5 sentences that tell the full story: what happened, who was affected, how long it lasted, what the root cause was, and whether it is fully resolved. A VP should be able to read this section alone and brief their team.

Impact

Quantify the damage. Vague impact statements lead to vague prioritization of fixes. Be as specific as the data allows.

Dimension

Impact

Users affected

Number or percentage of users who experienced degraded or lost service

Requests affected

Error rate, failed requests, or dropped transactions during the incident

Revenue impact

Estimated revenue loss, failed payments, or SLA credit exposure

Data impact

Any data loss, corruption, or inconsistency? If none, state "No data loss"

SLA impact

Did this breach any SLA/SLO? Which ones? What is the credit/penalty exposure?

Downstream impact

Were other teams, services, or partners affected?

Customer communication

Were customers notified? How? (Status page, email, support proactive outreach)

Timeline

Reconstruct the incident chronologically. Include what happened, who did it, and what information was available at the time. The timeline should be detailed enough that someone who was not on-call can understand the sequence of decisions.

Time (UTC)

Event

Actor

HH:MM

Triggering event — the change, failure, or condition that started the incident

System / person

HH:MM

First alert fires (or customer reports the issue)

Monitoring / customer

HH:MM

On-call acknowledges and begins investigation

Name

HH:MM

Root cause identified (or first hypothesis formed)

Name

HH:MM

Mitigation applied (the action that stopped the bleeding)

Name

HH:MM

Service restored and verified

Name

HH:MM

All-clear communicated to stakeholders

Name

Root Cause Analysis

Go beyond the surface. The trigger is what started the incident; the root cause is why the system was vulnerable to that trigger in the first place. Most serious incidents have multiple contributing factors.

Trigger

What specific event initiated the incident? (e.g., a deploy, a config change, a traffic spike, a dependency failure, a data migration)

Root Cause

Why did the trigger cause an outage instead of being handled gracefully? Dig into the contributing factors:

  • Technical factor: What system weakness allowed the trigger to cause user-facing impact?

  • Process factor: What gap in the deployment, review, or testing process allowed this to reach production?

  • Organizational factor: Was there missing knowledge, unclear ownership, or insufficient investment in this area?

Detection

Evaluate how the incident was found and how fast the team responded. Detection is often the biggest opportunity for improvement.

Question

Answer

How was the incident detected?

Alert / customer report / internal discovery / partner notification

How long between trigger and detection?

X minutes — is this acceptable?

Did the right alert fire?

Yes / No — if no, what alert should exist?

Did the alert reach the right person?

Yes / No — was escalation needed?

Were there earlier signals that were missed?

Log warnings, error rate trends, or anomalies that could have been caught sooner

What Went Well

Acknowledge what worked. This reinforces good practices and keeps the postmortem from being purely negative.

  • Thing that worked well during the incident and should be preserved

  • Process or tool that helped reduce time to resolution

  • Communication or coordination that was effective

What Went Poorly

Be candid about what did not work. These are the inputs to your action items.

  • Thing that slowed down detection or resolution

  • Missing runbook, tool, or automation that would have helped

  • Communication gap or confusion during the incident

Where We Got Lucky

Things that could have made this incident much worse but didn't — by chance, not by design. These reveal hidden risks that should be addressed before luck runs out.

  • Factor that limited the blast radius this time but won't next time

  • Coincidence that helped (e.g., low traffic period, right person happened to be online)

Action Items

Every action item must have an owner, a priority, and a deadline. An action item without a deadline is a wish. Review these in your next team meeting and track them to completion.

Action

Type

Priority

Owner

Deadline

Status

Action that prevents this specific failure from recurring

Prevent

P0

Name

Not Started

Action that reduces blast radius or time to recovery next time

Mitigate

P1

Name

Not Started

Action that improves detection speed (alert, dashboard, health check)

Detect

P1

Name

Not Started

Process improvement (runbook, review checklist, training)

Process

P2

Name

Not Started

Lessons Learned

Step back from the specifics. What has this incident taught the team about how the system works, how the team operates, or what assumptions were wrong? These are the insights that should inform future architecture and process decisions.

  1. Lesson that changes how we think about this part of the system

  2. Lesson about our incident response process or team coordination

  3. Assumption that this incident proved wrong

Supporting Information

Link to everything someone might need to dig deeper. This section turns the postmortem into the definitive reference for this incident.

  • Monitoring dashboard link (with time range covering the incident)

  • Incident Slack channel or chat log archive

  • Relevant deploy or change log entries

  • Customer communication sent (status page update, email, support messages)

  • Related incidents or postmortems

Other Engineering templates