Incident Postmortem — Template

Field	Details
Incident ID	INC-XXXX
Severity	SEV-1
Status	Draft
Date of Incident	YYYY-MM-DD
Duration	X hours Y minutes (from detection to resolution)
Time to Detect	X minutes (from trigger to first alert)
Time to Resolve	X hours Y minutes (from detection to resolution)
Incident Commander	Name
Authors	Names of people who wrote this postmortem
Review Date	YYYY-MM-DD

Executive Summary

Write 3-5 sentences that tell the full story: what happened, who was affected, how long it lasted, what the root cause was, and whether it is fully resolved. A VP should be able to read this section alone and brief their team.

Impact

Quantify the damage. Vague impact statements lead to vague prioritization of fixes. Be as specific as the data allows.

Dimension	Impact
Users affected	Number or percentage of users who experienced degraded or lost service
Requests affected	Error rate, failed requests, or dropped transactions during the incident
Revenue impact	Estimated revenue loss, failed payments, or SLA credit exposure
Data impact	Any data loss, corruption, or inconsistency? If none, state "No data loss"
SLA impact	Did this breach any SLA/SLO? Which ones? What is the credit/penalty exposure?
Downstream impact	Were other teams, services, or partners affected?
Customer communication	Were customers notified? How? (Status page, email, support proactive outreach)

Timeline

Reconstruct the incident chronologically. Include what happened, who did it, and what information was available at the time. The timeline should be detailed enough that someone who was not on-call can understand the sequence of decisions.

Time (UTC)	Event	Actor
HH:MM	Triggering event — the change, failure, or condition that started the incident	System / person
HH:MM	First alert fires (or customer reports the issue)	Monitoring / customer
HH:MM	On-call acknowledges and begins investigation	Name
HH:MM	Root cause identified (or first hypothesis formed)	Name
HH:MM	Mitigation applied (the action that stopped the bleeding)	Name
HH:MM	Service restored and verified	Name
HH:MM	All-clear communicated to stakeholders	Name

Root Cause Analysis

Go beyond the surface. The trigger is what started the incident; the root cause is why the system was vulnerable to that trigger in the first place. Most serious incidents have multiple contributing factors.

Trigger

What specific event initiated the incident? (e.g., a deploy, a config change, a traffic spike, a dependency failure, a data migration)

Root Cause

Why did the trigger cause an outage instead of being handled gracefully? Dig into the contributing factors:

Technical factor: What system weakness allowed the trigger to cause user-facing impact?
Process factor: What gap in the deployment, review, or testing process allowed this to reach production?
Organizational factor: Was there missing knowledge, unclear ownership, or insufficient investment in this area?

Detection

Evaluate how the incident was found and how fast the team responded. Detection is often the biggest opportunity for improvement.

Question	Answer
How was the incident detected?	Alert / customer report / internal discovery / partner notification
How long between trigger and detection?	X minutes — is this acceptable?
Did the right alert fire?	Yes / No — if no, what alert should exist?
Did the alert reach the right person?	Yes / No — was escalation needed?
Were there earlier signals that were missed?	Log warnings, error rate trends, or anomalies that could have been caught sooner

What Went Well

Acknowledge what worked. This reinforces good practices and keeps the postmortem from being purely negative.

Thing that worked well during the incident and should be preserved
Process or tool that helped reduce time to resolution
Communication or coordination that was effective

What Went Poorly

Be candid about what did not work. These are the inputs to your action items.

Thing that slowed down detection or resolution
Missing runbook, tool, or automation that would have helped
Communication gap or confusion during the incident

Where We Got Lucky

Things that could have made this incident much worse but didn't — by chance, not by design. These reveal hidden risks that should be addressed before luck runs out.

Factor that limited the blast radius this time but won't next time
Coincidence that helped (e.g., low traffic period, right person happened to be online)

Action Items

Every action item must have an owner, a priority, and a deadline. An action item without a deadline is a wish. Review these in your next team meeting and track them to completion.

Action	Type	Priority	Owner	Deadline	Status
Action that prevents this specific failure from recurring	Prevent	P0	Name	YYYY-MM-DD	Not Started
Action that reduces blast radius or time to recovery next time	Mitigate	P1	Name	YYYY-MM-DD	Not Started
Action that improves detection speed (alert, dashboard, health check)	Detect	P1	Name	YYYY-MM-DD	Not Started
Process improvement (runbook, review checklist, training)	Process	P2	Name	YYYY-MM-DD	Not Started

Lessons Learned

Step back from the specifics. What has this incident taught the team about how the system works, how the team operates, or what assumptions were wrong? These are the insights that should inform future architecture and process decisions.

Lesson that changes how we think about this part of the system
Lesson about our incident response process or team coordination
Assumption that this incident proved wrong

Supporting Information

Link to everything someone might need to dig deeper. This section turns the postmortem into the definitive reference for this incident.

Monitoring dashboard link (with time range covering the incident)
Incident Slack channel or chat log archive
Relevant deploy or change log entries
Customer communication sent (status page update, email, support messages)
Related incidents or postmortems

Incident Postmortem Template

What's inside

Executive Summary

Impact

Timeline

Root Cause Analysis

Trigger

Root Cause

Detection

What Went Well

What Went Poorly

Where We Got Lucky

Action Items

Lessons Learned

Supporting Information

Other Engineering templates