Incident Response Playbook — Template

Field	Details
Scope	All production services / specific product area
Owner	On-call team or SRE lead
Last Updated	YYYY-MM-DD

Severity	Definition	Response Time	Examples
SEV-1	Complete outage or data loss affecting all users	Immediate — all hands	Database down, auth broken, data corruption
SEV-2	Major feature broken, significant user impact	15 minutes	Payments failing, search down, API errors > 10%
SEV-3	Degraded experience, workaround exists	1 hour	Slow performance, minor feature broken, UI glitch
SEV-4	Cosmetic or low-impact issue	Next business day	Typo, non-critical alert firing

Role	Responsibility
Incident Commander (IC)	Owns the incident. Makes decisions, delegates, communicates status. Usually the on-call engineer for SEV-3/4, escalated for SEV-1/2.
Operations Lead	Hands on keyboard. Investigates, applies fixes, runs commands.
Communications Lead	Updates status page, Slack, stakeholders. Shields Ops Lead from interruptions.
Scribe	Records timeline, actions taken, decisions made. Critical for the postmortem.

Check dashboards: error rates, latency, resource utilization
Check recent changes: deploys, config changes, feature flags, infra changes
Check dependencies: are upstream or downstream services healthy?
Contain the blast radius: disable feature flag, scale up, redirect traffic, block bad actor
Update status page if user-facing impact

Investigating: We are aware of [impact description] and are actively investigating. Updates will follow every [15/30] minutes.

Identified: The issue has been identified as [brief cause]. We are working on a fix. [X%] of users are affected.

Resolved: The issue has been resolved. [Brief explanation]. We apologize for the disruption.

[SEV-X] [Service] — [Impact description]. IC: [name]. Channel: #inc-YYYY-MM-DD-description. Current status: investigating/contained/fixing.

Escalation Level	Who	Contact	When
On-call engineer	Name	PagerDuty / phone	First responder for all alerts
Engineering lead	Name	Phone / Slack	SEV-1/2, or on-call needs help
VP Engineering / CTO	Name	Phone	SEV-1 with business impact, data breach, extended outage
External: cloud provider	Support	Support ticket / phone	Infrastructure issue outside our control

Incident Response Playbook Template