OPERATIONS / DEVOPS / SRE TEMPLATE
Incident Response Playbook Template
Step-by-step incident response: severity classification, roles, detection through resolution, and communication templates.
Use this templateWhat's inside
Field | Details |
|---|---|
Scope | All production services / specific product area |
Owner | On-call team or SRE lead |
Last Updated |
Severity Classification
Severity | Definition | Response Time | Examples |
|---|---|---|---|
SEV-1 | Complete outage or data loss affecting all users | Immediate — all hands | Database down, auth broken, data corruption |
SEV-2 | Major feature broken, significant user impact | 15 minutes | Payments failing, search down, API errors > 10% |
SEV-3 | Degraded experience, workaround exists | 1 hour | Slow performance, minor feature broken, UI glitch |
SEV-4 | Cosmetic or low-impact issue | Next business day | Typo, non-critical alert firing |
Roles
Role | Responsibility |
|---|---|
Incident Commander (IC) | Owns the incident. Makes decisions, delegates, communicates status. Usually the on-call engineer for SEV-3/4, escalated for SEV-1/2. |
Operations Lead | Hands on keyboard. Investigates, applies fixes, runs commands. |
Communications Lead | Updates status page, Slack, stakeholders. Shields Ops Lead from interruptions. |
Scribe | Records timeline, actions taken, decisions made. Critical for the postmortem. |
Phase 1: Detect & Triage
-
Alert fires or user reports issue — acknowledge within response time SLA
-
Assess severity using the classification table above
-
Open an incident channel (Slack: #inc-YYYY-MM-DD-short-description)
-
Assign roles: IC, Ops Lead, Comms Lead
-
Post initial assessment: what's broken, who's affected, what we know so far
Phase 2: Investigate & Contain
-
Check dashboards: error rates, latency, resource utilization
-
Check recent changes: deploys, config changes, feature flags, infra changes
-
Check dependencies: are upstream or downstream services healthy?
-
Contain the blast radius: disable feature flag, scale up, redirect traffic, block bad actor
-
Update status page if user-facing impact
Phase 3: Fix & Verify
-
Apply the fix: deploy patch, rollback, config change, data fix
-
Verify the fix: confirm metrics recover, test affected flows, check logs
-
Monitor for 15-30 minutes to ensure stability
-
Update status page: issue resolved
-
Post all-clear in incident channel
Phase 4: Follow Up
-
Schedule postmortem within 48 hours (use Incident Postmortem template)
-
Create tickets for follow-up action items
-
Send internal summary to stakeholders
-
Update runbooks if the incident revealed a gap
-
Thank the team — incident response is stressful work
Communication Templates
Status Page Update
Investigating: We are aware of [impact description] and are actively investigating. Updates will follow every [15/30] minutes.
Identified: The issue has been identified as [brief cause]. We are working on a fix. [X%] of users are affected.
Resolved: The issue has been resolved. [Brief explanation]. We apologize for the disruption.
Internal Escalation
[SEV-X] [Service] — [Impact description]. IC: [name]. Channel: #inc-YYYY-MM-DD-description. Current status: investigating/contained/fixing.
Escalation Contacts
Escalation Level | Who | Contact | When |
|---|---|---|---|
On-call engineer | Name | PagerDuty / phone | First responder for all alerts |
Engineering lead | Name | Phone / Slack | SEV-1/2, or on-call needs help |
VP Engineering / CTO | Name | Phone | SEV-1 with business impact, data breach, extended outage |
External: cloud provider | Support | Support ticket / phone | Infrastructure issue outside our control |
Other Ops templates
-
Capacity PlanningCapacity assessment: current utilization, growth projections, bottlenecks, and scaling recommendations with cost impact. -
Change Management RecordChange request with scope, risk assessment, step-by-step implementation, rollback plan, and approvals. -
Disaster Recovery PlanDR plan: recovery tiers, system inventory, activation criteria, recovery procedures, and testing schedule.