Incident Response Playbook template thumbnail

OPERATIONS / DEVOPS / SRE TEMPLATE

Incident Response Playbook Template

Step-by-step incident response: severity classification, roles, detection through resolution, and communication templates.

Use this template

What's inside

Field

Details

Scope

All production services / specific product area

Owner

On-call team or SRE lead

Last Updated

Severity Classification

Severity

Definition

Response Time

Examples

SEV-1

Complete outage or data loss affecting all users

Immediate — all hands

Database down, auth broken, data corruption

SEV-2

Major feature broken, significant user impact

15 minutes

Payments failing, search down, API errors > 10%

SEV-3

Degraded experience, workaround exists

1 hour

Slow performance, minor feature broken, UI glitch

SEV-4

Cosmetic or low-impact issue

Next business day

Typo, non-critical alert firing

Roles

Role

Responsibility

Incident Commander (IC)

Owns the incident. Makes decisions, delegates, communicates status. Usually the on-call engineer for SEV-3/4, escalated for SEV-1/2.

Operations Lead

Hands on keyboard. Investigates, applies fixes, runs commands.

Communications Lead

Updates status page, Slack, stakeholders. Shields Ops Lead from interruptions.

Scribe

Records timeline, actions taken, decisions made. Critical for the postmortem.

Phase 1: Detect & Triage

  • Alert fires or user reports issue — acknowledge within response time SLA

  • Assess severity using the classification table above

  • Open an incident channel (Slack: #inc-YYYY-MM-DD-short-description)

  • Assign roles: IC, Ops Lead, Comms Lead

  • Post initial assessment: what's broken, who's affected, what we know so far

Phase 2: Investigate & Contain

  • Check dashboards: error rates, latency, resource utilization

  • Check recent changes: deploys, config changes, feature flags, infra changes

  • Check dependencies: are upstream or downstream services healthy?

  • Contain the blast radius: disable feature flag, scale up, redirect traffic, block bad actor

  • Update status page if user-facing impact

Phase 3: Fix & Verify

  • Apply the fix: deploy patch, rollback, config change, data fix

  • Verify the fix: confirm metrics recover, test affected flows, check logs

  • Monitor for 15-30 minutes to ensure stability

  • Update status page: issue resolved

  • Post all-clear in incident channel

Phase 4: Follow Up

  • Schedule postmortem within 48 hours (use Incident Postmortem template)

  • Create tickets for follow-up action items

  • Send internal summary to stakeholders

  • Update runbooks if the incident revealed a gap

  • Thank the team — incident response is stressful work

Communication Templates

Status Page Update

Investigating: We are aware of [impact description] and are actively investigating. Updates will follow every [15/30] minutes.

Identified: The issue has been identified as [brief cause]. We are working on a fix. [X%] of users are affected.

Resolved: The issue has been resolved. [Brief explanation]. We apologize for the disruption.

Internal Escalation

[SEV-X] [Service] — [Impact description]. IC: [name]. Channel: #inc-YYYY-MM-DD-description. Current status: investigating/contained/fixing.

Escalation Contacts

Escalation Level

Who

Contact

When

On-call engineer

Name

PagerDuty / phone

First responder for all alerts

Engineering lead

Name

Phone / Slack

SEV-1/2, or on-call needs help

VP Engineering / CTO

Name

Phone

SEV-1 with business impact, data breach, extended outage

External: cloud provider

Support

Support ticket / phone

Infrastructure issue outside our control

Other Ops templates