Incident response playbook in one page

TL;DR. An incident response playbook is a one-page set of instructions for the most common ways your service goes sideways: who pages whom, what to check first, how to contain it, and how to write it up afterwards. Six phases — prepare, detect, contain, eradicate, recover, learn — repeated for each incident type that’s plausible enough to bother. The playbook fails the moment the on-call can’t find it; everything else is implementation detail.

The first time you’ll wish you had an incident response playbook is at 03:47 on a Tuesday with a phone in one hand and a half-loaded dashboard in the other. (Patches insists this is “wildlife rescue with a spreadsheet,” and we are not going to argue with Patches.) The good news is that a playbook is not a sixty-page document; it’s a one-page set of instructions, ideally one per incident type, written when nothing is on fire. The bad news is that most teams write theirs in a panic on the day they discover they needed one. The rest of this post is the working shape: what goes in it, where it has to live, and when a small team is right to skip the formality.

What an incident response playbook actually is

A playbook is a pre-decided response to a specific kind of incident — the kind your team has had before or expects to have. “What do we do if the database fills up?” gets a playbook. “What do we do if a rogue raccoon enters the loading dock?” gets a different playbook (and yes, we have one).

Two things make a playbook a playbook rather than a wishlist:

It names a triggering condition. “Latency p99 exceeds 1 second for 5 minutes,” not “things feel slow.” The playbook is keyed by the alert that fires.
It names who does what, in what order. Specific roles (incident commander, scribe, comms), not “the team.” Specific actions, not “figure it out.”

A playbook is not a runbook (more on that below) and not a strategy. It’s a recipe with names attached, tested in advance, kept short enough that someone reading it for the first time at 03:47 can find the step they need.

Policy, plan, playbook, runbook in one paragraph each

These four words get used interchangeably in conference talks. They aren’t the same thing.

Policy is the high-level rule. “All severity-1 incidents are paged immediately and reviewed by leadership within 24 hours.” One paragraph; rarely changes.
Plan is the org-wide framework. Roles, severities, on- call rotations, escalation paths, communication channels. One short doc; reviewed yearly. The Upwind glossary draws this distinction well.
Playbook is the response to a specific type of incident. “Database disk full” gets one. “Phishing click” gets one. One page each; one playbook per realistic scenario.
Runbook is the operational how-to for a single procedure inside a playbook. “How to fail over the primary Postgres replica” is a runbook. The playbook links to it; the runbook is a separate doc with the command-by-command detail.

If your playbook contains shell commands, those commands belong in a runbook the playbook links to. The playbook is the air-traffic-control voice; the runbook is the cockpit checklist.

The six phases every playbook has to cover

The widely-cited shape, first articulated in NIST SP 800-61 revision 2, boils down to six phases. Every credible playbook walks through them, in order, every time.

Prepare. Who’s on call. What roles fire. What channels (the chat room, the war room, the status page). What pre-conditions are required to even start. This is the section the on-call reads first when a page lands.
Detect. What signals fire when this incident starts. The alert wording. The dashboard you open. The five-second “is it really this incident, or am I about to spend the next four hours on the wrong one?” check.
Contain. What to do first — the action that stops the bleeding before you understand the cause. Killing a bad deploy. Failing over to a secondary. Putting the service into a maintenance mode. Containment buys you thinking time.
Eradicate. What to do once contained — the actual fix. This is where the playbook links to runbooks. Restore from backup, roll forward a fix, terminate the compromised credentials.
Recover. Bring services back, validate they’re healthy, monitor for the recurrence the on-call dreads. “Recovered” and “pre-incident” are not the same state; the playbook says how to know.
Learn. Postmortem. Root cause. Action items. A blameless write-up that lives in the wiki forever, links to from this playbook so the next time someone reads it the prior incident is one click away. Use the Incident Postmortem template and let it carry the shape.

The phases overlap in time on a real incident; the document still walks through them in order. The playbook is a checklist, not a Gantt chart.

Examples of playbooks worth writing first

The cybersecurity SERP universe — Microsoft, CISA, AWS — publishes per-attack playbooks for phishing, ransomware, DDoS, compromised credentials, account takeover. Those are the right starting points if your playbook universe is security-shaped. If yours is operational/SRE-shaped, the first three to write are different. Start with the top three to five most likely incident types, not the most exotic ones. Most teams ship in this order:

Service down. The catch-all “the thing isn’t responding.” Most-paged. Worth the most attention.
Database full / connection pool saturated. The classic, the recurring, the one the on-call has seen before and will see again.
Bad deploy. Roll back. There is one playbook in the world that says “don’t roll back, debug forward,” and it’s wrong for ninety-nine percent of teams.
Cache failure. Especially relevant if your service depends on a Redis or memcached layer. There’s a canon postmortem in our wiki — the Cache Catastrophe — about a fatigue-plus-crumbs Redis incident. The irony is that the postmortem itself, on the wiki we used to use, took 14 seconds to load. Rocket asked us to note this once and never mention it again. We are noting it once.
Third-party outage. AWS region down. Your DNS provider. Stripe. The playbook is “don’t panic, communicate, wait, document.”
Security: unauthorised access. When the situation gets security-shaped, the playbook hands off to the cybersecurity-incident playbook universe — see CISA’s federal playbooks for the canonical version.

A common SRE story illustrates how playbooks earn their keep. Operation Midnight Snack II — what canon calls the doughnut incident — was a runbook test that went sideways when a tray of doughnuts arrived at the staging area unannounced. The runbook said “two-meter snack-free zone around the server room.” The doughnuts were 1.4 meters from the server room. Patches caught the recovery with a butterfly net, first try; the playbook update was a single sentence about catered food deliveries and where they go. The lesson: playbooks anticipate not what should happen, but what people actually do — including bringing snacks.

Where the playbook lives at four in the morning

This is the part the long-form SERP guides skip. The top-3 results for incident response playbook explain phases and components at length and never say where the document sits when the page actually fires. Here’s the lesson, after watching teams chase a playbook through three SharePoint folders during an actual incident: a playbook the on-call can’t find in five seconds is not a playbook they have.

Concrete moves that work:

One short page per playbook in your wiki. Title it the service plus the incident type. “Postgres — disk full.” “API — bad deploy.” The on-call types two of those words into the search bar and the page is the first hit.
Linked from the on-call rotation page. Every playbook is reachable from the who’s on call doc, which is the doc the new on-call reads first.
Pages have to load fast. Pages load in 50–150ms depending on your network on Raccoon Page; if your playbook takes a second longer than the alert pop-up, the on-call will improvise instead, and improvising at 04:00 is the leading cause of “how did we make this worse?” postmortems.
Linked to from every alert. Where the alerting tool supports a “runbook URL” field, put the playbook URL there. “Sub-second loads, keyboard-first” — same practical bar as everything else; if the alert text doesn’t take you to the playbook in one click, the playbook is one click too far.
Tested by chaos days, not just by real incidents. Once a quarter, run a fake page; time how long it takes the on-call to find the right playbook. If the median is over thirty seconds, the playbook isn’t where it needs to be.
Updated after every real incident. Every postmortem ends with “playbook deltas” — what the on-call wished the playbook had said. That section is the actual product of the postmortem.

The playbook is part of the same operating-discipline shape as a team charter, an SLO, and a quarterly business review — the doc lives in the wiki, the wiki is reachable in one keystroke, and the team’s running operating system uses it or it isn’t a system. The Incident Response Playbook template and the Incident Postmortem template are the importable shapes; the Service Runbook template is what the playbook links to for the command-by-command details.

When you don’t need a formal playbook yet

Three signs writing a formal playbook this quarter is the wrong move:

You’re a team of two on a side project. Two people on a side project sharing a Slack DM and a “what do we do if it breaks?” habit are the playbook. The doc earns its keep when the team is large enough that the on-call is not always the same person.
You don’t have a service that pages anyone. If nothing wakes a human up, you don’t yet have an incident to respond to. The playbook earns its keep when the alerts are real.
You haven’t had the same kind of incident twice. Premature formalisation is its own failure mode. If you’ve had one ransomware false alarm and zero real ones, “call the security lead and panic together” is fine for now. Write the playbook after the second incident, not the first.

Above that bar, the playbook earns its keep. Below it, you’re writing playbook theatre, and the on-call will improvise anyway. Tell people when not to use Raccoon Page is the honest version of this advice for our own product: the Free tier — three users, one space, a hundred pages, no card — is the right home for the first playbook a small team writes; if and when your team grows into needing real-time co-editing during the incident itself, the Team tier at $8/user/month is the honest math. Pick the cheapest plan that fits the job.

Things people actually ask

What is an incident response playbook, in one sentence? A one-page pre-decided response to a specific type of incident, walking the on-call through six phases — prepare, detect, contain, eradicate, recover, learn — with names, roles, and links to runbooks attached.

What’s the difference between a playbook and a runbook? A playbook is the air-traffic-control voice — the high-level response to an incident. A runbook is the cockpit checklist — the command-by-command how-to for a single procedure. The playbook says “fail over to the secondary”; the runbook says “run this command, wait for this response.”

How many playbooks should we have? Start with the top three to five most likely incident types for your team and grow from there. A typical small SRE team ends up with eight to twelve playbooks; a security team can have thirty or more. More than that and the finding the right one problem starts to dominate the executing it problem.

Who writes the playbook? The on-call rotation, with help from the engineers who own the service. Not in isolation; the playbook is only useful if every on-call has read it before they need it. Reviewed quarterly, materially edited after each real incident.

What goes in the prepare phase? Roles (incident commander, scribe, comms), channels (chat room, war room, status page), pre-conditions (access to the runbook URLs, the production console, the alerting tool), and the severity definition that says when this playbook fires versus a different one.

What’s the difference between containment and eradication? Containment stops the bleeding before you understand the cause. Eradication is the real fix once you do. “Kill the bad deploy” is containment. “Fix the bug that the bad deploy introduced and re-deploy” is eradication. They are different actions and the playbook orders them deliberately.

Should playbooks be automated? The detection and the first containment step often can be. The judgement calls — when to escalate, when to roll back, when to call the customer — should not be. Don’t automate the part where the human decides whether the human should panic.

How often should we update playbooks? After every real incident, plus a quarterly review. The quarterly review catches drift in roles and tooling; the post-incident edit catches the gap between what the playbook said and what the on-call wished it had said.

If your incident response playbooks are currently a SharePoint folder you can’t find in five seconds and a ten-year-old Confluence page nobody owns, the upgrade isn’t a different folder — it’s a one-page wiki page per playbook, linked from the alert. Try the Free tier on your service-down playbook and your bad-deploy playbook; those are the two that pay back the cost of writing them fastest. If the next 04:00 page is still a search through three folders, write to us; we want to know which folder won.