How to write a runbook your on-call will actually use
How to write a runbook: the seven-section shape, runbook vs playbook, writing for 04:00, where the doc lives, and how to notice it's gone stale before it fails.
TL;DR. A runbook is one short, ordered procedure for one recurring situation: the symptom, the checks, the steps in order, the rollback, the escalation, and the link out. Written for someone tired at 04:00, not for an architect at a whiteboard. Two hundred to six hundred words; same shape every time; reviewed when it’s used and culled when nobody’s used it in a quarter. The doc has to live in a wiki the next on-call can find in one keystroke, or it’s a runbook in name only.
Most runbooks read like furniture instructions written by someone who has never owned the furniture. (Patches, our senior SRE, calls the bad ones “butterfly net manuals — long, theoretical, and useless once the moth is already in the server room.” The good ones are short.) The good news about how to write a runbook is that the shape has been the same since teams started keeping them — one symptom, an ordered list, a rollback, a way out. The bad news is that most teams write them once, file them in a folder nobody opens, and discover at 04:00 that the runbook is six months out of date and the API it calls retired in March. The rest of this post is the working shape, the writing voice that survives contact with an actual incident, and the part the long-form SERP guides skip: where the runbook lives between incidents and how you know it’s wrong before it gets you.
A runbook is one short procedure for one recurring situation
A runbook is a written operational procedure — not a wiki page about a system, not an architecture diagram, not a postmortem. It answers the operational question “this thing is happening; what do I do?” in steps an awake-but-tired engineer can execute without re-reading the whole document.
A runbook succeeds when:
- A new on-call can read it cold at 04:00, follow the steps, and resolve the situation without paging anyone they don’t have to page.
- It tells the on-call what to do, not how the system works — that’s a different doc, and it lives elsewhere.
- It names a rollback and an escalation explicitly. Neither is implicit; under stress, implicit means forgotten.
- It’s short enough to read in two minutes. If it’s longer than two minutes, the runbook is two runbooks pretending to be one.
Anything shorter is a Slack thread; anything longer is a manual. The right length is between 200 and 600 words for most procedures. A runbook that’s 3,000 words is a runbook the on-call will skim, miss a step, and call someone else.
Runbook vs playbook — scope is the difference
The two words get used interchangeably and shouldn’t. The split that survives contact with a real on-call rotation:
| Artifact | Scope | Trigger | Length |
|---|---|---|---|
| Runbook | One procedure for one recurring symptom or task | A specific alert / symptom / scheduled job | 200–600 words |
| Playbook | The shape of a response to a class of incident | A declared incident of a given severity | 600–1,500 words |
A runbook is “the queue depth alarm has fired; here are the six steps to drain it.” A playbook is “a customer-facing outage of severity SEV-1 has been declared; here is who’s incident commander, who’s scribe, who talks to support, when status pages update, and when the postmortem starts.” The runbook is a tool you reach for during the incident; the playbook is the operating shape of the incident itself.
If you’re writing one and reaching for the other’s content, you’re writing the wrong doc. Two artifacts. Separate links. Cross-reference where the response calls the procedure.
The seven sections every runbook needs
A working runbook is a single short doc with seven sections, in this order. Same headings every time; the on-call doesn’t have to learn a new format at 04:00.
| Section | What it answers | Length |
|---|---|---|
| Title + owner | What this is for; who keeps it current | One line |
| Symptom | How you know you’re in this situation | 1–3 sentences |
| Pre-flight | What to check before you start (auth, env, blast radius) | 3–6 lines |
| Procedure | The ordered steps, with commands | Half the doc |
| Verification | How you know it worked | 2–4 lines |
| Rollback | How you undo it if it didn’t | 3–8 lines |
| Escalation + links | Who to page, what dashboards, what docs | A short list |
The Symptom row is the runbook’s title in long form — the unmistakable signal that this runbook is the one. “P99 search latency > 2s for more than five minutes, paging the search on-call.” That’s a symptom. “Search is slow” is a vibe.
The Pre-flight row is the cheapest debugging move in a runbook
and the most-skipped. “Confirm the alert is real (SELECT COUNT(*) FROM … against the metric store), confirm the deploy
log says nothing shipped in the last 30 minutes, confirm
kubectl get pods -n search shows the pods healthy.” Three
checks. Two minutes. Avoids the rest of the runbook running on a
false premise.
The Procedure row is the runbook itself. Numbered. With
exact commands in code blocks where appropriate, and
expected output underneath each one. Not “check the queue
depth” — “run redis-cli LLEN search:index:queue; expect a
number under 5,000. If higher, proceed to step 3.” The on-call
shouldn’t have to interpret the step; the runbook should
interpret the result for them.
The Rollback row is the runbook’s load-bearing safety net. Two
on-calls in three forget to write one; the third one needs it
within a month. “If the manual reindex hangs for more than two
minutes, Ctrl+C, then redis-cli FLUSHDB search:index, then
kubectl rollout restart -n search indexer.” Rollback is not a
later doc; it’s the same doc.
The Google SRE chapter on the on-call position and the PagerDuty incident response docs are the canonical primary sources for the operating discipline; both worth re-reading once a year the way you’d re-read a contract you signed.
Write for someone tired and scared at 04:00
The single biggest move that separates a useful runbook from a counter-productive one is voice and audience. Most runbook templates encourage the writer to demonstrate their understanding of the system. The actual reader will be a different person, six months later, woken by a page, two coffees behind, with seventeen unread Slack messages and a partner asleep beside them. The runbook is for that person, not the writer.
Concrete rules for the runbook voice:
- Second person, imperative. “Run the query.” Not “the engineer should run the query” or “queries can be run.” The reader is busy.
- One sentence per step. A step that needs two sentences is two steps.
- Name every command in
code. No paraphrasing. Copy-paste is the load-bearing UX of a runbook. - Name expected output literally. “Expect:
OK. Anything else: skip to rollback.” Not “the response should be successful.” - No theory. Why the queue grows is a separate doc and a link at the bottom of the runbook. What to do when it grows is the runbook itself. Don’t explain on the inside of the procedure.
- Name the failure mode. “If you see
WRONGPASS, the vault token has rotated; rotate it (link below) and start again.” The runbook is the place where you turn known failure modes into known recoveries.
The test for runbook voice is the cold-read test: hand the runbook to an engineer on the team who has never executed this procedure, hand them the access they’d have at 04:00, and watch them work through it. Every place they hesitate, ask a clarifying question, or open a second tab is a rewrite. The on-call will hesitate too, but they won’t have anyone to ask.
Where the runbook lives between incidents
This is the part the long-form SERP guides skip. The runbook example pieces explain what to write and the runbook-automation pieces explain how to chain steps to a button; almost none of them name where the finished runbook actually lives between the day it’s written and the day a pager fires at 03:47 and someone needs to find it.
Here’s the lesson, after watching teams keep runbooks in Google Docs / private GitHub gists / Confluence pages nobody can find: a runbook in a Google Doc is a runbook you don’t have at 04:00. The wiki is the right home for it, and the wiki has to be reachable in one keystroke.
Concrete moves that work:
- One short page per runbook in your wiki. Title it Runbook — <service> — <symptom>. Searchable by service name, by alert name, by symptom. The Service Runbook template is the importable shape; copy it once and live with it.
- Linked from the alert. The alert message itself contains a link to the runbook. Not the dashboard the alert came from — the runbook. PagerDuty calls this a response play; the same idea works with any alerting tool that lets you attach a URL to an alert template.
- Pages have to load fast. Pages load in 50–150ms depending on your network on Raccoon Page; if the runbook takes a second longer than your terminal prompt, the on-call will alt-tab to the terminal and stop reading. Sub-second loads, keyboard-first is the same operational bar as everywhere else, and uniquely uncomfortable on a doc that exists to be read under stress.
- Indexed in a single register. A page called Runbooks that lists every one with the symptom visible. The list is the actual artifact of your operational discipline; the individual docs are the receipts.
- Linked from the incident response playbook and from any matching postmortem. The playbook routes the response; the runbook executes the procedure; the postmortem captures what happened. Three artifacts, one operating loop.
The runbook is part of the same operating-discipline shape as a team charter, an SLO, and the incident postmortem — the doc lives in the wiki, the wiki is reachable in one keystroke, and the team’s running operating system uses it or it isn’t a system.
Notice the runbook is wrong before it fails
Every runbook decays. The API gets renamed; the dashboard URL
changes; the failure mode you wrote up two years ago retired
when the team replaced the cache. The runbook says “check
redis-cli”; the cache is now Memcached. The on-call follows
the runbook, the steps fail, and they spend the first ten
minutes of an outage debugging the runbook instead of the
incident.
The defence against runbook decay is a small habit, not a heroic audit:
- Use it on every incident, even when you don’t need to. If the runbook applies and you wrote it from memory, the run surfaces stale steps within sixty seconds. Use means audit.
- Date the last execution. “Last run: 2026-04-22 by on-call@.” If the date is older than two quarters and the service is in active production, the runbook is presumed stale.
- Owner per runbook. A named owner — not a team alias. Owners get pinged when their runbooks haven’t been touched.
- Quarterly cull. Once a quarter, walk the runbook index; delete the ones for services that no longer exist; mark the ones nobody has executed in two quarters as probable candidates for deletion. A runbook nobody uses is a runbook nobody trusts.
- Tabletop the scary ones. Once a quarter, pick a runbook for a Sev-1 scenario and read through it with the on-call rotation. Not run it — read it. Half the stale steps surface in the reading.
The cheapest move on this list is the first one. The team that uses its runbooks knows when they’re broken. The team that files runbooks and never reads them owns a folder of fiction.
When you don’t need a formal runbook yet
Three signs the formal runbook is the wrong artifact for your situation:
- The procedure runs once a year. A yearly key-rotation isn’t a runbook; it’s a calendar reminder with a checklist embedded. Write the checklist; don’t pretend it’s an operational procedure.
- There’s only ever one on-call. A single engineer on a
side project doesn’t need a runbook — they need a
READMEin the repo. The runbook earns its keep when the on-call rotates and the writer isn’t the runner. - The procedure is one step long. “Restart the service” is a runbook nobody needs; the alert message can contain the command. The runbook earns its keep when the procedure has branches, pre-flight checks, or rollbacks that aren’t obvious.
Above that bar, the formal runbook earns its keep. Below it, you’re writing runbook theatre. Pick the cheapest plan that fits the job — same logic, by the way, applies to wikis: our Free tier — three users, one space, a hundred pages, no card — is the right home for the first runbook a small team writes; if and when the rotation grows to a real schedule, the Team tier at $8/user/month is the honest math.
A note on Raccoon Page itself: we’re a wiki, not an incident platform. For the alert routing, the pager, the bridge call, the status page, you’ll need other tools. The wiki is where the runbook lives — searchable, linkable, keyboard-reachable, the load profile a 04:00 on-call can survive.
Things people actually ask
What is a runbook, in one sentence? A short, ordered, written procedure for one recurring operational situation — symptom, checks, steps in order, verification, rollback, escalation — written so a tired on-call can execute it without re-reading the rest of the doc.
What’s the difference between a runbook and a playbook? Scope and trigger. A runbook is one procedure for one recurring symptom or task; it fires from an alert. A playbook is the shape of a response to a class of incident; it fires from an incident declaration. Most teams use the words interchangeably; the split is useful enough that two artifacts is the right number.
How long should a runbook be? Two hundred to six hundred words. If it’s longer, it’s two runbooks. The cold-read test — can a teammate who has never run the procedure complete it in two minutes of reading and the correct commands — is the length check.
Who should write the runbook? The engineer who owns the system, with review from the on-call rotation. The on-call review is load-bearing: the writer can’t test runbook voice on themselves; the rotation can.
When do you write the runbook? Right after the second time you ran the procedure manually. The first time is exploratory; the second time is the moment you realise you’ll do this again, and the runbook locks in what you just figured out before it slips.
How do runbooks get reviewed? Use them. Date the last execution. Quarterly cull. Tabletop the Sev-1-shaped ones. A named owner per runbook gets the ping when nobody’s touched theirs.
Should runbooks be public? Inside your company — yes; visibility is half of the reliability dividend. Outside your company — almost never; runbooks name internal systems, credentials, and rollback levers, and a public runbook is a free hand to anyone curious about your stack.
Should runbook automation replace the runbook? No. Automate the steps inside the runbook where it’s safe (the verification row, the pre-flight row), but keep the runbook itself as the human-readable artifact. A button that runs the procedure is great until the button stops working; the runbook is the fallback the team understands.
If your runbooks currently live in a folder called runbooks/ inside a private Confluence space three clicks deep from the on-call homepage, the upgrade isn’t a new template — it’s a wiki that loads before your terminal does and a single index of procedures the team can keyboard their way through. Try the Free tier on the runbook for your next recurring alert; even one well-shaped runbook is more on-call rest than a binder of stale ones. If the next 04:00 page can’t find it, write to us; we want to know which folder it ended up in.
Written by The Editorial Raccoon — house style for Raccoon Page. Numbers and claims pulled from product reality; jokes pulled from the Raccoon Corp canon. No raccoons were quoted in real life.