Disaster Recovery Plan template thumbnail

OPERATIONS / DEVOPS / SRE TEMPLATE

Disaster Recovery Plan Template

DR plan: recovery tiers, system inventory, activation criteria, recovery procedures, and testing schedule.

Use this template

What's inside

Field

Details

Scope

What systems and environments this plan covers

Owner

Name / team responsible for DR

Last Tested

Next Test

Last Updated

Recovery Tiers

Not everything is equally critical. Classify systems so the team knows what to recover first.

Tier

RTO

RPO

Definition

Examples

Tier 1: Critical

< 1 hour

< 5 min

Business cannot function without these

Auth, database, payment processing

Tier 2: Important

< 4 hours

< 1 hour

Significant impact but workarounds exist

Search, notifications, analytics

Tier 3: Standard

< 24 hours

< 24 hours

Can tolerate a day of downtime

Internal tools, batch jobs, reporting

System Inventory

System

Tier

Backup Method

Backup Frequency

Restore Tested?

Restore Time

Primary database

Tier 1

WAL streaming + daily snapshot

Continuous + daily

Yes

X minutes

Object storage

Tier 1

Cross-region replication

Real-time

Yes

N/A — automatic

Application config

Tier 2

Git + infra-as-code

On every change

Yes

X minutes

Search index

Tier 2

Rebuild from primary DB

N/A

No

X hours

Activation Criteria

When do we activate the DR plan? Be specific so there's no debate during a crisis.

  • Primary region is unreachable for > X minutes

  • Primary database is unrecoverable and failover is needed

  • Data corruption detected that requires point-in-time recovery

  • Cloud provider declares a regional outage

Recovery Procedures

Database Recovery

  • Assess: determine the type of failure (hardware, corruption, human error, region down)

  • Choose recovery method: failover to replica / restore from snapshot / point-in-time recovery

  • Execute recovery (include exact commands or link to runbook)

  • Validate: run data integrity checks, compare row counts, verify critical records

  • Reconnect: update application connection strings, restart services

  • Verify: confirm application is healthy, run smoke tests

Application Recovery

  • Deploy application to DR environment (or activate standby)

  • Update DNS / load balancer to route traffic to DR

  • Verify: health checks, smoke tests, monitor error rates

  • Notify: update status page, inform stakeholders

Returning to Normal (Failback)

  • Confirm primary environment is healthy and has capacity

  • Sync data from DR back to primary (if needed)

  • Gradually shift traffic back (canary → 50/50 → 100%)

  • Deactivate DR environment or return to standby

  • Post-incident review: what worked, what didn't, update this plan

Communication

Audience

Channel

Who Sends

Template

Engineering team

Slack #incidents

IC

DR activated for [system]. Recovery in progress. ETA: X hours.

Leadership

Email / Slack DM

Eng lead

DR activated. Impact: [X]. Recovery timeline: [X]. Next update: [time].

Customers

Status page

Comms lead

We are experiencing a service disruption. We are working on recovery.

Testing Schedule

A DR plan that isn't tested is fiction. Schedule regular tests and document results.

Test Type

Frequency

Last Tested

Result

Next Test

Backup restore (Tier 1 systems)

Monthly

Pass

Failover drill (full DR activation)

Quarterly

Pass

Tabletop exercise (walk through plan)

Bi-annually

Pass

Other Ops templates