OPERATIONS / DEVOPS / SRE TEMPLATE
Disaster Recovery Plan Template
DR plan: recovery tiers, system inventory, activation criteria, recovery procedures, and testing schedule.
Use this templateWhat's inside
Field | Details |
|---|---|
Scope | What systems and environments this plan covers |
Owner | Name / team responsible for DR |
Last Tested | |
Next Test | |
Last Updated |
Recovery Tiers
Not everything is equally critical. Classify systems so the team knows what to recover first.
Tier | RTO | RPO | Definition | Examples |
|---|---|---|---|---|
Tier 1: Critical | < 1 hour | < 5 min | Business cannot function without these | Auth, database, payment processing |
Tier 2: Important | < 4 hours | < 1 hour | Significant impact but workarounds exist | Search, notifications, analytics |
Tier 3: Standard | < 24 hours | < 24 hours | Can tolerate a day of downtime | Internal tools, batch jobs, reporting |
System Inventory
System | Tier | Backup Method | Backup Frequency | Restore Tested? | Restore Time |
|---|---|---|---|---|---|
Primary database | Tier 1 | WAL streaming + daily snapshot | Continuous + daily | Yes | X minutes |
Object storage | Tier 1 | Cross-region replication | Real-time | Yes | N/A — automatic |
Application config | Tier 2 | Git + infra-as-code | On every change | Yes | X minutes |
Search index | Tier 2 | Rebuild from primary DB | N/A | No | X hours |
Activation Criteria
When do we activate the DR plan? Be specific so there's no debate during a crisis.
Primary region is unreachable for > X minutes
Primary database is unrecoverable and failover is needed
Data corruption detected that requires point-in-time recovery
Cloud provider declares a regional outage
Recovery Procedures
Database Recovery
-
Assess: determine the type of failure (hardware, corruption, human error, region down)
-
Choose recovery method: failover to replica / restore from snapshot / point-in-time recovery
-
Execute recovery (include exact commands or link to runbook)
-
Validate: run data integrity checks, compare row counts, verify critical records
-
Reconnect: update application connection strings, restart services
-
Verify: confirm application is healthy, run smoke tests
Application Recovery
-
Deploy application to DR environment (or activate standby)
-
Update DNS / load balancer to route traffic to DR
-
Verify: health checks, smoke tests, monitor error rates
-
Notify: update status page, inform stakeholders
Returning to Normal (Failback)
-
Confirm primary environment is healthy and has capacity
-
Sync data from DR back to primary (if needed)
-
Gradually shift traffic back (canary → 50/50 → 100%)
-
Deactivate DR environment or return to standby
-
Post-incident review: what worked, what didn't, update this plan
Communication
Audience | Channel | Who Sends | Template |
|---|---|---|---|
Engineering team | Slack #incidents | IC | DR activated for [system]. Recovery in progress. ETA: X hours. |
Leadership | Email / Slack DM | Eng lead | DR activated. Impact: [X]. Recovery timeline: [X]. Next update: [time]. |
Customers | Status page | Comms lead | We are experiencing a service disruption. We are working on recovery. |
Testing Schedule
A DR plan that isn't tested is fiction. Schedule regular tests and document results.
Test Type | Frequency | Last Tested | Result | Next Test |
|---|---|---|---|---|
Backup restore (Tier 1 systems) | Monthly | Pass | ||
Failover drill (full DR activation) | Quarterly | Pass | ||
Tabletop exercise (walk through plan) | Bi-annually | Pass |
Other Ops templates
-
Capacity PlanningCapacity assessment: current utilization, growth projections, bottlenecks, and scaling recommendations with cost impact. -
Change Management RecordChange request with scope, risk assessment, step-by-step implementation, rollback plan, and approvals. -
Incident Response PlaybookStep-by-step incident response: severity classification, roles, detection through resolution, and communication templates.