OPERATIONS / DEVOPS / SRE TEMPLATE
SLA / SLO Definition Template
SLO targets, error budgets, alerting thresholds, SLA commitments, and escalation policies.
Use this templateWhat's inside
Field | Details |
|---|---|
Service | Service or product name |
Owner | Team / person responsible |
Last Reviewed | |
Next Review |
SLO Definitions
What are we promising to ourselves (SLO) and to customers (SLA)? SLOs should be tighter than SLAs — the SLO is your internal standard, the SLA is the contractual minimum.
Metric | SLO (Internal) | SLA (External) | Measurement | Window |
|---|---|---|---|---|
Availability | 99.95% | 99.9% | Synthetic monitoring + real user requests | Rolling 30 days |
Latency (p50) | < 100ms | N/A | API gateway metrics | Rolling 30 days |
Latency (p99) | < 500ms | < 1000ms | API gateway metrics | Rolling 30 days |
Error rate | < 0.1% | < 1% | 5xx responses / total requests | Rolling 30 days |
Error Budget
The error budget is the gap between 100% and your SLO. It's the amount of unreliability you can "spend" on shipping fast. When the budget runs out, you slow down and focus on reliability.
SLO | Budget (30 days) | Budget Remaining | Status |
|---|---|---|---|
99.95% availability | 21.6 minutes downtime | X minutes | Healthy |
< 0.1% error rate | ~43,200 errors per 30M requests | X errors | Healthy |
Alerting
Alert | Condition | Burn Rate | Action |
|---|---|---|---|
Slow burn | Budget consumption 2x normal rate | Budget exhausted in ~15 days | Investigate — ticket in current sprint |
Fast burn | Budget consumption 10x normal rate | Budget exhausted in ~3 days | Page on-call — treat as incident |
Budget exhausted | Error budget = 0 | N/A | Deploy freeze until budget recovers or SLO is adjusted |
SLA Commitments
If you have contractual SLAs with customers, document them here. SLAs without consequences are just marketing.
Tier / Plan | SLA | Credit / Penalty | Measurement |
|---|---|---|---|
Enterprise | 99.9% availability | 10% credit per 0.1% below SLA | Monthly, measured by provider monitoring |
Business | 99.5% | N/A | |
Free | No SLA | N/A | Best effort |
Delete this section if you don't have contractual SLAs.
Escalation
Condition | Action | Who |
|---|---|---|
SLO at risk (burn rate alert) | Investigate root cause, consider deploy freeze | On-call engineer |
SLO breached | Incident declared, postmortem required | Eng lead |
SLA breached | Customer communication, credit processing | Eng lead + account manager |
Other Ops templates
-
Capacity PlanningCapacity assessment: current utilization, growth projections, bottlenecks, and scaling recommendations with cost impact. -
Change Management RecordChange request with scope, risk assessment, step-by-step implementation, rollback plan, and approvals. -
Disaster Recovery PlanDR plan: recovery tiers, system inventory, activation criteria, recovery procedures, and testing schedule.