OPERATIONS / DEVOPS / SRE TEMPLATE
Service Runbook Template
Operational runbook: health checks, common procedures, rollback, dependencies, and escalation.
Use this templateWhat's inside
Field | Details |
|---|---|
Service | Service name |
Status | Healthy |
Owner | Team / person on-call |
Repo | Link to repository |
Dashboard | Link to primary monitoring dashboard |
Last Updated |
What This Service Does
One paragraph: what the service does, who depends on it, and what breaks if it goes down.
Health & Monitoring
# Health check
curl -s https://service.example.com/health | jq .
# Key metrics to watch
# - Request rate, error rate, latency (RED)
# - CPU, memory, disk, connections (USE)Alert | Severity | Meaning | What To Do |
|---|---|---|---|
HighErrorRate | P1 | Error rate > X% for 5min | Check logs, recent deploys, downstream deps |
HighLatency | P2 | p99 > Xms for 10min | Check DB slow queries, connection pool, cache hit rate |
DiskSpaceLow | P2 | Disk > 85% | Clean logs/tmp, expand volume if recurring |
P3 |
Common Procedures
Restart
# Graceful restart (zero-downtime if behind load balancer)
# Adapt to your setup: systemctl, kubectl, docker, ECS
kubectl rollout restart deployment/service-name -n namespaceScale
# Scale up
kubectl scale deployment/service-name --replicas=N -n namespace
# Verify
kubectl get pods -n namespace -l app=service-nameCheck Logs
# Recent errors
kubectl logs -l app=service-name -n namespace --since=1h | grep -i error
# Or via centralized logging
# Link to Kibana/Loki/CloudWatch query: [paste link]Database
# Check connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'dbname';
# Slow queries
SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;Cache
# Clear cache (if safe)
redis-cli -h cache.example.com FLUSHDB
# Check memory
redis-cli -h cache.example.com INFO memoryRollback
# Roll back to previous version
# Adapt: kubectl, helm, ECS, Vercel, etc.
kubectl rollout undo deployment/service-name -n namespace
# Verify
kubectl rollout status deployment/service-name -n namespaceDependencies
Dependency | What For | Health Check | If It's Down |
|---|---|---|---|
Database | Primary data store | pg_isready / connection test | Service cannot function — page immediately |
Cache | Session / hot data | redis-cli ping | Service degrades but works — higher latency |
Upstream API | External data | GET /health | Feature X unavailable — circuit breaker should trip |
Access & Secrets
Paste or link to credentials, connection strings, API keys needed for emergency access
Access procedure: who can grant emergency access, and how (Vault, AWS IAM, manual).
Escalation
Severity | Who | How | When |
|---|---|---|---|
P1 — service down | On-call → Eng lead → VP Eng | Page immediately | 0 min |
P2 — degraded | On-call → Eng lead | Slack + page if no ack in 15min | 15 min |
P3 — minor | On-call | Slack | Next business day |
Other Ops templates
-
Capacity PlanningCapacity assessment: current utilization, growth projections, bottlenecks, and scaling recommendations with cost impact. -
Change Management RecordChange request with scope, risk assessment, step-by-step implementation, rollback plan, and approvals. -
Disaster Recovery PlanDR plan: recovery tiers, system inventory, activation criteria, recovery procedures, and testing schedule.