Service Runbook template thumbnail

OPERATIONS / DEVOPS / SRE TEMPLATE

Service Runbook Template

Operational runbook: health checks, common procedures, rollback, dependencies, and escalation.

Use this template

What's inside

Field

Details

Service

Service name

Status

Healthy

Owner

Team / person on-call

Repo

Link to repository

Dashboard

Link to primary monitoring dashboard

Last Updated

What This Service Does

One paragraph: what the service does, who depends on it, and what breaks if it goes down.

Health & Monitoring

# Health check
curl -s https://service.example.com/health | jq .

# Key metrics to watch
# - Request rate, error rate, latency (RED)
# - CPU, memory, disk, connections (USE)

Alert

Severity

Meaning

What To Do

HighErrorRate

P1

Error rate > X% for 5min

Check logs, recent deploys, downstream deps

HighLatency

P2

p99 > Xms for 10min

Check DB slow queries, connection pool, cache hit rate

DiskSpaceLow

P2

Disk > 85%

Clean logs/tmp, expand volume if recurring

P3

Common Procedures

Restart

# Graceful restart (zero-downtime if behind load balancer)
# Adapt to your setup: systemctl, kubectl, docker, ECS

kubectl rollout restart deployment/service-name -n namespace

Scale

# Scale up
kubectl scale deployment/service-name --replicas=N -n namespace

# Verify
kubectl get pods -n namespace -l app=service-name

Check Logs

# Recent errors
kubectl logs -l app=service-name -n namespace --since=1h | grep -i error

# Or via centralized logging
# Link to Kibana/Loki/CloudWatch query: [paste link]

Database

# Check connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'dbname';

# Slow queries
SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;

Cache

# Clear cache (if safe)
redis-cli -h cache.example.com FLUSHDB

# Check memory
redis-cli -h cache.example.com INFO memory

Rollback

# Roll back to previous version
# Adapt: kubectl, helm, ECS, Vercel, etc.

kubectl rollout undo deployment/service-name -n namespace

# Verify
kubectl rollout status deployment/service-name -n namespace

Dependencies

Dependency

What For

Health Check

If It's Down

Database

Primary data store

pg_isready / connection test

Service cannot function — page immediately

Cache

Session / hot data

redis-cli ping

Service degrades but works — higher latency

Upstream API

External data

GET /health

Feature X unavailable — circuit breaker should trip

Access & Secrets

Paste or link to credentials, connection strings, API keys needed for emergency access

Access procedure: who can grant emergency access, and how (Vault, AWS IAM, manual).

Escalation

Severity

Who

How

When

P1 — service down

On-call → Eng lead → VP Eng

Page immediately

0 min

P2 — degraded

On-call → Eng lead

Slack + page if no ack in 15min

15 min

P3 — minor

On-call

Slack

Next business day

Other Ops templates