Service Runbook — Template

Field	Details
Service	Service name
Status	Healthy
Owner	Team / person on-call
Repo	Link to repository
Dashboard	Link to primary monitoring dashboard
Last Updated	YYYY-MM-DD

What This Service Does

One paragraph: what the service does, who depends on it, and what breaks if it goes down.

Health & Monitoring

# Health check
curl -s https://service.example.com/health | jq .

# Key metrics to watch
# - Request rate, error rate, latency (RED)
# - CPU, memory, disk, connections (USE)

Alert	Severity	Meaning	What To Do
HighErrorRate	P1	Error rate > X% for 5min	Check logs, recent deploys, downstream deps
HighLatency	P2	p99 > Xms for 10min	Check DB slow queries, connection pool, cache hit rate
DiskSpaceLow	P2	Disk > 85%	Clean logs/tmp, expand volume if recurring
	P3

Common Procedures

Restart

# Graceful restart (zero-downtime if behind load balancer)
# Adapt to your setup: systemctl, kubectl, docker, ECS

kubectl rollout restart deployment/service-name -n namespace

Scale

# Scale up
kubectl scale deployment/service-name --replicas=N -n namespace

# Verify
kubectl get pods -n namespace -l app=service-name

Check Logs

# Recent errors
kubectl logs -l app=service-name -n namespace --since=1h | grep -i error

# Or via centralized logging
# Link to Kibana/Loki/CloudWatch query: [paste link]

Database

# Check connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'dbname';

# Slow queries
SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;

Cache

# Clear cache (if safe)
redis-cli -h cache.example.com FLUSHDB

# Check memory
redis-cli -h cache.example.com INFO memory

Rollback

# Roll back to previous version
# Adapt: kubectl, helm, ECS, Vercel, etc.

kubectl rollout undo deployment/service-name -n namespace

# Verify
kubectl rollout status deployment/service-name -n namespace

Dependencies

Dependency	What For	Health Check	If It's Down
Database	Primary data store	pg_isready / connection test	Service cannot function — page immediately
Cache	Session / hot data	redis-cli ping	Service degrades but works — higher latency
Upstream API	External data	GET /health	Feature X unavailable — circuit breaker should trip

Access & Secrets

Paste or link to credentials, connection strings, API keys needed for emergency access

Access procedure: who can grant emergency access, and how (Vault, AWS IAM, manual).

Escalation

Severity	Who	How	When
P1 — service down	On-call → Eng lead → VP Eng	Page immediately	0 min
P2 — degraded	On-call → Eng lead	Slack + page if no ack in 15min	15 min
P3 — minor	On-call	Slack	Next business day

Service Runbook Template

What's inside

What This Service Does

Health & Monitoring

Common Procedures

Restart

Scale

Check Logs

Database

Cache

Rollback

Dependencies

Access & Secrets

Escalation

Other Ops templates