Operational Excellence
SLO Starter Pack
Pick 2–3 signals that reflect real customer pain. Use these to start small, stay observable, and iterate.
Sample SLOs
- Availability: 99.5% per 30 days for core API/checkout.
- Latency: p95 < 600ms (peak) for /checkout and /order.
- Error rate: 99% success on write paths; alert on 1%+ sustained errors.
Availability math
(1 - SLO) * days * 24 * 60 = allowed downtime (minutes).
99.5% over 30 days ≈ 216 minutes allowed error.
Quick setup
- Dashboards for each SLO: target vs. actual + burn.
- Alert on SLO symptoms, not infra noise.
- Review weekly: burn, top incidents, upcoming risky changes.
Latency SLO tips
- Use p95 or p99; avoid averages.
- Split read vs. write paths if they differ.
Error SLO tips
- Exclude client-canceled/rate-limited requests if non-impacting.
- Focus on critical paths (checkout, payments, auth).
Minimal data you need
- Request count, success/fail, latency histogram, timestamps.
- Health checks excluded if they inflate counts.