Operational Excellence

SLO Starter Pack

Pick 2–3 signals that reflect real customer pain. Use these to start small, stay observable, and iterate.

Sample SLOs

Availability: 99.5% per 30 days for core API/checkout.
Latency: p95 < 600ms (peak) for /checkout and /order.
Error rate: 99% success on write paths; alert on 1%+ sustained errors.

Availability math

(1 - SLO) * days * 24 * 60 = allowed downtime (minutes).

99.5% over 30 days ≈ 216 minutes allowed error.

Quick setup

Dashboards for each SLO: target vs. actual + burn.
Alert on SLO symptoms, not infra noise.
Review weekly: burn, top incidents, upcoming risky changes.

Latency SLO tips

Use p95 or p99; avoid averages.
Split read vs. write paths if they differ.

Error SLO tips

Exclude client-canceled/rate-limited requests if non-impacting.
Focus on critical paths (checkout, payments, auth).

Minimal data you need

Request count, success/fail, latency histogram, timestamps.
Health checks excluded if they inflate counts.

← Back