Operational Excellence
Alert Design Cheatsheet
Design alerts that surface real customer impact and reduce on-call fatigue. Alert on symptoms, not infrastructure noise.
🎯 Core Principles
👥
Alert on User Symptoms
Focus on what customers experience: errors, slow pages, failed transactions. Avoid alerting on CPU, memory, or disk alone.
📊
Use SLO-Based Thresholds
Tie alerts to Service Level Objectives. Fire when error rate, latency, or availability breach defined targets over a time window.
🔗
Always Link Runbooks
Every alert must include a runbook URL with clear diagnosis steps, mitigation actions, and escalation paths.
⚡ Example Alerts with Thresholds
Current Error Rate: 4.2%
Target: < 1% | Duration: 8 minutes | Impact: Payment failures
Alert Rule:
IF error_rate(checkout_api) > 1%
FOR 5 minutes
THEN page on-call
RUNBOOK: /runbooks/checkout-high-errors
Why this threshold? SLO target is 99.9% availability. 1% error rate for 5+ minutes breaches monthly budget and indicates systemic issue, not transient blip.
Current p95 Latency: 820ms
Target: < 600ms | Duration: 12 minutes | Impact: Slow page loads
Alert Rule:
IF p95_latency(api/orders) > 600ms
FOR 10 minutes
DURING peak_hours (9am-6pm)
THEN notify on-call (non-page)
RUNBOOK: /runbooks/api-latency-spike
Why this threshold? SLO is p95 < 600ms during business hours. Sustained degradation affects user experience and may cascade to other services.
Current Queue Depth: 3,200 messages
Target: < 1,000 | Duration: 15 minutes | Impact: Delayed order processing
Alert Rule:
IF queue_depth(order_processing) > 1000
FOR 15 minutes
AND age_of_oldest_message > 5 minutes
THEN notify on-call
RUNBOOK: /runbooks/queue-backlog
Why this threshold? Queue depth > 1K with messages aging indicates processing lag. Correlates to delayed customer notifications and order confirmations.
🛡️ Alert Guardrails
- Quiet Hours: Batch non-critical alerts (8pm-8am). Page only for user-impacting SLO breaches.
- Deduplication: Group similar errors by root cause. Cap pages at 2 per hour to avoid alert fatigue.
- Escalation Path: Primary on-call → Secondary (15 min) → Incident Commander (30 min) if unresolved.
- Auto-Resolution: Close alerts automatically when metrics return to normal for 5+ minutes.
📘 Example Runbook: Checkout API High Error Rate
⚠️ Quick Context: This alert fires when checkout API error rate exceeds 1% for 5+ minutes, indicating payment failures affecting customers.
Check the monitoring dashboard to confirm error rate and affected endpoints. Identify if all regions are impacted or isolated to specific zones.
Dashboard: https://monitoring.example.com/checkout-api
Look for: Error rate trend, affected endpoints, HTTP status codes (500, 503, 504)
Verify health of payment gateway, database, and authentication service. Common causes: payment provider outage, database connection pool exhaustion, auth service timeout.
kubectl get pods -n payment-gateway
Check DB connections: SELECT count(*) FROM pg_stat_activity;
Payment gateway status: https://status.stripe.com
Check if any deployments occurred in the last 2 hours. Look for configuration changes, feature flags, or infrastructure updates.
Recent deploys: kubectl rollout history deployment/checkout-api
Feature flags: curl https://flags.internal/checkout
Last change: git log --since="2 hours ago" --oneline
If recent deployment: Rollback to last known good version immediately.
If dependency issue: Enable circuit breaker or failover to backup region.
If database saturation: Scale connection pool or add read replicas.
Rollback: kubectl rollout undo deployment/checkout-api
Circuit breaker: kubectl set env deployment/checkout-api CIRCUIT_BREAKER=true
Scale: kubectl scale deployment/checkout-api --replicas=10
Post incident alert in #incidents Slack channel. Update status page if customer-facing impact persists > 5 minutes. Monitor error rate for 10 minutes to confirm resolution.
Slack: Post to #incidents with INC-ID, impact, and ETA
Status page: https://status.example.com/admin
Monitor: Watch dashboard for error rate < 0.5% sustained
If unresolved after 15 minutes: Escalate to Secondary On-Call (@secondary-oncall)
If unresolved after 30 minutes: Page Incident Commander and notify Head of Engineering
If payment gateway outage: Contact vendor support and enable maintenance mode
📝 Post-Incident: Create postmortem ticket within 24 hours. Schedule review meeting. Document any new failure modes discovered.
← Back