Tech Exec Insight Logo Operational Excellence

Alert Design Cheatsheet

Design alerts that surface real customer impact and reduce on-call fatigue. Alert on symptoms, not infrastructure noise.

🎯 Core Principles
👥
Alert on User Symptoms
Focus on what customers experience: errors, slow pages, failed transactions. Avoid alerting on CPU, memory, or disk alone.
📊
Use SLO-Based Thresholds
Tie alerts to Service Level Objectives. Fire when error rate, latency, or availability breach defined targets over a time window.
🔗
Always Link Runbooks
Every alert must include a runbook URL with clear diagnosis steps, mitigation actions, and escalation paths.
⚡ Example Alerts with Thresholds
🔴 Availability: Checkout API Error Rate
CRITICAL
4.2%
Current Error Rate: 4.2%
Target: < 1% | Duration: 8 minutes | Impact: Payment failures
Alert Rule:
IF error_rate(checkout_api) > 1%
FOR 5 minutes
THEN page on-call
RUNBOOK: /runbooks/checkout-high-errors
Why this threshold? SLO target is 99.9% availability. 1% error rate for 5+ minutes breaches monthly budget and indicates systemic issue, not transient blip.
🟡 Latency: API Response Time (p95)
WARNING
820ms
Current p95 Latency: 820ms
Target: < 600ms | Duration: 12 minutes | Impact: Slow page loads
Alert Rule:
IF p95_latency(api/orders) > 600ms
FOR 10 minutes
DURING peak_hours (9am-6pm)
THEN notify on-call (non-page)
RUNBOOK: /runbooks/api-latency-spike
Why this threshold? SLO is p95 < 600ms during business hours. Sustained degradation affects user experience and may cascade to other services.
🟠 Saturation: Message Queue Backlog
WARNING
3.2K
Current Queue Depth: 3,200 messages
Target: < 1,000 | Duration: 15 minutes | Impact: Delayed order processing
Alert Rule:
IF queue_depth(order_processing) > 1000
FOR 15 minutes
AND age_of_oldest_message > 5 minutes
THEN notify on-call
RUNBOOK: /runbooks/queue-backlog
Why this threshold? Queue depth > 1K with messages aging indicates processing lag. Correlates to delayed customer notifications and order confirmations.
🛡️ Alert Guardrails
📘 Example Runbook: Checkout API High Error Rate
Service
Checkout API
Severity
Critical
SLO Impact
99.9% Availability
On-Call Team
Platform SRE
⚠️ Quick Context: This alert fires when checkout API error rate exceeds 1% for 5+ minutes, indicating payment failures affecting customers.
1
Verify Impact & Scope
Check the monitoring dashboard to confirm error rate and affected endpoints. Identify if all regions are impacted or isolated to specific zones.
Dashboard: https://monitoring.example.com/checkout-api
Look for: Error rate trend, affected endpoints, HTTP status codes (500, 503, 504)
2
Check Upstream Dependencies
Verify health of payment gateway, database, and authentication service. Common causes: payment provider outage, database connection pool exhaustion, auth service timeout.
kubectl get pods -n payment-gateway
Check DB connections: SELECT count(*) FROM pg_stat_activity;
Payment gateway status: https://status.stripe.com
3
Review Recent Changes
Check if any deployments occurred in the last 2 hours. Look for configuration changes, feature flags, or infrastructure updates.
Recent deploys: kubectl rollout history deployment/checkout-api
Feature flags: curl https://flags.internal/checkout
Last change: git log --since="2 hours ago" --oneline
4
Immediate Mitigation
If recent deployment: Rollback to last known good version immediately.
If dependency issue: Enable circuit breaker or failover to backup region.
If database saturation: Scale connection pool or add read replicas.
Rollback: kubectl rollout undo deployment/checkout-api
Circuit breaker: kubectl set env deployment/checkout-api CIRCUIT_BREAKER=true
Scale: kubectl scale deployment/checkout-api --replicas=10
5
Communicate & Monitor
Post incident alert in #incidents Slack channel. Update status page if customer-facing impact persists > 5 minutes. Monitor error rate for 10 minutes to confirm resolution.
Slack: Post to #incidents with INC-ID, impact, and ETA
Status page: https://status.example.com/admin
Monitor: Watch dashboard for error rate < 0.5% sustained
6
Escalation Path
If unresolved after 15 minutes: Escalate to Secondary On-Call (@secondary-oncall)
If unresolved after 30 minutes: Page Incident Commander and notify Head of Engineering
If payment gateway outage: Contact vendor support and enable maintenance mode
📝 Post-Incident: Create postmortem ticket within 24 hours. Schedule review meeting. Document any new failure modes discovered.
← Back