Operational Excellence

Alert Design Cheatsheet

Design alerts that surface real customer impact and reduce on-call fatigue. Alert on symptoms, not infrastructure noise.

🎯 Core Principles

👥

Alert on User Symptoms

Focus on what customers experience: errors, slow pages, failed transactions. Avoid alerting on CPU, memory, or disk alone.

📊

Use SLO-Based Thresholds

Tie alerts to Service Level Objectives. Fire when error rate, latency, or availability breach defined targets over a time window.

🔗

Always Link Runbooks

Every alert must include a runbook URL with clear diagnosis steps, mitigation actions, and escalation paths.

⚡ Example Alerts with Thresholds

🔴 Availability: Checkout API Error Rate

CRITICAL

4.2%

Current Error Rate: 4.2%

Target: < 1% | Duration: 8 minutes | Impact: Payment failures

Alert Rule:
IF error_rate(checkout_api) > 1%
FOR 5 minutes
THEN page on-call
RUNBOOK: /runbooks/checkout-high-errors

Why this threshold? SLO target is 99.9% availability. 1% error rate for 5+ minutes breaches monthly budget and indicates systemic issue, not transient blip.

🟡 Latency: API Response Time (p95)

WARNING

820ms

Current p95 Latency: 820ms

Target: < 600ms | Duration: 12 minutes | Impact: Slow page loads

Alert Rule:
IF p95_latency(api/orders) > 600ms
FOR 10 minutes
DURING peak_hours (9am-6pm)
THEN notify on-call (non-page)
RUNBOOK: /runbooks/api-latency-spike

Why this threshold? SLO is p95 < 600ms during business hours. Sustained degradation affects user experience and may cascade to other services.

🟠 Saturation: Message Queue Backlog

WARNING

3.2K

Current Queue Depth: 3,200 messages

Target: < 1,000 | Duration: 15 minutes | Impact: Delayed order processing

Alert Rule:
IF queue_depth(order_processing) > 1000
FOR 15 minutes
AND age_of_oldest_message > 5 minutes
THEN notify on-call
RUNBOOK: /runbooks/queue-backlog

Why this threshold? Queue depth > 1K with messages aging indicates processing lag. Correlates to delayed customer notifications and order confirmations.

🛡️ Alert Guardrails

Quiet Hours: Batch non-critical alerts (8pm-8am). Page only for user-impacting SLO breaches.
Deduplication: Group similar errors by root cause. Cap pages at 2 per hour to avoid alert fatigue.
Escalation Path: Primary on-call → Secondary (15 min) → Incident Commander (30 min) if unresolved.
Auto-Resolution: Close alerts automatically when metrics return to normal for 5+ minutes.

📘 Example Runbook: Checkout API High Error Rate

Service

Checkout API

Severity

Critical

SLO Impact

99.9% Availability

On-Call Team

Platform SRE

⚠️ Quick Context: This alert fires when checkout API error rate exceeds 1% for 5+ minutes, indicating payment failures affecting customers.

Verify Impact & Scope

Check the monitoring dashboard to confirm error rate and affected endpoints. Identify if all regions are impacted or isolated to specific zones.

Dashboard: https://monitoring.example.com/checkout-api
Look for: Error rate trend, affected endpoints, HTTP status codes (500, 503, 504)

Check Upstream Dependencies

Verify health of payment gateway, database, and authentication service. Common causes: payment provider outage, database connection pool exhaustion, auth service timeout.

kubectl get pods -n payment-gateway
Check DB connections: SELECT count(*) FROM pg_stat_activity;
Payment gateway status: https://status.stripe.com

Review Recent Changes

Check if any deployments occurred in the last 2 hours. Look for configuration changes, feature flags, or infrastructure updates.

Recent deploys: kubectl rollout history deployment/checkout-api
Feature flags: curl https://flags.internal/checkout
Last change: git log --since="2 hours ago" --oneline

Immediate Mitigation

If recent deployment: Rollback to last known good version immediately.
If dependency issue: Enable circuit breaker or failover to backup region.
If database saturation: Scale connection pool or add read replicas.

Rollback: kubectl rollout undo deployment/checkout-api
Circuit breaker: kubectl set env deployment/checkout-api CIRCUIT_BREAKER=true
Scale: kubectl scale deployment/checkout-api --replicas=10

Communicate & Monitor

Post incident alert in #incidents Slack channel. Update status page if customer-facing impact persists > 5 minutes. Monitor error rate for 10 minutes to confirm resolution.

Slack: Post to #incidents with INC-ID, impact, and ETA
Status page: https://status.example.com/admin
Monitor: Watch dashboard for error rate < 0.5% sustained

Escalation Path

If unresolved after 15 minutes: Escalate to Secondary On-Call (@secondary-oncall)
If unresolved after 30 minutes: Page Incident Commander and notify Head of Engineering
If payment gateway outage: Contact vendor support and enable maintenance mode

📝 Post-Incident: Create postmortem ticket within 24 hours. Schedule review meeting. Document any new failure modes discovered.

← Back