Leadership Intelligence
Define clear Service Level Agreements (SLAs) to set expectations with product teams and stakeholders about platform availability, support, and feature delivery.
Target: 99.95% uptime (maximum ~22 minutes downtime per month)
Scope: Core platform services including CI/CD pipelines, Kubernetes clusters, service mesh, secrets management, and observability infrastructure.
Measurement: Calculated as % of time all critical platform services are operational and accessible.
Exclusions:
Target: 99.9% uptime during business hours (8am-8pm ET, Mon-Fri)
Scope: CI/CD pipelines, artifact registries, deployment automation tools.
Consequence: If SLA missed, platform team will provide root cause analysis within 48 hours and remediation plan within 5 business days.
| Severity | Definition | Response Time | Resolution Target |
|---|---|---|---|
| P0 Critical | Complete platform outage blocking all production deployments | 15 minutes | 2 hours |
| P1 High | Partial platform degradation affecting multiple teams or critical workflows | 1 hour | 4 hours |
| P2 Medium | Single team blocked or non-critical feature unavailable | 4 business hours | 2 business days |
| P3 Low | Questions, feature requests, documentation improvements | 1 business day | 5 business days |
On-call coverage: 24/7 for P0/P1 incidents
Standard support: Monday-Friday, 8am-6pm ET (P2/P3 requests)
Office hours: Every Wednesday, 2pm-4pm ET for drop-in questions and pairing sessions
Initial review: All feature requests triaged within 3 business days
Prioritization framework:
Commitment: Platform roadmap published quarterly with feature estimates
Status updates: All in-progress features receive biweekly status updates via Slack or email
Delivery estimates: Platform team provides delivery estimate within 5 business days of prioritization
Launch announcements: All major features announced via #platform-updates Slack channel and monthly all-hands
| Activity | Timeline | Owner |
|---|---|---|
| Initial platform orientation session | Within 2 business days of request | Platform team |
| Development environment setup assistance | Within 1 business day of request | Platform team (pairing session) |
| First production deployment (with support) | Within 5 business days of onboarding start | Platform + Product team |
| Follow-up check-in | 1 week after first deployment | Platform team |
Notification: Minimum 5 business days advance notice for planned downtime
Window: Maintenance performed during low-traffic periods (typically Saturday 2am-6am ET)
Communication: Detailed change plan shared via email and Slack #platform-updates
Deprecation notice: Minimum 90 days before removing deprecated features
Migration support: Platform team provides migration guides and pairing sessions
Backward compatibility: Platform maintains backward compatibility for minimum 6 months unless security-critical
Monthly SLA reports: Published first Monday of each month, including:
Quarterly business reviews: Presented to VP Eng and CTO, covering:
If SLA missed:
Continuous improvement: SLAs reviewed and adjusted quarterly based on platform maturity and organizational needs.
Learn from actual scenarios showing how platform teams handle SLA violations, communicate with stakeholders, and implement lasting improvements: