SLO/SLI Explained: A Practical Guide for Engineering Teams
2025-12-10 · 5 min read
The problem with "we target 99.9% uptime"
Most teams say they target high availability, but nobody knows what that actually means day-to-day. SLOs fix this by making reliability a concrete, measurable goal.
Key concepts
**SLI (Service Level Indicator)** — a metric that measures reliability
Example: % of requests that succeed (status < 500)
**SLO (Service Level Objective)** — your target for an SLI
Example: 99.5% of requests succeed over a 30-day window
**Error budget** — how much failure you're allowed
Example: 0.5% of requests can fail = ~3.6 hours of downtime per month
Choose the right SLIs
Not every metric is a good SLI. Good SLIs measure what users actually experience:
| Service type | Good SLI |
|---|---|
| API | Request success rate (non-5xx / total) |
| API | P99 latency < 500ms |
| Batch job | Job completion rate |
| Data pipeline | Freshness (last update < 1 hour ago) |
| Storage | Read success rate |
PromQL for a basic SLI
# Success rate over 5min window
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
SLO-based alerting (the right way)
Instead of alerting on the SLI directly (too noisy), alert on **error budget burn rate**:
# Alert if burning error budget 14x faster than sustainable
# (Will exhaust 30-day budget in ~50 hours at this rate)
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > 14 * (1 - 0.995)
for: 2m
labels:
severity: critical
annotations:
summary: "High error budget burn rate"
Error budget in practice
When your error budget is healthy (plenty remaining):
When error budget is low or exhausted:
This creates a natural feedback loop: teams that deploy recklessly burn their error budget and lose the ability to ship new features. Teams that are too conservative have budget to spend on velocity.
Start simple
You don't need a perfect SLO framework from day one:
1. Pick 1–2 critical user journeys
2. Define one SLI per journey
3. Set a realistic SLO (look at last 90 days of actual data)
4. Build a Grafana dashboard showing current SLO compliance
5. Add one burn rate alert
Start there. Expand as your team gets comfortable with the process.
Need help implementing this?
We set this up for teams every week. Book a free call and let's talk about your specific situation.
Book a Discovery Call