SRESLOSLIObservability

SLO/SLI Explained: A Practical Guide for Engineering Teams

2025-12-10 · 5 min read

The problem with "we target 99.9% uptime"

Most teams say they target high availability, but nobody knows what that actually means day-to-day. SLOs fix this by making reliability a concrete, measurable goal.

Key concepts

**SLI (Service Level Indicator)** — a metric that measures reliability

Example: % of requests that succeed (status < 500)

**SLO (Service Level Objective)** — your target for an SLI

Example: 99.5% of requests succeed over a 30-day window

**Error budget** — how much failure you're allowed

Example: 0.5% of requests can fail = ~3.6 hours of downtime per month

Choose the right SLIs

Not every metric is a good SLI. Good SLIs measure what users actually experience:

| Service type | Good SLI |

|---|---|

| API | Request success rate (non-5xx / total) |

| API | P99 latency < 500ms |

| Batch job | Job completion rate |

| Data pipeline | Freshness (last update < 1 hour ago) |

| Storage | Read success rate |

PromQL for a basic SLI

# Success rate over 5min window

sum(rate(http_requests_total{status!~"5.."}[5m]))

sum(rate(http_requests_total[5m]))

SLO-based alerting (the right way)

Instead of alerting on the SLI directly (too noisy), alert on **error budget burn rate**:

# Alert if burning error budget 14x faster than sustainable

# (Will exhaust 30-day budget in ~50 hours at this rate)

alert: HighErrorBudgetBurn

expr: |

(

sum(rate(http_requests_total{status=~"5.."}[1h]))

sum(rate(http_requests_total[1h]))

) > 14 * (1 - 0.995)

for: 2m

labels:

severity: critical

annotations:

summary: "High error budget burn rate"

Error budget in practice

When your error budget is healthy (plenty remaining):

Shipping velocity > reliability work

Experiments and risky deploys are fine

When error budget is low or exhausted:

Freeze non-critical deploys

Focus on reliability improvements

Investigate incident root causes

This creates a natural feedback loop: teams that deploy recklessly burn their error budget and lose the ability to ship new features. Teams that are too conservative have budget to spend on velocity.

Start simple

You don't need a perfect SLO framework from day one:

1. Pick 1–2 critical user journeys

2. Define one SLI per journey

3. Set a realistic SLO (look at last 90 days of actual data)

4. Build a Grafana dashboard showing current SLO compliance

5. Add one burn rate alert

Start there. Expand as your team gets comfortable with the process.

Need help implementing this?

We set this up for teams every week. Book a free call and let's talk about your specific situation.

Book a Discovery Call

Kubernetes for Startups: What You Actually Need How We Cut AWS Costs by 40% Without Downtime Production Observability: Grafana + Loki from Scratch