DevOpsCraftDevOpsCraft
All posts
SRESLOSLIObservability

SLO/SLI Explained: A Practical Guide for Engineering Teams

2025-12-10 · 5 min read

The problem with "we target 99.9% uptime"

Most teams say they target high availability, but nobody knows what that actually means day-to-day. SLOs fix this by making reliability a concrete, measurable goal.

Key concepts

**SLI (Service Level Indicator)** — a metric that measures reliability

Example: % of requests that succeed (status < 500)

**SLO (Service Level Objective)** — your target for an SLI

Example: 99.5% of requests succeed over a 30-day window

**Error budget** — how much failure you're allowed

Example: 0.5% of requests can fail = ~3.6 hours of downtime per month

Choose the right SLIs

Not every metric is a good SLI. Good SLIs measure what users actually experience:

| Service type | Good SLI |

|---|---|

| API | Request success rate (non-5xx / total) |

| API | P99 latency < 500ms |

| Batch job | Job completion rate |

| Data pipeline | Freshness (last update < 1 hour ago) |

| Storage | Read success rate |

PromQL for a basic SLI

# Success rate over 5min window

sum(rate(http_requests_total{status!~"5.."}[5m]))

/

sum(rate(http_requests_total[5m]))

SLO-based alerting (the right way)

Instead of alerting on the SLI directly (too noisy), alert on **error budget burn rate**:

# Alert if burning error budget 14x faster than sustainable

# (Will exhaust 30-day budget in ~50 hours at this rate)

  • alert: HighErrorBudgetBurn
  • expr: |

    (

    sum(rate(http_requests_total{status=~"5.."}[1h]))

    /

    sum(rate(http_requests_total[1h]))

    ) > 14 * (1 - 0.995)

    for: 2m

    labels:

    severity: critical

    annotations:

    summary: "High error budget burn rate"

    Error budget in practice

    When your error budget is healthy (plenty remaining):

  • Shipping velocity > reliability work
  • Experiments and risky deploys are fine
  • When error budget is low or exhausted:

  • Freeze non-critical deploys
  • Focus on reliability improvements
  • Investigate incident root causes
  • This creates a natural feedback loop: teams that deploy recklessly burn their error budget and lose the ability to ship new features. Teams that are too conservative have budget to spend on velocity.

    Start simple

    You don't need a perfect SLO framework from day one:

    1. Pick 1–2 critical user journeys

    2. Define one SLI per journey

    3. Set a realistic SLO (look at last 90 days of actual data)

    4. Build a Grafana dashboard showing current SLO compliance

    5. Add one burn rate alert

    Start there. Expand as your team gets comfortable with the process.

    Need help implementing this?

    We set this up for teams every week. Book a free call and let's talk about your specific situation.

    Book a Discovery Call