Insight
15 Jan 2026
SLOs, SLIs, SLAs and Error Budgets - The Practical Playbook Most Teams Get Wrong
The Core Problem With How Teams Monitor Reliability

Most engineering teams set reliability targets based on what their infrastructure can do. That's backwards.
A 92% CPU spike tells you nothing about whether users are affected. CPU is an infrastructure metric. User happiness is a product metric. Most monitoring setups confuse the two — and the confusion only surfaces when something breaks and nobody can answer a simple question: *are users actually hurting right now?*
Metrics that can't drive decisions during an incident are useless. They're decoration.
The one principle that changes everything: your SLO should reflect user happiness, not infrastructure capability.
The Vocabulary - Get This Right First
These four terms get confused constantly. Here's the distinction that actually matters:
Metric | What is actually means | The Real-World Example |
|---|---|---|
SLI | The metric you measure | "What % of checkout requests succeeded?" |
SLO | The target for that metric | "99.9% of checkouts succeed over 30 days." |
SLA | The legally binding contract that guarantees a minimum level of reliability, backed by financial penalties | "We guarantee 99.5% uptime, or we pay you." |
Error Budget | Your allowed failure margin | "We can afford 43 minutes of downtime this month." |
The relationship is simple: the SLI is the measurement, the SLO is the goal, the SLA is the contract, and the error budget is the math that connects them to engineering decisions.
Adoption: The Three-Gear Model
Trying to map every Critical User Journey and set perfect SLOs on day one fails. Every time. The correct approach is incremental:
Gear | The Goal | What We Actually Did |
|---|---|---|
First Gear: Visibility | Measure one thing correctly | Pick your highest-impact user journey (e.g., Checkout). Measure the SLI at the load balancer: Good Events / Total Valid Events. Don't set alerts yet - just watch the data for a few weeks and establish a baseline. |
Second Gear: Context | Set a target and track the budget | Set a realistic SLO slightly below your current baseline. Build one dashboard showing error budget remaining. Set one alert: page only if you burn 10% of the budget in an hour. |
Third Gear: Action | Use the budget to make decisions | Error budget drives go/no-go deployment calls. Internal SLOs stay strictly separated from external SLAs. The budget becomes the rulebook, not opinions. |
Start in First Gear. Most teams skip straight to Third and end up with targets nobody trusts.
Six Rules for Pragmatic Reliability
1. Start With Users, Not Infrastructure
Don't pick SLIs based on what your monitoring tool measures by default. Pick them based on what users actually do.
Map your 3–5 Critical User Journeys and rank them by business impact. A failing checkout costs money immediately. A slow "forgot password" flow is an annoyance. These are not the same priority, and your SLOs should reflect that.
The common mistake: setting an SLO on CPU utilization. CPU doesn't tell you whether users are happy. Request success rate does.
2. Pick SLIs That Correlate With Pain
An SLI is only useful if it drops when users are having a bad time. The universal formula:
SLI = (Good Events / Total Valid Events) × 100
Measure as close to the user as your engineering capacity allows. Load balancer metrics are a pragmatic starting point - they catch most issues without requiring deep application-level instrumentation.
The diagnostic test: look at your SLI on a day you know users were hurting (via support tickets or social media). If the SLI stayed flat during a known outage, you're measuring the wrong thing. Go find a better signal.
3. Set SLOs Lower Than You Think
A common mistake: setting your SLO at your current performance level. If you're hitting 99.99% and users can't distinguish 99.9% from 99.99%, the tighter target just burns engineering hours for zero user value.
Your SLO shouldn't reflect what your infrastructure can achieve. It should reflect what your users need. Those are almost never the same number. Defend the number that matters - not the one that looks impressive in a slide deck.
4. The Multi-Service Reality: Who Burned Our Budget?
In any microservice architecture, a downstream dependency will eventually burn your error budget. This is not a question of if - it's when.
Your SLI should measure your boundary, not your dependency's health. And your service needs a degradation path. If a dependency fails, fail open with partial functionality instead of returning a 500. Serve what you can, flag what's missing, escalate to a human.
If a downstream service is a hard dependency, establish an internal OLA (Operational Level Agreement) with that team. Without one, you're accountable for reliability you don't control - and that's a structural problem, not a people problem.
5. Treat Error Budgets as Spendable Currency
Error budget = 100% minus your SLO target. With a 99.9% SLO over 30 days, that's exactly 43 minutes of allowed downtime.
Error budgets turn "ship fast vs. stay stable" from an opinion fight into a math problem:
Status | Budget Remaining | The Rule for Engineering & Product |
|---|---|---|
Healthy | > 50% | Ship aggressively. Merge PRs, run experiments, do migrations. You can afford the risk. |
Shrinking | 10% – 50% | Slow down. Hold risky deployments. Prioritize stability and minor fixes. |
Exhausted | < 10% | Stop feature work. Reliability is the #1 priority until the 30-day window resets. |
The number decides. Not the loudest voice in the room.
6. Keep SLAs Strictly Separate
An SLO is an internal alarm. An SLA is a financial penalty. If they're the same number, you'll be writing refund checks before your team has time to fix anything.
The correct setup: maintain a buffer. A 99.9% internal SLO with a 99.5% external SLA gives your team room to detect and resolve problems before they become contractual breaches. That gap isn't slack - it's survival margin.
The Point of All This
SLOs aren't about modeling perfect math. They're about decision speed.
One dashboard. One number. One clear answer: is this a wake-the-team emergency or a file-a-ticket-tomorrow annoyance? Every minute spent interpreting ambiguous metrics during an incident is a minute users are waiting.
Get the SLI right. Set the SLO at a level users actually care about. Spend the error budget deliberately. Keep the SLA safely below your internal target.
That's the entire system. Everything else is refinement.

Sign-up For Our Newsletter!
Receive new articles about steadwing delivered straight to your inbox.
Written by

Yash Aggarwal

Try Steadwing now ! Your Autonomous On-Call Engineer
Reducing MTTR so your team can stay focused on building.

