Insight

15 Jan 2026

SLOs, SLIs, SLAs and Error Budgets - The Practical Playbook Most Teams Get Wrong

The Core Problem With How Teams Monitor Reliability

Most engineering teams set reliability targets based on what their infrastructure can do. That's backwards.

A 92% CPU spike tells you nothing about whether users are affected. CPU is an infrastructure metric. User happiness is a product metric. Most monitoring setups confuse the two — and the confusion only surfaces when something breaks and nobody can answer a simple question: *are users actually hurting right now?*

Metrics that can't drive decisions during an incident are useless. They're decoration.

The one principle that changes everything: your SLO should reflect user happiness, not infrastructure capability.


The Vocabulary - Get This Right First

These four terms get confused constantly. Here's the distinction that actually matters:


Metric

What is actually means

The Real-World Example

SLI

The metric you measure

"What % of checkout requests succeeded?"

SLO

The target for that metric

"99.9% of checkouts succeed over 30 days."

SLA

The legally binding contract that guarantees a minimum level of reliability, backed by financial penalties

"We guarantee 99.5% uptime, or we pay you."

Error Budget

Your allowed failure margin

"We can afford 43 minutes of downtime this month."


The relationship is simple: the SLI is the measurement, the SLO is the goal, the SLA is the contract, and the error budget is the math that connects them to engineering decisions.


Adoption: The Three-Gear Model

Trying to map every Critical User Journey and set perfect SLOs on day one fails. Every time. The correct approach is incremental:


Gear

The Goal

What We Actually Did

First Gear: Visibility

Measure one thing correctly

Pick your highest-impact user journey (e.g., Checkout). Measure the SLI at the load balancer: Good Events / Total Valid Events. Don't set alerts yet - just watch the data for a few weeks and establish a baseline.

Second Gear: Context

Set a target and track the budget

Set a realistic SLO slightly below your current baseline. Build one dashboard showing error budget remaining. Set one alert: page only if you burn 10% of the budget in an hour.

Third Gear: Action

Use the budget to make decisions

Error budget drives go/no-go deployment calls. Internal SLOs stay strictly separated from external SLAs. The budget becomes the rulebook, not opinions.


Start in First Gear. Most teams skip straight to Third and end up with targets nobody trusts.


Six Rules for Pragmatic Reliability


1. Start With Users, Not Infrastructure

Don't pick SLIs based on what your monitoring tool measures by default. Pick them based on what users actually do.

Map your 3–5 Critical User Journeys and rank them by business impact. A failing checkout costs money immediately. A slow "forgot password" flow is an annoyance. These are not the same priority, and your SLOs should reflect that.

The common mistake: setting an SLO on CPU utilization. CPU doesn't tell you whether users are happy. Request success rate does.



2. Pick SLIs That Correlate With Pain

An SLI is only useful if it drops when users are having a bad time. The universal formula:

SLI = (Good Events / Total Valid Events) × 100

Measure as close to the user as your engineering capacity allows. Load balancer metrics are a pragmatic starting point - they catch most issues without requiring deep application-level instrumentation.

The diagnostic test: look at your SLI on a day you know users were hurting (via support tickets or social media). If the SLI stayed flat during a known outage, you're measuring the wrong thing. Go find a better signal.



3. Set SLOs Lower Than You Think

A common mistake: setting your SLO at your current performance level. If you're hitting 99.99% and users can't distinguish 99.9% from 99.99%, the tighter target just burns engineering hours for zero user value.

Your SLO shouldn't reflect what your infrastructure can achieve. It should reflect what your users need. Those are almost never the same number. Defend the number that matters - not the one that looks impressive in a slide deck.



4. The Multi-Service Reality: Who Burned Our Budget?

In any microservice architecture, a downstream dependency will eventually burn your error budget. This is not a question of if - it's when.

Your SLI should measure your boundary, not your dependency's health. And your service needs a degradation path. If a dependency fails, fail open with partial functionality instead of returning a 500. Serve what you can, flag what's missing, escalate to a human.

If a downstream service is a hard dependency, establish an internal OLA (Operational Level Agreement) with that team. Without one, you're accountable for reliability you don't control - and that's a structural problem, not a people problem.



5. Treat Error Budgets as Spendable Currency

Error budget = 100% minus your SLO target. With a 99.9% SLO over 30 days, that's exactly 43 minutes of allowed downtime.

Error budgets turn "ship fast vs. stay stable" from an opinion fight into a math problem:


Status

Budget Remaining

The Rule for Engineering & Product

Healthy

> 50%

Ship aggressively. Merge PRs, run experiments, do migrations. You can afford the risk.

Shrinking

10% – 50%

Slow down. Hold risky deployments. Prioritize stability and minor fixes.

Exhausted

< 10%

Stop feature work. Reliability is the #1 priority until the 30-day window resets.


The number decides. Not the loudest voice in the room.


6. Keep SLAs Strictly Separate

An SLO is an internal alarm. An SLA is a financial penalty. If they're the same number, you'll be writing refund checks before your team has time to fix anything.

The correct setup: maintain a buffer. A 99.9% internal SLO with a 99.5% external SLA gives your team room to detect and resolve problems before they become contractual breaches. That gap isn't slack - it's survival margin.


The Point of All This

SLOs aren't about modeling perfect math. They're about decision speed.

One dashboard. One number. One clear answer: is this a wake-the-team emergency or a file-a-ticket-tomorrow annoyance? Every minute spent interpreting ambiguous metrics during an incident is a minute users are waiting.

Get the SLI right. Set the SLO at a level users actually care about. Spend the error budget deliberately. Keep the SLA safely below your internal target.

That's the entire system. Everything else is refinement.

Sign-up For Our Newsletter!

Receive new articles about steadwing delivered straight to your inbox.

Written by

Yash Aggarwal

Try Steadwing now ! Your Autonomous On-Call Engineer

Reducing MTTR so your team can stay focused on building.

Stop firefighting.
Start shipping.

Free to start. No credit card.
Connect your stack in 5 minutes.

501 Folsom Street, San Francisco, CA 94105

Steadwing © 2026. All rights reserved.

Stop firefighting.
Start shipping.

Free to start. No credit card.
Connect your stack in 5 minutes.

501 Folsom Street, San Francisco, CA 94105

Steadwing © 2026. All rights reserved.

Stop firefighting.
Start shipping.

Free to start. No credit card.
Connect your stack in 5 minutes.

501 Folsom Street, San Francisco, CA 94105

Steadwing © 2026. All rights reserved.