SLO vs SLA: a plain-English comparison for teams

TL;DR. An SLA is a contractual promise to a customer with money attached. An SLO is your internal target, usually tighter than the SLA, that you steer the system by. An SLI is the actual measurement under both. Pick numbers you can hit ninety-nine times out of a hundred and put them somewhere your team can find on a Tuesday afternoon.

Most teams write the SLA first, the SLO never, and the SLI in the form of a Slack message that says “is the dashboard broken or is the site?” The honest answer is that slo vs sla is the question your reliability discipline starts at, and you should know which one you’re talking about before anyone signs anything. (Yes, we measured it. Yes, we have opinions.) The rest of this post is the difference, in plain English, with the math you need and the runbook you don’t yet have.

SLA, SLO, SLI in one breath

Three letters, three different audiences. Get the audiences right and the rest follows.

SLA — Service Level Agreement. External. Contractual. Lives between you and your customer, drafted by people whose email signatures include a job title with the word counsel in it. “99.5% monthly availability or you get a 10% credit” is an SLA sentence. It has consequences in dollars.
SLO — Service Level Objective. Internal. A target your engineering team picks and steers by. “99.9% of API responses under 300ms over a 28-day rolling window” is an SLO sentence. The consequence of missing it is internal — a freeze on shipping risky changes, a postmortem, a conversation about what to roll back.
SLI — Service Level Indicator. The measurement under both. “The fraction of requests in the last 28 days that returned 2xx within 300ms” is an SLI definition. It’s a number a query against your telemetry can produce; everything else is built on top.

The order to think about them is SLI first, SLO second, SLA third. Pick what you can measure. Decide what you want it to look like. Then promise the customer something less ambitious than that. (Most teams do these in reverse order, which is how you end up promising 99.99% of an event you can’t even count.)

Your SLO should run tighter than your SLA

The textbook answer for the gap between SLO and SLA is “because the SLO is the thing your team optimises for, and the SLA is what’s safe to promise after a bad week.” The practical answer is that the gap is your error budget — the room to fail without owing anyone money.

A common shape, lifted straight from the Google SRE book’s chapter on SLOs:

SLA: 99.5% monthly availability.
SLO: 99.9% monthly availability.
Error budget: 0.4 percentage points per month, which works out to roughly 17 minutes of allowed unavailability before you eat into the SLA.

Why the asymmetry? Because hitting your SLO every month gives you headroom; hitting your SLA every month means you’re one unlucky deploy away from a credit. Headroom is the “keyboard-first” of operations — same idea, different domain. You don’t want to be the team whose SLA equals their SLO equals the actual performance, because that team has no slack and no joy.

A good rule of thumb: if your SLO and SLA are the same number, one of them is wrong.

What a real SLA and SLO pair looks like

Here’s an example pair for a hypothetical SaaS API. The numbers are illustrative — the shape is the lesson.

Metric	SLI (definition)	SLO (internal target)	SLA (customer promise)
Availability	Fraction of requests returning 2xx in 28d	99.9%	99.5%
Latency	Fraction of 2xx requests under 300ms in 28d	99%	95%
Time to first byte	p95 over 28d	< 200ms	not promised
Support response	First human reply within business hours	< 1h	< 4h

Three things to notice:

The SLI is a fraction, not a duration. “99% of requests under 300ms” is measurable. “The site is fast” is a feeling. Reliability discipline is the long, slow conversion of feelings into fractions.
The time-to-first-byte SLO has no SLA partner. You can steer by an internal target without making it a customer promise. Most internal SLOs should have no SLA counterpart.
The support-response SLA is a different shape from the technical ones. It still belongs in the same family — a measurable promise with consequences for missing it.

The error budget is the bridge between them

The error budget is the difference between your SLO and 100%. For a 99.9% availability SLO over a 30-day month, you have 0.1% of 43,200 minutes — roughly 43 minutes — of unavailability before you’ve burned the budget. Below the SLO and above the SLA, you’re in the red zone: technically still keeping the contract, but mortgaging next month’s stability.

What teams do with the error budget separates aspirational reliability from operational reliability:

Spend it on velocity. If the budget is healthy, ship the risky migration this sprint. The point of a budget is to spend it.
Freeze when it’s gone. If the budget is exhausted, no new feature flags, no schema migrations, no Friday deploys. The team’s job is to earn the budget back before resuming velocity.
Adjust when it’s permanently underspent. If you’re burning 5% of your budget every quarter, your SLO is set too loose and your team is leaving reliability on the table. Tighten it.

Error budgets are the part of SRE practice most teams skip, because they require everyone — engineering, product, leadership — to agree what enough reliability looks like before the customer is upset. That’s the actual hard part. The math is easy. The Google SRE workbook chapter on implementing SLOs walks the exact mechanics if you want a longer treatment than fits here.

Where SLOs go to rot

The two things every wiki gets wrong are waiting and clicking. Both apply, with depressing accuracy, to where most SLO documents end up living. The page is a Confluence doc nobody can find on the first try. By the time someone has clicked through Spaces → Engineering → SRE → Reliability → 2024 → Q3, the on-call is back to triaging the actual incident and the SLO is decorative.

The fix isn’t a different doc shape. The fix is that the SLO has to be reachable in the same number of keystrokes as the on-call’s terminal. Sub-second loads, keyboard-first is not an SRE talking point, but it’s the same shape: the wiki is either part of the response or it isn’t, and the second a wiki takes to load is the second the on-call doesn’t have.

Concrete moves that work:

One short page per service, named the service. Not a hierarchy. Not nested folders. Pages load in 50–150ms depending on your network is the receipt; what it buys you is that nobody dreads opening the SLO doc.
The SLA, the SLO, and the current burn rate at the top. Every other detail below the fold.
A single owner per SLO. “The team” doesn’t update the number. “Maple” does.
Last-reviewed and next-review dates. SLOs that aren’t reviewed quarterly become folklore.

If you’re picking a wiki specifically for this kind of operating-discipline work, the knowledge-management best practices we lean on are the same discipline applied to the rest of the team’s documentation. And the importable SLA / SLO Definition template is the same shape we use ourselves — with a Service Runbook and an Incident Postmortem wired in for the moments the SLO actually starts to burn.

When you don’t need an SLO

A solo developer running a side project does not need an SLO. A two-person consulting team whose biggest customer is each other does not need an SLO. A startup pre-revenue with no production users does not need an SLO; they need users.

You start needing an SLO when:

A real customer cares about your uptime in a way that affects whether they pay you.
Your team has more than one on-call rotation, and the what counts as down question now has more than one answer.
You’ve had two incidents this quarter that ended with “well, that’s fine, isn’t it?” and nobody could say.

Below that bar, an SLO is a checklist item. Above it, an SLO is the only honest way to talk about reliability without arguing about feelings. Pick the cheapest plan that fits the job — same logic applies to wikis, by the way; our Free tier is real: three users, one space, a hundred pages, no card, no trial timer.

Things people actually ask

What’s the difference between SLO and SLA in one sentence? An SLA is a promise to a customer with money attached; an SLO is an internal target that’s usually tighter than the SLA, so your team has room to miss without breaching the contract.

What is an SLI, and where does it fit? An SLI is the actual measurement — “fraction of requests under 300ms in the last 28 days” — that both the SLO and the SLA are evaluated against. Pick SLIs first, then build the SLO and SLA on top.

Are SLOs legally binding? No. SLOs are internal targets your team manages. Only SLAs are legally binding, and only because they live inside a customer contract negotiated by your legal team.

Why is the SLO usually tighter than the SLA? The gap between them is your error budget — the room to have a bad week without owing service credits. If your SLO equals your SLA, you’ve built no buffer, and one small incident eats both.

Who owns SLOs? A named human, not “the team.” SLOs are reviewed quarterly, the burn rate is watched continuously, and a change to the target is a change to the document with a revision date on it.

What’s an error budget? The portion of bad performance an SLO permits — for a 99.9% SLO over a month, that’s 0.1% of available time, or about 43 minutes. Spend it on velocity when it’s healthy; freeze new work when it’s gone.

How is an SLO different from a KPI? A KPI is a business metric (revenue, NPS, retention). An SLO is a reliability metric tied to a specific user-visible behaviour of a system. They live in different docs and they live for different audiences.

Can a small team skip the SLA and only have an SLO? Yes, and many do. SLAs are for customer contracts; if you don’t have a contract that mentions uptime, you don’t need an SLA. You probably still want an SLO so the team can argue about the right thing.

If your SLOs live in a wiki the on-call has to click four folders to reach, slo vs sla isn’t the problem you have — finding the doc is. Try the SLA / SLO Definition template on the Free tier and see how short the round-trip from “what’s our SLO again?” to “there it is” gets. If it takes longer than a Tuesday afternoon, write to us; we want to know.